16 April 2004
Brewster Kahle gave the talk at the closing plenary of the CNI Spring Task Force meeting. Brewster just keeps on doing, he never seems to be daunted by the scope of large tasks. The amazing thing is that it works! He set out to capture the web, and the Internet Archive (IA) does that better than any other entity. He called on us to “put the best we have to offer within the reach of our children.” Within reach, to Brewster (and to our children) means “on the web.” He then walked us through a back-of-the-napkin calculation of what it would take, concluding that the goal is within reach of us today and within our budgets to boot. Are we ready to answer the call?
Books. The Library of Congress = 20M volumes = 26TB = $60,000 disk space. At 2 hours/book (without destroying the books) this is doable. Output back to book form costs $1/book. This print-on-demand solution is being demonstrated today by the BookMobile the Internet Archive has put on the streets not just of the USA, but also India, Egypt, and most recently rural Uganda.
Audio. 2M “saleable objects” of audio exist, but much of it behind IP regs that make it hard to deal with. The IA approached the “taper” community of people who have taken advantage of performance oriented rock bands who followed the Grateful Dead’s lead into allowing fans to tape their music and exchange it for non-commercial use. “How would you like infinite bandwidth and infinite storage for free?” the IA asked the tapers. Guess what? They love the idea. 500 rock bands have given the IA permission to archive this material and share it for free. The tapers have already produced 10-20TB of concerts available on the IA.
Moving Images. Don’t just consider the 100-200,000 mainstream films (half of them from India). Consider the 2M films created in the 20th century that document daily life. Some of these may be in your very own basement. One hour of film costs about $100 to convert. One hour of video costs only $15. The IA is also now capturing 20 channels of video from around the world 24/7 for about $500,000. It is estimated there may be about 400 channels around the world.Software. The IA has received a DMCA exception to circumvent copy protection for the purpose of ripping some of the 50,000 software packages that exist to date. They are only allowed to rip titles from no-longer-supported operating systems.Web. The IA now captures 20TB/month of web content. The WayBackMachine holds over 30B (yes, billion) pages from 50M sites on 15M hosts. Anna Patterson’s search engine based on this corpus searches 4 times the number of sites covered by Google.
The Internet Archive does all this on a budget of about $4M or $5M each year. I don’t know about you, but this leaves me breathless.In order to preserve this growing corpus (libraries, Brewster notes, traditionally burn eventually) the IA seeks out partners around the world who can host copies of the data. The more different they are from the US the better. Right now a copy is held at the new library in Alexandria and negotiations are under way with a northern european country. Brewster estimates that the resources needed to maintain a mirror of IA are a PB of disk (that’s petabyte), a GB of bandwidth, and $100M to set up an appropriate endowment for continued operation.
But if the “Universal Access to All Human Knowledge” goal articulated by Raj Ready of the Million Book Project is too vast, and even the “All Published Knowledge Available to the Kid in Uganda” is a bit far out, how about something easy, asks Brewster. What if we just tried to attack what we already have every right to collect? Let’s go for “Public Access to the Public Domain.”In the USA the public domain is pre-1923 publications. In fact, Brewster points out, with the aid of Mike Klezman’s (?) recently completed electronic version of the copyright registry, it is now easy to find out which material from 1923-1964 did not have their copyright renewed and are now also in the public domain. Let’s go get this material! His proposal: give the IA a book and $10 and the IA will return to you the book unharmed plus a digital copy. Will we accept the offer? Oh, and by the way, the IA is also happy to accept video and $15/hour for the conversion of that to digital format. Oh, and did I mention that the IA will also host the digital documents on their servers “forever”?
I think we should take Brewster up on this offer. How much material do we have in the University of Minnesota collections which we could part with for a bit to let the IA digitize and store it? We should seriously consider a project to pump this material and the limited dollars required to the IA as fast as we can. This is a crazy idea at a crazy price point, let’s try to sink Brewster under our enthusiastic response! The great thing is, we probably won’t, he has not sunk yet.P.S. Brewster also tossed off an idea about how to archive blogs in response to a question. His thought was that we should be able to subscribe to blog RSS feeds and simply archive everything we see announced via that mechanism. I wonder if we could auto-harvest RSS from UThink.