Last month I had the privilege of attending the Fall 2010 CNI Task Force meeting. CNI is my favorite meeting of the digital library circuit because it combines a small scale with high engagement and a broad spectrum of interest. Last December’s meeting was well attended even though travel was difficult that week. My only disappointment was that the readiness of the audience to talk back and forth with speakers in the smaller briefings seems to have diminished a bit. The best CNI briefings, in my opinion, are about half presentation and half discussion.

DC sunset after Fall 2010 CNI Task Force Meeting

It looks like Access 2009 was a great conference, and they have many of their presentations online. The shame of this is that until a few hours ago I didn’t even know Access existed. With my US blinders on, I failed to realize that Canada hosted a conference that falls somewhere between DLF Forum and Code4Lib. It’s been going on for a long while, I have no excuse! I’d better start watching some video.


Brewster Kahle gave the talk at the closing plenary of the CNI Spring Task Force meeting. Brewster just keeps on doing, he never seems to be daunted by the scope of large tasks. The amazing thing is that it works! He set out to capture the web, and the Internet Archive (IA) does that better than any other entity. He called on us to “put the best we have to offer within the reach of our children.” Within reach, to Brewster (and to our children) means “on the web.” He then walked us through a back-of-the-napkin calculation of what it would take, concluding that the goal is within reach of us today and within our budgets to boot. Are we ready to answer the call?

Books. The Library of Congress = 20M volumes = 26TB = $60,000 disk space. At 2 hours/book (without destroying the books) this is doable. Output back to book form costs $1/book. This print-on-demand solution is being demonstrated today by the BookMobile the Internet Archive has put on the streets not just of the USA, but also India, Egypt, and most recently rural Uganda.

Audio. 2M “saleable objects” of audio exist, but much of it behind IP regs that make it hard to deal with. The IA approached the “taper” community of people who have taken advantage of performance oriented rock bands who followed the Grateful Dead’s lead into allowing fans to tape their music and exchange it for non-commercial use. “How would you like infinite bandwidth and infinite storage for free?” the IA asked the tapers. Guess what? They love the idea. 500 rock bands have given the IA permission to archive this material and share it for free. The tapers have already produced 10-20TB of concerts available on the IA.

Moving Images. Don’t just consider the 100-200,000 mainstream films (half of them from India). Consider the 2M films created in the 20th century that document daily life. Some of these may be in your very own basement. One hour of film costs about $100 to convert. One hour of video costs only $15. The IA is also now capturing 20 channels of video from around the world 24/7 for about $500,000. It is estimated there may be about 400 channels around the world.Software. The IA has received a DMCA exception to circumvent copy protection for the purpose of ripping some of the 50,000 software packages that exist to date. They are only allowed to rip titles from no-longer-supported operating systems.Web. The IA now captures 20TB/month of web content. The WayBackMachine holds over 30B (yes, billion) pages from 50M sites on 15M hosts. Anna Patterson’s search engine based on this corpus searches 4 times the number of sites covered by Google.

The Internet Archive does all this on a budget of about $4M or $5M each year. I don’t know about you, but this leaves me breathless.In order to preserve this growing corpus (libraries, Brewster notes, traditionally burn eventually) the IA seeks out partners around the world who can host copies of the data. The more different they are from the US the better. Right now a copy is held at the new library in Alexandria and negotiations are under way with a northern european country. Brewster estimates that the resources needed to maintain a mirror of IA are a PB of disk (that’s petabyte), a GB of bandwidth, and $100M to set up an appropriate endowment for continued operation.

But if the “Universal Access to All Human Knowledge” goal articulated by Raj Ready of the Million Book Project is too vast, and even the “All Published Knowledge Available to the Kid in Uganda” is a bit far out, how about something easy, asks Brewster. What if we just tried to attack what we already have every right to collect? Let’s go for “Public Access to the Public Domain.”In the USA the public domain is pre-1923 publications. In fact, Brewster points out, with the aid of Mike Klezman’s (?) recently completed electronic version of the copyright registry, it is now easy to find out which material from 1923-1964 did not have their copyright renewed and are now also in the public domain. Let’s go get this material! His proposal: give the IA a book and $10 and the IA will return to you the book unharmed plus a digital copy. Will we accept the offer? Oh, and by the way, the IA is also happy to accept video and $15/hour for the conversion of that to digital format. Oh, and did I mention that the IA will also host the digital documents on their servers “forever”?

I think we should take Brewster up on this offer. How much material do we have in the University of Minnesota collections which we could part with for a bit to let the IA digitize and store it? We should seriously consider a project to pump this material and the limited dollars required to the IA as fast as we can. This is a crazy idea at a crazy price point, let’s try to sink Brewster under our enthusiastic response! The great thing is, we probably won’t, he has not sunk yet.P.S. Brewster also tossed off an idea about how to archive blogs in response to a question. His thought was that we should be able to subscribe to blog RSS feeds and simply archive everything we see announced via that mechanism. I wonder if we could auto-harvest RSS from UThink.

I am concerned that the work of the Joint Committee of the Higher Education and Entertainment Communities may do more harm than good by legitimizing some role for higher ed in killing off P2P file sharing. I don’t think we have a role, I think this is a fight between the RIAA and MPAA and American society, we will just get trampled in the middle. Still, a session updating us on the P2P issue at CNI was interesting. It is clear that EDUCAUSE is finding little workable technology to help satisfy industry demands (tools like Audible Magic and ICARUS are throwing out the legitimate baby with the illegal bathwater). Brewster Kahle was in the audience and asked us to please remember that the Internet Archive depends on P2P for distribution of its legitimate content. If we need an example of real life content dependent on P2P distribution, he welcomes us to point his way.

I am a pretty visual person and appreciate a well laid out graphical representation of an issue. I find one of the masters of our field to be Herbert Van de Sompel. I didn’t attend his session today on Federations of Institutional Repositories, but I see the handouts in the CNI packet and am struck again by what lean, direct, and illuminating illustrations he comes up with. I don’t know whether he makes this stuff up himself or employs some graphic talent on the back end, but his touch has been so consistent over the years in many contexts that I suspect the former. I hear many people laud the interface of SFX, few of whom realize just how much it is the vision of Herbert, who showed “rough” versions of SFX many years before it became a commercial product with virtually the same interface it still enjoys. If you want to see what I consider PowerPoint well-used, take a look at a presentation by Herbert some day.

By the way, his work on new roles for MPEG-21 & OAI & OpenURL in federating repositories is quite interesting, thinking way outside the box. Take a look at the D-Lib article he and a few colleagues wrote for a taste.

After lunch a few of us retired to a quieter corner of the hotel to discuss whether it would be worth our time and effort to try to make LOCKSS more of a preservation tool. There was a clear consensus among this group that LOCKSS is not preservation today, and that the project (though it claims a preservation role) is really not doing much (beyond its NSF grant attempt, anyway) to make accommodations in the software for preservation issues. These would include things like issue level manifests with metadata, file format recognition and metadata (perhaps via JHOVE, which I saw was announced today), or picking up formats other than HTML (maybe an OAI harvest of metadata followed by a harvest of the related deeper-web items). Right now LOCKSS is, in essence, a “bit store,” it is a backup mechanism. In some ways, building up LOCKSS installations might also remove some of the wins the system brings in terms of ease of setup and maintenance.

An interesting experiment might be to use the WayBackMachine to figure out how many of the current Humanities titles are captured in the Internet Archive.

A group funded by the Mellon Foundation is trying to define the bounds of interaction between course management systems (CMS) and repositories. Their report should be available on the DLF web site by the end of May. In today’s presentation to CNI they made three fundamental points: (1) users will be getting to repository content through a broad set of “course management” tools that extend well beyond CMS into PowerPoint, Weblogs, Citation Managers and the like; (2) repositories need to attend to a Checklist of requirements and desirables in order to interoperate with this layer of tools; and (3) the process used to build course content can be expressed as “Gather-Create-Share”.This “Gather-Create-Share” seems like a weak echo of Apple’s “Rip, Mix, and Burn” campaign a few years ago. It is also the process that Lessig warns us is under threat given the intellectual property regime our country is putting into force. The session really didn’t touch on the impediments that copyright puts in the way of the “Gather” step, but I was told that IP issues will be part of the Checklist when the group reports out to the DLF.Another mention of Chandler and its higher-ed alter-ego Westwood, this is something I should pay some attention to. Chandler is an open source personal information management tool under development.

One of the commitments the IT Council has made is to revise our Libraries privacy policy before the next school year begins. An overcommitted Sue Hallgren is leading this effort for the IT Council. I attended a CNI session on Security and Privacy to try to gather info for Sue and the small group working with her on this. SPEC Kits 277 and 278 came up again in this context, and 278 looks particularly helpful for developing a refreshed policy. A few tools were also mentioned that we might want to peek at, though I’m not sure any of them are actually appropriate for our context. Check out the privacy proxy they mentioned, and the public workstation privacy info and tool mentioned.

Random thought… Could we cut off much of the unwanted workstation traffic by limiting Public Browser in a new way? What would happen if public browser refused to allow more than 100 characters in any text field of any form? Would that be enough to kill use for email, but still allow research use?

Ralph Quarles (IU) and I found each other at the reception. Ralph has offered to help us evaluate our computer support and seek an appropriate model for future support. He noted that he is ready for some ongoing contact with his colleagues at other CIC institutions. The Library IT Directors have that kind of forum in the CIC, but staff at his level, those actually running technology support operations, really don’t have many opportunities to reach out to each other. I wonder if we should plan a day or two of professional “shoot the breeze” time at Minnesota for all the folks in these positions? We could do it as part of our investigative effort. This could both help this cohort build connections to one another and serve as a font of wisdom and warning for our own planning effort.

We ask our technology staff to do the seeming impossible. Our staff is not nearly large enough to manage the kind of deployment we’ve got around the Libraries. How can only six staff manage 600 machines? On the other hand, could it be a failure of imagination? When I arrived at the U the first significant decision I made was to kill our attempt to use Sun Ray “appliance” computers to replace public workstations in the Libraries. We had good reasons for that decision, but the fundamental problem was staring us in the face then and remains at the core of our troubles in ITS: we cannot support a deployment of 600 Windows workstations with so few staff. Why can’t we change the rules? At MIT I watched an organization deploy and maintain thousands of workstations with fewer staff than we have available to us.

I believe we need to think outside the box, it may not be Sun Ray, but we must recognize our situation (a budget even more limited than it was in 2001) and devise creative solutions to meet our needs within those bounds. I am certain this means compromises, but not necessarily the ones that run our staff ragged without the reward of a computing infrastructure they can take pride in, tell the world about, and share with our community.

My frustration with our current situation expresses itself as a frustration with ugly machinery, and I do believe that computers should be in the process of fading out of sight, but that’s a red herring. My real frustration is that I’ve allowed our expectations to be diminished by accepting the limits we’ve imposed on ourselves. I wonder if we shouldn’t get the CIC equivalents of Directors of ITS together to share their frustrations and triumphs. We could certainly use some inspiration and, who knows, we might even be able to do something inspiring ourselves!

Reagan Moore from the San Diego Supercomputer Center discussed their SRB development. A very dense presentation left me with the basic impression that I need to understand this approach to data storage being developed as part of the NSF grid infrastructure. SRB “provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets.” An alternative to LOCKSS? I did hear from one colleague who has already been reviewing SRB that it is a very complex bit of software to install and maintain.

