Jump to content


From Wikimania

< Proceedings:BK1

THE AUDIENCE packs the room for the session. BREWSTER KAHLE stands at the front of the room before a projected set of slides. SJ KLEIN introduces:

SJ KLEIN: Welcome to the second day keynote. Brewster's goal is surprisingly close -- almost word-for-word -- to the goals of the Wikimedia Foundation. He's built many key tools, including the Internet Archive, the Million Book Project, and Alexa Internet. He's recently been thinking about what it means to build the digital library of the future in both tools and law, to make a comprehensive list of the knowledge that's out there and get it every single person. Brewster Kahle.

BREWSTER KAHLE: This group is doing something really big. I think these talks are about trying to understand what it is and what our role is in it. What I'm really interested in is the connection between open source, open content, and technical non-profits. So maybe you can help me figure it out.

The grand goal

Marvin Minsky taught me to pick a nice big goal so you can work on it for your life and keep moving along. It's nice if you can't achieve your goal, because what do you do if you achieve it? So here's mine: Universal Access to All Human Knowledge. Not going to be achieving that.

Look at the Alexa data: something big is going on. And the nice thing is that you can unreservedly feel good about it. There's no "but...". It signs up to this big enlightenment goal of all knowledge.

The structural problems and solutions

When I started working on this goal, I found a structural one, not a technical one: the idea of copyright as property. It's one of the worst ideas of the century, with the exception of the domino theory. "Information is knowledge, knowledge in that science sense that you build on other's work." But the star of the other show is Jack Valenti, who pushed through the 1976 radical extension of the US Copyright regulation. It was a travesty, but the repurcussions took a while to come out.

The first casualty was software. When I was at MIT, we built software for the Lisp machine, but it was part of the ethic not to sign the code. But MIT claimed it as their property and sold it to a company called Symbolics. RMS and Richard Greenblatt tried to keep up with the commercial version and lost. RMS "blew a fuse" and came up with the idea of the GPL.

Another casualty was music and video, which Larry Lessig responded to with Creative Commons.

Another response is organizatins around community efforts on the Internet, in large part because we lost places like MIT -- it was hard to do things in that structure without it being hurt by a collection of lawyers. So we built other organizations, like the Free Software Foundation. But there were others, like DejaNews which was structured as a for-profit, got sold to Google, and disappeared. Then there was IMDB, a community project that got bought by Amazon.com. CDDB collects the track information for CDs on your computer, that became Gracenote. There was WAIS, a pre-Web publishing system I started, which turned into a company (it was the only thing we knew how to do) wh ch we sold to AOL and basically disappeared. FTP Software sold to netmanage, Cygnus sold to Red Hat. My point is companies built around community efforts don't last very long.

The Apache Software Foundation is an incorporated structure to work on free projects. The Open Source Application Foundation, funded by foundations. Mozilla, funded by Google and making so much money they spun off a for-profit. Linux has a foundation. The Internet Archive -- 10 years, last year was $5M, this year will be $8M. Built around the open access model, pay for the administration to make everything available. The Wikimedia Foundation.

It seems like there's something go on with the technical non-profits. What does that mean? Let's watch it. In 10 or 15 years, we'll see what happens. I think we may have gone wrong with the over-corporatization after WWII. Small companies from centuries ago were more thin-skinned, more interesting.

Now there are also supporting organizations for these open systems: EFF, Public Knowledge, Open Content Alliance. These organizations are like infrastructure to serve us. The Internet isn't new, but we screwed up the legal structure and we need these organizations to reform it.

Then there's open hardware. We, for example, built the Petabox, a cheap petabyte=size machine. We couldn't find one, so we built it and open sourced all the tools. Much more important is the $100 laptop, with interest on the order of 5-10M. What happens when the next major laptop company is a non-profit? Mary-Lou Jepson, the CTO, says they succeed because governments trust the non-profit.

Solving the grand goal: a progress report

So now that we have these structures in place to respond to knowledge-as-property, we can get back to our grand goal. Now I'm on ground I know a little more about: how are we doing? I'm going to go down the types on the Amazon website: books, movie, software.

Text: We can't make all text available for free -- some of it will cost money -- but we can make it all available. Let's do everything -- how big is it? 26-28M books in the Library of Congress, the text for each is about 1M. 26M is 26TB, fits in a $60K Linux box the size of the table. The whole Library of Congress, spinning on the Net. The cost of a house. So how do you get it there? What do you do with it once you have it? Well, digitized you can do searches and try different interfaces and so on. But I don't really want to read a book on a screen -- I like books. So what if we can do electronic printing at the last possible minute? So we said, "how hard can it be?" We made a Bookmobile -- a van with a satellite dish. We pull out a table and print, bind, and cut books right there. A book costs a penny a page -- a buck. Now we're in India, Egypt, Uganda -- bringing books to rural areas. So we're getting close. But we needed to scan more books. So how do we do that? One way is to send them to India. We bought 100K books for the Million Books Project and it worked great, but other people weren't into it -- it didn't make them feel good. So the Indians started scanning their own books. And that's probably right. Similar projects in China and Egypt. So what we found was put the scanners next to the books, it's not worth the savings. So how much does it cost to scan it here? $10 in India, $30 in the US (assuming 300pg book). We tried the robots but we ended up building our own system with two digital cameras that work without breaking the binding. So we have these scanners in four libraries now. Scanning about 400 books a day now. So we've got the tech to work. 26M books, $30 a book, but $750M -- one-time shot and we have the Library of Congress, for everybody. For comparison: the LoC budget has a $500M/yr. budget, the entire library system is $12B. Only 10%, that's pretty good. It could be corporate or private, but the technology is there.

Of course, there's still a legal problem of orphaned works, so we started a lawsuit: Kahle v. Ashcroft (now Gonzales). We came up with the term "orphaned works" -- and since we framed the problem, we've almost already won. Valenti is brilliant at this -- artists v. pirates. So we channeled him and came up with orphaned works: who could be against the orphans? And it's working -- the copyright office asked what they should do about orphaned works and everyone responded give the orphans a home! So then we need to move to out-of=print works, then in-print works. So books are doable. And through the Open Content Alliance, libraries, Yahoo! and Microsoft (Micrsoft going open!).

Audio: ??M of published musical works, but you have to be careful about putting them online. Lots of people want to put their stuff online, but need bandwidth. So we partnered with one particular group (the rock-and-rollers!) who allowed bootlegging as long as there was no money involved, not even the cost of the tape. They tried to do it on the Internet, but it was technically tough. So we asked them if they wanted unlimited storage, unlimited bandwidth, forever for free. They said: We don't believe you. We said: Try us. They said: If you can do that, it would be our dream. Always a good sign! So we asked the band's permission, got lots of OKs. We now have over 2000 rock-and-roll bands, 1000s of recordings, 2800 just of the Grateful Dead. So it's working. And we're finding a role for technical non-profits to make it free to give something away. In this country, giving away is free--except on the Internet. If you get /.ed, it could be a nightmare -- you could lose your house from the ISP bills. We need more than just licenses, we need institutions to take care of the commons. Now the European Archive is digitizing LPs in Europe of classical music for free, announcement at the end of September. It's within our grasp -- $10/disc. All published music? It's doable.

Moving Images: Larger files, but fewer of them. Look at theatrical releases, it's only 200K films. We can do all of it! There are a lot of archival films that never got wide distribution; we now have 30K on the archive (dwarfed completely by YouTube -- an awesome service) and we're finding whole genres we didn't even know about, like Lego films. There are political events, lectures (usually end up in closets), etc. So we digitize them -- we have a guy who does it for $15 for video-hour. You guys have videos -- let's get them online!

Then there's television. We think there are 400 channels, we have 20 of them, 24 hours a day at DVD quality. Got about a petabyte now. Russian, Chinese, Japanese, Iraq, Al-Jazeera, ABC, CBS, CNN, Fox, etc. It's unfortunately mostly all unavailable. I remember Noam Chomsky saying you had to read 7 papers in the morning because papers had a point of view. I thought he was nuts -- it'd get in my corn flakes! But now I see he's right. After 9/11 CNN said Palestinians were dancing in the streets. Really? Let's look at Palestinian television. Let's do it all.

Software: 50K software titles, almost all rotting, got a DMCA exemption to be able to archive it. But we need a lot of help.

Web: Our best-known collection. We take a full snapshot every 2 months. It's growing...fast. But the disks have been growing too. It's about a petabyte. For preservation, we made a copy in the new Library of Alexandria (libraries burn...usually by governments). Working on one in Amsterdam. Try to do it in mutual swap agreements--back an forth. The Wayback Machine gets about 100 hits/sec. We find that most people are looking for their own stuff; people are reasonably excited.

Next Steps

It's possible to do it. There are open questions: public or private, profits or nonprofits. Google's project is quite possibly are only shot at digitizing Harvard, and it's going to one company, with lots of bizarre restrictions.

So let's go back to stuff I don't know much about. Let me advertise some stuff we need help with.

Build open networks. The Internet is fantastic, but it's being run by fewer and fewer telecom companies. What if we built open neworks run by open organizations? Like community wireless. We're seeing the issue of net neutrality, since the telecom companies are back to their old tricks. Do we try to stop them with regulation or do we try to build a better system. One way is distributed ownership. People own their own PCs now, then their own routers, but we still connect through commercial links. What if we own those too? So there was an experiment in SF (SFlan) -- a rooftop hopping network. Just an idea.

Open and transparent Web search. Let's let the commercial guys do the ad thing, but let's build some alternatives. There are just too few search engines and they're not creative enough. We tried something with time-based search on the whole Web archive--done by just one woman and indexed more pages than Google. Unfortunately, we didn't have a business model so she went to work for Google. (Hopefully she'll be back when she cashes her stock options in.) There's also Nutch.

Privacy and anonymity. Let's solve this guys, it's not that hard. It's now known that the US Government is watching all our traffic. We have a long history of anonymous publishing, let's take advantage of it. One example is Tor.

Patents. What about a GPL for patents? The Defensive Patent License (DPL) is being worked on by the EFF and USC Law Clinic. Let's organizations pool patents so their patents really are defensive.

Open textbooks. People say they need textbooks. There's a Wikipedia proejct, maybe it needs more help, maybe it needs seed, but it's a big issue and we need to nail it. Textbooks are the #1 request worldwide.

Add attribution to Wikipedia. Please. (some applause) Gutenberg should know what edition their books are and we should know where the facts on Wikipedia come from. Like Ted Nelson's ideas of transclusion. I was with Richard Feynman in 1982 and we were looking at the new Encyclopedia Britannica with propedia and micropedia and we calculated that at five more levels it'd go all the way down to the Library of Congress. So just a request while I'm here.

Open library. We're scanning lots of books and we need to make them relevant. But what if we could annotate them: why is this book interesting? A new kind of catalog, a living one. Why you should care about this book.

The grand goal

The structures are there. The technical ability is there. We can pull this off. Universal access to all knowledge can be one of our greatest achievements, up there with the man on the moon in the mythology of humanity. Thanks.


Q: What do you do about rotting information?

A: Scan it, put it online. It's possible to do the whole thing, so let's actually do it. What we do is make copies in different locations. We do lose data -- but the only thing we know is keep 2 copies and toss the machines every 3 years. It's tough.

Q: How do you move to 2-way communication?

A: You're doing more than I ever dreamed of, including large-scale participation. We're trying to put up university lectures from around the world. We need to make it OK to put stuff somewhere else as long as it preserves attribution.

Q: Engineering societies submit papers, for free, to IEEE, who turns around to charge us.

A: Public Library of Science. I like the ideas of science but we let a few orgs take control of it. It's very painful. We're having to build whole new systems to replace them; it's terrible to have to repeat ourselves. So let's to it right next time.

Q: Can we help your storage challenge with something like Seti@Home?

A: Please, please do! The SETI guys offered the computers. The F451 idea -- everyone keeps a book -- makes a heck of a lot of sense. We're just not clever enought to be able to pull it off.

Q: So what? You have universal access. History is driven by its use. We had the enlightenment, but we're still on the brink of destroying ourselves because it's not being used right.

A: Wow. So your question isn't "so what?" but more "is this really going to help? is it going to hurt?" Is it going to help? It's a guess. I take it almost as a given. I think it's embedded in the enlightenment goals. You can ask whether it hurts -- but I take it as a given. Isn't that a wimp-out answer? (audience member: it's an honest answer.)

Q: What about the developing world? They missed the industrial revolution, we can't let them miss the information revolution.

A: Devloping is the wrong way to look at it. Information is like biodiversity. Diversity of ideas, stories, languages, are not in the US. We can all sing the Gilligan's Island theme song. We made a small, dumb amount of knowledge universal. So let's do everything we can to make everything universal.

Q (continued): What about network access?

A: Yeah, that's broken. But we've come a long way. Maybe solar-powered wireless units you can throw out of airplanes and just have them repeat. Maybe we can get someplace.

Q: What about people who don't have the tools of access?

A: First, realize they're worth accessing. Second, just do it. $100 laptop. $30 book. But I've never been someplace so obscure I've been more than one day's walk from an Internet cafe -- they're everywhere. Maybe they're too expensive, but they could be the next generation of libraries.

Q: A lot of the info you talk about is public, but a lot of what we have from the past is private. Unless you're counting on the NSA to give you access to people's email, that all might get lost? What if you could BCC your email to a bot that strips out the identifiable names so we can have an archive of communication.

A: Yes. And we're losing the drafts people are writing on hard drives that will fail. We're overcollecting some and undercollecting others. Companies have document destruction policies, because they can be sued for holding on to information and have to produce it in discovery. We need to bring balance back, so it doesn't cost you to remember.

Q: Talk about peak oil, peak gas, the end of energy. What happens if the power goes out? What happens if an asteroid hits the Earth? Do you think about this kind of stuff? Just curious.

A: All the time, I live in California. One thing we technologists can do is to build tech to keep people off the roads -- let's make it productive to stay at home. Ambient video things, etc. to help the oil. As for the archive, I'm counting on problems. But not the idea of an Ice Age. The Long Now has been working more on that kind of thing, aiming at the archaeologist with an optical microscope in ten years.

Q: After 18 years there's finally a bill for a public intelligence agency. Can you say why it should not be owned by the spies instead of the diplomats?

A: Sounds good to me. What does intelligence look like in an era of data mining?

Q: Where monopolies? Where decentralized?

A: I come from small is beautiful, small is robust. We like to be replicated and have friendly competition. The key to longevity is use.


Note: This is not an exact transcript of the words said, but an impressionistic one of the ideas expressed.