April 2, 2012
In Preserving Our Digital Pre-History I nominated Jason Scott to be our generation's digital historian in residence. It looks like a few people must have agreed with me, because in March 2011, he officially became an archivist at the Internet Archive.
Jason recently invited me to visit the Internet Archive office in nearby San Francisco. The building alone is amazing; when you imagine the place where they store the entire freaking Internet, this enormous former Christian Science church seems … well, about right.
It's got a built in evangelical aura of mission, with new and old computer equipment strewn like religious totems throughout.
Doesn't it look a bit like the place where we worship servers, with Jason Scott presiding over the invisible, omnipresent online flock? It's all that and so much more.
Maybe the religious context is appropriate, because I always thought the Internet Archive's mission – to create a permanent copy of every Internet page ever created, as it existed at the time – was audacious bordering on impossible. You'd need to be a true believer to even consider the possibility.
The Internet Archive is about the only ally we have in the fight against pernicious and pervasive linkrot all over the Internet. When I go back and review old Coding Horror blog entries I wrote in 2007, it's astonishing just how many of the links in those posts are now, after five years, gone. I've lost count of all the times I've used the Wayback Machine to retrieve historical Internet pages I once linked to that are now permanently offline – pages that would have otherwise been lost forever.
The Internet Archive is a service so essential that its founding is bound to be looked back on with the fondness and respect that people now have for the public libraries seeded by Andrew Carnegie a century ago … Digitized information, especially on the Internet, has such rapid turnover these days that total loss is the norm. Civilization is developing severe amnesia as a result; indeed it may have become too amnesiac already to notice the problem properly. The Internet Archive is the beginning of a cure – the beginning of complete, detailed, accessible, searchable memory for society, and not just scholars this time, but everyone.
— Stewart Brand
Without the Internet Archive, the Internet would have no memory. As the world's foremost expert on backups I cannot emphasize enough how significant the Internet Archive is to the world, to any average citizen of the Internet who needs to source an old hyperlink. Yes, maybe it is just the world's largest and most open hard drive, but nobody else is doing this important work that I know of.
Let's Archive Atoms, Too
While what I wrote above is in no way untrue, it is only a small part of the Internet Archive's mission today. Where I always thought of the Internet Archive as, well, an archive of the bits on the Internet, they have long since broadened the scope of their efforts to include stuff made of filthy, dirty, nasty atoms. Stuff that was never on the Internet in the first place.
The Internet Archive isn't merely archiving the Internet any more, they are attempting to archive everything.
All of this, in addition to boring mundane stuff like taking snapshots of the entire Internet every so often. That's going to take, uh … a lot of hard drives. I snapped a picture of a giant pile of 3 TB drives waiting to be installed in one of the storage rooms.
The Internet Archive is a big organization now, with 30 employees in the main San Francisco office you're seeing above, and 200 staff all over the world. With a mission of such overwhelming scope and scale, they're going to need all the help they can get.
The Internet Archive Needs You
The Internet Archive is a non-profit organization, so you could certainly donate money. If your company does charitable donations and cares at all about the Internet, or free online access to human knowledge, I'd strongly encourage them to donate to the Internet Archive as well. I made sure that Stack Exchange donated every year.
But more than money, what the Internet Archive needs these days is … your stuff. I'll let Jason explain exactly what he's looking for:
I'm trying to acquire as much in the way of obscure video, obscure magazines, unusual pamphlets and printed items of a computer nature or even of things like sci-fi, zines – anything that wouldn't normally find itself inside most libraries. Hence my computer magazines collection – tens of thousands of issues in there. I'd love to get my hands on more.
Also as mentioned, I love, love, love shareware CDs. Those are the most bang for the buck with regards to data and history that I want to get my hands on.
obsessiveconscientious geeks that I know you are, I bet you have a collection of geeky stuff exactly like that somewhere in your home. If so, the best way you can help is to send it in as a contribution! Email firstname.lastname@example.org about what you have, and if you're worried about rejection, don't be:
There's seriously nothing we don't want. I don't question. I take it in, I put it in
items. I am voracious. Omnivorous. I don't say no.
The Internet Archive has an impossible mission on an immense scale. It is an unprecedented kind of open source archiving, not driven by Google or Microsoft or some other commercial entity with ulterior motives, but a non-profit organization motivated by nothing more than the obvious common good of building a massive digital Library of Alexandria to preserve our history for future generations. Let's do our part to help support the important work of the Internet Archive in whatever way we can.
Posted by Jeff Atwood
Archive.org is one of the best sites I know from the internet.
Nevertheless, it arises a very good discussion: that way internet never forgets... and there's some discussion about the troubles we have when we do something and it can (and will) always be remembered.
Don't you have the right to have your actions forgetten someday ?
You've totally distracted me from this blog post now lol
It's a great resource, but it can be disappointing. So often it only collects say 1 or 2 pages from a site (repeatedly over time) and misses the rest (also repeatedly over time), even when the content is simple HTML and image links and should pose no obstacle (that is, not Flash, etc).
The format shifting and the change of reader implyies some loss in the culture.
We see it with the VHS to DVD (not all movie will be tranferred ), will see it with DVD to blu-ray, then with full digital distribution
The web gives the same issues, with the loss of rendering engine (a page rendered with netscape will never look the same with modern browsers), all those Internet Explorer quirks will fade, plugin content are near impossible to read (with the rise of webGL for example, VRML seems to have disappeared).
However I like to see my now defunct page back when I wanted to be a 3D graphics artist instead of a developer like I am today (not giving up the link, I was really bad at that time).
On a somewhat related, but totally unrelated, note, where can we find details about their hardware specs to accomplish this impossible task?
What about archiving software? I'm not sure if anyone has taken this task up, making copies of either source code or binaries.
Irony: Your third link -- to textfiles.com -- is currently returning a 500 error.
Interesting that they use retail packaged USB3 externals.
Is there some reasoning behind that?
The ASCII.TEXTFILES.COM weblog is currently down for the count due to a hardware failure. I appreciate the irony too. Machine will be back "later".
The retail packaged USB3 externals are because the usual supplier of disk drives is subject to the same extortionate prices due to the Thailand floods affecting a lot of drive purchases, but bulk buys of the USB3 externals are, believe it or not, currently cheaper. That will change and I'm sure the Internet Archive will move back to the more intuitive drives when the price comes down.
Never in my life have I seen that many individually wrapped drives. In fact I don't think I've ever seen a hard drive wrapped, ever, in anything except a static bag.
Haven't heard of OEM orders? I'm sure for quantities that large they'd oblige!
I restate again, Mr. Henderson: The current situation of drive costs due to the shortage related to Thailand floods is that the price of OEM drives has skyrocketed, often tripling or worse the price, as well as severely cutting back the ability to order any OEM drives at all. As a result of study, the Internet Archive found the external drives are currently cheaper than OEM drives, and are currently using piles of these drives for the need of the archive (an average of three drives a day have to be RMA'd). When the economic/supply issue is fixed, I'm sure the Archive will return to the method and approaches you are more familiar with.
That is just awesome.. I am not sure if it is back in Georgia, but I have a 101 shareware games CD from the 90s somewhere. I might send that in if I can find it somewhere in my spindles.
It scares me to think how much data is "created" on the internet everyday. Is there a pipeline big enough to shovel all that to the internet archive, and how much could that possibly cost to handle the download of all the information? My mind is spinning just thinking about it.
I love the Internet Archive. Besides what Jeff said, it once saved my ass from a lawsuit for plagiarism: I was accused of copying an article from a magazine (one that I had written), but the Internet Archive helped me prove that said content was present in my website before the magazine even existed - the tables were turned and I got the upper hand in this. In the end, the lawsuit never materialized.
I have a hypothesis that the Internet Archive will make my personal web pages accessible hundreds of years from now. I'm testing this with letters I'm writing to my descendents. The one to my grandchildren is here: http://www.leppik.net/david/7gen/1_OtherGrandchildren.html and a little context is here: http://www.leppik.net/david/blog/?p=292
One of the issues is writing a program that will run correctly for the first time 50 to 200 years from now. That's so that my letter is scrambled to discourage casual reading by unintended recipients. The intended recipients might not be technically savvy, so uncompiled C is out. My guess is that today's ECMAScript will still run in 200 years. My reasoning is at: http://www.leppik.net/david/blog/?p=208
I should add, by way of context, that my children are still small enough for me to carry, so the letter to my grandchildren is to as-yet-completely-hypothetical grandchildren, so it shouldn't get read for another 40 years.
I'm pretty sure today's ECMAScript will still be readable in 40 years. After all, it took over 10 years for browser makers to fully implement CSS.
The internet archive is an incredible resource. It's a shame there isn't a search facility because I'd love to search out all the Corewar / Programming Game material that has long since disappeared from the net.
Wow, I always dreamed of going there myself one day!
I use Internet Archive daily, especially for genealogy purposes, so many websites were created about family trees ten years ago and most of them are "all gone" (from google and bing of course) but they existed, and the information they contain as well!
I have a question : when I was in library school, we had a full-week class on web archiving, digital archives and preservation of hardware (disk drives, diskettes, etc). We heard of a museum of "hardware" in the US, where they keep, restore and preserve old computers, OS, software and diskettes (aka that report your secretary mom typed using Word Perfect 3.1 back in the days) is it tied to the Internet Archive program? Went through my old school notes this morning and can't find it anywhere... I think it was in Texas.
Wow, I really didn't know that the Internet Archive is archiving so much more than 'only' (haha) the internet. There is some really cool stuff out there.
Thanks for pointing that out!
>> Yes, maybe it is just the world's largest and most open hard drive, but nobody else is doing this important work that I know of.
Really, the Internet Archive is trying to do and doing a great job, but they are only a part of the picture. Maybe the single largest part, but what has actually been going on for the past decade or so, is that many (or should I say "most") countries with the Internet presence, have been archiving websites of their own countries insofar as they can, as part of preserving their culture.
It's called Web harvesting and it is has been a whole field of study in information science for quite some time, dealing not only with physical preservation, but also logical one (changing formats, emerging new formats to archive, multimedia, etc.).
The task is frequently associated with the National Library of the country (Library of Congress has a bit similar role in the US), because their task is usually to preserve the cultural heritage of the nation -- all the printed books, magazines, newspapers and other public materials. Preserving the public web is just an extension of this.
Unfortunately, the task is such an enormous one that most of the countries are very selective about which websites they preserve. In this sense Internet Archive actually is quite unique, since it is non-discriminating. On the other hand, as far as my country is concerned, IA has been preserving about what... 1% of the websites here, I believe.
But yes, long-term digital preservation has been the topic in libraries (especially national ones), information sciences and computer sciences for quite some time by now, and there are lots of interesting issues at stake. For example, if we compare this sort of "national" or "global" memory with human memory, we must not forget(forgive the pun) the fact that it is quite important for humans to be able to forget things sometimes. In national and global terms this could also be rephrased that maybe the information is only to be remembered while there is somebody to whom it is important enough to preserve, so he does it himself.
A very broad topic, anyway, but for the list of web archiving, you can also check http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives and related info.
Great post! I didn't know their office is right in SF! I wonder if they're open for public. It does look like a church (because it was) and it's hilarious.