Sunday, October 19, 2003

All these words on the web, lost like tears in the rain?

Given the propensity for the collective masses of "the internet" to pour their hearts out in digital form alone, I am led to wondering just how long they will survive after they are published? We are all familiar with "404 Not Found" errors that signify another website bit the dust. On occassion their content can be recovered via the great Google cache in the sky, or The Way Back Machine at archive.org. Other times the content just never made it into any semblance of semi-persistance. For instance this blog does not appear on archive.org because it is bound to an IP address that is shared with other domains and The Way Back Machine appears only to hit individual IPs and not registered domains. Even though available IP addresses probably outnumber registered domains by about 100 to 1 it seems a large percentage of web sites and hence content will still slip through the cracks of archival like this one.

Even if my site were to end up archived by one of these remote data suckers, what is to say that the archived data will ever survive for any significant period of time? While, as I have previously noted, the cost of hard drive storage capacity is still dropping precipitously, there is still no guarantee anyone will have the interest or money to continue to archive new web content or preserve old, no longer valid content. Furthermore, even if I rely on my own personal archives - a backup drive and the occasional copy to CD-R - I know these copies will probably not last more than ten years due to "data rot" and hardware obsolesence, or in the event of my untimely demise significantly less. How many of you have uncovered an "old" 5 1/4" floopy disk and wondered just what you will need to do to get data from it. Often even to discover if the disk has anything useful on it would require it to be sent to an expensive data retrieval service. Then what will you do with the old WordStar, WordPerfect or Word1.0 files on it? Take it to the Tech Museum to find a computer able to run the software to decode it, or employ someone data archaeologist to reverse engineer the storage format? It should be clear that persistance and utility of archived data requires a continuous effort to keep it maintained in a useful format on a reasonably persistent media that is still supported by current and affordable hardware.

So it occured to me that what should really be happening is that someone should be archiving data by simplying printing onto paper. Good acid free paper kept away from fire and prolonged water logging will usually last a few hundred years at least. Furthermore it requires no special reading equipment, and is readily converted back to digital format by contemporary scanning equipment. Eventually it is quite possible that entire books could be scanned without even opening them (using X-ray techniques). Thus I'm tempted to start a web preservation campaign called "bits to books". It'll be bad for trees, but if it requires planting large numbers of trees, turning them into paper and not burning them it will actually be a great way to suck large quantities of carbon-dioxide out of the environment on a long term basis and do some good. I haven't yet done the math, but I think with a small, but still readable without optical assistance, font you could probably get 20k bytes of text on a single 8"x10" sheet of paper. I doubt if I could generate more than a gigabyte of original text content in my lifetime, that would be a lot of keystrokes considering the average person only gets 2 billion seconds or so on this planet and never going to spend every waking moment typing! I've yet to do the math of how many pounds of paper that gigabyte translates to and hence many trees will be required to archive my lifetime of data. I suspect its actually not very much at all especially compared to the 750 pound per year average consumption of paper by Americans for non-archival purposes. Counting all the wood pulp that goes into packaging, toilet paper and other wood fibre based products people use brings the total to over 3,000 pounds per year. Suddenly archiving my data to paper suddenly looks quite practical so long as I can find somewhere to store it!

No comments: