Thursday, November 15, 2007

John Kunze: Preservation and Mass Digitization cage match

AKA the computer science perspective. Kunze is discussing some of the technical issues relating to digitization, which I find really interesting.

For instance, there is the problem of how do you transfer all this data across the network? Lots of transfer tools tested -- but parallelism works -- so sthe practical solution is to combine parallelism with common tools -- e.g. run SCPI 20x. This means that they can how to move millions of files.

Now: how to make the files smaller?

This requires a discussion of what mass digitization is -- mass digitization is, for us, not intended to replace the physical form.

For millions of files, we need to strike a balance between size of the files and quality of the reading experience -- AND images need to work with OCR.

There are also lots of technical problems with getting the OCR to work problems with two-column pages, pages where the ink is too heavy, or where it's too light -- coarse half-tones are problematic.

For an example of other media storage, the Swedish Archives are digitizing 8-track tapes and producing 42 terabytes of data per MONTH.

Given all this possible data... we need to think about disks.

* RAID -- used in the 80s
* JOBD (Brewster Kahle's approach) "just a bunch of discs" (not fancy disks, but lots of them) - 1990s solution
* MAID -- massive arrays of idle disks -- today's approach

Lots of technology is coming out of the Internet Archive's examples/innvotions, e.g. the W/ARC file format -- many files in one file for speed and ease

Digitizing the digital... what to do with microfilm?

"Data desiccation" -- it's actually a very difficult problem to turn a book into plain-text because of all the formatting. CF Project Gutenberg.

No comments: