AKA the computer science perspective. Kunze is discussing some of the technical issues relating to digitization, which I find really interesting.
For instance, there is the problem of how do you transfer all this data across the network? Lots of transfer tools tested -- but parallelism works -- so sthe practical solution is to combine parallelism with common tools -- e.g. run SCPI 20x. This means that they can how to move millions of files.
Now: how to make the files smaller?
This requires a discussion of what mass digitization is -- mass digitization is, for us, not intended to replace the physical form.
For millions of files, we need to strike a balance between size of the files and quality of the reading experience -- AND images need to work with OCR.
There are also lots of technical problems with getting the OCR to work problems with two-column pages, pages where the ink is too heavy, or where it's too light -- coarse half-tones are problematic.
For an example of other media storage, the Swedish Archives are digitizing 8-track tapes and producing 42 terabytes of data per MONTH.
Given all this possible data... we need to think about disks.
* RAID -- used in the 80s
* JOBD (Brewster Kahle's approach) "just a bunch of discs" (not fancy disks, but lots of them) - 1990s solution
* MAID -- massive arrays of idle disks -- today's approach
Lots of technology is coming out of the Internet Archive's examples/innvotions, e.g. the W/ARC file format -- many files in one file for speed and ease
Digitizing the digital... what to do with microfilm?
"Data desiccation" -- it's actually a very difficult problem to turn a book into plain-text because of all the formatting. CF Project Gutenberg.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment