Thursday, November 15, 2007

questions for the presenters -- Robin Chandler and John Kunze

Question: OCR -- what's the success of OCR, error-wise? What kind of editing do you have to do?

JK: we looked at the degradation of OCR over time vs compression -- but doesn't have any data on the average error rate per book. (An idea: have library school students go through and correct pages as part of learning about OCR).*

Question: Languages -- apparently there are certain languages that don't OCR well?

A: German and CJK (Chinese, Japanese, Korean scripts) and Greek are problematic -- but Google etc. don't have the tools to index these scripts either. There's a product called Abireader (?) that the IA is using for Russian.

Question: Google had controversy that they were western-based, U.S. centric -- but BHG was trying to find an Indian publication (in English) that wasn't available anywhere the other day. Is there a push to move digitization beyond just the libraries we have heard about and move it into collections we couldn't get any other way?

Answer: The answer is yes -- Google especially has moved into Europe and Japan, but not India yet. RC thinks they are pushing for to get out more.

Q: is there room for bibliographers to give guidance to Google? Can we say, "do these, they're rare and public domain?"

A (RC): for our UC piece, the answer is yes

Michigan is the only library that has put content from the digitization up, and they have built a large rights database of who can use what

Q: are there any restrictions on our use of the project?

A (RC): Google has three different versions of what you can view -- title, snippet, full view

MS & OCA is only scanning material in public domain

UC restrictions... we can't make copyrighted material available either -- PD material we can share (obviously) -- restricted in the percentage of public domain material we can share via google (???)

content contract -- we can't allow the content to be "indexed or downloaded by a commercial service"

RC: Google is trying to follow all the laws in all the countries.. .

Q: who's doing the work on all these orphaned works to find out if they really are orphaned?

A: OCA, MS and Google are all interested in it, but OCA is doing a lot of the work

The Boston Library consortium has said that they will digitize things if someone requests them

Q: what is the speed of searching all these digitized books -- esp if they put full text into worldcat.

A: OCLC and Google have not finalized how they will put links to books into worldcat.

Q: will this material be accessible to anyone, or will it only be accessible through a proxy server (e.g. if I'm helping a non-UC student)

A (RC):
in the pilot, when there's a link from the record to the item at Google or MS, anyone who can get into the catalog can see that book -- restricted by copyright

what about copyrighted works that we own -- not addressed

Q: limitations that we've agreed to in our contract -- are there any restrictions

A (RC): yes. what do we want to do?
we have not agreed to restrictions beyond the copyright issues etc.

Q: Can they copyright the digital form that they have made of our books?

A: (RC) -- I don't think so...

Q: who is checking the quality of the digitized works? What are they doing?

A: (RC) - Google puts a lot of effort into checking quality and quality algorithms. JK: At CDL we do a bunch of format checking to make sure the files are well formed. RC: we don't have the staff at CDL to go through every page, beyond the files.. but the vendors are interesting in getting error reports from individuals, though there's some question of how to do that and how do rescans/insert pages, etc. Might fall on the library to help correct errors...

BHG: Google is no longer double-scanning, they're comfortable with the number of errors they are getting now.

Q) how are books scanned?
RC: in both processes, the scanning is manual -- it's people turning pages. But the processes are different... the artificial intelligence part comes into error correction.

Google has several scanning centers around the country, but they are not outsourced. They are not using automatic machines..

q) what about fold-out maps, and other rare materials?

RC) None of the projects can do folios/large format. MS & IA -- when they scan, they have been skipping books with fold-out maps (but tracking on those lists). We've been working with MS/IA, telling them we'd like to get those foldouts scanned -- so IA has been working on trying to get them done in future in an elaborate process.

Google is scanning the book, but not scanning the foldout.

Q: Is there a problem with mislaying titles?

A: (RC) -- They have not lost any books so far.. with Google, there's one that has gone missing recently, and apparently that's the only book they have lost in all their 27 sites. They have "shipment reconciliation statements" from NRLF to Google -- they'are on it all the time.

Q: do you have any interest in or pressure from faculty on what gets done?

A: (RC) -- Honestly, we haven't done enough to really engage the faculty yet. Re the access piece -- we haven't really done much; Google has been talking to digital humanities centers around the country, which is great, but that's also our job as well.

Q: can you give us a preview of your digitization feedback group and what you plan to be doing, and how you plan to get feedback?

A: (RC) -- the challenge will be that any proposal that comes forward needs to have been signed off by a UL on campus etc etc.

* As a fairly recent grad, I have to say this doesn't sound like much fun...

No comments: