Thursday, November 15, 2007

dinner

Dinner, incidentally, was delicious. It was a Mexican buffet, with enchiladas, rice and beans, tortillas, chile rellanos, churros and more. Pretty tasty. It was served in one of the conference rooms here in the library.

Afterwards, we went to explore Merced... more on that adventure later. Suffice to say that I apparently cannot read a map.

the conference room where the assembly is



From Dana Peterman

questions for the presenters -- Robin Chandler and John Kunze

Question: OCR -- what's the success of OCR, error-wise? What kind of editing do you have to do?

JK: we looked at the degradation of OCR over time vs compression -- but doesn't have any data on the average error rate per book. (An idea: have library school students go through and correct pages as part of learning about OCR).*

Question: Languages -- apparently there are certain languages that don't OCR well?

A: German and CJK (Chinese, Japanese, Korean scripts) and Greek are problematic -- but Google etc. don't have the tools to index these scripts either. There's a product called Abireader (?) that the IA is using for Russian.

Question: Google had controversy that they were western-based, U.S. centric -- but BHG was trying to find an Indian publication (in English) that wasn't available anywhere the other day. Is there a push to move digitization beyond just the libraries we have heard about and move it into collections we couldn't get any other way?

Answer: The answer is yes -- Google especially has moved into Europe and Japan, but not India yet. RC thinks they are pushing for to get out more.

Q: is there room for bibliographers to give guidance to Google? Can we say, "do these, they're rare and public domain?"

A (RC): for our UC piece, the answer is yes

Michigan is the only library that has put content from the digitization up, and they have built a large rights database of who can use what

Q: are there any restrictions on our use of the project?

A (RC): Google has three different versions of what you can view -- title, snippet, full view

MS & OCA is only scanning material in public domain

UC restrictions... we can't make copyrighted material available either -- PD material we can share (obviously) -- restricted in the percentage of public domain material we can share via google (???)

content contract -- we can't allow the content to be "indexed or downloaded by a commercial service"

RC: Google is trying to follow all the laws in all the countries.. .

Q: who's doing the work on all these orphaned works to find out if they really are orphaned?

A: OCA, MS and Google are all interested in it, but OCA is doing a lot of the work

The Boston Library consortium has said that they will digitize things if someone requests them

Q: what is the speed of searching all these digitized books -- esp if they put full text into worldcat.

A: OCLC and Google have not finalized how they will put links to books into worldcat.

Q: will this material be accessible to anyone, or will it only be accessible through a proxy server (e.g. if I'm helping a non-UC student)


A (RC):
in the pilot, when there's a link from the record to the item at Google or MS, anyone who can get into the catalog can see that book -- restricted by copyright

what about copyrighted works that we own -- not addressed

Q: limitations that we've agreed to in our contract -- are there any restrictions

A (RC): yes. what do we want to do?
we have not agreed to restrictions beyond the copyright issues etc.

Q: Can they copyright the digital form that they have made of our books?

A: (RC) -- I don't think so...

Q: who is checking the quality of the digitized works? What are they doing?

A: (RC) - Google puts a lot of effort into checking quality and quality algorithms. JK: At CDL we do a bunch of format checking to make sure the files are well formed. RC: we don't have the staff at CDL to go through every page, beyond the files.. but the vendors are interesting in getting error reports from individuals, though there's some question of how to do that and how do rescans/insert pages, etc. Might fall on the library to help correct errors...

BHG: Google is no longer double-scanning, they're comfortable with the number of errors they are getting now.

Q) how are books scanned?
RC: in both processes, the scanning is manual -- it's people turning pages. But the processes are different... the artificial intelligence part comes into error correction.

Google has several scanning centers around the country, but they are not outsourced. They are not using automatic machines..

q) what about fold-out maps, and other rare materials?

RC) None of the projects can do folios/large format. MS & IA -- when they scan, they have been skipping books with fold-out maps (but tracking on those lists). We've been working with MS/IA, telling them we'd like to get those foldouts scanned -- so IA has been working on trying to get them done in future in an elaborate process.

Google is scanning the book, but not scanning the foldout.

Q: Is there a problem with mislaying titles?

A: (RC) -- They have not lost any books so far.. with Google, there's one that has gone missing recently, and apparently that's the only book they have lost in all their 27 sites. They have "shipment reconciliation statements" from NRLF to Google -- they'are on it all the time.

Q: do you have any interest in or pressure from faculty on what gets done?

A: (RC) -- Honestly, we haven't done enough to really engage the faculty yet. Re the access piece -- we haven't really done much; Google has been talking to digital humanities centers around the country, which is great, but that's also our job as well.

Q: can you give us a preview of your digitization feedback group and what you plan to be doing, and how you plan to get feedback?

A: (RC) -- the challenge will be that any proposal that comes forward needs to have been signed off by a UL on campus etc etc.

* As a fairly recent grad, I have to say this doesn't sound like much fun...

Questions about OCR

Quality of OCR and what kind of editing is required? They looked at the degradation of quality and OCR performed better with fewer errors and then got worse with over compression. It depends on the original book or item. There are no efforts right now for corrections right now. Perhaps we could assign library school students to correct pages 25-30 for homework and over the years we'll get a lot more items corrected.

Google folks were talking about things that might be better than OCR. A lot of foreign language items are not getting useable with OCR, what can we do? OCR does vary greatly by language but these days results are gettting better. We have to prioritize languages with the usage by our patrons. Germans, Islamic languages & Greek do have a lot of problems. CJK is making a lot of progress. Abbey 8 is Google's OCR and it's in theory moving along pretty quickly. Google is working on that. They are commercial entities and they are looking at where they can make money and so Google is very interested in Asia right now.

Google is expanding into Europe and Japan in bringing their collections into the fold but India has not yet been approached. They tried with France, but they are doing their own thing.

International Internet Preservation Consortium

referenced by Kunze:
http://netpreserve.org/about/index.php

John Kunze: Preservation and Mass Digitization cage match

AKA the computer science perspective. Kunze is discussing some of the technical issues relating to digitization, which I find really interesting.

For instance, there is the problem of how do you transfer all this data across the network? Lots of transfer tools tested -- but parallelism works -- so sthe practical solution is to combine parallelism with common tools -- e.g. run SCPI 20x. This means that they can how to move millions of files.

Now: how to make the files smaller?

This requires a discussion of what mass digitization is -- mass digitization is, for us, not intended to replace the physical form.

For millions of files, we need to strike a balance between size of the files and quality of the reading experience -- AND images need to work with OCR.

There are also lots of technical problems with getting the OCR to work problems with two-column pages, pages where the ink is too heavy, or where it's too light -- coarse half-tones are problematic.

For an example of other media storage, the Swedish Archives are digitizing 8-track tapes and producing 42 terabytes of data per MONTH.

Given all this possible data... we need to think about disks.

* RAID -- used in the 80s
* JOBD (Brewster Kahle's approach) "just a bunch of discs" (not fancy disks, but lots of them) - 1990s solution
* MAID -- massive arrays of idle disks -- today's approach

Lots of technology is coming out of the Internet Archive's examples/innvotions, e.g. the W/ARC file format -- many files in one file for speed and ease

Digitizing the digital... what to do with microfilm?

"Data desiccation" -- it's actually a very difficult problem to turn a book into plain-text because of all the formatting. CF Project Gutenberg.

Mass digitizations and preservations.

What's digital preservation:
Definition changes monthly but basically storing digital objects wile retainings a balance of usability and faithfulness to their creators' original intentions.

Policy challenges include:
  • how faithful do we have to be, how long, at what cost, how many replicas?
  • how much manipulation can the item tolerate? We have to manipulate with new technologies come out.
  • Rightsmare (copyright)
Technical challenges
  • Lots of files, lots of data can take months to move and replicate
  • explore data transfer and replication options
  • survey tool performance and usability
  • continuing conversations with the San Diego Supercomputer Center and the Library of Congress with goal of creating guidelines.
  • Making many files small - we need to learn how to make these files smaller so we can move them faster?
Why mass digitization?
For better access and search. Can act as a back up to safeguard against loss. It is NOT intended to replace the physical item.

Tradeoffs between size and quality. National Library of France, Harvard University Libraries and UC Berkeley did a lot of testing. What they found include recommendations: JPEG 2000 JP2 (ISO/IEC 15444-1) file format and an all color, all glossy solution is feasible. We can't forget audio/video. Now we need cheaper and still reliable disks to store all this data. One solution is to go to the aggregate W/ARC file format.

Espresso express

The Espresso Book Machine referenced in the last post is a printer that entire books can be printed on. The current ones are huge, but they are projected to become the size of a photocopier in the next generation. They can print out a book and perfect-bind it in six minutes. The Internet Archive has one that they've been printing some of the OCA digitized content on. It's one of the coolest things I've ever seen.

New York Public Library gets first Espresso Book Machine

While it looks like it's still a ways from setting up shop next to more traditional vending machines, those in New York CIty can now get their instant-book fix from the very first (non-beta) Espresso Book Machine, which has found a home in the New York Public Library's Science, Industry and Business Library. For the time being, most of the books on offer appear to be ones in the public domain, including over 200,000 titles from the Open Content Alliance database, which visitors to the library can print off books free of charge, the end result of which is supposedly "indistinguishable from the factory-made title."

Read more here: http://tinyurl.com/25u2jg

"jewels of the collection"

Chandler said that some of the jewels of the digitized collections include special collections, Bancroft Library, classic mathematics, children's books, cookbooks...

she showed some slides of some of these. Of course, there aren't links in our catalogs yet for these -- she suggested that when these come the project will become a little more "real" for all of us.

Chandler referenced this blog: http://landscape.blogspot.com/ which talks about the effect of google books on her work as a scholar.

About UC Merced



It's a lovely conference room here -- big table and comfy chairs, with two rows of chairs on either side.

Angela is impressed with the Library as well. It's really a beautiful space. We got a tour and walked passed 2 aquariums, probably about 80 gallon freshwater tanks. We saw numerous large plasma televisions where images and announcements scroll. The instruction room is a beauty. There are empty desks which as equipped with laptops before each instruction session! We walked into the reading room and everyone just sighed wishing we all had such a room in our own libraries! Another impressive feature is at the Info Desk. They have a projector behind the desk and it projects announcements about new services, items, and what not. It was really a brilliant idea. Sam Dunlap tested every chair, couch and other fun furniture! He gave his approval. He also helped out by shelving a few books that had been mis-shelved. Of course there are some problems, such as leaking over what would have been the special collections room. If you want to see what is a most welcoming library, come visit!!!

Angela will post pictures later on, so stay tuned!

re: digitization numbers

From Robin Chandler: NRLF is pulling 3000 books a day to go to Google for digitization, which means 3000 books a day are coming back as well -- so 6000 books a day are being touched. And that's just for Google... OCA is being more selective.

That's a lot of books, but still a drop in the bucket compared to what they want to do -- and what we have.

Links to the two projects:
http://www.opencontentalliance.org/
http://books.google.com

The Fun Part - Mass Digitization

Presentation from two CDL Colleagues: Robin Chandler - Director of Data Acquisitions & John Kunze - Preservation Technologies Architect.

Robin will go first. Little tech glitch with microphone. Robin will talk loudly. Mass digitization at UC Libraries. It's been worked on since 2005. This will be a status report. We will get an environmental scan impact, book discovery, user services & scholarly use studies.

Why are we doing this? Because we have a vision where people have the ability to discover and access books anywhere, anytime and essentially for free. The realities are that funding opportunity that we can take advantage of with offers from Google and Microsoft. Cons include costs and it is disruptive. It does allow us to explore new models for the library. Collection Management: digital reformatting can help support our efforts to build shared print collections. Curating through collaboration: digitization of local materials creates access to third party materials no currently available. Funding reallocation: what is MS and Google scanning and what can we scan to complement their work.

Overview of the two projects:
Microsoft/Open Content Alliance (OCA) & Google Books

Libraries are supplying and curating and cataloging books. We provide bibliographic metadata. We also supply onsite scanning facilities and staff when appropriate. Google & Microsoft provide funding and manage digitization vendor. Microsoft/OCA began production by scanning April 2006. Projected scope: 100,000 books per year. Pick list driven: list to public domain. The scanning centers are at NRLF and SRLF.

Google becan scanning began October 2006. Scanning books from NRLF currently Projected scope include 2.5 million books during 6 year period. Bulk pulling: public domain and in-copy right items. The scanning center are doing 3,000 books per day. Discussions are in place for explansion to UC Libraries - Phase one include UCSC, UCSD and UCLA.

Only the UC was able to get the image coordinates in the contract with Google.

Process includes the following: select, retrieve, inspect, mass charge/physical charge, & physical transfer. Sharing bbibliographic records - over 3,000 a day. Digitization includes creating content files - JPEG 2000, PDF, OCR, image coordinates. We have to then mass or manually discharge the books and returne them to he shelves. Then download the content files to UC servers. Ingest content files into preservation repository. End result is access: UC/OCLC WorldCat Local Pilot Spring 2008 - OCLC eContent Synchronization enable links to mass dig books in UC/OCLC WorldCat Local pilot.

Mass Digitization Collection Advisory Gorup (MDCAG) has been assembled and are meeting regularly.

Some examples of what is being scanned: Special Collections (Bancroft & UCLA YRL) - American History, Children's literature & Oral histories. Mathematics - classic historical text.
For more info: Check out Phoebe's post on "jewels of the collection."

Environmental Scan: Impact, Book discovery, user services in development, scholarly use studies underway.
We are redefining collections for our users by leveraging the collections of other libraries. Interfaces to mass digitized collections: internet archive, microsoft, google. web 2.0 services: amazon, library thing. Library networks: NCSU Libraries, WorldCat. These are all things we should think about in terms of discovery.

Check out Heather Christenson and Steve Toub's presentation on Book Discovery in a Mass Digitized Environment @ DLF 11/06/07. It's online at the DLF site. They cover the strengths and weaknesses.

Check out demo.openlibrary.org to see what books might look like online. They have faceted browsing of books. I spend so much time online with Facebook and MySpace and IMing and of course email and everything else, I'm probably legally blind by now if it were not for contacts and glasses. I'm only 30 years old. Will my kids be as blind as I am by age 10? (In case my parents are reading this, don't get your hopes up, I'm not pregnant and have no plans on getting pregnant in the near future, ie next 6 months).

Take a look at the North Carolina Statue University Libraries catalog. It looks idea - subject genre vs topic, narrow by call number range. limit results to currently available items. Might this be what UC students will be able to do some day?

on the agenda ...

The LAUC president's report, from Bob Heyer-Gray; a report from Gary Lawrence, director of library planning and policy development, and committee reports (which were very brief). Then we start what Bob calls the "fun part" -- the program on mass and large scale digitization, with presentations from Robin Chandler and John Kunze from the CDL.

Taping

The meeting is being videotaped -- it will be posted later on. So stay tuned for that!

Gary Lawrence: Good news and bad news..

Gary Lawrence reported about UCOP. There is nothing but bad news about the budget. Potential shortfall next year of ten billion. There's currently a hiring freeze and all the functions in UCOP (including CDL) is being restructured.

Budget addressing employee salary is at the top of the list of priorities. Budget proposal 08/09 covers existing compact with the Governor which is 5% increase on salary.

UC is working on Faculty salary parity as a top priority over the next four years. Range for other staff in merit increase will be the same as this year. UC recognizes that this is not the best solution and will be seeking to work on this further as economic situation improves.

Next year there will be a budget deficit of $10 billion so we don't know if the UC can fund the compact.

UCOP is restructuring to find savings of $28 million. Planning for this is underway. Report will go to the Regents. Final UCOP budget will be in place in May. There will be cuts.

There is still a hiring freeze at UCOP. There is a new website: http://www.uc.edu/future that contains information about the restructuring process. Academic personnel policies review is now complete and will go out before Thanksgiving. There are 3 under review currently.

An aside: the weather in Merced

It's surprisingly hot in Merced -- at least people from the northern campuses think so :) It's sunny and clear, and currently 75 degrees. Perhaps that's why ASUMerced is hosting an ice cream social in the Library!

Bob Heyer-Gray's presidential message

BHG wants this year of LAUC assemblies to be program-driven -- and not bylaw-driven! He discussed his meeting with Google Books and the University librarians, and said that he wanted to get some of the information about Google books to UC librarians more widely.

He also talked about some of his committee work, the bylaw revisions, and what has been going on lately.

welcome from Bruce Miller

The meeting was called to order with a welcome from Bruce Miller, the University librarian at UC Merced. He started out by pointing out that Merced wins, with 83% of LAUC members from Merced at the meeting.

Miller gave a few thoughts on LAUC. First, Miller noted that the UC system is going through a big change lots of new librarians are hired -- "so I don't actually know most of you!" He stated that because of this, LAUC has a special chance now to help inculcate new librarians.

Next, Miller said, is there a way to include non-represented people in the library into LAUC? We all have colleagues who have the same concerns as we do, who may not be in traditional librarian positions, who ought to be at the table in these discussions.

Third, Miller talked about the LAUC research grant money -- he hears repeatedly that the research grant money goes unclaimed. He gave a challenge to LAUC: can we streamline this process?

Finally, Miller said, this is a time to step outside the box, step back, and try new things -- for instance new technologies in your work in LAUC. Give yourself permission to be liberated enough to try new things. (He gave the example of maybe starting a LAUC Facebook group, then said he didn't have a Facebook account himself. Of course, there actually is a UC Librarians Facebook group already).

He then welcomed the assembly to UC Merced.

Next on Agenda

Roll call of divisions and elegates by Mr. Careaga. Minutes, Spring Assembly 2007 approved as presented.

President's Report by Bob Heyer-Gray
Bob wanted to focus his year on programs of broad enough appeal and interest for newer librarians and across different divisions. Bylaws revisions were not on the horizon. Programs should inform ULs what we as professionals are about. Some fun to make LAUC look more attractive. Some ideas included: Melvyl OCLC pilot, Future of Public Services, and Grant Writing.
We are taping the meeting and will post later on. Spring Assembly will be at UC Irvine on May 7. By laws revisions will be spoken to by Gary Lawrence later today. Bob met with all the ULs before the Fall Assembly in Oakland. The meeting went well. Google Books also gave a presentation about the project and status. Bob is an advisor to the Systemwide Academic Senate Committee on Scholarly Communication and Open Access Publishing. There may be a program about it later on. There is a search for a UL for CDL. Bob will participate on the search committee.