marc – nostuff.org

I know next to nothing about MARC,though being a shambrarian I have to fight it sometimes. My knowledge is somewhat binary, absolutely nothing for most fields/subfields/tags but ‘fairly ok’ for the bits I’ve had to wrestle with.

[If you don’t know that MARC21 is an ageing bibliographic metadata standard, move on. This is not the blog post you’re looking for]

Recent encounters with MARC

Importing MARC files in to our Library System (~~Talis~~ Capita Alto), mainly for our e-journals (so users can search our catalogue and find a link to a journal if we subscribe to it online). Many of the MARC records were of poor quality and often did not even state the item was (a) a journal (b) online. Additionally Alto will only import if there is a 001 field, even though the first thing it does is move the 001 field to the 035 field and create its own. To handle these I used a very simple script to run through the MARC file – using MARC::Record – to add an 001/006/007 where required.
Setting up sabre – a web catalogue which searches the records of both the University of Sussex and the University of Brighton – we need to pre-process the MARC records to add extra fields, in particular a field to tell the software (vufind) which organisation the record was from.

Record problems

One of the issues was that not all the records from the University of Brighton were present in sabre. Where were they going missing? Were they being exported from the Brighton system? copied to the sabre server ok? Being output through the perl scritp? lost during the vufind import process?

To answer these questions I needed to see what was in the MARC files, the problem is that MARC is a binary format so you can’t just fire up vi to investigate. The first tool of the trade is a quick script using MARC::Record to convert a MARC file to text file. But this wasn’t getting to the bottom of it. This lead me to a few PC tools that were of use.

PC Tools

MarcEdit : Probably the best known PC application. It allows you to convert a MARC file to text, and contains an editor as well as a host of other tools. A good swiss army knife.

MARCView : Originally from Systems Planning and now provided by OCLC, I had not come across MARCView until recently. It allows you to browse and search through a file containing MARC records. Though the browsing element does not work on larger files.

USEMARCON is the final utility. It comes with a GUI interface, both of which can be downloaded from The National Library of Finland. The British Library also have some information on it. Its main use is to convert MARC files from one type of MARC to another, something I haven’t looked in to, but the GUI provides a way to delve in to a set of MARC records.

Back to the problem…

So we were pre-processing MARC records from two Universities before importing them in to vufind using a Perl script which had been supplied by another University.

It turns out the script was crashing on certain records, all records after the problematic record were not being processed. It wasn’t just that script, any perl script using MARC::Record (and MARC::batch) would crash when it hit a certain point.

By writing a simple script that just printed out each record we could as least see what the record was immediately before the record causing it to crash (i.e. the last in the list of output). This is where the PC applications were useful. Once we know the record before the problematic record, we could find it using the PC viewers and then move to the next record.

The issue was certain characters (here in the 245 and 700 fields). I haven’t got to the bottom of what the exact issue is. There are two kinds of popular encodings: MARC-8 and records in UTF-8, and this can be designated in the Leader (9th character). I think Alto (via it’s marcgrabber tool) exports in MARC-8 but perhaps the characters in the record did not match the specified encoding.

The title (245) on the orignal catalogue looks like this:

One work around was to use a slightly hidden feature of MarcEdit to convert the file to UTF:

I was then able to run the records through the perl script, and import it in to vufind.

But clearly this was not a sustainable solution. Copying files to my PC and running MarcEdit was not something that would be easy to automate.

Back to MARC::Record

The error message produced looked something like this:

utf8 "xC4" does not map to Unicode at /usr/lib/perl/5.10/Encode.pm line 174

I didn’t find much help via Google, though did find a few mentions of this error related to working with MARC Records.

The issue was that the script loops through each record, the moment it tries to start a loop with a record it does not like it crashes, so there is no way to check for certain characters in the record as it will already be too late.

Unless we use something like exceptions. The closest to this perl has out-of-the-box is eval.

By putting the whole loop in to an eval, if it hits a problem the eval simply passed the flow down to the or do part of the code. But we want to continue processing the records, so this simply calls the eval again, until it reaches the end of the record. You can see a basic working example of this here.

So if you’ve having problems processing a file of MARC records using perl MARC::Record / MARC::batch try wrapping it in a eval. You’ll still loose the records it can not process but it wont stop in it’s tracks (and you can output an error log to record the record number of the records with errors).

Post-script

So, after pulling my hair out, I finally found a way to process a filewhich contains records which cause MARC::Record to crash. It had caused me much stress as I needed to get this working, and quickly, in an automated manner. As I said, the script had been passed to us by another University and it already did quite a few things so I was a little unwilling to rewrite using another language (though a good candidate would be php as the vufind script was written in that language and didn’t seem to have these problems).

But in writing this blog post, I was searching using Google to re-find the various sites and pages I had found when I encountered the problem. And in doing so I had found this: http://keeneworks.posterous.com/marcrecord-and-utf

Yes. I had actually already resolved the issue, and blogged about it, back in early May. I had somehow – worryingly – completely forgotten any of this. Unbelievable! You can find a copy of a script based on that solution (which is a little similar to the one above) here.

So there you are, a few PC applications and a couple of solutions to perl/MARC issue.

Below is an email I sent to UK Library e-resources mailing list (lis-e-resources@jiscmail.ac.uk). I’m putting it here for the same reason then I sent the original email: I think there are questions relating to the changing role of the library catalogue and new models are developing in how and where metadata exists for the items libraries provide access to.

My points in a nutshell:

The way we work with Library system (LMS) catalogues is changing, with the need to import and export large batches of records from other systems increasing, especially with online materials such as journals and e-books. This is quite a different need to those before the web, when item = physical thing in the library building, records were added one at a time, with
Library systems have not adapted to this new need, and though technically possible is often fiddly and can feel like a hack.
While it is possible to import batches of records, there are issues regarding keeping everything in sync. For example, Libraries often subscribe to publisher (online) ‘journal bundles’, the titles included in these bundles can change over time, but how to easily update/sync the catalogue to reflect this. One option is to regularly delete the imported records and reimport from the data source completely, though, if I understand correctly, Library Systems often do not delete records, but instead simple ‘suppress’ from the public view. So doing this for twenty thousand e-journal records each month would leave 240,000 hidden records on the system each year!
Why do we want them on the catalogue? Because users search the web interface to the catalogue to look for journals, books, theses etc. So we need to ensure our e-journals, e-books etc can be discovered via the catalogue interface.
A Library System (LMS) will typically have a catalogue, which cataloguers and other library staff maintain, and a public web front end for users to search (and locate) items.
However, ‘next generation’ of web search systems are now on the market, these allow the user to search the LMS catalogue and other data sources simultaneously in one modern interface.
Setting up these systems to search other data sources (in addition to the library system catalogue) so that they include records for online journals and e-books (and more) is a much neater solution, then trying to add/sync complete cataloging records in to the library system catalogue.
This to me (and I’m no scholar on the subject) has changed the model. The Library System catalogue was one and the same as the public web catalogue. What was on one was on the other. Librarians would discuss ‘should we put X on to the catalogue?’. But now these two entities are separate. Do we want something to be discoverable on our web catalogue search system, Do we want a record on our back-end library System for something? These are two separate questions, and it is possible to have an item on one and not the other. It would be easy to say that if you just want users to be able to discover a set of items, just make them available on your next generation search systems if it wasn’t for the fact that…
Third party systems can cross search or import records from multiple library catalogues, getting data from their library system. This was a simple thing to consider: do we want to allow these systems to have access to our records/holdings, and if so they would search our catalogue. These are examples and not the only things to consider, for example Endnote allows you to search a library catalogue within the application, of course this is the Library System catalogue.
This creates questions: which items do we want to make available to our users searching our web catalogue? which items do we want to expose to other systems out there? What items do we want to keep in our Library system back-end catalogue for administrative/inventory purposes? With the old simpler model these questions did not need to be asked.

I’ve drawn some rather embarrassingly bad diagrams to try and illustrate the point:

Original Library catalogue model (click on images for larger version)

So after that rather lengthy nutshell (sorry) here is the original email, which does ramble and lack a point in parts, sorry:

Over the last few years the need to add e-resources (journals/books) to our library catalogue has grown. The primarily reason being users expect (understandably) to find books and journals in the catalogue, and that includes online copies.

This has seen the way we use our catalogue change, from primarily adding individual records as we purchase items, to trying to add records in bulk from various third party systems.

These third party systems include the link resolver (for journal records), e-book suppliers and (experimentally) repository software (for theses).

I imagine many are in the same boat as us, we want to do this in a scalable way, we don’t want to be editing individual records by hand when we could be looking at a very large number of records both for journals and – as/if usage takes off – e-book.

For this to work, it requires high quality (MARC) records from suppliers, and LMS (ILS) vendors adapting their systems for this change in behaviour. For example, it may have been reasonable in the past for an LMS supplier to presume that large numbers of records would not need to be regularly suppressed/dropped, though with ever changing journal bundles this may be normal practice in the future.

Furthermore, just to add confusion, next generation web catalogues can search multiple sources. The assumption that ‘public web catalogue’ reflects the ‘LMS catalogue’ (i.e. what is in one is in the other) may no longer apply. Should e-content be kept out of the LMS but made seamlessly available to users using new web interfaces (Primo, Aquabrowser, etc etc)?

This seems like quite a big area, and a change in direction, with questions, and yet I haven’t seen much amounts of discussion (Of course, this may well be due to a bad choice of mailing lists/blogs/articles).

Are others grappling with this sort of thing?

Anyone else wishing they could import their entire online journal collection with a few clicks but find dodgy records (which we may for!) and fussy library systems turn it in to a very slow process?
And not quite sure how to keep them all in sync?

Would love to hear from you.
Who else has all their e-journals on the catalogue? Was it quick? Do you exclude free journals etc?

I also added this in a follow up email:

We already have Aquabrowser and this does seem to offer a nice solution to some of this. It looks like you just need to drop a MARC file in place and the records will be included. (See http://www.sussex.ac.uk/library/)

But this presumes the ‘keep the records out of the LMS’ is the right approach, and it is not for all.

Our (LMS) catalogue is exposed else where, such as the M25 search, Suncat. And others will add COPAC and Worldcat to the list. Plus other local union-ish services.

By simply adding these records to a next gen catalogue system they will not be available to these types of systems. This may be desirable (Does someone searching Suncat want to know that your users have online access to a journal) but the opposite may also be true.

Lets take a thesis, born digital in a repository. It would seem desirable to add the thesis to main LMS catalogue (especially as printed/bound thesis would appear there), and make it available to third party union/cross-search systems.

Next gen catalogues are – I think – certainly part of the solution, but only when you just want to make the records available via your local web interface.

Owen Stephens has replied with some excellent points and thoughts on the matter which are worth reading.

Finally, I’m not a Librarian, cataloguer, or expert, so these are just my thoughts. There is stuff to think about in this area, I’m not suggestion I have the answers or even have articulated what I think the issues are with any success.

Update: Just come across a blog post from Lorcan Dempsey, which as ever articulates some of this very well.

nostuff.org

…living up to its name

nostuff.org

Tag: marc

MARC Tools & MARC::Record errors

Record problems

PC Tools

Back to the problem…

Back to MARC::Record

Post-script

Library catalogues, search systems and data