Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?