ircount : update

One Sunday morning in January this year I got an email sent automatically from the webhosting company. It contained the output of the script that ran weekly, when all ran fine the script produced no output. When something went wrong the error messages were emailed to me. Judging by the length of the email something big had gone wrong.

The script collected data from http://roar.eprints.org/ – to be used as this weeks ‘number of records’ for each repository.

The reason became clear quickly. A major revamp to ROAR had just been launch, showing off a new interface, which used the Eprints software as a platform (essential a repository or repositories). This was a great leap forward but unfortunately removed the simple text file I used to collect the data, and what was more, the IDs for each IR had changed.

I finally got around to fixing this in May. The most fiddly bit was linking the data I collected now with the data I already had. This involved matching URLs and repository names.

Anyways. Things should be more or less as they were. A few little tweaks have been added. A few bugs still remain.

As ever you can view the code and changes here: http://trac.nostuff.org/ircount/browser/trunk

And checkout the svn here: http://svn.nostuff.org/ircount/

ircount can be found here: http://www.nostuff.org/ircount/

Event: The Data Imperative: Libraries and Research Data

Today I’m at the one day event ‘The Data Imperative: Libraries and Research Data’ at the Oxford e-research Centre. As usual, these are my own rough notes. There are mistakes, gaps and my own interpretation of what was said.

Paul Jeffreys : Director of IT, Oxford University.

Started off giving an overview of where this has come from. e-Research is more than just e-Infrastructure. e-Research is not just about outputs, but outputs (articles/data) are a part of this, and an discreet area to work on.

This is a cross-discipline area, it needs academics, University executive, research office, IT and Library. Libraries have skills that have to be fed in to this.

EIDCSR : ‘Enough talking,  let’s try and do it’, selected two research groups to work with, but not a pilot, a long term commitment. He talks about Oxford’s commitment to a data repository, it stresses cross agencies, mentions business models and feeds in to a senior research committee (the quote is far too long to add here!).

As each HEI is facing the same issue, it makes sense for national activity. but how much is done locally and how much is done nationally.

What is the vision of research management data? To what extent is managing research data the role of the Library/librarians? Is data management and data repositories a new kind of activity? Is it Librarians or Information Professionals who are charged to take this forward? [cjk: i thought they were one and the same]

John K Milner : Project Manager UKRDS

Can’t just use existing subject specific data centres. Need for cross-discipline (eg climate change) and therefore universal standards and methods so one subject can use another subject’s data with ease.

Feasibility study:

Understand what is happening today? where are the gaps. Avoid re-inventing the wheel.

Four Case studies (Bristol, Leeds, Leicester, Oxford), views of ~700 researchers over all disciplines (inc the arts).

What did they learn?

About half of the data has a useful life of 10 years? 26% has ‘indefinite’ value, ie keep for ever’ Nearly all kept locally (memory stick, departmental server, [cjk: not good!]).

21% use a national/international data centre. 18% share with them.

UK has rich landscape of facilities, skills and infrastructure.

The management of data from a research project are now starting to be directly funded, which is important.

What are others doing? Are we in step with other countries? Yes. US spending $100 million on 5 large data centres. Australians are leading in this area, and have a central approach to it. Canada and Germany also have similar developments.

Aim: to set up a framework for research data.

Why Pathfinder: not a pilot but the start of a long term commitment.

[my notes miss a bit here, had to deal with a urgent work issue]

Service must be useful and accessible. Need a framework for stakeholder engagement.

This is non-trivial. Lots of parties involved, a lot of effort needed.

Citation of datasets is of growing interest to some researchers, this may help engage the research community.

Showing a diagram of UKRDS Basic processes. Split between ‘Research Project process’, Research data sharing process and UKRDS Services and Administration

Diagram doesn’t focus on curation but on accessibility (inc discovery, stable storage, identity) as this seems like the most important part. Discovery:Google, Identity(auth):Shibboleth.

Making it happen.

Need clearly defined service elements, will involve DCC, RIN and data centres.

HEIs need a reliable back-office service to handle working with data.

UKRDS is extremely challenging, nothing is easy and it is expensive. Needs support of funders and HEIs, need the right bodies to show leadership and shape policy. It will take time.

Q: Is it limited to HEI or public sector (museums etc). A: a more complicated issue, but they are working with the liked of Connecting for Health and DEFRA.

Q: Copyright. A: HEI often don’t own copyright. Data Management Plan (Wellcome are funding Data planning as part of funding)

Q: Is it retrospective? A: Could be. [he did say more]

Q: Could UKRDS influence ‘reputational kick back’ [nice phase!] e.g. for the REF. A: Yes, in discussion with HEFCE.

Q: Research Councils A: they are in discussion with RCs but Wellcome very much taking the lead (leap of faith) in the area. The whole key is a ‘value proposition’ which makes a case for funding this.

Q/point: Engage government/politicians.

Q: Challenge in explaining what it is, especially for subjects which are already doing something with data. How can we tap in to those already doing it? A: there is sometimes a missing link between researchers and subject national data centres. No real relationship between the two. Which is a problem in cross-subject research.

Research data management at the University of Oxford: a case study for institutional engagement – Luis Martinez, OeRC, Sally Rumsey, Oxford University Library Service

More of a ‘in practice’ talk, rather than high level.

Luis Martinez

Scoping study: ‘DataShare project‘. Talking to researchers they found some couldn’t understand they own old data, some wanted to publish their own data, some found data was lost when academics moved on.

Requirements: Advice/support across research cycle (where to store it, how, etc), Secure Storage for large dataset. Sustainably infrastructure.

Lots of different Oxford units need to be consulted (library, it, research technology, academics, legal, repository etc).

Findings after consultation: there is actually widespread expertise in data management and curation amongst service units, and other findings. DataShare: new models, tools, workflows for academic data sharing.

Data Audit Framework: (DAF) adapted this to Oxford needs and used it to document practices in research groups.

Policy-making for Research Data in Repositories : a guide‘ [pdf]

The EIDCSR challenge: two units that both research around the human heart. The two groups share the data between them and agree to produce 3d models using the combined data. They are helping this groups do this, using a ‘life cycle approach’.

Using the DAF to capture the requirements. Participating in the UKRDS Pathfinder (as above).

They have a blog http://eidcsr.blogspot.com/

Sally Rumsey

Starts of by talking about the roles required regarding the library. They have Repository staff, librarians, curators, but not so sure about ‘data librarians’.

What should of data should they be responsible for? Some stuff can go to a national service. There are vast datasets (eg Oxford Supercomputing centre), who has the expertise to make these specialised datasets available. Some departments already have provision in place, fine, why rock the boat.

Long tail. Every thing else (not above). No other home, lots of it, Academics asking for it, highly individual (ie unique), hums and sciences.

Things to consider: live or changing data Freely available or restricted? Long term post project?

Showing what looks like a list of random words/letters/strings of chars, an example of some data they were asked to look after from the English department.

Showing a diagram showing that Fedora (a repository system which is strong on metadata/structure but lacks an out of the box UI) is key to the setup. many applications can sit on top of it. Institutional Repository is just one application which runs on top of Fedora.

ORA (IR) for DATA: actual data can be held anywhere in University but ORA is a place of discovery. Allows for referencing of data. Might want to link to ‘DataBank‘ (a proof of concept to show what is possible).

Databank: how do you search/discover? First things added were audio files, perhaps then photos, how do you find them?

Showing Databank. Explaining that everything has a uid so we have cool URLs, and hence you can link to it [yes!]. Explaining how you can group an audio object, a related photo object and a related text object (perhaps explaining it).

End of morning discussion (I’ll just note some points I picked up):

This seems to raise such huge resource implications.

DAF is flexible, you can pick elements of it to use.

Non academic repositories, such as flickr, preservation issues, if they go down. [unlike the AHDS then!]

The Research Data Management Workforce – Alma Swan, Key Perspectives

Study commissioned by JISC, looking at the ‘supply of DS [data scientists] skills’.

NSF Roles:

  • Data Authors – produce data
  • Data Managers – more technical people – often work in partnership with data authors
  • Data Users
  • Data Scientists – expert data handlers and managers (perhaps ‘Data Manager’ was a confusing name).

Our Definitions (but in practice the roles and names are fuzzy):

  • Data creators or authors
  • Data Scientists
  • Data Managers
  • Data Librarians

Data Creators

Using DCC Curation lifecycle model, these are the out ring. But not all of it, and do things not on the ring, such as throw data away.

Shows picture of an academics office. Data is stored in random envelops.

Data Scientists – the focus of this study

Work with the researchers, in the same lab. Do most things in the DCC model. Are computer scientists (or can be one), experts in database technologies, ensure systems are in place, format migration. A ‘translation service’ between Researchers and computer experts.

Lots of facts about this, based on the research. Often fallen in to the role by accident, often started out as a researcher. Domain (maths, chemistry) related or Computer training. Informatics Skills: well advanced in biology and chemistry. Majority have a further degree. Need People skills. Rapidly involving area.

Data Librarians

Only a handful in the UK. specific skills in data care, curation. Bottom half (or bottom two thirds) of DCC model.

Library schools have not yet geared up for training. Demand is low, no established career path. Good subject-based first degree is required.

Things are changing, eg library schools are creating courses/modules around this.

Future Roles of the library

train researchers to be more data aware

Pressing issue inform researchers on data principles, eg ownership.

Open Data : datasets

A growing recognition across all disciplines that articles aren’t enough, datasets are what are needed to be in the open.

Datasets are a resource in their own right.

Publishers do not normally claim ownership of datasets. Some are (usual suspects)

Funder may own Data, Employers may own data. No one seems sure. Several entities may own the data.

In some areas of research journals play role in enforcement.

Some journals are just data.

Using PDF for data is very very not good.

Do we leave preservation of data to publishers [cjk: no! they should have nothing to do with this, the actors are Universities, their employees and their funders]

Simon Hodson – JISC Data Management Infrastructure Programme

Something problem, not easy to tackle. Would be a mistake for institutions to wait. The Call is designed to better understand how its data management facility can be taken forward.

Detailed business cases are needed.

Needs everyone (HEI, funders, data centres, RIN, etc) to be on board.

the Call will have an Advisory Group.

‘Exemplar projects and studies designed to help establish partnership between researchers, institutions, research councils.

See DCC as playing a major role in developing capacity and skills in the sector.

Tools and technologies: tools to help managers make business case internally, institutional planning tools (building on DAT, DRAMBORA, and costing tools). Workshop 1oth June DCC to review progress/outcomes of DAT project.

Two calls planned for the early Autumn.

2 June Call: Infrastructure. To build examples within the sector. Requirements analysis -> Implementation plan -> Execution thereof -> business models.

Bids encouraged from consortia.

Briefing day 6 July. DCC will provide support for bids, including a specific helpdesk.

There may be a Digital Curation course in the next few weeks.

Libraries and Research Data Management; conclusions – Martin Lewis, Director of Library Services and University Librarian, University of Sheffield.

Martin had been chairing all day and here he sums up and bring the various threads together.

The library research data pyramid. Things at the bottom need to be in place before things higher up. At the bottom, training in library (confidence), Library schools. Then develop local data curation capacity, teach data literacy. Higher up: research data awareness, research data advice, Lead on local policy. At the very top ‘influence national data agenda’.

Summary

An excellent day and excellent knowledgeable speakers. Nice venue, and most importantly, I found the only plug socket in the room!

This is clearly an emerging area. Many are in the same posistion, they are aware of the (Opene) Research Data developments, but nothing has yet happened at their university, nor academics queuing up to demand such a service. This is a good thing and it needs to happen, and Universities need to start acting now. But there are many preasures on University resources at the moment. How high on the institutional priority list will this come?

[Very finally, I did another audioboo experiment. On the fly, with no pre-planning, I recorded about 2 minutes of talk during the lunch. It's random, with no thought, many umms, a pointless 'one more thing' and basically wrong. laugh at it here]

Research in the Open: How Mandates Work in Practice

Today I’m at the RSP/RIN Research in the Open: How Mandates Work in Practice at the impressive RIBA 66 Portland Place.

Slides can be found here (not available when I made this post, as semi excuse as to why my notes miss so much). These are rough notes, which I’m making available in case others are interested, apologies for mistakes and don’t take it as gospel!

After an introduction by Stéphane Goldstein, kicking off with Robert Kiley from the Wellcome Trust.

Wellcome trust mandate since 2006, anyone receiving funding from Wellcome Trust must deposit in to pubmed, now uk pubmed central. SHERPA Juliet lists 48 funder policies/mandates.

Two routes to complying to their mandate: (route 1) publisher in open access / hybrid journal (preferred), Wellcome will normally pay any associated fees. However when paying the publisher, they expect a certain level of service in return (deposited on behalf of author, final version available at time of publication, certain level of re-use. Route 2 Author self-archives author’s final version within 6 months of publication. It was stressed that the first option is very much preferred.

“Publication costs are legitimate research costs”. To fund Open Access fees for ALL research they fund would, they estimate. take up 1-2% of their budget.

Risk of ‘Double payment’ (author fees and subscriptions). OUP have a good model here.

Still to do:

  • Improve compliance (roughly 33%, significant increase after letters to VCs),
  • improve mechanisms (Elsevier introduced OA workflow which resulted in significant increase in deposits, but funders/institutions/publishers all need to play a part here),
  • Clarifying Publishers OA Policies  (and re-use rights, didn’t catch this).

Research Councils UK – Astrid Wissenburg, ESRC

Starts of by talking about drivers for OA in the RC. Value for money, ensuring research is used, infrastructure and more.

Principles: Accessible, Quality (peer review), preservation (she’s moving through the slides fast)

April 2009 study in to OA impact, provides options for RC to consider.Findings

  • Significant shift in favour of OA over last decade
  • Knowledge/awareness still limited. Confusion
  • Engagement with OA varies by subject area.
  • Too early to access impact of RCs policies.
  • Drivers
    • Not speed of dissemination
    • principles of free access
    • co-authors views are a big influence (mandates less so!)
    • some evidence that OA increases citation just after publication
    • limited compliance monitoring by finders
    • concern about impact of learned societies (but no evidence of libraries cancelling journals)
    • little evidence of use by non-researchers (CJK comment: interesting, I would imagine this may grow, wish newspapers would link/cite journal articles)

Both models (oa journals/repositories) supported by RCs, level playing field.

Pay to publish findings: limited use, barriers, costs, awareness, not RAE. would lead to redistribution of costs from non-academic to academic areas.

OA Deposit (repositories): from grant application from 1 Oct 2006, so a three year project starting then will only be finished in Autumn 2009. Acknowledges embargos but ‘at earliest opportunity’.

75% researchers were not aware of the mandate. diversity across subjects. ‘In general, no active deposit’.

A slide showing % of awareness broken down by RC, interesting.

From the highest level RCs are committed to supporting OA (this will increase). But change takes time.

Some issues: what do to with embargo periods, difficult for funders to manage (are there incentives we could use), depends on existence of repositories, multiple deposit options confusing to researchers, awareness/understanding.

UKPubMed Central – Paul Davey, Engagement Manager, UKPubMed Central

Aims to become the information resource of choice for biomedical sector.

Principles: freely available, added to UK pubmed central, freely copied and reused.

Departmental of Health have clear policy to make research freely available.

95% of papers submitted are taken care of (deposited?) by the authors. only 0.5% submitted by academics (PIs/colleagues)

1.6 milion papers in uk pubmed central. 366 thousand downloads last month.

Core benefits: transparency, cutting down duplication, greater visibility.

Text mining, grabbing key terms from an article  (a little like  OpenCalais does)

Mentions EBI’s CiteXplore, encouraging academics to ink to other research.

Pubmed UK includes funding/grant facilities search. Can link articles to funding grants.

In short, backing from key funders, will make researchers more efficient, researcher’s visibility will increase.

Beta out in the Autumn, new site in Jan 2012.

Questions:

Worried about text mining, need for humans to moderate this. response: Limited finding in this area so human intervention also limited. really need specialist to answer this fully.

Question about increasing visibility of UK pubmed central, referring to Google, response: getting indexed by Google very much part of increasing visibility.

Question about Canadian ‘pubmed central’, response confirms this and mentions talk of a European pubmed central. Potential of European funders using UK pubmed central as a place to deposit research (like everything here, not sure if I’ve noted this right).

PEER – Pioneering collaboration between publishers, repositories and researchers – Julia Wallace

Funded by EC, not a ‘publisher project’.

Three key stages of publication: NISO Author’s original, NISO Accepted Manuscript, NISO version of record.

Starts of talking about the project, interesting stuff but failed to take notes.

From the website:

PEER (Publishing and the Ecology of European Research), supported by the EC eContentplus programme, will investigate the effects of the large-scale, systematic depositing of authors’ final peer-reviewed manuscripts (so called Green Open Access or stage-two research output) on reader access, author visibility, and journal viability, as well as on the broader ecology of European research. The project is a collaboration between publishers, repositories and researchers and will last from 2008 to 2011.

Seven members: including a publisher group, university, funders etc. Various publishers involved, big and small and about six European repositories taking part.

Approach / content:

  • Publishers contribute 300 journals, plus control
  • Maximises deposit and access in participating repositories
  • 50% publisher submitted 50% author submitted.
  • Good quality, range of impact factors. Publishers set embargo periods, up to 36 months.

Publishers will deposit articles in to the repositories via a central depot for their 50% of articles submitted (50% fulltext, metadata for the remaining 50%). Publishers will invite authors to deposit for the ‘author’ 50%

Technical: using PDFA-1 (where possible) and SWORD

Three strands: Behaviour, Usage (looking at raw log files), Economic. Also looking at Model Development (the three strands will look in to this).

Question about why they chose PDF (not very good for text mining). A: wide range of subjects and publishers means that PDF the best fit

Economic Implications of Alternative Scholarly Publishing Models, also Loughborough University’s Institutional Mandate – Charles Oppenheim, Loughborough University

‘Houghton report’ looks at costs and benefits of scholarly publishing.

Link to report http://hdl.handle.net/2134/4137

Link to main website and models http://www.cfses.com/EI-ASPM/

  • Massive savings by using OA, UK would benefit from this.
  • Savings include: quicker searching, less negotiations, savings not just in library budgets
  • 2,300 activity items costed.
  • This report currently final word in economics of OA.
  • Charles Talks about the various methods and work involved in producng this report.
  • a 5% incease in accessibility would lead to savings (or extra money to spend) in research/he/RCs
  • Hard to compare UK toll/open access publishing costs as one pays for UK access to content from across the world, the other pays for UK content to be world wide accessible.
  • Keen to role this out to other countires
  • Publishers response to report: furious!

Now for something completly different: Loughborough approve a mandae a few months a go, to come in to affect Oct 09. An intergral part of academic personal research plans (only those research items in the IR will be considered at the review). Now have over 4,000 items

Lunch and audioboo

During lunch I did an experiment using audioboo. Would I be able to summarise the morning, on the fly with no planning, in a brief audio recording. The answer, as you can discover, is ‘no’, but fun to try, and made me think of what I had taken in during the morning. Link to audioboo recording. or try the embedded version below.

Institutional Mandates – Paul Ayris, University College London

Paul starts off by shoing a number of Venn diagrams, for example: 90% of its research is available online, 40% available to an NHS hospital

What do UCL academics want

  • as authors: visbility / impact
  • as readers: access
  • delivery 24×7 anywhere

UCL madate, a case study:

Looking global is an important part of UCL (for PIs rankings etc). Number of systems in their publication system: Symplectic, IRIS, eprints, data mart (and portico, FIS, HR). Symplectic (or similar tool) and IRIS seem central in this model. Plan to automatically extra metadata from other external places (publication repositories.

How did they get the mandate? Paul spoke at UCLs senate (Academic Board), the agreed: all academic staff should record they own publication on a UCL publication system, and, teaching materials should all be deposited in their eprints systems.

UCL are going to set up a publication board to over see the OA rollout; to advise, monitor, oversee presentation and more.

Next steps: market/exploit, set standards for online publication, to advise on ongoing resource issues in this area. Also, establish processes, Statistics and management information, advise on multimedia, copyright issues.

‘Open Access is the natural way for a global university to achieve its objectives’

Question about blurring the line between dissemination and publication, and that some of UCLs aims seem more fitting of ‘publication’. Paul agrees, still trying to figure this out.

HEFCE – Paul Hubbard, Head of Research Policy, HEFCE

Policy: Research is a process which leads to insights for sharing. So Scholarly Communication matters to HEFCE. Prompt and accessible publishing is essential for a world class research system.

Supporting research: JISC, RIN, Programmes to support national research libraries (UKRR), UKRDS. Mentions Boston Spa (BL) document centre as an example of our world class sharing.

Internet opens up new ways of scholarly communication and sharing.

What do HEFCE want to see:

  • Widest and earliest dissemination of public research.
  • IP shared effectively with the people best placed to exploit it (CJK comment, i don’t think it is publishers!)

Committed to: UK maintaining world leading research, funding that fosters autonomy and dynamism, research quality assessment regime that supports rather than inhibits new developments.

As we move forward, things may be unclear those HEIs with repositories will be at an advantage.

Paul finishes up with a personal view of scholarly communications in 2030. He sees to forms of communication: discussion (building up ideas), and writing up a formal firm idea/conclusion based on these. HEFCE supports – through the likes of JISC – a range of tools and systems to enable this. (sorry that was an awful summary, he said much more than that!).

Answered a question as to why IRs, HEIs are the places to administrate/manage. Websites people go to see research for a particular subject need to be overlay systems harvesting from IRs.

[hmm, does 'university requirement' sound better than mandate?]

Institutional Policies and Processes for Mandate Compliance – Bill Hubbard, SHERPA, University of Nottingham

99.9% of academics do not object to Open Access, but need to show it will not change how they work. Librarians going to be much more part of the research process. Most people (including most publishers) are in favour of Open Access.

Other pressures on the systems, lack of peer reviewers, rising prices of journals, growing need for different forms of scholarly communications (e-lab books, multimedia), public demand for highest value for money ‘public should get what they pay for’,

Not if we change, but how we change. Research has to change seamlessly. Mandates have a value-added basis with fast delivery of benefits. Need integrated processes, need integrated support (we don’t want researchers to hear different messages from their Uni, funder, publisher, etc).

Authors need to know ‘what do i meed to do’. Need to make it less confusing, need to make it clear when they can get help.

First step compliance: how can funders improve compliance, how can authors be supported?

All 1994 and Russel Group now have IR (Reading, I think, just setting one up now).

Compliance for mandates makes it better for us admin/support staff, and for the University generally.

Institutions need a compliance officer (perhaps repository manager). Funders need to ensure these people have the information they need. Share compliance information

I’ve missed so much of Bill’s talk here, he moves fast (and passionately) and lots of points.

After Bill’s talk there was a panel session.

Twitter

Finally check out some of the useful tweets from the day. (Twitter search only goes back about a month or so, so this link may not work after a certain date). Jim Richardson also created a permanent copy with the (new to me) webcitation website.

Conclusions

With such dodgy note taking I feel some concise summary is in order!

  • Mandates are happening, by Universities and by Funders.
  • HEFCE want research to be accessible to as many as possible as quickly as possible.  Coming from HEFCE, this holds a lot of weight.
  • Funders (Research Councils / Wellcome) put mandates in place several years a go. They have not sat back and said ‘job done’. They are building on this foundation. How can they check? How can they enforce/encourage? How can they assist? How can they automate? How can they work with publishers and HE to share this information? Expect more to come in this area.
  • Wellcome Trust prefers submission to Open Access Journals rather than author depositing in to a repository at a later date.
  • HE Mandates are coming, we alreay have a few in the UK. Making them an intergral part of an academic’s review seems like a good idea. My opinion is that this is reasonable – even if there are those who disagree – surely an employer can (and does in every other sector) ask for a record of what an employee has been working on, and a copy of the end output, i.e. the full text in an IR.
  • The report ‘Economic implications of alternative scholarly publishing models : exploring the costs and benefits. JISC EI-ASPM Project‘ is a thourough comprehensive look at the economic costs of Open Access and new forms of Scholorarly Communications.
  • I think we are starting to see the larger Universities developing sophisticated network of systems to manage research/publications/OA/research-funding. See slide 10 of Paul Ayris presentation, and this article about Imperial’s setup as two examples.
  • It makes sense to share information (between IT systems) between funders, HE and publishers. Examples: Funders sharing (bibliographic) information to a University about publications from its researchers, Universities (or publishers) passing information to funders linking publications to funding (or even the other way round?).
  • This is an area which is still developing, fast, and will of course involve a culture change. Publishers seem unsure how to handle this new world.
Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart, such as Herbert van de Sompel (I listened to a Talis podcast interview with him recently, interesting but would love to know more about the BL thing!) and many at UKOLN and Eduserv. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?

ircount : new location, new functionality

A while a go, I released a simple website which reported on the number of items in UK repositories over time. It collected its data from ROAR but by collecting it on a weekly basis could provide a table showing growth week by week.

First it has a new home: http://www.nostuff.org/ircount/

Secondly, it now collects data for every institutional (and departmental) repository registered in ROAR across the world. Not just the UK. It has been collecting the data since July.

The country integration isn’t perfect, you have to select a country, and then you are more or less restricted to that country (though you can hack it, see the ‘info&help’), and there is a lot of potential with improving this. There are also a couple of bugs, for example when comparing four repositories it seems to (a) forget which country you were dealing with, and (b) it stops showing the graph/chart.

I’m currently looking at trying to make an educated guess at how many fulltext items are in a given repository. This is proving to be a steep learning curve in the joys of OAI-PMH, and how the different repository systems (and the different versions on these systems) have allocated information about the fulltext in to different Dublin Core (DC) elements. But this is for another post.

In the mean time, I hope the worldwide coverage is of some use, and feel free to leave any comments.