Category: libraries, library technologies & open data

Libraries, Library 2.0, Open Data, Library Technology

Amazon AWS EC2 and vufind

Today I saw a tweet from juliancheal, which mentioned he was setting up his virtual server on slicehost. I hadn’t heard of this company but their offering looks interesting. This got me thinking about cloud hosting and I decided it was time to actually try out Amazon’s AWS EC2.This allows you to run a virtual server (or multiple servers) in the Amazon cloud, servers can be created and destroyed by a click of a button.

First thing is to get a server ‘image’ to run in the cloud. Thankfully many have already been created. I went for a recent ubuntu server by Eric Hammond. This is basically a ubuntu server vanilla install, but with a few tweaks to work nicely as a EC2 virtual server. Perfect!

Signing up is quick and easy, it just uses your Amazon (the shop) credentials. Once created, you are taken back to the main control panel where you can see your new instance, including details like the all important public DNS name.  Just save a private key to your computer and use it to ssh in to your new server.

e.g.: ssh -i key1.pem

(you may need to chmod 400 the key file, but all this is documented)

Once in, well it’s a new server, what do you want to do with it?

I installed a LAMP stack (very easy in ubuntu: apt-get update and then tasksel install lamp-server). I initally couldn’t connect to apache (but could from the server itself using ‘telent localhost 80’). I presumed it was a ubuntu firewall issue, but it turned out you also control these things from the AWS control panel. The solution was to go to ‘security groups’ and modify the group I had created when setting things up and adding HTTP to ‘Allowed Connections’. This couldn’t of been easier. And then success, I could point my browser at the DNS name of the host and saw my test index page from the web server.

Amazon aws control panel, modify to allow http connections

So now what? I pondered this out loud via Twitter, and got this reply:


Excellent idea!

Good news: vufind has some good – and simple – documentation for installing on ubuntu;

Following the instructions (and editing them as well, they specified an earlier release and lacked a couple of steps if you weren’t also reading the more general install instructions) I quickly had a vufind installation up and running. Took around 20-25 minutes in all.

Now to add some catalogue data to the installation. I grabbed a MARC file with some journal records from one of our servers at work and copied it across as a test (I copied it just by using a scp command logged in to my ec2 server). After running the import script I had the following:

vufind results.If the server is still running when you read this then you can access it here:

EC2 is charged by the hour, and while cheap, I can’t afford to leave it running for ever. :)

So, a successful evening. Mainly due to the ease of both Amazon EC2 and Vufind.

A final note that if you are interested in EC2 you may want to look at some notes made by Joss Winn as part of the jiscpress project:

Both Ec2 and vufind are worth further investigation.

The Data Imperative: Libraries and Research Data : comment

I put this in to a seperate post. It continues on from my previous post, but didn’t want my notes of the day to be taken over by my ill thought views.

Personal Thoughts

Reluctant to give some thoughts as I know so little about the service. However… (!)

There seems to be two clear areas here: Data formatting and Data storing. There is some linkage (Preserving surely covers both, formats can become obsolete, Servers die), yet the two seem to be somewhat seperate.

Both require IT skills, but IT is a broad church, the former is technical metadata (and is very much IT and library) and in the general area that I sees covered in the Eduserv efoundations blog.

The latter in its simplest form is hard core infrastructure. Disks, sans, servers, security, but also has elements at the application level (how do we access it, using what software, repositories? CRIS? Fedora?).

On another issue, while it is easy to say that libraries should take the lead, I think we need to be cautious. With the current climate of frozen or decreasing budgets nationally, and journal subscription pressure, how wise is it to go to the University’s executive and demand funding for resources/staff for data management. We know it’s important and could make the process of research more efficient, but there are other things higher up a Universities list of priorities (NSS/atracting good students, REF, research funding). Even at a library level, journals help researchers do research (which brings funding), and keep students happy because we have the stuff they need (NSS). How many journals should we cancel to focus on Research Data? Why? The recent JISC call will help with providing a business case.

The problem at the moment is that there are not enough clear benefits for most Universities to steam ahead with this. Let’s clarify this: not enough benefits for the institution itself. The benefits are for the UK as a while (actually, the while world). It’s the UK-wide economy and research that will benefit. So maybe it needs UK-wide funding. It’s easier to convince someone (or something) to spend money when the benefits for them are clear. In this case the benefits are for UK so it should the UK which sets aside explicit cash (via HEFCE, JISC, and so on).

And this is happening, with the JISC call (talked about today), amongst other things it will help build examples.

But I’m not sure if the institutional level is the best one. Australia has been successful with a centralised approach. We have a number of small Universities, and those which only have one or two departments which are research active. Yet the resources/knowledge required of them will be similar to that of a large institution. Will this leave them at a disadvantage?

On another note, it seems the range of data is vast. When dicussing this, I always – incorrectly – picture text based data, of vearying size, perhaps using XML. Of course this is blinkered. For auido, images and similar should a data service just provide a method to download, or a method to browse and view/listen? When it comes to storage and delivery, should we just treat all data as ‘blobs’ – things to be downloaded as a file, and we no nothing more with it? This makes it easy and repository softwareapplications (eprints/dspace/fedora) are well placed to cater to this need. But I get the impression that this is somewhat simplistic. Perhaps this means a data service needs a clear scope, otherwise we could end up building front end applications which mimic flickr, youtube and all in one. A costly path to go down.

[all views are my own. are wrong, badly worded, ill thought, why are you reading this?, just think the opposite and it will be right, etc]

Event: The Data Imperative: Libraries and Research Data

Today I’m at the one day event ‘The Data Imperative: Libraries and Research Data’ at the Oxford e-research Centre. As usual, these are my own rough notes. There are mistakes, gaps and my own interpretation of what was said.

Paul Jeffreys : Director of IT, Oxford University.

Started off giving an overview of where this has come from. e-Research is more than just e-Infrastructure. e-Research is not just about outputs, but outputs (articles/data) are a part of this, and an discreet area to work on.

This is a cross-discipline area, it needs academics, University executive, research office, IT and Library. Libraries have skills that have to be fed in to this.

EIDCSR : ‘Enough talking,  let’s try and do it’, selected two research groups to work with, but not a pilot, a long term commitment. He talks about Oxford’s commitment to a data repository, it stresses cross agencies, mentions business models and feeds in to a senior research committee (the quote is far too long to add here!).

As each HEI is facing the same issue, it makes sense for national activity. but how much is done locally and how much is done nationally.

What is the vision of research management data? To what extent is managing research data the role of the Library/librarians? Is data management and data repositories a new kind of activity? Is it Librarians or Information Professionals who are charged to take this forward? [cjk: i thought they were one and the same]

John K Milner : Project Manager UKRDS

Can’t just use existing subject specific data centres. Need for cross-discipline (eg climate change) and therefore universal standards and methods so one subject can use another subject’s data with ease.

Feasibility study:

Understand what is happening today? where are the gaps. Avoid re-inventing the wheel.

Four Case studies (Bristol, Leeds, Leicester, Oxford), views of ~700 researchers over all disciplines (inc the arts).

What did they learn?

About half of the data has a useful life of 10 years? 26% has ‘indefinite’ value, ie keep for ever’ Nearly all kept locally (memory stick, departmental server, [cjk: not good!]).

21% use a national/international data centre. 18% share with them.

UK has rich landscape of facilities, skills and infrastructure.

The management of data from a research project are now starting to be directly funded, which is important.

What are others doing? Are we in step with other countries? Yes. US spending $100 million on 5 large data centres. Australians are leading in this area, and have a central approach to it. Canada and Germany also have similar developments.

Aim: to set up a framework for research data.

Why Pathfinder: not a pilot but the start of a long term commitment.

[my notes miss a bit here, had to deal with a urgent work issue]

Service must be useful and accessible. Need a framework for stakeholder engagement.

This is non-trivial. Lots of parties involved, a lot of effort needed.

Citation of datasets is of growing interest to some researchers, this may help engage the research community.

Showing a diagram of UKRDS Basic processes. Split between ‘Research Project process’, Research data sharing process and UKRDS Services and Administration

Diagram doesn’t focus on curation but on accessibility (inc discovery, stable storage, identity) as this seems like the most important part. Discovery:Google, Identity(auth):Shibboleth.

Making it happen.

Need clearly defined service elements, will involve DCC, RIN and data centres.

HEIs need a reliable back-office service to handle working with data.

UKRDS is extremely challenging, nothing is easy and it is expensive. Needs support of funders and HEIs, need the right bodies to show leadership and shape policy. It will take time.

Q: Is it limited to HEI or public sector (museums etc). A: a more complicated issue, but they are working with the liked of Connecting for Health and DEFRA.

Q: Copyright. A: HEI often don’t own copyright. Data Management Plan (Wellcome are funding Data planning as part of funding)

Q: Is it retrospective? A: Could be. [he did say more]

Q: Could UKRDS influence ‘reputational kick back’ [nice phase!] e.g. for the REF. A: Yes, in discussion with HEFCE.

Q: Research Councils A: they are in discussion with RCs but Wellcome very much taking the lead (leap of faith) in the area. The whole key is a ‘value proposition’ which makes a case for funding this.

Q/point: Engage government/politicians.

Q: Challenge in explaining what it is, especially for subjects which are already doing something with data. How can we tap in to those already doing it? A: there is sometimes a missing link between researchers and subject national data centres. No real relationship between the two. Which is a problem in cross-subject research.

Research data management at the University of Oxford: a case study for institutional engagement – Luis Martinez, OeRC, Sally Rumsey, Oxford University Library Service

More of a ‘in practice’ talk, rather than high level.

Luis Martinez

Scoping study: ‘DataShare project‘. Talking to researchers they found some couldn’t understand they own old data, some wanted to publish their own data, some found data was lost when academics moved on.

Requirements: Advice/support across research cycle (where to store it, how, etc), Secure Storage for large dataset. Sustainably infrastructure.

Lots of different Oxford units need to be consulted (library, it, research technology, academics, legal, repository etc).

Findings after consultation: there is actually widespread expertise in data management and curation amongst service units, and other findings. DataShare: new models, tools, workflows for academic data sharing.

Data Audit Framework: (DAF) adapted this to Oxford needs and used it to document practices in research groups.

Policy-making for Research Data in Repositories : a guide‘ [pdf]

The EIDCSR challenge: two units that both research around the human heart. The two groups share the data between them and agree to produce 3d models using the combined data. They are helping this groups do this, using a ‘life cycle approach’.

Using the DAF to capture the requirements. Participating in the UKRDS Pathfinder (as above).

They have a blog

Sally Rumsey

Starts of by talking about the roles required regarding the library. They have Repository staff, librarians, curators, but not so sure about ‘data librarians’.

What should of data should they be responsible for? Some stuff can go to a national service. There are vast datasets (eg Oxford Supercomputing centre), who has the expertise to make these specialised datasets available. Some departments already have provision in place, fine, why rock the boat.

Long tail. Every thing else (not above). No other home, lots of it, Academics asking for it, highly individual (ie unique), hums and sciences.

Things to consider: live or changing data Freely available or restricted? Long term post project?

Showing what looks like a list of random words/letters/strings of chars, an example of some data they were asked to look after from the English department.

Showing a diagram showing that Fedora (a repository system which is strong on metadata/structure but lacks an out of the box UI) is key to the setup. many applications can sit on top of it. Institutional Repository is just one application which runs on top of Fedora.

ORA (IR) for DATA: actual data can be held anywhere in University but ORA is a place of discovery. Allows for referencing of data. Might want to link to ‘DataBank‘ (a proof of concept to show what is possible).

Databank: how do you search/discover? First things added were audio files, perhaps then photos, how do you find them?

Showing Databank. Explaining that everything has a uid so we have cool URLs, and hence you can link to it [yes!]. Explaining how you can group an audio object, a related photo object and a related text object (perhaps explaining it).

End of morning discussion (I’ll just note some points I picked up):

This seems to raise such huge resource implications.

DAF is flexible, you can pick elements of it to use.

Non academic repositories, such as flickr, preservation issues, if they go down. [unlike the AHDS then!]

The Research Data Management Workforce – Alma Swan, Key Perspectives

Study commissioned by JISC, looking at the ‘supply of DS [data scientists] skills’.

NSF Roles:

  • Data Authors – produce data
  • Data Managers – more technical people – often work in partnership with data authors
  • Data Users
  • Data Scientists – expert data handlers and managers (perhaps ‘Data Manager’ was a confusing name).

Our Definitions (but in practice the roles and names are fuzzy):

  • Data creators or authors
  • Data Scientists
  • Data Managers
  • Data Librarians

Data Creators

Using DCC Curation lifecycle model, these are the out ring. But not all of it, and do things not on the ring, such as throw data away.

Shows picture of an academics office. Data is stored in random envelops.

Data Scientists – the focus of this study

Work with the researchers, in the same lab. Do most things in the DCC model. Are computer scientists (or can be one), experts in database technologies, ensure systems are in place, format migration. A ‘translation service’ between Researchers and computer experts.

Lots of facts about this, based on the research. Often fallen in to the role by accident, often started out as a researcher. Domain (maths, chemistry) related or Computer training. Informatics Skills: well advanced in biology and chemistry. Majority have a further degree. Need People skills. Rapidly involving area.

Data Librarians

Only a handful in the UK. specific skills in data care, curation. Bottom half (or bottom two thirds) of DCC model.

Library schools have not yet geared up for training. Demand is low, no established career path. Good subject-based first degree is required.

Things are changing, eg library schools are creating courses/modules around this.

Future Roles of the library

train researchers to be more data aware

Pressing issue inform researchers on data principles, eg ownership.

Open Data : datasets

A growing recognition across all disciplines that articles aren’t enough, datasets are what are needed to be in the open.

Datasets are a resource in their own right.

Publishers do not normally claim ownership of datasets. Some are (usual suspects)

Funder may own Data, Employers may own data. No one seems sure. Several entities may own the data.

In some areas of research journals play role in enforcement.

Some journals are just data.

Using PDF for data is very very not good.

Do we leave preservation of data to publishers [cjk: no! they should have nothing to do with this, the actors are Universities, their employees and their funders]

Simon Hodson – JISC Data Management Infrastructure Programme

Something problem, not easy to tackle. Would be a mistake for institutions to wait. The Call is designed to better understand how its data management facility can be taken forward.

Detailed business cases are needed.

Needs everyone (HEI, funders, data centres, RIN, etc) to be on board.

the Call will have an Advisory Group.

‘Exemplar projects and studies designed to help establish partnership between researchers, institutions, research councils.

See DCC as playing a major role in developing capacity and skills in the sector.

Tools and technologies: tools to help managers make business case internally, institutional planning tools (building on DAT, DRAMBORA, and costing tools). Workshop 1oth June DCC to review progress/outcomes of DAT project.

Two calls planned for the early Autumn.

2 June Call: Infrastructure. To build examples within the sector. Requirements analysis -> Implementation plan -> Execution thereof -> business models.

Bids encouraged from consortia.

Briefing day 6 July. DCC will provide support for bids, including a specific helpdesk.

There may be a Digital Curation course in the next few weeks.

Libraries and Research Data Management; conclusions – Martin Lewis, Director of Library Services and University Librarian, University of Sheffield.

Martin had been chairing all day and here he sums up and bring the various threads together.

The library research data pyramid. Things at the bottom need to be in place before things higher up. At the bottom, training in library (confidence), Library schools. Then develop local data curation capacity, teach data literacy. Higher up: research data awareness, research data advice, Lead on local policy. At the very top ‘influence national data agenda’.


An excellent day and excellent knowledgeable speakers. Nice venue, and most importantly, I found the only plug socket in the room!

This is clearly an emerging area. Many are in the same posistion, they are aware of the (Opene) Research Data developments, but nothing has yet happened at their university, nor academics queuing up to demand such a service. This is a good thing and it needs to happen, and Universities need to start acting now. But there are many preasures on University resources at the moment. How high on the institutional priority list will this come?

[Very finally, I did another audioboo experiment. On the fly, with no pre-planning, I recorded about 2 minutes of talk during the lunch. It’s random, with no thought, many umms, a pointless ‘one more thing’ and basically wrong. laugh at it here]

Research in the Open: How Mandates Work in Practice

Today I’m at the RSP/RIN Research in the Open: How Mandates Work in Practice at the impressive RIBA 66 Portland Place.

Slides can be found here (not available when I made this post, as semi excuse as to why my notes miss so much). These are rough notes, which I’m making available in case others are interested, apologies for mistakes and don’t take it as gospel!

After an introduction by Stéphane Goldstein, kicking off with Robert Kiley from the Wellcome Trust.

Wellcome trust mandate since 2006, anyone receiving funding from Wellcome Trust must deposit in to pubmed, now uk pubmed central. SHERPA Juliet lists 48 funder policies/mandates.

Two routes to complying to their mandate: (route 1) publisher in open access / hybrid journal (preferred), Wellcome will normally pay any associated fees. However when paying the publisher, they expect a certain level of service in return (deposited on behalf of author, final version available at time of publication, certain level of re-use. Route 2 Author self-archives author’s final version within 6 months of publication. It was stressed that the first option is very much preferred.

“Publication costs are legitimate research costs”. To fund Open Access fees for ALL research they fund would, they estimate. take up 1-2% of their budget.

Risk of ‘Double payment’ (author fees and subscriptions). OUP have a good model here.

Still to do:

  • Improve compliance (roughly 33%, significant increase after letters to VCs),
  • improve mechanisms (Elsevier introduced OA workflow which resulted in significant increase in deposits, but funders/institutions/publishers all need to play a part here),
  • Clarifying Publishers OA Policies  (and re-use rights, didn’t catch this).

Research Councils UK – Astrid Wissenburg, ESRC

Starts of by talking about drivers for OA in the RC. Value for money, ensuring research is used, infrastructure and more.

Principles: Accessible, Quality (peer review), preservation (she’s moving through the slides fast)

April 2009 study in to OA impact, provides options for RC to consider.Findings

  • Significant shift in favour of OA over last decade
  • Knowledge/awareness still limited. Confusion
  • Engagement with OA varies by subject area.
  • Too early to access impact of RCs policies.
  • Drivers
    • Not speed of dissemination
    • principles of free access
    • co-authors views are a big influence (mandates less so!)
    • some evidence that OA increases citation just after publication
    • limited compliance monitoring by finders
    • concern about impact of learned societies (but no evidence of libraries cancelling journals)
    • little evidence of use by non-researchers (CJK comment: interesting, I would imagine this may grow, wish newspapers would link/cite journal articles)

Both models (oa journals/repositories) supported by RCs, level playing field.

Pay to publish findings: limited use, barriers, costs, awareness, not RAE. would lead to redistribution of costs from non-academic to academic areas.

OA Deposit (repositories): from grant application from 1 Oct 2006, so a three year project starting then will only be finished in Autumn 2009. Acknowledges embargos but ‘at earliest opportunity’.

75% researchers were not aware of the mandate. diversity across subjects. ‘In general, no active deposit’.

A slide showing % of awareness broken down by RC, interesting.

From the highest level RCs are committed to supporting OA (this will increase). But change takes time.

Some issues: what do to with embargo periods, difficult for funders to manage (are there incentives we could use), depends on existence of repositories, multiple deposit options confusing to researchers, awareness/understanding.

UKPubMed Central – Paul Davey, Engagement Manager, UKPubMed Central

Aims to become the information resource of choice for biomedical sector.

Principles: freely available, added to UK pubmed central, freely copied and reused.

Departmental of Health have clear policy to make research freely available.

95% of papers submitted are taken care of (deposited?) by the authors. only 0.5% submitted by academics (PIs/colleagues)

1.6 milion papers in uk pubmed central. 366 thousand downloads last month.

Core benefits: transparency, cutting down duplication, greater visibility.

Text mining, grabbing key terms from an article  (a little like  OpenCalais does)

Mentions EBI’s CiteXplore, encouraging academics to ink to other research.

Pubmed UK includes funding/grant facilities search. Can link articles to funding grants.

In short, backing from key funders, will make researchers more efficient, researcher’s visibility will increase.

Beta out in the Autumn, new site in Jan 2012.


Worried about text mining, need for humans to moderate this. response: Limited finding in this area so human intervention also limited. really need specialist to answer this fully.

Question about increasing visibility of UK pubmed central, referring to Google, response: getting indexed by Google very much part of increasing visibility.

Question about Canadian ‘pubmed central’, response confirms this and mentions talk of a European pubmed central. Potential of European funders using UK pubmed central as a place to deposit research (like everything here, not sure if I’ve noted this right).

PEER – Pioneering collaboration between publishers, repositories and researchers – Julia Wallace

Funded by EC, not a ‘publisher project’.

Three key stages of publication: NISO Author’s original, NISO Accepted Manuscript, NISO version of record.

Starts of talking about the project, interesting stuff but failed to take notes.

From the website:

PEER (Publishing and the Ecology of European Research), supported by the EC eContentplus programme, will investigate the effects of the large-scale, systematic depositing of authors’ final peer-reviewed manuscripts (so called Green Open Access or stage-two research output) on reader access, author visibility, and journal viability, as well as on the broader ecology of European research. The project is a collaboration between publishers, repositories and researchers and will last from 2008 to 2011.

Seven members: including a publisher group, university, funders etc. Various publishers involved, big and small and about six European repositories taking part.

Approach / content:

  • Publishers contribute 300 journals, plus control
  • Maximises deposit and access in participating repositories
  • 50% publisher submitted 50% author submitted.
  • Good quality, range of impact factors. Publishers set embargo periods, up to 36 months.

Publishers will deposit articles in to the repositories via a central depot for their 50% of articles submitted (50% fulltext, metadata for the remaining 50%). Publishers will invite authors to deposit for the ‘author’ 50%

Technical: using PDFA-1 (where possible) and SWORD

Three strands: Behaviour, Usage (looking at raw log files), Economic. Also looking at Model Development (the three strands will look in to this).

Question about why they chose PDF (not very good for text mining). A: wide range of subjects and publishers means that PDF the best fit

Economic Implications of Alternative Scholarly Publishing Models, also Loughborough University’s Institutional Mandate – Charles Oppenheim, Loughborough University

‘Houghton report’ looks at costs and benefits of scholarly publishing.

Link to report

Link to main website and models

  • Massive savings by using OA, UK would benefit from this.
  • Savings include: quicker searching, less negotiations, savings not just in library budgets
  • 2,300 activity items costed.
  • This report currently final word in economics of OA.
  • Charles Talks about the various methods and work involved in producng this report.
  • a 5% incease in accessibility would lead to savings (or extra money to spend) in research/he/RCs
  • Hard to compare UK toll/open access publishing costs as one pays for UK access to content from across the world, the other pays for UK content to be world wide accessible.
  • Keen to role this out to other countires
  • Publishers response to report: furious!

Now for something completly different: Loughborough approve a mandae a few months a go, to come in to affect Oct 09. An intergral part of academic personal research plans (only those research items in the IR will be considered at the review). Now have over 4,000 items

Lunch and audioboo

During lunch I did an experiment using audioboo. Would I be able to summarise the morning, on the fly with no planning, in a brief audio recording. The answer, as you can discover, is ‘no’, but fun to try, and made me think of what I had taken in during the morning. Link to audioboo recording. or try the embedded version below.

Institutional Mandates – Paul Ayris, University College London

Paul starts off by shoing a number of Venn diagrams, for example: 90% of its research is available online, 40% available to an NHS hospital

What do UCL academics want

  • as authors: visbility / impact
  • as readers: access
  • delivery 24×7 anywhere

UCL madate, a case study:

Looking global is an important part of UCL (for PIs rankings etc). Number of systems in their publication system: Symplectic, IRIS, eprints, data mart (and portico, FIS, HR). Symplectic (or similar tool) and IRIS seem central in this model. Plan to automatically extra metadata from other external places (publication repositories.

How did they get the mandate? Paul spoke at UCLs senate (Academic Board), the agreed: all academic staff should record they own publication on a UCL publication system, and, teaching materials should all be deposited in their eprints systems.

UCL are going to set up a publication board to over see the OA rollout; to advise, monitor, oversee presentation and more.

Next steps: market/exploit, set standards for online publication, to advise on ongoing resource issues in this area. Also, establish processes, Statistics and management information, advise on multimedia, copyright issues.

‘Open Access is the natural way for a global university to achieve its objectives’

Question about blurring the line between dissemination and publication, and that some of UCLs aims seem more fitting of ‘publication’. Paul agrees, still trying to figure this out.

HEFCE – Paul Hubbard, Head of Research Policy, HEFCE

Policy: Research is a process which leads to insights for sharing. So Scholarly Communication matters to HEFCE. Prompt and accessible publishing is essential for a world class research system.

Supporting research: JISC, RIN, Programmes to support national research libraries (UKRR), UKRDS. Mentions Boston Spa (BL) document centre as an example of our world class sharing.

Internet opens up new ways of scholarly communication and sharing.

What do HEFCE want to see:

  • Widest and earliest dissemination of public research.
  • IP shared effectively with the people best placed to exploit it (CJK comment, i don’t think it is publishers!)

Committed to: UK maintaining world leading research, funding that fosters autonomy and dynamism, research quality assessment regime that supports rather than inhibits new developments.

As we move forward, things may be unclear those HEIs with repositories will be at an advantage.

Paul finishes up with a personal view of scholarly communications in 2030. He sees to forms of communication: discussion (building up ideas), and writing up a formal firm idea/conclusion based on these. HEFCE supports – through the likes of JISC – a range of tools and systems to enable this. (sorry that was an awful summary, he said much more than that!).

Answered a question as to why IRs, HEIs are the places to administrate/manage. Websites people go to see research for a particular subject need to be overlay systems harvesting from IRs.

[hmm, does ‘university requirement’ sound better than mandate?]

Institutional Policies and Processes for Mandate Compliance – Bill Hubbard, SHERPA, University of Nottingham

99.9% of academics do not object to Open Access, but need to show it will not change how they work. Librarians going to be much more part of the research process. Most people (including most publishers) are in favour of Open Access.

Other pressures on the systems, lack of peer reviewers, rising prices of journals, growing need for different forms of scholarly communications (e-lab books, multimedia), public demand for highest value for money ‘public should get what they pay for’,

Not if we change, but how we change. Research has to change seamlessly. Mandates have a value-added basis with fast delivery of benefits. Need integrated processes, need integrated support (we don’t want researchers to hear different messages from their Uni, funder, publisher, etc).

Authors need to know ‘what do i meed to do’. Need to make it less confusing, need to make it clear when they can get help.

First step compliance: how can funders improve compliance, how can authors be supported?

All 1994 and Russel Group now have IR (Reading, I think, just setting one up now).

Compliance for mandates makes it better for us admin/support staff, and for the University generally.

Institutions need a compliance officer (perhaps repository manager). Funders need to ensure these people have the information they need. Share compliance information

I’ve missed so much of Bill’s talk here, he moves fast (and passionately) and lots of points.

After Bill’s talk there was a panel session.


Finally check out some of the useful tweets from the day. (Twitter search only goes back about a month or so, so this link may not work after a certain date). Jim Richardson also created a permanent copy with the (new to me) webcitation website.


With such dodgy note taking I feel some concise summary is in order!

  • Mandates are happening, by Universities and by Funders.
  • HEFCE want research to be accessible to as many as possible as quickly as possible.  Coming from HEFCE, this holds a lot of weight.
  • Funders (Research Councils / Wellcome) put mandates in place several years a go. They have not sat back and said ‘job done’. They are building on this foundation. How can they check? How can they enforce/encourage? How can they assist? How can they automate? How can they work with publishers and HE to share this information? Expect more to come in this area.
  • Wellcome Trust prefers submission to Open Access Journals rather than author depositing in to a repository at a later date.
  • HE Mandates are coming, we alreay have a few in the UK. Making them an intergral part of an academic’s review seems like a good idea. My opinion is that this is reasonable – even if there are those who disagree – surely an employer can (and does in every other sector) ask for a record of what an employee has been working on, and a copy of the end output, i.e. the full text in an IR.
  • The report ‘Economic implications of alternative scholarly publishing models : exploring the costs and benefits. JISC EI-ASPM Project‘ is a thourough comprehensive look at the economic costs of Open Access and new forms of Scholorarly Communications.
  • I think we are starting to see the larger Universities developing sophisticated network of systems to manage research/publications/OA/research-funding. See slide 10 of Paul Ayris presentation, and this article about Imperial’s setup as two examples.
  • It makes sense to share information (between IT systems) between funders, HE and publishers. Examples: Funders sharing (bibliographic) information to a University about publications from its researchers, Universities (or publishers) passing information to funders linking publications to funding (or even the other way round?).
  • This is an area which is still developing, fast, and will of course involve a culture change. Publishers seem unsure how to handle this new world.

Library search/discovery apps : intro

There’s a lot of talk in the Library world about ‘next generation catalogues’, library search tools and ‘discovery’. There’s good reason for this talk, in this domain the world has been turned on its head.

History in a nutshell:

  • The card catalogue became the online catalogue, the online catalogue let users search for physical items within the Library.
  • Journals became online journals. Libraries needed to let users find the online journals they subscribed to through various large and small publishers and hosts. They built simple in-house databases (containing journal titles and links to their homepages), or added them to the catalogue, or used a third party web based tool. As the number of e-journals grew, most ended up using the last option, a third party tool (which could offer other services, such as link resolving, and do much of the heavy lifting with managing a knowledge base).
  • But users wanted one place to search. Quite understandable. If you are after a journal, why should you look in one place to see if there is a physical copy, and another place if they had access to it online. Same with books/ebooks.
  • So libraries started to try and find ways to add online content to catalogue systems in bulk (which weren’t really designed for this). Aquabrowser : Uni Sussex beta catalogue

The online catalogues (OPAC) were simple web interfaces supplied with the much larger Library management system (ILS or LMS) which ran the backend the public never saw. These were nearly always slow, ugly, unloved and not very useful.

A couple of years a go(ish), we saw the birth of the next generation catalogue, or search and discovery tools. I could list them, but the Disruptive Technology Library Jester does an excellent job here. I strongly suggest you take a look.

Personally, I think I first heard about Aquabrowser. At the time a new OPAC which was miles ahead of those supplied with Library systems and was (I think) unique as a web catalogue interface not associated with a particular system, and shock, not from an established Library Company. The second system I heard about was probably Primo from Ex Libris. At first not understanding what it was: It sounds like Metalib (another product from the same company which cross-searches various e-resource), is Primo replacing it? Or replacing the OPAC? It took a while to appreciate that this was something that sat on top of the rest. From then, VuFind, LibraryFind and more.

While some where traditional commercial products (Primo, Encore, Aquabrowser), many more were open source solutions, a number of which developed at American Libraries. Often built on common (and modern) technology stacks such as Apache solr/Lucene, Drupal, php/java, mysql/postgres etc.Primo : British Library

In the last year or so a number of major Libraries have started to use one of these ‘Discovery Systems’ for example: the BL and Oxford using Primo, National Libraries of Scotland & Wales and Harvard have purchased Aquabrowser and the LSE is trying VuFind. At Sussex (where I work) we have purchased and implemented Aquabrowser. We’ve added data enrichments such as table of contents (searchable and visible on records), book covers and the ability to tag and review items (tag/reviewing has been removed for various reasons) .

It would be a mistake to put all of these in to one basket. Some focus on being a OPAC replacement, others on being a unified search tool, searching both local and online items. Some focus on social tools, tagging & reviewing. Some work out the box others are just a set of components which a Library can sow together, and some are ‘SaaS’.

It’s an area that is fast changing. Just recently an established Library web app Company announced a forthcoming product called ‘Summon’, which takes searching a library’s online content a step further.

So what do libraries go for, it’s not just potentially backing the wrong horse, but backing the wrong horse when everyone one else had moved on to dog racing!

And within all this it is important to remember ‘what do users actually want’. From the conversations and articles I’ve read, they want a Google search box, but one which returns results from trusted sources and academic content. Whether they are looking for a specific book, specific journal, a reference/citation, or one/many keywords. And not just one which searches the metadata, but one which brings back results based on the full text of items as well. There are some that worry that too many results are confusing. As Google proves, an intelligent ranking system makes the number of results irrelevant.

Setting up (and even reviewing) most of these systems take time, and if users start to add data (tags, reviews) to one system, then changing could cause problems (so should we be using third party tag/rating/review systems?).

You may be interested in some other articles I’ve written around this:

There’s a lot talk about discovery tools, but what sort to go for, who to back? And many issues have yet to be resolved. I’m come on to those next…

Free e-books online via University of Pittsburgh Press

The University of Pittsburgh Press has put nearly 500 out of print books online and Open Access. You can access them via their Digital Editions website.  This is excellent news, making work which could be lost openly available to all.

iversity of Pittsburgh Press Digital Editions - Open Access free ebooks

University of Pittsburgh Press Digital Editions - Open Access free ebooks

For years there has been a movement towards making Journal articles Open Access, i.e. publicly available. However some subjects (especially in the Humanities) publish much of their research in books, not journals. Letting the world gain from the (normally publicly funded) research contained within books is more complex, and it’s not an area I fully understand. The author normally receives royalties from book sales. However I understand this are normally very small 99% of the time, and normally tail down to tiny amounts after a few years. What if funders and Universities demanded that any book written with their money (or during their employment) must be made publicly available after x number of years (let’s say 10 years)? Academics and Publishers would not welcome the move, but would still allow a window where they can gain revenue, and if this became the norm it would be something they just have to accept. Meanwhile, once open access, the book becomes much easier to archive and preserve, and ensure the knowledge is available to all in the long term. Just a thought. (more…)

ircount : Repository Record Statistics

I’ve updated Repository Record  Statistics (which I refer to as ircount).

Some key points:

  • Repository names with non-standard characters were not displaying properly. They now should, though there are some parts of the site where I have not updated the code yet. Many Russian IRs are also not showing the correct name, even though it is correct in the database, which I will look in to.
  • If a repository changes its name (in ROAR) then the new name should be shown in ircount
  • When looking at the details for one Repository, you can now compare it with any other repository, not just those from the same country. You’ll see two drop down boxes, one for those from the same country (for convenience) and one listing all repositories.
  • I’ve removed the Full Text numbers, which only appeared for some repositories. They were inaccurate and not very useful.
  • There’s now a link on the homepage to ircount news (which is actually just posts on this blog which have been tagged ‘ircount’, like this one).

The code is now in a subversion (svn) repository, my first time using such a system, which should help me keep track of changes. I can make it available to others if anyone is interested.

The future

There are other changes planned…. At the moment this is all in one big database table. Each week it collects lots of info about repositories from ROAR, including their name etc, and saves one row per repository (for each week). This means lots of infomation (such as a repository’s name) is duplicated each week. It also means that when selecting the name to be displayed you have to be careful to select the latest entry for the repository in question (something which hit me badly when trying to fix the name problem, and something I still haven’t got around to fixing for all SQL queries). I’m working on improving the back-end design.

I’m also thinking about periodically connecting to the OAI-PMH interface for each IR to collect certain details directly. Though this will be quite a change of direction, at the moment, ircount’s philosophy has been simply collecting from ROAR and reporting on what it gets back. Do I want to go down the road and loose this simple model.

I’m also pondering on ways to keep track of the number of full text items in each IR (you can see some initall thoughts on this here), though this will open a big can of worms.

The stats table which shows growth of repositories (based on number of records) over time for a given country (this one), could do with some improvements. RSS feeds for various things are also on the to do list.

Technical details

How did I fix these things, and add the new features. I made these changes a few months a go (on a test area) and the exact details have already slipped my mind.

Funny characters in Repository names

Any name with a non-standard character (a-z, A-Z, 0-9) has always displayed as garbage, this became more of an issue when I expanded ircount to include repositories world-wide. Below you can see an example from a Swedish repository: Växjö University.Example of incorrect name

The script which collects the data each week outputs its collected data in to a text file as well as writing it all to the database (really just as a backup and for debugging). The names displayed in the text file were the same messed up format, just as they were on the website. As the text file had nothing to do with the MySQL database or PHP front-end, I concluded the problem was with the actually grabbing the data from the ROAR website, which uses PERL’s LWP::simple.

This was a red herring. In the end I knocked up a script which just collected the file and output it to a textfile, and all worked fine. I gradually added code from the main script and it all was still working fine. So why did not main script not work.

In the end, I can’t remember the details but I think starting a new log file, or using a different filename (stupid I know) and other random things made the funny character problem go away.  Which meant the problem was now with the DB after all, which made more sense.

In the end I found this excellent page

Converting the database tables/fields to utf8_general_ci, and adding a couple of lines of code to the perl script to ensure the connection to the db was in utf (both outlined in the webpage linked to above) sorted this out. The final step was ensuring that the front end user interface selected the most recent repository name for a given IR, as older entries in the db would still have the incorrect name.

Country names

When I started collecting data for all repositories around the world I needed a way for users to be able to select a particular country. ROAR provided two digit codes, but how to display proper country names. I found a soultion using a simple to use PHP library detailed here (note it’s on two pages).

Known Bugs

  • Some Repository names still displayed wrong, e.g Russian names. Anyone know why?!
  • Table may show old names, and sometimes show two seperate rows if a repository has changed it’s name in ROAR
  • When comparing repositories, sometimes the graph does not display, especially when comparing four. This is due to the amount of data being passed to Google Charts is exceeding the maximum size of a URL (can’t remember of the top of my head but it is about 2,000 characters). This should be fixable as there is no need to pass so much data to Google charts, just need to be a little more intelligent in preparing the URL.
  • Some HTML tables are not displaying correctly, some border lines are missing, almost at random.

Linked data & RDF : draft notes for comment

I started to try and write an email about Linked Data, which I was planning to send to some staff in Library I work in.

I felt that it was a topic that will greatly impact on the Library / Information management world and wanted to sow the seeds until I could do something more. However after defining linked data, I felt I should also mention RDF and the Semantic Web, and also try and define these. And then in a nutshell try and explain what these were all about and why they are good. And then add some links.

Quickly it became the ultimate email that no one would read, would confuse the hell out of most people and was just a bit too random to come out of the blue.

So I think I will turn it in to a talk instead.

This will be quite a challenge, I’ve read a bit about all this, and get the general gist of it all, though don’t think I have a firm foundation of what RDF should be used for or how exactly it should be structured, and what bits of all this fall under the term ‘Semantic Web’. There’s a big difference to hearing/reading various introductions to a subject and being able to talk about it with confidence.

Anyway, below was the draft of the email, with various links and explanations, which turns in to just a list of links and notes as the realisation pops in that this will never be coherent enough to send.

If you know about this stuff (and good chance you know more than me), please comment on anything you think I should add, change, or perhaps where I have got things wrong.

I will probably regret putting a quick draft email, mistakes and all, on the web, but nevertheless, here is the email:

You have probably heard the phase ‘linked data’, and if not, you will do in the future.

It’s used in conjunction with RDF (a data format) and the Semantic Web.

“Linked Data is a term used to describe a method of exposing, sharing, and connecting data on the Web via dereferenceable URIs.”

It’s the idea about web pages (URLs) describing specific things. E.g. a page on our Repository describes (is about) a particular article. A page on the BBC describes the program ‘Dr Who’, another describes the Today program. A page on our new reading list system describes a list.

I know what these pages contain because I can look at them. But what if this was formalised so that systems (computers) could make use of these things, and extract information and links from within them?

Two companies prominent in this area are Talis and the BBC. The University of Southampton is also doing work and research around the semantic web (no coincidence Tim Berners-Lee worked there).

The following link is a diagram which is being used a lot in presentations and articles (and there’s a good chance you will see it crop up in the future):

Why should we care about all this?

Look at the image: Pubmed, IEEE, RAE, eprints, BBC and more. These are services we – and out users – use. This isn’t some distant technology thing, it’s happening in our domain.

Why should we care? (part 2)

Because when I searched for RDF I came across a result on the Department for Innovation, Universities and Skills (the one that funds us)…  It was a page called ‘library2.0’ ( that sounds relevant)…  The first link was to *Sussex’s* Resource list system… we’re already part of this.

(for those confused, The DIUS site simply shows things bookmarked at Delicious matching certain tags, someone from Eduserv while at UKSG had bookmarked our Resource List system as it was being used as an example of RDF/linked data, as UKSG was quite recent this link is at the top)

As I mention our Resource lists…

For each of our lists on Talis Aspire, there is the human readable version (html) and computer readable versions, eg



In JSON (don’t worry what these mean, they are just computer readable versions of the same info):

And in the same way for each BBC program/episode/series/person there is a webpage, and also a computer readable – rdf – version of the page page , e.g. The Today program:

There is tons of stuff I could point to about the BBC effort, here are some good introductions

I saw Tom Scott give this presentation at the  ‘confused’ Open Knowledge Conference:

A Paper on BBC/DBpedia with good introduction on background to /programmes



RDF is just a way of asserting facts (knowledge)

“RDF is designed to represent knowledge in a distributed world.”


“RDF is very simple. It is no more than a way to express and process a series of simple assertions. For example: This article is authored by Uche Ogbuji. This is called a statement in RDF and has three structural parts: a subject (“this article”), a predicate (“is authored by”), and an object (“Uche Ogbuji”). “



Imagine a book (referred by its ISBN).

Our catalogue asserts who wrote, and also asserts what copies we have

Amazon assert what price they offer for that book.

OCLC assert what other ISBNs are related to that item

Librarything assert what tags have been given for that item.

All these assertions are distributed across the web

But what if one system could use them all to display relevant information on one page, and created appropriate links to other pages.


Things to include…

(created for talis aspire – i think)

Tower and the cloud (not really linked data):

Semantic web for the working ontologist (we have a copy):|library/marc/talis|991152


Fantastic YouTube video from Davos by Tom Ilube:

Any pointers to good descriptions/explanations of RDF? (I think this is the most difficult area). Clearly this is mainly a set of links, and not a talk, but I will probably use this as a basis of what I will try and say.
All comment welcome.

Academic discovery and library catalogues

A slightly disjointed post. The useful website by Marshall Breeding announced the eXtensibleCataloge project has just released a number of webcasts in preparation for their software release later this year.

eXtensible Cataloge webcast screenshot

eXtensible Cataloge webcast screenshot

I’ve come across this project before, and put a little simply, is in the same field as the Next Generation Catalogues such as Primo, Aquabrowser and VuFind.

However where these are discreet packages, this seems like a more flexible set of tools and modules, and a framework which libraries can build on. I didn’t manage to watch all the screencasts but the 30mins or so that I watched was informative.

As an aside, while the screen consisted of a powerpoint presentation the presenter appeared in a small box at the bottom, and watching him speak oddly made listening to what was being said more easily digestible (or perhaps just gave my eyes something to focus on!).

This looks really interesting, and will be good to see how this compares to other offerings, certainly looks like they are taking a different angle, and perhaps the biggest question will be how much time will it take to configure such a flexible and powerful setup (especially with the small amount of technical staff found in most UK HE Libraries). Anyway, worth checking out, using various metadata standards and using – amongst others – SOLR and Drupal as a base.

While on the eXtensible Cataloge website I came across a link to this blog post from Alex Golub (Rex) an ‘adjunct assistant professor of anthropology at the University of Hawai’i Manoa‘. It talks about a typical day as he Discovers and evaluates research and learn about others in the same academic dicipline. Again, well worth a read.

It starts off with an email from recommending a particular book. He notes:

In exchange for giving too much of my money, I’ve trained it (or its trained me?) to tell me how to make it make me give it more money in exchange for books.

It doesn’t take a genius to see that the library catalogue could potentially offer a similar service. A Library Catalogue would be well placed to build up a history of what you have borrowed and produce a list of recommend items. But would this only suggest items your library has, and would it be limited by the relatively small user base; if there are only a few academics/researchers with a similar interest then this will be of limited use in producing books you may be interested in (i.e. serendipity).

This is where the JISC TILE project comes in (and I blogged about an event I attended about TILE a few months a go). If we could share this data at a national level (for example) we could create far more useful services, in this case it could draw on the borrowing habits of many researchers in the same field, and could – if you wish – recommend books not yet in your own Library. As well as the TILE project, Ex Libris have announced a new product called bx which sounds like it will do a similar thing with journals.

Another nugget from the blog post mentioned above is that he uses the recommendations & reviews in Amazon as a way to evaluate the book and its author:

So I click on the link and read more reviews, from authors whose work I know and respect.

I’ve been discussing with colleagues the merits and issues with allowing user reviews in an academic library catalogue. I hadn’t considered a use such as this. Local reviews would have been of limited use as other authors in the same field that a researcher respects (as he describes in the quote) are likely to be based at other institutions (and we would be naive to expect such a flood of reviews to a local system that every book had a number of good reviews). Again, maybe a more centralised review system is needed for academic libraries, though preferably not one which requires licensing from a third party at some expense!

And briefly, while we are talking about library catalogues. I see that the British Libraries ‘beta catalogue‘ (running on Primo) has Tag functionality out the box, and I’m pleased to see they have been this quite a central feature, with a ‘tag’ link right about the main search box. This link takes you to a list of the most frequently used and most recently added tags. Creating a new way to browse and discover items. What I love about the Folksonomy approach is that so often users find ways of using tags in ways you would never expect. For example, would a cataloger think to record an item in a museum as ‘over engineered‘? (I think the answer would be no, but it occurs to me I know nothing regarding museum cataloging standards). Could finding examples of over engineered items be useful for someone? of course! (from the Brooklyn Museum online collections, found via Mike Ellis’ excellent Electronic Museum blog). The Library of Congress on flickr pilot springs to mind as well.

So I guess to conclude all this, the quest continues in how we can ensure libraries (and their online catalogues and other systems) provide researchers and users with what they want, and use technology to enable them to discover items that in the past they might have missed.

JISC, Monitter and DIUS (Department of Innovation, Universities and Skills)

Earlier this week the Jisc 2009 Conference went ahead. A one day summary of where things are going in Jisc-land.

Like last year, I got a good feel of the day via twitter. I used a web app called for real time updates from anyone on twitter who used the tag #jisc09. allows you to track a number (3 columns by default) of words/searches, this works well as these can be usernames, tags or just a phase. I used ‘jisc09’, ‘brighton OR sussex’ and ‘library’.

The keynote talks was also streamed live on the web, the quality was excellent. Check out the main Jisc blog for the event.

Linking to all the different sites, searches and resources on the web after the event wouldn’t do it justice. The usefulness was in the way these were all being published during the day itself, using things like twitter (and bespoke sites) as a discovery mechanism for all these different things being added around the web. I didn’t know who most of the people were, but I was finding their contributions. That’s good.

An email came out the next day about the conference and announcing a guest blog post by David Lammy, the Minister for Higher Education, on the Jisc Blog.

He finished by asking for the conversation to continue, specifically on which is described as ‘a place to open up lines of communication between Ministers and the HE Community’. is set up to allow users to ask ‘famous people questions’. Its homepage suggests that it is designed for any kind of ‘famous person’ though seems to be dominated by UK politicians. Looks interesting but can’t help wonder if there are other sites which could facilitate a ‘discussion’ just as well or better.

The dius section of the site seems quite new. In fact my (rather quickly composed) question was the second to be added to the site. I think the idea of practitioners (yuck, did I just use that word?) raising issues directly with Ministers is an interesting one, and hope it takes off, and at very least, he/they answer the questions!

DIUS do seem to be making an effort to use web2.0 tools. I recently came across this sandbox idea of collecting sites from delicious based on tags, in this example, the library2.0 tag. Interesting stuff, but not specific to HE, it will work for any tag and really just creates a nice view of the latest items bookmarked with the tag in question. The code for it is here.

In any case, it is good to see a government department trying out such tools and also releasing the code under the GPL (even 10 Downing street’s flickr stream is under crown copyright, and don’t get me started on OS maps and Royal Mail postcodes). I’m reminded of the team who, when they found out there was a ‘hack the government‘ day to mashup and improve government web services, decided to join in.

DIUS homepage with web2.0 tools

DIUS homepage with web2.0 tools

On the DIUS hompage, just below the fold, they have a smart looking selection of tools, nice to see this stuff here, and so prominent, though the Netvibes link to me just a holding page when I tried it.

Finally, they have set up a blog on the jiscinvolve (WordPress MU) site. At the time of writing it has a few blogs posts which are one line questions, and has a couple of (good) responses. But I can’t help feeling that these sites need something more if they are to work. At the moment they are just there floating in space. How can they integrate these more into the places that HE staff and students inhabit. Perhaps by adding personal touches to the sites would encourage people to take part, for example the blog – a set of questions – is a little dry, it needs an introduction, host, and photos.

To sum up, some good stuff going on here, but need to see if it takes off, it must be difficult for a government department to interact with HE and students, the two are very different but they are trying.  I hope it proves useful, if you’re involved in HE why not take a look and leave a comment?


March 25, 2009