ircount : Repository Record Statistics

I’ve updated Repository Record  Statistics (which I refer to as ircount).

Some key points:

  • Repository names with non-standard characters were not displaying properly. They now should, though there are some parts of the site where I have not updated the code yet. Many Russian IRs are also not showing the correct name, even though it is correct in the database, which I will look in to.
  • If a repository changes its name (in ROAR) then the new name should be shown in ircount
  • When looking at the details for one Repository, you can now compare it with any other repository, not just those from the same country. You’ll see two drop down boxes, one for those from the same country (for convenience) and one listing all repositories.
  • I’ve removed the Full Text numbers, which only appeared for some repositories. They were inaccurate and not very useful.
  • There’s now a link on the homepage to ircount news (which is actually just posts on this blog which have been tagged ‘ircount’, like this one).

The code is now in a subversion (svn) repository, my first time using such a system, which should help me keep track of changes. I can make it available to others if anyone is interested.

The future

There are other changes planned…. At the moment this is all in one big database table. Each week it collects lots of info about repositories from ROAR, including their name etc, and saves one row per repository (for each week). This means lots of infomation (such as a repository’s name) is duplicated each week. It also means that when selecting the name to be displayed you have to be careful to select the latest entry for the repository in question (something which hit me badly when trying to fix the name problem, and something I still haven’t got around to fixing for all SQL queries). I’m working on improving the back-end design.

I’m also thinking about periodically connecting to the OAI-PMH interface for each IR to collect certain details directly. Though this will be quite a change of direction, at the moment, ircount’s philosophy has been simply collecting from ROAR and reporting on what it gets back. Do I want to go down the road and loose this simple model.

I’m also pondering on ways to keep track of the number of full text items in each IR (you can see some initall thoughts on this here), though this will open a big can of worms.

The stats table which shows growth of repositories (based on number of records) over time for a given country (this one), could do with some improvements. RSS feeds for various things are also on the to do list.

Technical details

How did I fix these things, and add the new features. I made these changes a few months a go (on a test area) and the exact details have already slipped my mind.

Funny characters in Repository names

Any name with a non-standard character (a-z, A-Z, 0-9) has always displayed as garbage, this became more of an issue when I expanded ircount to include repositories world-wide. Below you can see an example from a Swedish repository: Växjö University.Example of incorrect name

The script which collects the data each week outputs its collected data in to a text file as well as writing it all to the database (really just as a backup and for debugging). The names displayed in the text file were the same messed up format, just as they were on the website. As the text file had nothing to do with the MySQL database or PHP front-end, I concluded the problem was with the actually grabbing the data from the ROAR website, which uses PERL’s LWP::simple.

This was a red herring. In the end I knocked up a script which just collected the file and output it to a textfile, and all worked fine. I gradually added code from the main script and it all was still working fine. So why did not main script not work.

In the end, I can’t remember the details but I think starting a new log file, or using a different filename (stupid I know) and other random things made the funny character problem go away.  Which meant the problem was now with the DB after all, which made more sense.

In the end I found this excellent page http://www.gyford.com/phil/writing/2008/04/25/utf8_mysql_perl.php

Converting the database tables/fields to utf8_general_ci, and adding a couple of lines of code to the perl script to ensure the connection to the db was in utf (both outlined in the webpage linked to above) sorted this out. The final step was ensuring that the front end user interface selected the most recent repository name for a given IR, as older entries in the db would still have the incorrect name.

Country names

When I started collecting data for all repositories around the world I needed a way for users to be able to select a particular country. ROAR provided two digit codes, but how to display proper country names. I found a soultion using a simple to use PHP library detailed here (note it’s on two pages).

Known Bugs

  • Some Repository names still displayed wrong, e.g Russian names. Anyone know why?!
  • Table may show old names, and sometimes show two seperate rows if a repository has changed it’s name in ROAR
  • When comparing repositories, sometimes the graph does not display, especially when comparing four. This is due to the amount of data being passed to Google Charts is exceeding the maximum size of a URL (can’t remember of the top of my head but it is about 2,000 characters). This should be fixable as there is no need to pass so much data to Google charts, just need to be a little more intelligent in preparing the URL.
  • Some HTML tables are not displaying correctly, some border lines are missing, almost at random.