Ezproxy stats

Overview

This is a simple Perl script which produces usage statistics from an ezproxy log file. It can show usage by resource, and usage of journals. The latter requires you to provide a mapping file between journal names and journal urls.

The script is crude and rudimentary. The information this script provides can be useful for providing rough usage statistics for electronic resources and journals, however the figures it produces should not be taken as an accurate count.

March 2009 update

A new version of the script has been released. You can use the download link below to get the new version.

There are no major new features, though the code has been improved in parts and there is no need for a ‘makemini’ script which filters the log file down to only lines we are interested in.

You can now also pass a logfile name on the command line:

./stats.pl 0904access_log

If you don’t pass a filename on the command line, it will use the name as specified near the top of the file.

Sample output

These links show the first few lines of the reports run for our organisation. The output here is un-ordered and as plain text, separated by tabs, it will look ugly in most browsers. These files could be imported in to a program such as Excel to order and present the data.

resourcestats.txt (shows e-resource, total access, campus access, external access)

journalstats.txt sample output for ejournals (shows title, host, total access, campus access, external access)

Host is shown in the E journal report in case a title can be accessed through more than one provider.

How the script works

The  ezproxy log file is formatted as a web server log file.

This script presumes your setup is such that users on campus use ‘ezproxy URLs‘ to access
electronic resources, and your ezproxy is configured to redirect on-campus users
directly to the e-resource’s site (using ‘ExcludeIP’). However this is not a requirement for the script to work.

With the above in mind, the ezproxy log file will include:

  • One log entry for each ‘on campus’ session (after the first hit they access the resource directly)
  • One log entry for each hit/request during an ‘off campus’ session

The log lines generated by external users (ie every page/file requested) are of limited use, except for the first hit at the start of their session.

Therefore we need to ignore all hits, other than the first for each user accessing a resource. The good news is that the first hit is different to the rest. See examples below, where the ezproxy server is ‘ezproxy.sussex.ac.uk’

First request at the start of a user session (for both internal and external)

217.42.116.61 - - [07/May/2004:12:55:39 +0000] "GET http://ezproxy.sussex.ac.uk:2048/login?url=http://web.lexis-nexis.com/professional HTTP/1.1" 302 421

A request from someone off campus (after the first ‘hit’)

217.42.116.61 - - [07/May/2004:12:55:40 +0000] "GET http://web.lexis-nexis.com:80/professional HTTP/1.1" 302 173

As a rule, the first hit includes the ‘ezproxy.sussex.ac.uk’ prefix on the URL. By only including those lines that are prefixed in this way in our analysis, and hence only including one hit (the first) for each session,
we can count the number of sessions for each journal or provider.

So by filtering the log file to only include the log entries of the first hit of every session, we have one line per session. This can then provide some useful statistics.

Resources and Journals

As you can see from the sample output, the script can produce two reports, one for e-resources and one for e-journals.

For e-resources, the script works as follows:

  • It looks at each URL in the logfile
  • Chops most of it off down to the domain name (this is a little crude)
  • It then looks at the ezproxy.cfg to see if there is a Host or Domain entry which matches.
  • If it finds a match, it counts this hit against the corresponding ‘Title’ (as set in the ezproxy.cfg).
  • Based on the requesting client’s IP address for this log entry, it will then add this session to the count for ‘on’ or ‘off’ campus.

Note: If you have more than one resource entry in the ezproxy.cfg which use the same domain name then this script will count all usage of these resources together, giving one total.

The e-journals report

To produce a report for e-journals, we require a separate file which lists Journal Titles and their corresponding homepage URL. Any log entry which matches that URL exactly will count as a hit for that journal title. This can work
well if this title~URL mapping file can be produced by a database or system which manages your e-journals or creates links to e-journals on your library webpages.

Download

Includes:

  • stats.pl (main script),
  • sample journalnames.txt  – a file that maps your journal titles with their URL (eg. Nature [tab] http://www.nature.com/nature/)
  • makemini.sh – a script no longer needed. it used to filter out all lines that were not needed from the source logfile.
  • there is also a file which tries to break the stats down by ezproxy groups. Though this was a bit of an experiment, feel free to play!

If you use the script, or have any comments, please drop me an email

Note: You use the script at your own risk, I can not be held responsible in any way if its use causes a negative affect. It should also be made clear that this script has nothing to do with the University of Sussex or usefulutilities.

Setting up / Configuration

The perl script should be fairly easy to edit to suite your needs.Follow the steps below to get it working:

1. Decide where to place the script

Decide on a directory to run the stats script within. We shall use /home/bob/ezproxystats/

2. Decide what reports you are going to create

The script can currently create two reports: E-Resources and E-Journals

Electronic Resources only, (with on/off campus usage figures). This only requires the exproxy.cfg file, which all ezproxy installations will have. It uses the ezproxy file to map domains/hosts to Electronic Resource names.

Electronic Resources and Electronic journal titles reports. This requires you to create an additional file, which maps Journal titles with the titles homepage URL.

If you only want the first report (resources), edit stats.pl and set the following config option at the top of the document to 0 (zero)

my $doJournalReport=0;

3. Create the input log file (makemini.sh)

Edit makemini.sh, this shell script will create a file which is more or less one line per session.

The script in its default form will look in a particular directory for *.log.gz files for input. You will almost certainly need to edit this script, or replace it completely (it’s only one line). Once finished editing, you will need to run it and check the result.

The main stats perl script does not currently restrict by date. So you will need to manually edit mini.log to only include lines between those dates you want including in the report. Another option would be to edit makemini or stats.pl to include such functionality (and then send it to me!).

4. If you want the ‘Journals’ report : Create a text file of journal names & URLs (if required)

This file looks like the following (the name and url are separated by a tab t):

Advances in Physiology Education	http://advan.physiology.org/
Applied and Environmental Microbiology http://aem.asm.org/
African Affairs http://afraf.oupjournals.org/
Age and Aging http://ageing.oupjournals.org/
American Journal of Psychiatry http://ajp.psychiatryonline.org/

Such a list could be created from:

  • a database of electronic journals, many Universities maintain such a database
    for providing e journal listings (and search facility) on the web. Popular
    database packages such as MS Access and mySQL support exporting tables out to text files. [note: for MS Access, I have found the quickest method is
    to use the ‘analyse in Excel’ option, and then use excel to save to a text
    file]
  • If your University adds ejournals to your library catalogue, it may
    be possible to use SQL to extract such information from your Library management system.
  • If your University subscribes to an online ejournal management system (EBSCO, SWETS, etc) this may offer such a facility to export titles and homepage urls.

This is only required if you are running the second report which shows journal usage.  The file takes the form of one journal per line, each line being ‘Journal Title /t Journal URL’.

The file should be saved in to the same directory as the script, with the name “journalnames.txt“.

5. Copy your ezproxy.cfg file to the stats directory

Either copy you ezproxy file to your stats directory, or edit the stats.pl to look for ezproxy.cfg in its normal location.

6. Edit stats.pl for your organisation

There are a number of config settings at the top of the script.

The main thing you need to edit is the IP range for your organisation, to enable it to report on on/off campus usage.

The University of Sussex is lucky as this is quite simple for us, all our IP addresses start with ‘139.184’. Therefore this part of the script is very basic, it simply looks for a string – in our case 139.184 – to match to determine if a request is coming off campus.

Those with more complex needs, e.g. various ranges of IP addresses will either need to add their own code or ignore the on/off campus functionality. There is a third option, yet untested, use a simple utility to convert IP addresses to hostnames in the log file (many such tools out there on the web), then use your organisation’s domain name in the $localip variable set at the top of the script, instead of using part of an IP address.

7. Run the script

Finally we can run the script:

./stats.pl

./stats.pl access_log

the former users the logfile as specified at the top of the script, the latter uses the logfile passed on the comment line. Note you can currently only pass it one logfile, and wildcards (e.g. *) do not work (it will only read the first file from the list of files that match the wildcard filter).

The output files can then be imported in to a program such as MS Excel.

Issues and problems

The method used to match URLs (from the logfile) with ‘Domains’ and ‘Hosts’ in the ezproxy.cfg file is particularly crude. As such, there will be some URLs which it is unable to give a ‘Title’ (as given in the cfg file). In such cases, it will use the domain name as the e-resource’s title in the output.

One of the biggest issues is ensuring we have one and only one log entry per user ‘session’ (session is in quotes as it can mean different things to different people). Please bear this in mind, the numbers quoted in these reports will give a rough estimate, but may be wide of the mark, and may be more accurate for some resources than others.

Regarding the e-journal report. If a user clicks a link on your webpage to a particular e-journal, that should count as a hit. Once on the e-journal site, the user may of course follow links to other journals on the same resource, and their access to these journals will NOT be counted.

Finally, again, if you find this useful at all, please drop me an email.

Leave a Reply