Ezproxy Statistics
General Overview
This is a simple Perl script which produces usage statistics from an ezproxy log file.
The script is crude and rudimentary, has various flaws, and the author can't code perl for toffee. The information this script provides can be useful for providing rough usage statistics for electronic resources and journals, however the figures it produces should not be taken as an accurate count.
While it looks time consuming below to setup, it really only requires a couple of steps to get a basic report of e-resource usage.
Sample output
These links show the first few lines of the reports run for our organisation. The output here is un-ordered and as plain text, separated by tabs, it will look ugly in most browsers. These files could be imported in to a program such as MS Excel to order and present the data.
Sample output for Electronic Resources (shows e resource, total access, campus access, external access)
Sample output for E journals (shows title, host, total access, campus access, external access)
Host is shown in the E journal report in case a title can be accessed through more than one provider.
How the script works
The ezproxy log file
The log file is basically in the format of a web logfile.
I presume your ezproxy setup is such that users on campus use 'ezproxy URLs' to access electronic resources, though ezproxy is configured to redirect on-campus users directly to the e-resource's site (using 'ExcludeIP'). Though this is not a requirement for the script to work.
With the above in mind, the ezproxy log file will include:
- One log entry for each ‘on campus’ session (after the first hit they access the resource directly)
- One log entry for each hit/request during an ‘off campus’ session
The log lines generated by external users (ie every page/file requested) are of limited use, except for the first hit.
For example a site which has many images and requires the user to navigate through many pages, will generated many more hits (log entries) than one that has fewer images, or requires less navigation to reach a journal article. This will of course makes it difficult to compare the stats from two different providers, as one with a larger number of hits does not necessary have the higher level of usage.
Therefore we need to ignore all hits, other than the first for each user accessing a resource. The good news is that the first hit is different to the rest. See examples below, where the ezproxy server is 'ezproxy.sussex.ac.uk'
First request (for both internal and external)
217.42.116.61 - - [07/May/2004:12:55:39
+0000] "GET http://ezproxy.sussex.ac.uk:2048/login?url=http://web.lexis-nexis.com/professional
HTTP/1.1" 302 421
A request from someone off campus (after the first 'hit')
217.42.116.61 - - [07/May/2004:12:55:40
+0000] "GET http://web.lexis-nexis.com:80/professional
HTTP/1.1" 302 173
As a rule, the first hit includes the ‘ezproxy.sussex.ac.uk’ prefix on the URL. By only including those lines that are prefixed in this way in our analysis, and hence only including one hit (the first) for each session, we can count the number of sessions for each journal or provider.
So by filtering the log file to only include the log entries of the first hit of every session, we have one line per session. This can then provide some useful statistics.
Resources and Journals
As you can see from the sample output, the script can produce two reports, one for e-resources and one for e-journals.
For e-resources, the script works as follows:
- It looks at each URL in the logfile
- Chops most of it off down to the domain name (this is a little crude)
- It then looks at the ezproxy.cfg and looks to see if there is a Host or Domain entry which matches.
If it finds a match, it counts this hit against the corresponding 'Title' (as set in the ezproxy.cfg). Based on the requesting client's IP address for this log entry, it will then add to the count for on or off campus.
There is one issue with this, and this is multiple resources which use the same domain name. The script tries to use the names of all the resources (which share a domain) as one big Title.
To produce a report for e-journals, we require a separate file which lists Journal Titles and their corresponding homepage URL. Any log entry which matches that URL exactly will count as a hit for that journal title. This can work well if this title~URL mapping file can be produced by a database or system which manages your e-journals or creates links to e-journals on your library webpages.
Download
Includes: makemini.sh (needs editing), stats.pl (main script), sample journalnames.txt and oclc.sh (stupidly crude file for counting usage to individual OCLC resources.
If you use the script, or have any comments, please drop me an email
Note: You use the script at your own risk, I can not be held responsible in anyway if its use causes a negative affect. It should also be made clear that this script has nothing to do with the University of Sussex or usefulutilities.
Setting up / Configuration
The perl script should be fairly easy to edit to suite your needs.Follow the steps below to get it working:
1. Decide where to place the script
Decide on a directory to run the stats script within. We shall use /home/bob/ezproxystats/
2. Decide what reports you are going to create
The script can currently create two reports: E-Resources and E-Journals
Electronic Resources only, (with on/off campus usage figures). This only requires the exproxy.cfg file, which all ezproxy installations will have. It uses the ezproxy file to map domains/hosts to Electronic Resource names.
If you only want this first report, edit stats.pl and comment out the following lines by putting a hash at the beginning of the line:
open (JOURNALNAMES, "journalnames.txt") or die "can not open journal names \n";
the above line is on or near line 77
printjournalscampustext();
is near line 261
Electronic Resources and Electronic journal titles reports. This requires you to create an additional file, which maps Journal titles with the titles homepage URL.
This file looks like the following(the name and url are separated by a tab \t):
Advances in Physiology Education http://advan.physiology.org/
Applied and Environmental Microbiology http://aem.asm.org/
African Affairs http://afraf.oupjournals.org/
Age and Aging http://ageing.oupjournals.org/
American Journal of Psychiatry http://ajp.psychiatryonline.org/
Such a list could be created from:
- a database of electronic journals, many Universities maintain such a database for providing e journal listings (and search facility) on the web. Popular database packages such as MS Access and mySQL support exporting tables out to text files. [note: for MS Access, I have found the quickest method is to use the 'analyse in Excel' option, and then use excel to save to a text file]
- If your University adds ejournals to your library catalogue, it may be possible to use SQL to extract such information from your Library management system.
- If your University subscribes to an online ejournal management system (EBSCO, SWETS, etc) this may offer such a facility to export titles and homepage urls.
If you are able to create such a mapping file you will be able to create the second report of journal usage, if not, you will need to comment out the line which runs this report, mentioned above.
3. Create the input log file (makemini.sh)
Edit makemini.sh, this shell script will create a file which is more or less one line per session.
The script in its default form will look in a particular directory for *.log.gz files for input. You will almost certainly need to edit this script, or replace it completely (it's only one line). Once finished editing, you will need to run it and check the result.
The main stats perl script does not currently restrict by date. So you will need to manually edit mini.log to only include lines between those dates you want including in the report. Another option would be to edit makemini or stats.pl to include such functionality (and then send it to me!).
4. [Optional] Create a text file of journal names & URLs (if required)
This is only required if you are running the second report which shows journal usage. The methods which could be used to create such a file are listed above. The file takes the form of one journal per line, each line being 'Journal Title /t Journal URL'.
The file should be saved in to the same directory as the script (it goes without saying that the script could be edited to look elsewhere), with the name "journalnames.txt".
5. Copy your ezproxy.cfg file to the stats directory
Either copy you ezproxy file to your stats directory, or edit the stats.pl to look for ezproxy.cfg in its normal location.
6. Edit stats.pl for your organisation
The main thing you need to edit is the IP range for your organisation, to enable it to report on on/off campus usage.
The University of Sussex is lucky as this is quite simple for us, all our IP addresses start with '139.184'. Therefore this part of the script is very basic, it simply looks for a string - in our case 139.184 - to match to determine if a request is coming off campus.
Those with more complex needs, e.g. various ranges of IP addresses will either need to add their own code or ignore the on/off campus functionality. There is a third option, yet untested, use a simple utility to convert IP addresses to hostnames in the log file (many such tools out there on the web), then use your organisation's domain name in the $localip variable set at the top of the script, instead of using part of an IP address.
7. Run the script
Finally we can run the script:
./stats.pl
The output files can then be imported in to a program such as MS Excel.
Issues and problems
The method used to match URLs (from the logfile) with 'Domains' and 'Hosts' in the ezproxy.cfg file is particularly crude. As such, there will be some URLs which it is unable to give a 'Title' (as given in the cfg file). In such cases, it will use the domain name as the e-resource's title in the output.
One of the biggest issues is ensuring we have one and only one log entry per user 'session' (session is in quotes as it can mean different things to different people). Please bear this in mind, the numbers quoted in these reports will give a rough estimate, but may be wide of the mark, and may be more accurate for some resources than others.
Regarding the e-journal report. If a user clicks a link on your webpage to a particular e-journal, that should count as a hit. Once on the e-journal site, the user may of course follow links to other journals on the same resource, and their access to these journals will NOT be counted.
Finally, again, if you find this useful at all, please drop me an email.