- Dev8D
Published on February 28th, 2010 - ircount development
Published on November 26th, 2009 - ircount : Repository Record Statistics
Published on May 14th, 2009 - short urls, perl and base64
Published on March 9th, 2009
This isn’t a comprehensive review of dev8d, checkout the twitter stream (that link uses twapperkeeper by the way, the guy who created it was there and it was great talking to him), and wiki for and much else on the web for what went on.
I’ve updated Repository Record Statistics (which I refer to as ircount).
Some key points:
The code is now in a subversion (svn) repository, my first time using such a system, which should help me keep track of changes. I can make it available to others if anyone is interested.
There are other changes planned…. At the moment this is all in one big database table. Each week it collects lots of info about repositories from ROAR, including their name etc, and saves one row per repository (for each week). This means lots of infomation (such as a repository’s name) is duplicated each week. It also means that when selecting the name to be displayed you have to be careful to select the latest entry for the repository in question (something which hit me badly when trying to fix the name problem, and something I still haven’t got around to fixing for all SQL queries). I’m working on improving the back-end design.
I’m also thinking about periodically connecting to the OAI-PMH interface for each IR to collect certain details directly. Though this will be quite a change of direction, at the moment, ircount’s philosophy has been simply collecting from ROAR and reporting on what it gets back. Do I want to go down the road and loose this simple model.
I’m also pondering on ways to keep track of the number of full text items in each IR (you can see some initall thoughts on this here), though this will open a big can of worms.
The stats table which shows growth of repositories (based on number of records) over time for a given country (this one), could do with some improvements. RSS feeds for various things are also on the to do list.
How did I fix these things, and add the new features. I made these changes a few months a go (on a test area) and the exact details have already slipped my mind.
Funny characters in Repository names
Any name with a non-standard character (a-z, A-Z, 0-9) has always displayed as garbage, this became more of an issue when I expanded ircount to include repositories world-wide. Below you can see an example from a Swedish repository: Växjö University.![]()
The script which collects the data each week outputs its collected data in to a text file as well as writing it all to the database (really just as a backup and for debugging). The names displayed in the text file were the same messed up format, just as they were on the website. As the text file had nothing to do with the MySQL database or PHP front-end, I concluded the problem was with the actually grabbing the data from the ROAR website, which uses PERL’s LWP::simple.
This was a red herring. In the end I knocked up a script which just collected the file and output it to a textfile, and all worked fine. I gradually added code from the main script and it all was still working fine. So why did not main script not work.
In the end, I can’t remember the details but I think starting a new log file, or using a different filename (stupid I know) and other random things made the funny character problem go away. Which meant the problem was now with the DB after all, which made more sense.
In the end I found this excellent page http://www.gyford.com/phil/writing/2008/04/25/utf8_mysql_perl.php
Converting the database tables/fields to utf8_general_ci, and adding a couple of lines of code to the perl script to ensure the connection to the db was in utf (both outlined in the webpage linked to above) sorted this out. The final step was ensuring that the front end user interface selected the most recent repository name for a given IR, as older entries in the db would still have the incorrect name.
Country names
When I started collecting data for all repositories around the world I needed a way for users to be able to select a particular country. ROAR provided two digit codes, but how to display proper country names. I found a soultion using a simple to use PHP library detailed here (note it’s on two pages).
Known Bugs
One of my many many many faults is coming up with (in my blinkered eyes – good) ideas, thinking about them non-stop for 24hours, developing every little detail and aspect. Then spending a few hours doing some of the first things required. then getting bored and moving on to something else. Repeat ad nauseum.
Today’s brilliant plan (to take over the world)
Over the weekend it was ‘tinyurl.com’ services and specifically creating my own one.
I had been using is.gd almost non-stop all week, various things at work had meant sending out URLs to other people both formally and on services like twitter. Due to laziness it was nearly always easier to just make another shortURL for the real URL in question than to find the one I made earlier. It seemed a waste. One more short code used up when it was not really needed. The more slap-dash we are in needlessly creating short URLs, the quicker they become not-so-short URLs.
Creating my own one seemed like a fairly easy thing to do. Short domain name, bit of php or perl and a mysql database, create a bookmarklet button etc.
Developing the idea
But why would anyone use mine and not someone elses?
My mind went along the route of doing more with the data collected (compared to tinyurl.com and is.gd). I noticed that when a popular news item / website / viral come out, many people will be creating the same short URL (especially on twitter).
What if the service said how many – and who – had already shortened that URL. What if it made the list of all shortened URLs public (like the twitter homepage). The stats and information that could be produced with data about the urls being shortened, number of click throughs, etc, maybe even tags. Almost by accident I’m creating a bookmarking social networking site.
This would require the user to log in (where as most do not), not so good, but this would give it a slightly different edge to others, and help fight spam, and not so much of a problem if users only have to log in once.
I like getting all wrapped up in an idea as it allows me to bump in to things i would not otherwise. Like? like…
So I haven’t built my (ahem, brilliant) idea. Of course the very things that would have made it different (openly showing what URLs have been bookmarked, by who, and how many click throughs, and tags) were the very thing that would make it time consuming. And sites like snurl.com and tr.im had already done such a good job.
So while I’m not ruling out creating my own really simple service (and infact u.nostuff.org already exists) and I learned about mod_rewrite, base64 on cpan, and a bunch of other stuff, the world is spared yet-another short URL service for the time being.