PubSubHubbub instant RSS and Atom

I have just come across PubSubHubbub via Joss Winn’s Learning Lab blog at the University of Lincoln.

It’s a way for RSS/Atom feed consumers (feed readers etc) to be instantly updated and notified when an RSS is updated.

In a nutshell, the RSS publisher notifies a specific hub when it has a new item. The hub then notifies – instantly – any subscribers who have requested the hub to contact them when there is an update.

This is all automatic and requires no special setup by users. Once the feed producer has set up PubSubHubbub, and specified a particular, the RSS feed has an extra entry in the feed itself telling subscribing client systems that they can use a specific hub for this feed. Clients which do not understand this line will just ignore and carry on as normal.Those that are compatible with PubSubHubbub can then contact the hub and ask to be notified when there are updates.

It has been developed by Google, and they’ve implemented it in to various Google services such as Google Reader and Blogger. This should help give it momentum (which is also crucial for this sorts of things). In a video on Joss’ post (linked to above) the developers demonstrate posting an article and showing Google Reader instantly update the article count for that feed (in fact, before the blog software has even finished loading the page after the user has hit ‘publish’). It reminds be of the speed of Friendfeed, I will often see by friendfeed stream webpage update with my latest tweet before I see it sent from twirl.

I’ve installed a PubSubHubbub WordPress plugin for this blog. Let’s hope it takes off

UPDATE: I’ve just looked at the source of my feed ( http://www.nostuff.org/words/feed/ ) and saw the following line:

<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/>

JISC, Monitter and DIUS (Department of Innovation, Universities and Skills)

Earlier this week the Jisc 2009 Conference went ahead. A one day summary of where things are going in Jisc-land.

Like last year, I got a good feel of the day via twitter. I used a web app called monitter.com for real time updates from anyone on twitter who used the tag #jisc09. monitter.com allows you to track a number (3 columns by default) of words/searches, this works well as these can be usernames, tags or just a phase. I used ‘jisc09’, ‘brighton OR sussex’ and ‘library’.

The keynote talks was also streamed live on the web, the quality was excellent. Check out the main Jisc blog for the event.

Linking to all the different sites, searches and resources on the web after the event wouldn’t do it justice. The usefulness was in the way these were all being published during the day itself, using things like twitter (and bespoke sites) as a discovery mechanism for all these different things being added around the web. I didn’t know who most of the people were, but I was finding their contributions. That’s good.

An email came out the next day about the conference and announcing a guest blog post by David Lammy, the Minister for Higher Education, on the Jisc Blog.

He finished by asking for the conversation to continue, specifically on  http://www.yoosk.com/dius which is described as ‘a place to open up lines of communication between Ministers and the HE Community’. Yoosk.com is set up to allow users to ask ‘famous people questions’. Its homepage suggests that it is designed for any kind of ‘famous person’ though seems to be dominated by UK politicians. Looks interesting but can’t help wonder if there are other sites which could facilitate a ‘discussion’ just as well or better.

The dius section of the site seems quite new. In fact my (rather quickly composed) question was the second to be added to the site. I think the idea of practitioners (yuck, did I just use that word?) raising issues directly with Ministers is an interesting one, and hope it takes off, and at very least, he/they answer the questions!

DIUS do seem to be making an effort to use web2.0 tools. I recently came across this sandbox idea of collecting sites from delicious based on tags, in this example, the library2.0 tag. Interesting stuff, but not specific to HE, it will work for any tag and really just creates a nice view of the latest items bookmarked with the tag in question. The code for it is here.

In any case, it is good to see a government department trying out such tools and also releasing the code under the GPL (even 10 Downing street’s flickr stream is under crown copyright, and don’t get me started on OS maps and Royal Mail postcodes). I’m reminded of the Direct.gov team who, when they found out there was a ‘hack the government‘ day to mashup and improve government web services, decided to join in.

DIUS homepage with web2.0 tools
DIUS homepage with web2.0 tools

On the DIUS hompage, just below the fold, they have a smart looking selection of tools, nice to see this stuff here, and so prominent, though the Netvibes link to me just a holding page when I tried it.

Finally, they have set up a blog on the jiscinvolve (WordPress MU) site. At the time of writing it has a few blogs posts which are one line questions, and has a couple of (good) responses. But I can’t help feeling that these sites need something more if they are to work. At the moment they are just there floating in space. How can they integrate these more into the places that HE staff and students inhabit. Perhaps by adding personal touches to the sites would encourage people to take part, for example the blog – a set of questions – is a little dry, it needs an introduction, host, and photos.

To sum up, some good stuff going on here, but need to see if it takes off, it must be difficult for a government department to interact with HE and students, the two are very different but they are trying.  I hope it proves useful, if you’re involved in HE why not take a look and leave a comment?

UPDATE:

Posted in interesting, libraries, library technologies & open data, politics and current affairs, universities, web and blogs by chriskeene · Tags: , ,

short urls, perl and base64

One of my many many many faults is coming up with (in my blinkered eyes – good) ideas, thinking about them non-stop for 24hours, developing every little detail and aspect. Then spending a few hours doing some of the first things required. then getting bored and moving on to something else. Repeat ad nauseum.

Today’s brilliant plan (to take over the world)

Over the weekend it was ‘tinyurl.com’ services and specifically creating my own one.

I had been using is.gd almost non-stop all week, various things at work had meant sending out URLs to other people both formally and on services like twitter. Due to laziness it was nearly always easier to just make another shortURL for the real URL in question than to find the one I made earlier. It seemed a waste. One more short code used up when it was not really needed. The more slap-dash we are in needlessly creating short URLs, the quicker they become not-so-short URLs.

Creating my own one seemed like a fairly easy thing to do. Short domain name, bit of php or perl and a mysql database, create a bookmarklet button etc.

Developing the idea

But why would anyone use mine and not someone elses?

My mind went along the route of doing more with the data collected (compared to tinyurl.com and is.gd). I noticed that when a popular news item / website / viral come out, many people will be creating the same short URL (especially on twitter).

What if the service said how many – and who – had already shortened that URL. What if it made the list of all shortened URLs public (like the twitter homepage). The stats and information that could be produced with data about the urls being shortened, number of click throughs, etc, maybe even tags. Almost by accident I’m creating a bookmarking social networking site.

This would require the user to log in (where as most do not), not so good, but this would give it a slightly different edge to others, and help fight spam, and not so much of a problem if users only have to log in once.

I like getting all wrapped up in an idea as it allows me to bump in to things i would not otherwise. Like? like…

  • This article runs through some of the current short URL services
  • The last one it mentions is snurl.com, I had come across the name on Twitter, but had no idea it offers so much more, with click-thru stats and a record of the links you have shortened. It also has the domain name sn.im (.im being the isle of man). Looks excellent (but they stole some of my ideas!)

    snurl.com
    snurl.com
  • Even though domains like is.gd clearly exist, it seems – from the domain registrars I tried – that you can not buy two digit .gd domains. though three letter ones seem to start from $25 a year.
  • the .im domain looked like it could be good. But what to call any potential service??? Hang-on… what about tr.im! what a brilliant idea. fits. genius. Someone had, again, stolen my idea. besides, when I saw it could be several hundred pounds other top level domains started to look more attractive
  • tr.im mentioned above, is a little like snurl.com. looks good, though mainly designed to work with twitter. Includes lots of stats. Both have a nice UI. Damn these people who steal my ideas and implement them far better than I ever could. :)
  • Meanwhile…. Shortly is an app you can download yourself to run your own short url service.
  • Oh and in terms of user authentication php user class seemed worth playing with.
  • Writing the code seemed fairly easy, but how would I handle creating those short codes (the random digits after the domain name). They seem to increment while keeping as small as possible.
  • Meanwhile I remember an old friend and colleague from Canterbury had written something like this years a go, and look! he had put the source code up as well.
  • This was good simple perl, but I discovered that it just used hexadecimal numbers as the short codes, which themselves are just the hex version of the DB auto-increment id. nice and simple but would mean the codes become longer more quickly than other algorithms.
  • I downloaded the script above and quickly got it working.
  • I asked on twitter and got lots of help from bencc (who wrote the script above) and lescarr.
  • Basically the path to go down was base64 (i.e. 64 dgits in a number system, instead of the usual 10), which was explained to me with the help of a awk script in a tweet. I got confused for a while as the only obvious base64 perl lib actually converts text/binary for MIME email, and created longer, not shorter, codes than the original (decimal) id numbers as created by the database.
  • I did find a cpan perl module to convert decimal numbers to base64 called Math::BaseCnv. Which I was able to get working with ease.
  • It didn’t take long to edit the script from Ben’s spod.cx site, and add the Base64 code so that it produced short codes using all lower case, upper case and numbers.
  • you can see it yourself – if I haven’t broken it again – at http://u.nostuff.org/
  • You can even add a bookmarklet button using this code
  • Finally, something I should have done years a go, and setup mod_rewrite to make the links look nice, e.g. http://u.nostuff.org/3

So I haven’t built my (ahem, brilliant) idea. Of course the very things that would have made it different (openly showing what URLs have been bookmarked, by who, and how many click throughs, and tags) were the very thing that would make it time consuming. And sites like snurl.com and tr.im had already done such a good job.

So while I’m not ruling out creating my own really simple service (and infact u.nostuff.org already exists) and I learned about mod_rewrite, base64 on cpan, and a bunch of other stuff, the world is spared yet-another short URL service for the time being.

webpad : a web based text editor

So I have WordPress (and in fact Drupal, Joomla, mediawiki, Moodle, damn those Dreamhost 1-click installs) as a way of running my website.

But there are still many pages which are outside of a content management system. Especially simple web app projects (such as ircount and stalisfield) and old html files.

It can be a pain to constantly ftp in to the server, or use ssh. Editing via ssh can be a pain, especially over a dodgy wireless connection, or when you want to close the lid to your macbook.

But trying to find something to fit this need didn’t come up with any results. Many hits were either tinyMCE clones which are WYSIWYG html editors that convert input in to html, do good for coding.

Webpad screenshot
Webpad screenshot

Until I came across Webpad. It not only suited my needs perfectly, but it is well designed and implemented.

After a quick install (more or less simply copying the files), you simply enter a specified username and password, and once authenticated you are presented with a line of icons at the top. Simple select the ‘open’ icon to browse to the file you wish to edit on your web server and you’re away!

It’s simple, yet well written and serves its purpose well. If there was one thing I would I suggest for future development it would be improved file management functionality. You can create directories and delete files from the file open dialog box. But I can’t see a way to delete directories, or move/copy files. Deleting directories is of use, as many web apps (wikis, blogs, cms) require you to upgrade the software, edit a config file, and then delete the install directory, or similar.

Oh, and it’s free!

Check out webpad by Beau Lebens on dentedreality.com.au

Zoho and WordPress themes

Zoho.com is very impressive. The Web based Apps are of very high quality, there are many of them, and all this is such a small space of time. It makes Google Docs look rather poor. They are free (for most things), and have a business model (such a rare thing these days!).

I had checked out the Word Processor before, but today looked at the other offerings, they even have a Customer Relationship Manager and Project Management tool, the former I have no real need for (I don’t have any customers or sell anything) but good to know these are here.

It was readwriteweb.com which reminded me of Zoho. A website I need to remember to read more often than I currently do. They have some great articles (I liked the look of tinychat), and focus on writing about the latest web apps, rather than the companies behind them like techcrunch.

An aside: When I have a spare five minutes in front of the laptop I’m finding myself going to specialist news sites (and blogs) like this more and more, instead of going to my RSS reader. Seeing a long line of blogs waiting for me to read just seems like hard work. Sometimes it is nicer to just go to a website you haven’t been to for a while and see what is there. Maybe RSS readers need to work on a way of turning the long list of feeds in to something more visual: cover view, or some sort of scrolling article headlines which you could pick from? </aside>

While reading readwriteweb.com I was impressed with the readability and appearance of their font. I used firebug (which I’m really starting to find useful) to discover it was just Ariel, with a particular size and line-height.

I decided to edit this blog’s theme to use the same style, but a tiny flaw at one stage meant I had no css formatting at all after refreshing the page in my browser. The problem was quickly fixed, but I was amazed how readable the old fashioned Times New Roman on white background was, and how pleasant to read the text with the distraction of widgets and menus running down the sides. I was almost motivated to use a minimal theme with all navigation links and menus at the top or bottom.

Instead I opted to convert this blog to use the Georgia serif font instead of the previous Sans Serif Verdana. So, again using Firebug, I played with font sizes and line-heights in ems, and a few other bits and pieces until I was happy with the new look. Which I am, and hope it is ok for you too (though those of you reading via your RSS readers probably couldn’t care less).

The theme is called Greening, and I noticed the link at the bottom of this page has stopped working. So I Googled for ‘Greening theme wordpress‘ and, ummm, the first hits were to this very blog. Most odd as the Theme was quite popular and was a pre-installed option on Dreamhost. Now the theme and its owner seem to have disappeared.

If the situation doesn’t change, I may think about making the theme (and my changes) available for download from nostuff.org, with due credit to the orginal author of course.

At least I don’t have to worry about a lot of blogs using the same theme as me for now.

Mashed Libraries

Exactly a week a go I was coming home from Mashed Libraries in London (Birkbeck).

I wont bore you with details of the day (or more to the point, I’m lazy and others have already done it better than i could (of course, I should have made each one of those words a link to a different blog but I’m laz… or never-mind)).

Thanks to Owen Stephens for organising, UKOLN for sponsoring and Dave Flanders (and Birkbeck) for the room.

During the afternoon we all got to hacking with various sites and services.

I had previously played around with the Talis Platform (see long winded commentary here, got it seems weird that at the time I really didn’t have a clue what I was playing with, and it was only a year a go!).

I built a basic catalogue search based on the ukbib store. I called it Stalisfield (which is a small village in Kent).

But one area I had never got working was the Holdings. So I decided to set to work on that. Progress was slow, but then Rob Styles sat down next to me and things started to move. Rob help create Talis Cenote (which I nicked most of the code from) and generally falls in to that (somewhat large) group of ‘people much smarter than me’.

We (well I) wanted to show which Libraries had the book in question, and plot them on a Google Map. So once we had a list of libraries we needed to connect to another service to get the location for each of these libraries. The service which fitted this need was the Talis Directory (Silkworm). This raised a point with me, it was a good job there was a Talis service which used the same underlying ID codes for the libraries i.e. the holdings service and the directory both used the same ID number. It could have been a problem if we needed to get the geo/location data from something like OCLC or Librarytechnology.org, what would we have searched on? a Libraries name? hardly a reliable term to use (e.g. The University of Sussex Library is called ‘UNIV OF SUSSEX LIBR’ in OCLC!). Do Libraries need a code which can be used to cross reference them between different web services (a little like ISBNs for books)?

Using the Talis Silkworm Directory was a little more challenging than first thought, and the end result was a very long URL which used SPARQL (something which looks a steep learning curve to me!).

In the mean time, I signed up for Google Maps, and gave myself a crash course in setting it up (I’m quite slow to pick these things up). So we had the longitude and latitude co-ordinates for each library, and we had a Google Map on the page, we just needed to connect to the two.

Four people trying to debug the last little bit of code for my little project
Four people at Mashedlibrary trying to debug the last little bit of my code.

Time was running short, so I was glad to take a back seat and watch (and learn) while Rob went to in to speed-javascript mode. This last part proved to be elusive. The PHP code which was generating javascript code was just not quite working. In the end the (final) problem was related to the order I was outputting the code, but we were out of time, and this required more than five minutes.

Back home, I fixed this (though I never would have known I needed to do this without help).

You can see an example here, and here and here (click on the link at the top to go back to the bib record for the item, which, by the way, should show a Google Book cover at the bottom, though this only works for a few books).

You can click on a marker to see the name of library, and the balloon also has a link which should take you straight to item in question on the library’s catalogue.

It is a little slow, partly due to my bad code and partly due to what it is doing:

  1. Connecting to the Talis Platform to get a list of libraries which have the book in question (quick)
  2. For each library, connect to the Talis Silkworm Directory and perform a SPARQL query to get back some XML which includes the geo co-ordinates. (geo details not available for all libraries)
  3. Finally generate some javascript code to plot each library on to a Google map.
  4. As this last point needs to be done in the <head> of the page, it is only at this point that we can push the page out to the browser.

I added one last little feature.

It is all well and good to see which libraries have the item you are after, but you are probably iterested in libraries near you. So I used the Maxmind GeoLite City code-library to get the user’s rough location, and then centering the map on this (which is clearly not good for those trying to use it outside the UK!). This seems to work most of the time, but it depends on your ISP, some seem more friendly in their design towards this sort of thing. Does the map centre on your location?

ecto : first impressions

I’ve heard good things about PC/Mac clients for writing blog posts, so thought I would give it a go. I tried out ecto for OS X, to published to my WordPress based blog.

It did what was promised it acted as a WYSIWYG blog composition tool, and in that sense it was easy to use and worked without problems. However I few things:

  • I could only attach (as far as I could see) audio, pictures and movies. I wanted to attached a txt file, (and may want to upload PDF/doc files) but could see no way of doing this.
  • I couldn’t send it to the blog as an unpublished draft, so I couldn’t send it to WordPress and then upload/link-to the text file using the wordpress interface before publishing.
  • Ecto is a generic blog tool, not specific to WordPress. While in many ways a good thing, it does have it’s downside, there are some options on the WordPress composition screen that I rarely use but do find them useful occasionally, and felt it somewhat unsettling for them not to be there should I need them.
  • As a plus: the problem with the WordPress interface is that there is a lot of screen space used with the top of the screen menus and other misc stuff, and the edit space is somewhat small, and it is annoying to need to scroll both the screen and the text box. the ecto UI does not have this issue. But then WordPress 2.7 may address this problem.
  • One of the main plus points is being able to carry on editing offline, but with Google Gears you should be able to edit happily offline (I haven’t tried this yet).

So ecto is certainly worthy trying if you are after a OS X based blog client, and I chose it above others available based on reviews I had read, but for me, I think I will stick with the native web interface for the time being.

(posted using Ecto)

Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?

Navel gazing

I was having a quick think about the categories I use here. I have tried to use categories which match people’s interests. e.g. someone from Brighton can choose to read (and subscribe to) ‘Brighton’, same for technology or libraries.

I’ve recently started to blog a bit more about things related to my work. Which is best summed up as where technology (& web) and libraries (& information management) meet. This includes searching, metadata, cataloguing, making data and information accessible, and scholarly publishing (and changing it to be less stupid). My rule of thumb is that if I feel something would only be of interest to those in the library (or HE) tech area, I stick it in ‘libraries and technology’, if it could be of interest to those who are generally interested in techy stuff then it is added to the technology category.

So if you are interested in reading my ill informed rants relating to libraries and technology (but don’t wish to have to suffer the rest of the crap i post) then you can subscribe to the following feed:

http://www.nostuff.org/words/category/information-searching-and-libraries/feed

oh, but that’s a good point. I have started to talk about the Library world more, in a ‘I’m presuming you know what I’m talking about‘ type way. I’m hoping that hasn’t alienated my huge previous user base (if you were that reader can you let me know). Some keep a seperate blog for work and home. I’ve resisted this, my thoughts about the things I encounter due to work, and those I encounter due to outside interests are all basically me, if you like one or the other (but not both), just follow the rss feed for the appropriate category (maybe I need one called ‘not work’). By the way, you can subscribe to a feed for a category by going to the categories main page, and then added ‘/feed/’ to the end.

(would be great if you could create a feed which is a combination of several categories you are interested in). Oh and one weakness of the blogging model is that one person’s output is distributed and not easily connect-able. so all the comments i have made in other blogs are disconnected to this blog (of course the alternative is to reply via this blog and rely on ping/trackback) and this is one of the reasons why I don’t run multiple blogs, there’s no easy way to say ‘this blog should include any content i post to another specified blog’ or ‘include my comments in other blogs’ or ‘when posting this, also post it to blog X’. but i digress.

I occasionally chatter on about politics, but also talk about more general stuff happening in the world today, this can be anything from shops, to phones to education. I tend to stick all this under ‘politics and current affairs’, but it is a broad church and really need a better category for the ‘stuff around me today which takes my interest’, any ideas?

Peter Suber recently described me as ‘anonymous’ blogger in a post of his. Which turns out to be true, so I have updated my blog theme (see earlier post) to show a mini profile at the top of the page.

You can also find me at:

And randomly some embedded stuff:

Friendfeed

Dipity

Radio Pop

Radio Pop is an interesting experimental site from the fantastic BBC radio labs.

It is a sort of soical network site for radio listening. It only records your listening through the ‘radio pop’ live streams. I (like many) mainly listen to listen again and the radio iplayer, and they are working on intergrating with both. You can see my profile here.

Screenshot of radio pop
Screenshot of radio pop - click for a larger version

You can ‘pop’ what you are currently listening to (basically a ‘i like this’ button). I’ve added my ‘pop’ rss feed to my dipity timeline.