Friday, January 18, 2008

An OpenRef implementation

Recently, Noel O'Boyle of Noel O'Blog proposed a new RESTful scheme for resolving publications, as an alternative to using DOI or PubMed ID (PMID) identifiers. Essentially, this would allow resolution of a publication like:

EL Willighagen, NM O'Boyle, H Gopalakrishnan, D Jiao, R Guha, C Steinbeck and D J Wild Userscripts for the Life Sciences BMC Bioinformatics 2007, 8, 487.

Using something like this:

openref://BMC Bioinformatics/2007/8/487 
or
http://dx.openref.org/BMC Bioinformatics/2007/8/487 

Simply using the journal title, publication year, volume and first page number. Read his post for a more detailed explanation.

While I think the scheme needs a little fleshing out, the idea is nice, since as Noel highlights - the "OpenRef" URL can be derived from the typical citation style used by academics, while the DOI and the PMID cannot (although the DOI is often printed on the journal article these days, it's generally not used in a reference list at the end of a paper). I'm sure there are lots of corner cases that could ultimately work to over-complicate this scheme and force it to lose it's simplicity ... but at the moment it remains appealing.

It dawned upon me that an OpenRef resolver would actually be pretty straightforward to write with Turbogears (or just straight CherryPy), and a bit of Biopython EUtils magic to search PubMed.

So, without further ado ... here's the essential code for my quick implementation. It requires that you have installed Turbogears and made a quickstart project with tg-admin (see the Turbogears docs on how to do this). The code below should be added to the Root class in controllers.py, in addition to the autogenerated code that tg-admin makes for you:


(sorry, syntax highlighting isn't working for me ... it could be worth cut-n-pasting this into your favorite syntax highlighting editor to read it)



from turbogears import controllers, expose, flash, redirect
from model import *

# from openref import model
from Bio import EUtils
from Bio.EUtils import DBIdsClient

from xml.dom import minidom
import urllib

class Root(controllers.RootController):

# we use *args and **kw here to accept a variable number of
# arguments and keyword arguments
# (eg Journal/Year/Page or Journal/Year/Volume/Page)
# turbogears passes arguments to the function from the URL like
# http://webapp:8080/arg1/arg2/arg3?keyword=stuff&keyword2=morestuff
@expose()
def openref(self, journal, *args, **kw):

# deals with openref://Journal/Year/Page
# (no volume argument)
if len(args) == 2:
year, page = args
query = '"%s"[TA] AND "%s"[DP] AND "%s"[PG]' % \
(journal, year, page)
# deal with openref://Journal/Year/Volume/Page
# (including volume number)
if len(args) == 3:
year, volume, page = args
query = '"%s"[TA] AND "%s"[DP] AND "%s"[VI] AND "%s"[PG]' % \
(journal, year, volume, page)

# search NCBI PubMed with EUtils
client = DBIdsClient.DBIdsClient()
result = client.search(query, retmax = 1)
res = result[0].efetch(retmode = "xml", rettype = "xml").read()

# get doi link from eutils XML result, example:
#
# S0022-2836(07)01626-9
# 10.1016/j.jmb.2007.12.021
# 18187149
#

xml_doc = minidom.parseString(res)
for tag in xml_doc.getElementsByTagName("ArticleId"):
if tag.getAttribute("IdType") == "doi":
doi = tag.childNodes[0].data
if tag.getAttribute("IdType") == "pubmed":
pmid = tag.childNodes[0].data

# make the DOI resolution URL
doi_url = urllib.basejoin("http://dx.doi.org/", doi)
# make the Entrez Pubmed resolution URL
pubmed_url = "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=%s&dopt=Abstract" % (pmid)
# and lets not forget a URL to HubMed
hubmed_url = "http://www.hubmed.org/display.cgi?uids=%s" % (pmid)

# decide where to redirect to based on "?redirect=xxx" argument
if kw.has_key("redirect"):
if kw['redirect'] == "doi":
url = doi_url
elif kw['redirect'] == "pubmed":
url = pubmed_url
elif kw['redirect'] == "hubmed":
url = hubmed_url
else:
url = doi_url

raise redirect(url)


Since this is seat-of-the-pants Friday arvo coding, there is very little in the way of error handling or exceptions in the above code. I might add some niceties like that later. If the Pubmed query constructed from the URL gives no PubMed hit(s), or the PubMed results doesn't contain a DOI, you'll get some ugly and inelegant errors.

Assuming that you run this Turbogears app locally on the default port 8080, you should be able to get redirected to the Willighagen et al Userscripts paper by going to:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487

(Firefox will properly escape the space character in the URL .. I'm not sure what other browsers may do).

By default you will be redirected to wherever dx.doi.org decides to send you (which is often the journal article at the publishers site, but there is no rule that says this must be the case), but you can also choose to be redirected to PubMed or Hubmed using:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=pubmed
or
http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=hubmed

I've got a working example running at http://openref.pansapiens.com/ if anyone would like to try it out (eg, try http://openref.pansapiens.com/openref/BMC Bioinformatics/2007/8/487 ). No promises that it will stay up for long (Turbogears apps seem to die quite a lot on my cheap little virtual hosting account ... I'm using supervisor2 now, which may help keep things more available).

It should be stressed that this as is only a quick and dirty hack to demonstrate the proof of concept. It's really only translating the 'paths' in the URLs provided by the user into PubMed queries, and uses the existing DOI infrastructure to ultimately redirect the user to the article; in reality I'd expect that an "OpenRef" resolver would have to be more independent and sophisticated than this. I can't imagine who would maintain a separate OpenRef database in order to make it independent of DOIs and PubMed.

Unfortunately the domain openref.org has already been registered .. and not by Noel. Maybe it's already time for a new name for this fledgling resolution scheme :) ??

No comments: