Thursday, March 27, 2008

Wednesday, March 19, 2008

The Biosciences in Google's Summer of Code

The Google Summer of Code project participants have been selected. I scanned the list to see how projects specifically aimed at the biosciences and bioinformatics fared:

  • GenMAPP (Gene Map Annotator and Pathway Profiler), a tool for visualizing gene expression data on top of graphical representations of biological pathways.
  • The NESCent (National Evolutionary Synthesis Centre) Phyloinformatics project, has range of potential projects to do with phylogenetic analysis, covering things like phyloXML integration with BioPerl and BioRuby, phyloinformatics web services and tree analysis using the MapReduce algorithm (with Hadoop).
  • OMII-UK, which covers a range of tools including the Taverna Workbench for workflow design and execution.
  • Also participating is OpenMRS, a medical record system aimed at developing countries.
There are also at least two platforms for cluster, parallel or grid computing on the list; I spotted the Globus Toolkit and OAR, but there are probably a few more in that that broad category (eg, OMII-UK oversees a bunch of Grid related projects too).

It's worth noting that I've ignored a bunch of really important pieces of software that are less field-specific, but are actually lower level components of the platforms critical for most large bioinformatics projects. Things like Python, Perl, R, various Open Source databases, and collaboration tools like wikis (MoinMoin) and CMSs (eg Drupal) are also participating.

I don't think coding for bioinformatics applications is as attractive to students as working on some of the other "sexier" projects available (eg the SecondLife client, or the Apache Webserver), but kudos to Google for letting a few bioinformatics tools into the fray. Hopefully the students who hack on them learn something, and hone their coding skills (you never know, they may even help improve these tools too :) ).

Monday, March 03, 2008

Buzzwordomics

I see Lars Juhl Jensen has come up with a fun tag cloud of recently popular buzzwords in the biosciences. He calls it a BuzzCloud. The buzzword from the cloud I've noticed most lately is "Quantative Proteomics" ... quantitation is a good goal for the field of proteomics to aim for, since IMHO it doesn't really deserve the -omics prefix. "Omics" tends to imply the possibility of global proteome coverage, which proteomic studies rarely, if ever, achieve. But enough of the side-rants.

The way Lars' BuzzCloud is constructed by extracting phrases ending in -ics, -ology, -omy, -phy, -chemistry, -medicine, or -sciences etc reminded me of a stupid little CGI application I wrote a few years back ... the Biotech company name generator. When you take common prefixes like "Gene-", "Pept-" or "Chemi-" and suffixes like "-omics" or "-agen" etc, it's amazing how often Googling the name turns up a real honest-to-goodness biotech company.

Feel free to comment on any "biotechie" suffixes and prefixes that I should add ... the hardcoded list in the script isn't that long.

Friday, January 18, 2008

An OpenRef implementation

Recently, Noel O'Boyle of Noel O'Blog proposed a new RESTful scheme for resolving publications, as an alternative to using DOI or PubMed ID (PMID) identifiers. Essentially, this would allow resolution of a publication like:

EL Willighagen, NM O'Boyle, H Gopalakrishnan, D Jiao, R Guha, C Steinbeck and D J Wild Userscripts for the Life Sciences BMC Bioinformatics 2007, 8, 487.

Using something like this:

openref://BMC Bioinformatics/2007/8/487 
or
http://dx.openref.org/BMC Bioinformatics/2007/8/487 

Simply using the journal title, publication year, volume and first page number. Read his post for a more detailed explanation.

While I think the scheme needs a little fleshing out, the idea is nice, since as Noel highlights - the "OpenRef" URL can be derived from the typical citation style used by academics, while the DOI and the PMID cannot (although the DOI is often printed on the journal article these days, it's generally not used in a reference list at the end of a paper). I'm sure there are lots of corner cases that could ultimately work to over-complicate this scheme and force it to lose it's simplicity ... but at the moment it remains appealing.

It dawned upon me that an OpenRef resolver would actually be pretty straightforward to write with Turbogears (or just straight CherryPy), and a bit of Biopython EUtils magic to search PubMed.

So, without further ado ... here's the essential code for my quick implementation. It requires that you have installed Turbogears and made a quickstart project with tg-admin (see the Turbogears docs on how to do this). The code below should be added to the Root class in controllers.py, in addition to the autogenerated code that tg-admin makes for you:


(sorry, syntax highlighting isn't working for me ... it could be worth cut-n-pasting this into your favorite syntax highlighting editor to read it)



from turbogears import controllers, expose, flash, redirect
from model import *

# from openref import model
from Bio import EUtils
from Bio.EUtils import DBIdsClient

from xml.dom import minidom
import urllib

class Root(controllers.RootController):

# we use *args and **kw here to accept a variable number of
# arguments and keyword arguments
# (eg Journal/Year/Page or Journal/Year/Volume/Page)
# turbogears passes arguments to the function from the URL like
# http://webapp:8080/arg1/arg2/arg3?keyword=stuff&keyword2=morestuff
@expose()
def openref(self, journal, *args, **kw):

# deals with openref://Journal/Year/Page
# (no volume argument)
if len(args) == 2:
year, page = args
query = '"%s"[TA] AND "%s"[DP] AND "%s"[PG]' % \
(journal, year, page)
# deal with openref://Journal/Year/Volume/Page
# (including volume number)
if len(args) == 3:
year, volume, page = args
query = '"%s"[TA] AND "%s"[DP] AND "%s"[VI] AND "%s"[PG]' % \
(journal, year, volume, page)

# search NCBI PubMed with EUtils
client = DBIdsClient.DBIdsClient()
result = client.search(query, retmax = 1)
res = result[0].efetch(retmode = "xml", rettype = "xml").read()

# get doi link from eutils XML result, example:
#
# S0022-2836(07)01626-9
# 10.1016/j.jmb.2007.12.021
# 18187149
#

xml_doc = minidom.parseString(res)
for tag in xml_doc.getElementsByTagName("ArticleId"):
if tag.getAttribute("IdType") == "doi":
doi = tag.childNodes[0].data
if tag.getAttribute("IdType") == "pubmed":
pmid = tag.childNodes[0].data

# make the DOI resolution URL
doi_url = urllib.basejoin("http://dx.doi.org/", doi)
# make the Entrez Pubmed resolution URL
pubmed_url = "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=%s&dopt=Abstract" % (pmid)
# and lets not forget a URL to HubMed
hubmed_url = "http://www.hubmed.org/display.cgi?uids=%s" % (pmid)

# decide where to redirect to based on "?redirect=xxx" argument
if kw.has_key("redirect"):
if kw['redirect'] == "doi":
url = doi_url
elif kw['redirect'] == "pubmed":
url = pubmed_url
elif kw['redirect'] == "hubmed":
url = hubmed_url
else:
url = doi_url

raise redirect(url)


Since this is seat-of-the-pants Friday arvo coding, there is very little in the way of error handling or exceptions in the above code. I might add some niceties like that later. If the Pubmed query constructed from the URL gives no PubMed hit(s), or the PubMed results doesn't contain a DOI, you'll get some ugly and inelegant errors.

Assuming that you run this Turbogears app locally on the default port 8080, you should be able to get redirected to the Willighagen et al Userscripts paper by going to:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487

(Firefox will properly escape the space character in the URL .. I'm not sure what other browsers may do).

By default you will be redirected to wherever dx.doi.org decides to send you (which is often the journal article at the publishers site, but there is no rule that says this must be the case), but you can also choose to be redirected to PubMed or Hubmed using:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=pubmed
or
http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=hubmed

I've got a working example running at http://openref.pansapiens.com/ if anyone would like to try it out (eg, try http://openref.pansapiens.com/openref/BMC Bioinformatics/2007/8/487 ). No promises that it will stay up for long (Turbogears apps seem to die quite a lot on my cheap little virtual hosting account ... I'm using supervisor2 now, which may help keep things more available).

It should be stressed that this as is only a quick and dirty hack to demonstrate the proof of concept. It's really only translating the 'paths' in the URLs provided by the user into PubMed queries, and uses the existing DOI infrastructure to ultimately redirect the user to the article; in reality I'd expect that an "OpenRef" resolver would have to be more independent and sophisticated than this. I can't imagine who would maintain a separate OpenRef database in order to make it independent of DOIs and PubMed.

Unfortunately the domain openref.org has already been registered .. and not by Noel. Maybe it's already time for a new name for this fledgling resolution scheme :) ??

Wednesday, January 09, 2008

Posts that didn't make it in 2007

Well, a New Year is fully in swing, so I thought it would be a good time to cleanup my 'posts in progress'. There are a bunch of posts that I started last year, for reasons of lack of quality, lack of timeliness or general motivation never made it out the gate.

I generally dislike this kind of 'meta-blogging', but this is the easiest way for me to let go of them and move on ... here is a list of the posts that could have been, but never were:

  • "Open Data in structural biology: share your structure factors and restraints" was a post spurred on by the Chang et al incident and a letter written by Alexander Wlodawer about the importance of sharing 'raw data' in structural biology, particularly to allow structures to be independently validated. I'm sure mandatory deposition of structure factors tied to publication will become the norm in the not too distant future. The post became a long essay which really went nowhere except to suggest that maybe there should be more incentive (and even enforcement) for sharing not only raw data, but also source code used to process data too.
  • There was the seed of a post on "Patenting", based around this link to Christopher Soghoians blog "slight paranoia". I'm not against the patent system, but I'm not all for patenting anything and everything either. If I ever got time to flesh this post out, chances are it would have turned into a dumb Slashdot-style anti-patent rant anyway. No one wants to read that crap, so I canned it. I will say, however, that Christophers post reminds me alot of the situation of being a Postdoc in a fairly pro-patent Institute, and underscores why the incentive to patent often isn't there at the grass-roots level.
  • One potential post started as some scribblings about one of last years new hot topics, "Open notebook science". I wanted to compare a public/private wiki system I was envisaging with the scheme presented on page 2 of this presentation by Jean-Claude Bradley which shows the continuum between Traditional Lab Notebook (unpublished science), through to Tradational and Open Access Journals and finally through to the Open Lab Notebook (full transparency). The whole thing never materialized.
  • Then there was a quick post to highlight an article about a GM crop in Nature Biotechnology. I never got it out in a timely fashion, but essentially the article discussed how the Italian media and politicians were continuing their blanket crusade against all GM crops, while conveniently ignoring the independent academic trails showing that the MON810 corn strain had significantly lower levels of the fungal toxin fumonisin when compared with the non-GM equivalents. This was (and still is) a topical and often emotive issue for Australians, as two states (Victoria and New South Wales) have recently lifted moratoria on commercial release of GM crops. It's nice to see that sometimes a well tested GM strain is often better for human health than an untested traditional strain that has only had the benefit of genetic modification by crossbreeding and selection rather than the new techniques of molecular biology.
  • Another post was my attempt at being funny. "Australian Government department concerned about organisms from space: Quarantine assessment of an asteroid". Actually, I appreciate that they take this type of thing seriously ... I wouldn't be laughing if we end up with an outbreak of some deadly alien virus or something (still not sure if I'm joking or not ....).

There are also a few beginnings of some posts I can't quite let go of yet, and they may appear in the future. One, while getting a little dated now, is "Why are the still 1000 uncharacterised yeast genes ?", discussing a 2007 paper by Lourdes Pena-Castillo and Timothy R. Hughes. Another is "Has structural genomics paid off ?" ... this is still pretty fresh and in discussion in the November and December issues of Structure. In fact, these letters to Structure are such a treasure trove of practical and philosophical arguments about structural biology the topic probably warrants multiple posts. Finally, I planned to host the inaugural "Bioinformatics data-munging challenge: 2.0-style", but never felt I had time to devise and run the challenge properly. We can still come up with some guidelines (aka rules) and try it out if anyone is interested.

It's not a New Years resolution ... but I hope that this year I can produce more short, frequent and high quality posts. We'll see.