Thursday, December 20, 2007

Roll over Google Docs :: MonkeyTeX is here

Okay, so maybe Google Docs still has a (sizable) market, but MonkeyTeX is probably worth knowing about if you are a TeX-nerd like myself :)

Essentially, MonkeyTeX allows you to edit, store, share and collaborate on LaTeX documents via a web interface. It supports BibTeX and Sty files, and can render to PDF. I've only played with it briefly, but it could be useful on occasions when I get stuck without a working LaTeX installation (such as internet cafes with only Windows machines) but would like to produce a professional looking document. It may be a nice way for people to test and learn LaTeX with minimal barriers to entry too.

Via thekit at LiveJournal

Thursday, November 22, 2007

Information R/evolution

I just stumbled upon this great little short film by Michael Wesch of Kansas State University (it's a year old now, it could be that everyone else has seen it but me). It highlights the way that the organization and retrieval of information has been changing since the growth of the Web. I'm sure the question "Exactly what is Web2.0 ?" has been debated around in circles for the last couple of years ... well, beyond the flashy web-based
UIs and 'social' aspects, this film exposes nicely one of the more important parts of Web2.0 in my opinion - the change in the way many of us organize and find information.



Saturday, October 27, 2007

Qutemol using Cedega

Pawel over at Freelancing Science recently highlighted Qutemol, a nice looking molecular viewer that does real-time ambient occlusion rendering. There isn't any official Linux version, but I found that the Windows version runs okay on Linux using Cedega (a version of Wine that has better DirectX support, especially for games). Since Cedega is based on the Open Source Wine code, you can compile your own command line version ... but it's a good idea to buy a maintenance subscription from Transgaming and support it's further development, if you can afford it.

Here's a screenshot of Qutemol running under Cedega on Ubuntu Gutsy Gibbon, just to prove it.


No, it's not Photoshopped ... (or GIMPed) ... :)

Wednesday, September 19, 2007

If pipettes could talk, oh the tales they could tell !

Occasionally impoverished University labs and early career researchers go looking for bargain priced lab equipment on Ebay ... sometimes hard to get but still very useful equipment also comes up. A colleague of mine found this entertaining auction for a P20 Gilson Pipette, much of it written from the pipettes perspective. Here's a quote:

For the purposes of full disclosure, this pipette has NOT resulted in data that has made it into Nature or Science (and in hindsight then the Cell paper may be considered to be a fluke). However, we choose to believe that that is the fault of both the editorial staff of these journals, and many a short-sighted peer-reviewer, rather than the pipette itself. Nonetheless, you may want to calibrate the pipette upon its arrival.
Fluke or not, a sentient pipette that produces Cell papers has got to be worth more than the mere $48 it's currently sitting at ! Sadly (or maybe happily for them), it looks like the seller is leaving bench science and selling up their gear. I've no idea who it is, but they live in the same suburb as me, so there's a good chance this is a fairly well-published senior scientist that I've crossed paths with at some stage. Also, don't miss the (*) footnote at the bottom of the auction info about impact factors ...

Wednesday, September 12, 2007

ARIA verson 2.2 released

I don't usually post about NMR (Nuclear Magnetic Resonance) and structural biology related stuff, but I've always intended to. In this post I'm pulling out all the stops on specialist lingo and assumed background knowledge, so hopefully it isn't too incomprehensible to the non-structural biology crowd :).

ARIA version 2.2 has been released in the last few weeks. ARIA is an automated NOE assignment and structure calculation package, which (in theory) takes some of the pain and slowness out of producing protein (and DNA and/or RNA) structures from Nuclear Magnetic Resonance data. I'll say up front; I haven't tried this version yet, but some of the improvements look exciting.

Here are two new features worth noting ... followed by what I think it all means:


  • The assignment method has been improved with the introduction of a network-anchoring analysis (Herrmann et al., 2002) for filtering of the initial assignments.
  • The integration of the CCPN has been completed. The imported CCPN distance constraints lists can enter the ARIA process for calibration, violation analysis and network-anchoring analysis. The final constraint lists can be exported as well.

In the past I have done some quick and dirty tests comparing the quality of protein structures produced using Aria 2.1 vs. Peter Gunterts CYANA 1.07 and 2.1, using the exact same NMR peak input lists (with slightly noisy data containing a number of incorrectly picked peaks). CYANA always won hands down, assigning more NOE crosspeaks correctly and producing an ensemble of model structures with much lower RMSD and generally better protein structure quality scores (ie using pretty much any decent pairwise pseudo-energy potential, and Procheck). Also, ARIA produced 'knotted' structures which were almost certainly incorrect, while CYANA did not. Other postdocs and students in my former lab had done similar independent tests with ARIA 1.2 vs. CYANA 1.0.7, and had come to similar conclusions.

The disclaimer: It should be noted here that assessment of the quality of an ensemble of NMR structure coordinates can be problematic, and is really the topic of another long post (and probably tens if not hundreds of peer-reviewed journal articles). So saying "CYANA version X is better then ARIA version X" based on the RMSD of the final calculated ensemble is a bit unfair ... in fact using RMSD of the ensemble to gauge structure quality is just plain wrong in this context. In my (unpublished, non-peer reviewed) tests, it is possible that ARIA could be producing high RMSD but essentially 'correct' structures, while CYANA could be producing tightly defined but 'incorrect' structures, but I doubt it. The gap between the output of each program was wide enough to suggest that under real-world conditions where the input peak list contained a number of 'noise' peaks, ARIA was failing to give a set of consistent solutions (probably due to lack of NOE assignments), while CYANA was giving a set of tightly defined structures (which may or may not have represented the 'correct' solution). Other evaluations (protein structure quality measures, Procheck, comparison to known structures of similar proteins) indicated that the CYANA structures were not grossly 'incorrect', so I'd say CYANA was just giving a better defined (ie lower ensemble RMSD) set of plausible solutions.

My gut feeling is that ARIA 2.2 will perform much better than past versions, due to one key feature that has been 'borrowed' from CYANA; the introduction of a network-anchoring analysis. In a nutshell, network-anchoring scores essentially weight distance constraints (or NOE assignments) based on how 'connected' that constraint is within the graph formed by other constraints. This means that in effect a single, isolated constraint pulling two residues on opposite sides of a protein together is down-weighted, while if multiple constraints link those residues (or their neighboring residues) then those constraints are considered more trusted and hence weighted heavier. For better or worse (usually better), this score simulates what the human NMR spectroscopist would do when assigning NOE crosspeaks manually ... usually two residues in contact will show multiple NOE crosspeaks connecting them and involve multiple different nuclei, however a single lonely NOE between two nuclei which are distant from eachother in the primary protein sequence is heavily scrutinized and regarded with suspicion since it is likely to be mis-assigned. I'm very keen to test ARIA 2.2 on my old data set and see if I'm actually right (I may be able to try it with network anchoring turned on, and off, and see just what sort of contribution that score is making).

Another completed feature, the integration between ARIA and the CCPN libraries/analysis package should also be a big plus. I haven't used the CCPN analysis software yet, but a few years ago I wrote some code to help make CYANA and the Sparky NMR assignment program work together better. The result was functional, but very hackish (and I'm probably the only person in the world who understands how it was intended to be used, since I still haven't got around to writing any documentation. Naughty, naughty). CCPN + ARIA may turn out to be the better option for spectral analysis and structure calculation in the future, as opposed to my currently preferred Sparky + CYANA combination.

I'm really itching to find a good reason to do an NMR structure project now ... back to work !!

Thursday, July 19, 2007

Searching biological databases from the Firefox search bar

There is currently no "Google for Bioinformatics", and so biologists/bioinformaticians typically need to search a number of separate databases to find the data the desire. While the Biobar Firefox extension helps search for a dizzying array of biological databases (individually), I think sometimes it offers too much, and contains too many databases that I rarely if ever use. As useful as it is, most of the time I keep the Biobar toolbar hidden to reclaim the screen real-estate.

Instead, I prefer the more lightweight "search plugins" to fully fledged extentions (accessed via the little search box up near the URL location bar). Here are some Firefox search plugins for common 'bioinformatic' search engines which I found scattered across the far reaches of the web:

  • HubMed is a clean and slick interface to search the PubMed database, with some features that the NCBI search doesn't have. You may already be familiar with it, but sadly the majority of life scientists appear not to know about it / use it. I prefer HubMed to the regular NCBI Entrez interface. You can install the HubMed Firefox search plugin at HubMed.
  • In case you want the plain vanilla NCBI PubMed interface, the UCSF library provides a PubMed search plugin.
  • The Mycroft project by Mozilla seems to be an official repository of firefox search plugins. On Mycroft, I found plugins for searching SwissProt, the Protein Data Bank (PDB).
I also wanted some that couldn't find, so I made them. Here are Firefox search plugins for Uniprot and Pfam:

Here is the code for the Pfam one, as an example:

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"
xmlns:moz="http://www.mozilla.org/2006/browser/search/">
<shortname>Pfam</shortname>
<description>Search the Pfam database</description>
<image width="16" height="16">%2F%2F%2F%2FAMWLMwDew54AjEcBAO22WwD658kAoWclAMSdbADYn0gA1LKBALiLVADt1rMA%2FffkAKp0PACcXQ0AtnomAMiXVwDxvmsA0ZY4ALuJQwDjqk8AzqZ3AKVoFQD27dkA2LuRAJJRFACVVAEA5s6lALyVYgCbXR0Aq3ElAPPhvwClbDIAzqBkAL2BLgDFkEAAtYJKAPS9YACQTwkA%2Fu3QALJxHgDZnD8A5bJfAMWTTwD%2B89cAnmIUAOiwVQDXtYkAl1gXAN6%2FlgDsuWYAxJpeAN6kTADoz60AwYg6AL6RWwDQrH0Ay6NwAMmPOQCNTgEAllcMAKVtIwDNlD4A9uPGAPfAZgCocjEAxp1mAJhYBQDZv5sAi0oLAKxvHgDUmkIAt4hIAJpeFACoaxoAn2EOAKBlGQDwuF4Aun8qALJ2JQCocDgA3KFEAJRSCAD%2B9twAjksFAJVWEQDt2rcAkFEOAMKVXQD57M4AklACAKBfGACkaigA%2BvDWAPXnyQDr068A7bNWALeHTQD24sEAu5NdAJlZCgDgxqEA6rZjALyOWADcvZIAzZE4ANi5jQDEmGIAj0oAAJVWBgDl0KoA67NaAMCELwDChzQA8%2BLDAJhZEgClbS4AqnM1AOSpSwDy3L4Ai0cFAJxfEQCiYxQAnWAbAJteIQChZikAx402ANyhSAD669EA8Ni2AJBNAgCXVwIAzqt6AOjTsACSUwUAl10WAKhpGADutl0A5a9UAMugbgCSUAwA1rB%2FAOCmSgD86s4A9ujGAI1MAgCbYBUApmoZAK9yHwCjai8AqHAmANWzhADFnm8Ajk8EAI9LCADlzasAk1MBAKBeDgCfZBYAm2AeAKpzOQDQlD0At4ROALuQXQD879QA%2BujLAJFPAACVVgMAmFcFAJRTFQCVVhgAmFsYAKVmFwDZtocA9L1jAPK7YQDLkDsAvJNdAMeVUAD77c8AkFEFAOGqUQDhqE8A3J9CAM2TOgD57NAA9OTHAPThwACPUAIAjk0HAOXNpwCWWAQA27uTAJ9jGACkZhUApmkbAKNsJADwul8A77dbAO60WQDRq38A67RYAOmxVwDPqXkA36ZNANmeRgC%2BlmEA%2FOvQAPvqzAD86MkA7di2AI1KAQCPTAEAjUoFAI9OAACOTQMA6NCrAJJOAgCRUQEAk1QDAJZVAwCTVQYAklENAJRWCACYWAcAmFcKAJtdDwCYWRkAnmMWANW0iQDyvWIA07ODAOu2XADnsFcAuHsmALh%2BKQDcpEsAtIJIALuTXACOSwMAkE0HAI9NCQCSUQUAkE8MAJhZAwCcXw8Anl8RAPK7XwDtuFwA7bVWAP745QD66s4A%2BebJAJBPAwCQUQMAPuWfn5%2Bfn5MfJIjln%2BWfLkIZgdGmwX0un5%2Bf5eWf9s%2FlwETAwMCp3EznUNjdRET1nY2jDLTROINERETARERE75%2FARMDARETdbgUmfz29wMCphNRWyV5v1sCERERERERkn8BEwMBERIT%2Bf7DGt5XznanARNvymdPx3cBERERERO%2BfwETAwMD1RBvgsbDHTj96kqjARKiNWd6oqUREREREZJ%2FARMDARERERNyHT3DHinBP4p2onaBJg4RERERERESyn8BEwMBERERERN08xM1O%2BcY1S5LkA5x5hERERERERLKfwETAwERERERERMCd8yMvxunGajQY6FWEREREREREsp%2FARMDAREREREREwMCoW%2FcTYQXHUhF1G4TARERERESyn8BEwMBERERERPVEwIQ8RguzKshOjw%2Fc9cBERERERO%2BfwETAwEREREREqYSd8UIy0LuOO8jnudj1REREREREsp%2FARMDAwMBERMCE1B7Mpga0kQhYieqLW4TARERERETvn8BEwMDAwMCEGydiV1rSWml1VRvblO2HqMBEREREwGSfwETAwEREqRsnjIHSkIJovoOEwIT%2FiYTAREREREREZJ%2FARMDAwETA2EO0kHMWdIOEhMCEW60C%2FoRERERERERkn8BEwMDARBtWYy1gDlUbhKlEhDzjtou9wERERERERLKfwETAwMBEW10NhZvahEREwJ1T7PmvooOERERERERE75%2FARMDARPWdSjlWnYTAwN3%2Fw87H%2BcutG%2FVERERERMDvn8BEwMDAwN1NfKeERIT%2B4jv5ik7LR9vARPVEREREwO%2BfwETAwETAW0JFqp2dtVAVTgXGCXvbwEREREREREREsp%2FARMDAwIRbB%2Ftrq2VI%2BIpOLxCH%2F4SERERERERERESyn8BEwMBEwBs9vKWGLLj6BQIPvahERESpREREREREwO%2BfwETAwEREhJKkugZACnJcg6jAwEREREREREREREREZJ%2FARMDAwETAWykiQP3ReEMa2oREqURERERERERERESyn8BEwMBE9dxuyncUYNAGKGZ28Bup9URE9URERERERO%2BfwETAwMD1G55BcdXfbCDQBrTM5NX1RKlEREREREREZJ%2FARMDARESo4euHGxvYljC6Bvy%2FIYOERERERERERMBkn0T1wMBERBsXR731RIRbBCXZgZBf7tepRMDAwMDAwLKfwET1RMDAncPplL2oRPXd2jE6IChzoZ3AwMDAwMDA759EwMBEREQbTEHLUJqdwITAnW1RHFSY1oTARERERERk5UTAwMDAhN29gPjGSHv%2F3UTAhL305jbWwITAwMBERGTFn%2BXln5%2Bf5UqXKzMSZzfC5eXCn0p%2Brqzl5Z%2Bfn5%2F2HQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
</image>
<url type="text/html" method="GET" template="http://www.sanger.ac.uk/cgi-bin/Pfam/qquerypfam.pl?">
<param name="terms" value="{searchTerms}">
</url>
</opensearchdescription>



To add this to your list of search engines, you also need to put the appropriate "link" tag in the head of an html page that you want the search plugin to be detected from, like this:


<link rel="search" href="opensearch_uniprot.xml"
type="application/opensearchdescription+xml"
title="Uniprot" />
<link rel="search" href="opensearch_pfam.xml"
type="application/opensearchdescription+xml"
title="Pfam" />


OpenSearch plugins are also partially supported by some versions of Internet Explorer, but I haven't tested it (there is no POST support for OpenSearch plugins in IE 7). [insert obligatory Firefox fanboy stab at IE here].

Chances are, you'd like to make one to search your favorite database. Here is the documentation I used for creating OpenSearch plugins for Firefox:




Edit: Argghh .. looks like this code is showing up fine on the web page, but is a bit broken when displayed from the RSS feed in akregator .. I assume other feed readers may also be having trouble .. Anyone know a reliable way to post code in Blogger ? Wordpress is starting to look attractive ...

Thursday, July 05, 2007

A simple but useful application for Google Co-op On-the-fly Custom Search

My del.icio.us linkstream is a bit like a window into my stream of consciousness.

I've always wanted to be able to 'search outward' from just my delicious links ... a recent post from Alf about Google Co-op 'on-the-fly' search engines just triggered a simple but really useful idea; make a Google Co-op search of my delicious linkstream ... this way I can search just within pages I've deemed worthy of bookmarking.

After a bit of wrangling with the various ways of getting data out of delicious (see my own comments to this post), I found this script for generating an html page of links from delicious. This page contains all my delicious links, not just the most recent ones (it seems you can only get a subset of the most recent links without authenticating with delicious, hence the PHP script using the delicious API).

So, after hosting the PHP script revealing all my links, I can point the Google Co-op Custom Search Engine to that page, and search away ! It's a nice way of using search to 'retrace' my steps, since it should generally restrict results to the parts of the web I've already seen and rated as worthy of re-examination.

Thursday, June 28, 2007

Google Desktop for Linux Released

Google Desktop for Linux has been officially released. It's a real-honest-to-god native linux application, and doesn't use Wine like the Linux version of Picasa.

I've just installed it on Ubuntu Feisty Fawn from the Google Linux software repositories, and while it's currently only indexed about 1 % of my files, my initial tests suggest it is pretty slick ... a quick Ctrl-Ctrl, and up pops the search box. Apart from all the things I'd expect, like indexing the content of pdf files, directories like "/usr/man" are included on the default path list, so I presume it also looks inside man pages. One problem I've noticed so far in my very quick testing is that it seems to not follow symlinks to directories and won't let me add them as paths to index. The effect is that my "/home/perry/documents", which is actually a symlink to a larger partition, does not get indexed unless I add it to the path list with it's real path.

While there are already similar offerings for Gnome (eg Beagle) and KDE (eg Kat), my gut feeling is that Google Desktop will be my preferred option for the moment. Maybe one day we will get lucky, and Google will even make it FOSS (not holding my breath though).

Wednesday, June 13, 2007

SDF Public Access Unix celebrates 20 years



The Super Dimensional Fortress Public Access Unix has now been in operation for 20 years ! SDF is a non-commercial member supported BBS, which offers free accounts with Unix shell access, and a friendly and vibrant community.

I've been a 'lifetime' APRA member for a few years now, and have been using SDF for some lightweight web hosting (CGI in various flavours is supported). It's really handy to have a reliable shell account somewhere out in the aether to check network connectivity from, and my interactions with the community have always been fun. The photos in this post show one of my aging PCs, proudly displaying an SDF sticker.

To get a free account and check it out for yourself, telnet to sdf.lonestar.org and login as new.

Happy Birthday SDF !!


(yeh, I know I labeled this 'linux', and SDF actually runs on NetBSD. Different OS, similar audience).

Friday, May 25, 2007

Cleaning up the cesspool that is the PDB

Well .. maybe cesspool is a little strong ... there's a lot of great data in the Protein Data Bank, it's just that in the early days it was allowed to grow very large without enforcing better standardization of the data. Things that are being fixed include updating citations for structures from "To be published" to the actual publication if it exists (with PubMed ID), linking to sequence databases (ie UniProt), bringing atom names to standard IUPAC nomenclature (Hooray!!) and loads of other things I haven't mentioned. Don't fret ... none of the raw experimental data or coordinates are going to be changed :)

From the PDB remediation overview document (pdf):

When the RCSB PDB first addressed the remediation issues in 1998, it was with the intention of providing a uniform and consistent content across all formats. It was surprising and very disappointing to find that many PDB users at the time strongly objected to any changes in the released PDB entries, even if these changes addressed serious but correctable errors (e.g., consistency between chemical and coordinate sequence). As a result of this prevailing attitude toward changes in PDB format entries, the RCSB PDB released its corrections in a new set of mmCIF format data files and left the data in PDB file format unchanged. Since that initial release of mmCIF data, new data items and uniformity corrections have been added to the released mmCIF data files.

I've used coordinates from PDB format files for a lot of things over the years, but I've got to admit, I've never used an mmCIF file. The PDB file format is almost always supported by all legacy (and recent) structural biology analysis software, while using mmCIF is rarely an option (unless it's converted to PDB format first). If I'd known the mmCIF versions in the database have been 'remediated' I may have been more inclined to use them (or the somewhat equivalent XML/PDBML files) for some tasks, since the non-uniformity in atom naming in legacy PDB files can become a royal pain in the butt ....

Anyhow, everyone has until July 2007 to check out the new remediated files before the 'mainline' PDB changes over and provides these by default. All new structure releases will follow the remediated format after July. The old versions will still remain available ... but who would want them ... we are getting standardized goodness !!

Monday, March 19, 2007

Dapper : the screen scraper for everyone

I've been meaning to write about the Dapper 'screen scraping' service for a while, since I think it's mostly useful and pretty cool.

(Yes, this service is called Dapper, sharing a name with the popular Ubuntu GNU/Linux release. I'm a little suspicious that maybe this was a deliberate marketing trick to pull search traffic intended for Ubuntu ....).

Techcrunch describes Dapper well ... 'create an API for any site'. Essentially, Dapper is good at analyzing web pages which have a fixed format (eg Google search results) and will extract the content in a predictable fashion to provide the raw data in XML, RSS, JSON or CSV format. Depending on the type of data you extract from the page, Dapper can also display data as a Google Map, a Google Gadget or Netvibes module, send email alerts or output iCalendar format.

By making it simple to extract information from sites that do not provide their data as an RSS feed or other useful format, it is possible for non-programmers to use Dapper and 'liberate' these sites, producing feeds compatible with their favorite news feed reader from an otherwise dead and lifeless web page. Web application programmers can waste less time writing screen scrapers, and waste less time fixing broken code caused by slight changes in the page format. It may not be something you use for a mission critical component, but for a quick mashup or to get started quickly before writing your own more robust parser, I think Dapper will prove (and is proving) very valuable.

However, there is a downside. From playing with Dapper on-and-off for the past few months, I've established that Dapper works quite well extracting data from very uniform pages like Google search hits [however I didn't do that since it's against the terms of service. "No screen-scraping for you", says the Google-Nazi] or the front page of digg .. but usually fails on pages that don't follow a strict pattern. Getting the wrong data (or junk) in the wrong field 5% of the time may be tolerable for the occasional frivolous RSS feed, but it is annoying enough that for more important applications it is a real show stopper.

One of the reasons it took me several months to get around to posting about Dapper is that I desperately wanted a killer example of extracting data from a bioinformatics database or web site. I've found that most decent projects already make their data available in some useful format like XML or CSV and don't really require scraping with Dapper, while some of the less organized projects which only provide say, HTML tables, [I won't name names ... in fact I've forgotten about them already ... "no citation for you" says the citation-Soup-Nazi ] often failed to work well with Dapper's page analysis unless the page formatting was strictly uniform.

Pedro

I replicated Pedro's openKapow Ensembl orthologue search in Dapper as an example. It's not the best example since, as Pedro notes, Ensembl is one of the 'good guys' that already provide results in XML format.

First, I fed Dapper four URLs for Ensembl gene report pages , which contain a section with predicted orthologues. Apparently, giving Dapper several pages of the same format helps the analysis:


Then, I selected the Gene ID in the orthologues list .. Dapper colours fields it detects as the same type. There is a cryptic unlabeled slider which determines the 'greediness' of the selection:



After selecting "Save and continue", Dapper asks for the newly defined field to be named. In this case, I chose the same name as Pedro ("ort_geneID"), just for the hell of it:


This process was repeated to create a field for the species name, which I named "ort_spp". Dapper allows 'Fields' to be grouped into 'Groups', so I grouped the "ort_geneID" and "ort_spp" fields into a group called "orthologue": (data not shown :) ).

Now, we save the Dapp. In "Advanced Options", I changed the Ensembl gene ID part of the URL to {geneID}. This tells Dapper to make this part of the URL a query field, so that the user can provide any gene ID they like and have the orthologue results scraped:



Finally, we can test the saved Dapp, and retrieve XML formatted results for a particular gene ID:


The gene ID can be changed in the Dapper XML transform URL (http://www.dapper.net/RunDapp?dappName=EnsemblOrthologues&v=1&
variableArg_0=ENSG00000100347) to get XML results for orthologues of other human genes.
Various other transforms, like a cruft-free CSV version are also possible. Feel free to have a play yourself with my Ensembl Orthologues Dapp (like I can stop you now ! It's public & live & irrevocable).

Sunday, March 04, 2007

An Amazon EC2 cluster for BLAST searching ?

I've just been reading about the new Amazon Elastic Compute Cloud (EC2), which is essentially a pay-as-you go cluster, based on Xen virtual machine images. You can create and upload your own image using their tools, or use one of the pre-rolled GNU/Linux distro images already shared by other users of the EC2 system.

While it seems aimed at web service 'startups' that want a competitively priced hosting option which can quickly scale, I thought I'd attempt to figure out the economics of using something like this for some scientific computing. Would it be a cheap / easy / reliable alternative to the home-rolled Beowulf cluster ?

The advertised specs per node are: 1.7Ghz x86 (Xeon) processor, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth. Nodes cost US$0.10 per instance hour. Bandwidth between nodes within EC2 is free, as is bandwidth from EC2 nodes to the S3 storage service, however Internet traffic costs US$0.20 / GB.

First, lets think about BLAST, the bread-and-butter sequence search tool for many bioinformaticians. Now as far as I understand (and from my own experience), NCBI BLAST works best when the entire database can be cached in RAM ... otherwise lots of disk thrashing ensues and the search time is bounded by the disk I/O. The NCBI 'nr' (non-redundant) protein sequence database is currently about 3.3 Gb (and growing) ... so it won't fit in RAM on one of these EC2 nodes. While I don't mind paying to thrash Amazons' servers disks a little, it will slow the search down. However, if we use mpiBLAST the database gets split into chunks evenly distributed across each node, so if we were to use 3 nodes, the 'nr' database would be split into 1.1 Gb chunks and should fit in the RAM of each node (leaving ~600 Mb RAM for the OS and other overhead). However, now the speed of the network interconnects between nodes matters, since we are no longer computing on a single node ... but from what I've read 250 Mb/s should be enough for mpiBLAST to run such that is it not bounded by the internode communication speed. (Actually EC2 instances have shared gigabit interconnects , but since several instances might share the same ethernet card, gigabit performance 'per node' can't be expected. I guess the 250 Mb/s figure means that there are probably four EC2 instances per physical server/ethernet card ??). So with 3 nodes, this would cost US$ 0.30 per hour to run a (scalable) BLAST service.The performance should scale better than linearly with the number of nodes added. If you need the job done faster, just resplit the database and create more EC2 VM instances (mpiBLAST should work with the Portable Batch System to do this transparently, but I guess this would require some code to interface PBS with the control of your EC2 'elastic cluster'). It would only cost US$0.66 to upload the database to EC2 in the first place, and about US$0.50 cents to store it in S3 per month. This seems well within reach of many academic departments, and would really suite 'sporadic' users with occasional big jobs ....

Now for applications like molecular dynamics simulations (MD) (ie, GROMACS, NAMD, CHARMM etc etc), a lot more internode communication bandwidth is required. Looking at these benchmarks for GROMACS , it looks like things should scale nicely to two or four EC2 nodes, but after that the scaling would probably drop off, due to the less-than-gigabit ethernet. That doesn't mean you won't get more speed for more nodes, just that at some point adding more nodes will give greatly diminishing returns. While I'm speculating here, my I'd say it's probably better to leave this type of number crunching to the 'real' supercomputers or home-rolled purpose-built clusters; EC2 may not be worth the cost/effort here for big long running calculations. Others are using MPI applications on EC2 already though, and I'd love to be proved wrong.

One of the current difficulties for running database driven web applications on EC2 is that the virtual machine instances do not have persistent storage ... either a connection to a database running somewhere else needs to be used, or the precious data needs to be moved off each EC2 instance before shutting the server down. If it crashes before shifting data off ... goodbye database. I'm sure Amazon will come up with a solution to this, since it seems often requested on their forums. Having non-persistent data wouldn't be such a big deal for mpiBLAST ... the servers should rarely crash, the results could be stored in Amazon S3 or sent to a remote machine as they arrive, and the sequence database can also be stored in S3 (for about US$0.50 per month ... dirt cheap). There are already a few FUSE S3fs implementations floating about (like s3fs-fuse ) ... I haven't tried them yet, but essentially they should allow S3 storage to be mapped transparently to the Linux filesystem. My guess is it would be a bad idea to host a large MySQL database file on S3 using s3fs-fuse (there is a 5 Gb filesize limit for starters) ... but for lots of little-ish files, as is often generated by bioinformatics software, s3fs-fuse might just do the trick.

Whew ! .. Now I'm really itching to spend some spare change and a few hours to see if running mpiBLAST on EC2 is as good an idea as it sounds.

Doh ! Just tried to set up an account and the Amazon EC2 limited beta is currently full ... I'll have to wait .. :(.


A few additional links I was also looking at while writing this post .. wow ... someone has some 'issues' with the NCBI Blast implelmentation: http://blast.wustl.edu/blast/Memory.html
and http://blast.wustl.edu/blast/cparms.html


Tuesday, February 13, 2007

Snow day


So, the reason I haven't posted in a while is that I've had infrequent Internet access since I've made a temporary move to Washington DC for three months. I'm visiting the laboratory of a collaborator at the NIH, learning some techniques in membrane protein purification, refolding and crystallography.

Federal Government employees in Washington DC, like those at the NIH, seem to get a pretty good deal. If more than just a dusting of snow falls, then much of the Federal Government 'closes' ... So this afternoon I got to go home early, since they called a "snow day" at 2 PM.

Normally this would be great ... a free afternoon off ... except with laboratory work of course it means that by dropping everything and leaving, my mornings work was for naught, and I'll need to start again fresh tomorrow. I could have stayed to finish things off, but I was told there was no guarantee that the Metro (trains/buses) was going to stay open.

Ah well .. tomorrow is another day ... (on which I can optionally come in to the lab two hours late ... due to snow :) )

Thursday, January 18, 2007

Flash Player 9 for Linux released : Quick install for Ubuntu Dapper

Adobe Flash Player 9 is finally out of beta ! No more feeling like a second class netizen on "flashy" sites !

Here's how I installed it on Ubuntu Dapper (the package is for Debian Sarge, but seems to work fine):

Download flashplugin-nonfree.

Use right-click, "Save Link As ..." and save it to \tmp.

$ cd /tmp

$ sudo dpkg -i flashplugin-nonfree_9.0.21.78.4~bpo1_i386.deb

(you'll be prompted for your password, and once you provide it, the install should happen)

You can check if it worked by typing about:plugins into the URL box in Firefox. You should see something like "
Shockwave Flash 9.0 d78" on that page.

Now go view some F-F-F-Flash cartoons :) (I don't think Homerstarrunner requires Flash 9, but it's the only Flash site I use on any regular basis)

Wednesday, January 10, 2007

Changing "Illustration" to "Figure" in OpenOffice Writer

I've decided to try and use OpenOffice Writer properly .. like take advantage of some of its more powerful features rather than just using it as a text editor with formatting.

For drafting manuscripts of scientific papers, pictures/photos/illustrations etc are usually referred to as "Figures", however when inserting a picture via "Insert -> Picture -> From File .." the default behavior of OpenOffice is to use the caption "Illustration". This will not do.

From the OpenOffice Writer Guide, Chapter 8 [pdf], here is how to get it to use "Figure" by default:

• Open the "Tools -> Options –> OpenOffice.org Writer—> AutoCaption" dialog box.

• Under "Add captions automatically when inserting section", check
OpenOffice.org Writer Picture, and make sure its checkbox is ticked.

• Under the Category drop-down list, enter the name that you want added,
eg, Figure, in the place by overwriting any sequence name in the list (it will probably have "Illustration", before you overwrite it.) I also like my Figure label bold, so I also selected "Strong Emphasis" from the "Character Style" dropdown box. Press OK to save the changes.

Now you can insert a Picture using "Insert -> Picture -> From File .." and the label should be "Figure", not "Illustration". The picture comes in its own frame, and you can edit the figure legend directly in the document.

Hmmm ... Latex is not looking so bad again ....