Monday, March 19, 2007

Dapper : the screen scraper for everyone

I've been meaning to write about the Dapper 'screen scraping' service for a while, since I think it's mostly useful and pretty cool.

(Yes, this service is called Dapper, sharing a name with the popular Ubuntu GNU/Linux release. I'm a little suspicious that maybe this was a deliberate marketing trick to pull search traffic intended for Ubuntu ....).

Techcrunch describes Dapper well ... 'create an API for any site'. Essentially, Dapper is good at analyzing web pages which have a fixed format (eg Google search results) and will extract the content in a predictable fashion to provide the raw data in XML, RSS, JSON or CSV format. Depending on the type of data you extract from the page, Dapper can also display data as a Google Map, a Google Gadget or Netvibes module, send email alerts or output iCalendar format.

By making it simple to extract information from sites that do not provide their data as an RSS feed or other useful format, it is possible for non-programmers to use Dapper and 'liberate' these sites, producing feeds compatible with their favorite news feed reader from an otherwise dead and lifeless web page. Web application programmers can waste less time writing screen scrapers, and waste less time fixing broken code caused by slight changes in the page format. It may not be something you use for a mission critical component, but for a quick mashup or to get started quickly before writing your own more robust parser, I think Dapper will prove (and is proving) very valuable.

However, there is a downside. From playing with Dapper on-and-off for the past few months, I've established that Dapper works quite well extracting data from very uniform pages like Google search hits [however I didn't do that since it's against the terms of service. "No screen-scraping for you", says the Google-Nazi] or the front page of digg .. but usually fails on pages that don't follow a strict pattern. Getting the wrong data (or junk) in the wrong field 5% of the time may be tolerable for the occasional frivolous RSS feed, but it is annoying enough that for more important applications it is a real show stopper.

One of the reasons it took me several months to get around to posting about Dapper is that I desperately wanted a killer example of extracting data from a bioinformatics database or web site. I've found that most decent projects already make their data available in some useful format like XML or CSV and don't really require scraping with Dapper, while some of the less organized projects which only provide say, HTML tables, [I won't name names ... in fact I've forgotten about them already ... "no citation for you" says the citation-Soup-Nazi ] often failed to work well with Dapper's page analysis unless the page formatting was strictly uniform.

Pedro

I replicated Pedro's openKapow Ensembl orthologue search in Dapper as an example. It's not the best example since, as Pedro notes, Ensembl is one of the 'good guys' that already provide results in XML format.

First, I fed Dapper four URLs for Ensembl gene report pages , which contain a section with predicted orthologues. Apparently, giving Dapper several pages of the same format helps the analysis:


Then, I selected the Gene ID in the orthologues list .. Dapper colours fields it detects as the same type. There is a cryptic unlabeled slider which determines the 'greediness' of the selection:



After selecting "Save and continue", Dapper asks for the newly defined field to be named. In this case, I chose the same name as Pedro ("ort_geneID"), just for the hell of it:


This process was repeated to create a field for the species name, which I named "ort_spp". Dapper allows 'Fields' to be grouped into 'Groups', so I grouped the "ort_geneID" and "ort_spp" fields into a group called "orthologue": (data not shown :) ).

Now, we save the Dapp. In "Advanced Options", I changed the Ensembl gene ID part of the URL to {geneID}. This tells Dapper to make this part of the URL a query field, so that the user can provide any gene ID they like and have the orthologue results scraped:



Finally, we can test the saved Dapp, and retrieve XML formatted results for a particular gene ID:


The gene ID can be changed in the Dapper XML transform URL (http://www.dapper.net/RunDapp?dappName=EnsemblOrthologues&v=1&
variableArg_0=ENSG00000100347) to get XML results for orthologues of other human genes.
Various other transforms, like a cruft-free CSV version are also possible. Feel free to have a play yourself with my Ensembl Orthologues Dapp (like I can stop you now ! It's public & live & irrevocable).

5 comments:

stefanandr said...

Hey

I read your article, and noticed your frustration that dapper does not work on pages which are more complex. This is exactly what openkapow can do. It is a full visual scripting language, very powerful, and it allows to scrape virtually everything, even the most complex sites.

Check it out when you get time

Stefan, openkapow

Pedro Beltrao said...

I wonder if Stefan read the whole thing :), you mentioned kapow in the article. Nice post. I did not try dapper yet but it might be faster to use. For quick tests and simpler cases it is probably better to go with dapper instead of kapow.

Unknown said...

I don't blame him if he didn't read the whole thing, or only scanned it quickly .. I've gotta try to make my posts shorter and more digestible :)

I've downloaded the openkapow robot builder application (three cheers for providing a Linux client !!), I'll get a chance to check it out soon.

I think you are right though Pedro, Dapper is certainly good for quick jobs on uniform pages, but I'm guessing the openkapow will be better on more complex scraping tasks.

Pedro Beltrao said...

Hi,

I could not find you contact so I ask you here in the comments. I want to include this blog post in today's edition of Bio::Blogs. Would you mind if I also include a copy of the post in a PDF version ? If you still read this today you can reach me at pedrobeltrao in gmail. Thanks :)

Unknown said...

No problem .. I'm honored to be included in Bio::Blogs :)

[although I've always thought "from bio import blogs" would be a better name, being a Python snob :) ]

I'll have to make my contact info a little easier to find.

Cheers