Saturday, October 21, 2006

Bioinformatics data (non-)formats

(spurred on by my own comment here)

Anyone know if the Clustal alignment file format (eg ClustalW output) has any strict definition somewhere ?

Some Googling suggests it has never been "formally" described .. eg, from the ClustalX help:

"CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. It can be read in again at a later date to (for example) calculate a phylogenetic tree or add a new sequence with a profile alignment."
Well, it is fairly self explanatory, and as a result there are lots parsers around for Clustal format alignment data, and lots of programs that claim to output alignments in "Clustal format". I say claim, since many programs output Clustal alignments with different headers to the original ClustalW program (eg “MUSCLE” instead of “CLUSTAL”) .. and some parsers don’t handle that very gracefully (eg Biopython’s Bio.Clustalw).

Unfortunately, these ‘pseudo-Clustal’ formats aren’t going away, and so it is probably up to the parsers to be a little more flexible. Fortunately, the variation is usually only in the header on the first line of the file, so it should be trivial fix the Biopython parser so that it is more forgiving. One idea would be to simply add an optional keyword flag like "ignore_header = True" to the the Bio.Clustalw.parse_file() function. This way, something like:
alignment = Bio.Clustalw.parse_file(my_muscle_align_file, alphabet=IUPAC.protein, ignore_header=True)
should happily slurp up most variations on the Clustal format.

Eventually I’ll get this to the Biopython mailing list (I'll probably write a proper patch first).

No comments: