Friday, May 25, 2007

Cleaning up the cesspool that is the PDB

Well .. maybe cesspool is a little strong ... there's a lot of great data in the Protein Data Bank, it's just that in the early days it was allowed to grow very large without enforcing better standardization of the data. Things that are being fixed include updating citations for structures from "To be published" to the actual publication if it exists (with PubMed ID), linking to sequence databases (ie UniProt), bringing atom names to standard IUPAC nomenclature (Hooray!!) and loads of other things I haven't mentioned. Don't fret ... none of the raw experimental data or coordinates are going to be changed :)

From the PDB remediation overview document (pdf):

When the RCSB PDB first addressed the remediation issues in 1998, it was with the intention of providing a uniform and consistent content across all formats. It was surprising and very disappointing to find that many PDB users at the time strongly objected to any changes in the released PDB entries, even if these changes addressed serious but correctable errors (e.g., consistency between chemical and coordinate sequence). As a result of this prevailing attitude toward changes in PDB format entries, the RCSB PDB released its corrections in a new set of mmCIF format data files and left the data in PDB file format unchanged. Since that initial release of mmCIF data, new data items and uniformity corrections have been added to the released mmCIF data files.

I've used coordinates from PDB format files for a lot of things over the years, but I've got to admit, I've never used an mmCIF file. The PDB file format is almost always supported by all legacy (and recent) structural biology analysis software, while using mmCIF is rarely an option (unless it's converted to PDB format first). If I'd known the mmCIF versions in the database have been 'remediated' I may have been more inclined to use them (or the somewhat equivalent XML/PDBML files) for some tasks, since the non-uniformity in atom naming in legacy PDB files can become a royal pain in the butt ....

Anyhow, everyone has until July 2007 to check out the new remediated files before the 'mainline' PDB changes over and provides these by default. All new structure releases will follow the remediated format after July. The old versions will still remain available ... but who would want them ... we are getting standardized goodness !!

No comments: