Proteomics/Protein Identification - Mass Spectrometry/Databases
Mass Spectrometry databases are a unique challenge for maintaining the vast quantity of data generated from an MS experiment due to both size and complexity issues. While the actual data generated, primarily peak data discussed earlier in this chapter, is relatively linear, the extreme variations in technique which yielded the spectra introduces significant complexity in attempting to standardize the format a mass spectrometry experiment must be presented in. Although significant progress has been made in the standardization of these data types, there is still significant incongruence from one spectral database to another. This is due in part as a result of the many variations of the technique and the often varied degree of application dependency, with purpose driven mass spectrometry heavily dictating both the experimental design including equipment, protocol, and materials, as well as expected format of results. Thus while several standard data formats (XML based data types discussed previously) exist as well as a plethora of well established databases, there are significant challenges which persist in effect, purpose driven, spectral data mining
Perhaps the first effort to organize spectral data into a curated source, was put forth by the National Institute of Standards and Technology , in conjunction with the Environmental Protection Agency (EPA) and National Institutes of Health (NIH). Begun in 1970, the NIST standard reference database is a verbose collection of spectral data in a common data type, requiring both a minimal amount of data regarding the experiment as well as a standard format for the presentation of spectral data from a wide variety of MS applications. More focused repositories are much more common however, which serve to store and curate a specific subtype of spectra. One such example of this database type is the Mass Spectrometry Database Committee's comprehensive drug library , which contains spectral data for pharmaceutical substances, metabolites, and intermediate compounds. Other databases that contain proteomics data include:
- Proteomics Identification Database (PRIDE)
- Open Proteomics Database (OPD)
Table of common fragments : http://ull.chemistry.uakron.edu/gcms/
Compilation of databases : http://www.infochembio.ethz.ch/Links/en/spectrosc_mass.html
Table of spectrum data: http://www.lohninger.com/spectroscopy/dball.html
Mass spectrum information and profiling: http://www.dkfz.de/spec/glycosciences.de/sweetdb/start.php?action=form_profiling_search
Computerized spectrum databases: http://www.hellers.com/steve/resume/p125.html
New Articles to Summarize
Vizcaíno JA, Côté R, Reisinger F, Barsnes H, Foster JM, Rameseder J, Hermjakob H, Martens L. The Proteomics Identifications database: 2010 update. Nucleic Acids Res. 2010 Jan;38(Database issue):D736-42. Epub 2009 Nov 11.
Reviewer: Ben H.
This article seeks to highlight the recent advances and modifications in the Proteomics Identifications Database and to point out the vital role it plays in the collection and storage of mass spectrometry (MS) data. The data found in the database comes from a vast range of experiments and is stored in a format that allows for simple and complex querying in a common format. The database has been constantly growing and in the past two years between 2008 and 2010 it has exploded to contain over two and a half million protein IDs and eleven and a half million peptides. It also contains more than fifty million spectra. All of this data comes from approximately sixty different organisms. The data contained within is primarily protein and peptide IDs, MS mass spectra, and any related metadata.
- Proteomics Identification Database (PRIDE)
- A database that is centralized, public, and standards compliant and contains a variety of proteomics data. (http://www.ebi.ac.uk/pride/)
- Ontology Lookup Service (OLS)
- A query interface for controlled vocabulary and ontology lookup. (http://www.ebi.ac.uk/ontology-lookup/)
- Protein Identifier Cross-Reference System (PICR)
- A system designed to map protein sequences with protein IDs. (http://www.ebi.ac.uk/Tools/picr/implementation.do)
- Database on Demand (DoD)
- A tool used to generate custom databases of FASTA sequences. (source: http://)
- PRIDE Spectrum Viewer
- A tool used to view spectra from within the Proteomics Identifications Database. (http://www.ebi.ac.uk/pride/viewSpectrumHelp.do)
- PRIDE Converter
- A tool used to convert a variety of proteomic data to PRIDE XML format to be submitted to PRIDE and to conform to submission standards. (http://code.google.com/p/pride-converter/#What_is_PRIDE_Converter?)
- A consortium established to provide a common point of submission for a variety of proteomic repositories. It also seeks to encourage sharing of information between repositories. (http://www.proteomexchange.org/)
The Proteomics Identification Database (PRIDE) was established in 2005 in response to the large amounts of proteomic data. This is not the only database that serves as a repository of proteomics data. Others include GPMDB, Proteinpedia, Peptide Atlas, and NCBI’s Peptidome. The data submitted to the PRIDE database can be anonymously shared with reviewers and editors through log-in accounts. This feature has made the PRIDE database the preferred placed to submit data for a variety of journals including Nature Biotechnology, Proteomics, and Nature Methods. There have been two tools that have had a very positive influence on the growth of the PRIDE database. These are the Ontology Lookup Service (OLS) and the Protein Identifier Cross-Reference System (PICR). Database on Demand (DoD) is a third tool that was added to increase the usefulness of the database.
The data contained within PRIDE is very diverse and is becoming more diverse as the years pass by. As of 2010, humans are represented the most in the database, covering about 38% of all protein data and 36% of all peptide data. Bacteria are the most diverse group of organisms with 20 different species being represented in the database. The largest experiment submitted to PRIDE is approximately 85GB related to the c. elegan genome. The largest experiment set currently in PRIDE is that of the rat’s secretory pathway. Surprisingly the database also contains data from a variety of extinct animals, most noticeably the Tyrannosaurus rex.
The primary improvement in the PRIDE web interface is the ability to submit fragment ion annotations. This data can then be visualized with the online “PRIDE Spectrum Viewer.” Another feature developed was the integration of the PICR mappings into a variety of tools. These tools include the venn diagram tool, queries, and the BioMart Interface. The “Identification Detail View” has also been modified to take into account the PICR mappings.
The submission process has been made much easier with the addition of the PRIDE Converter. This tool has caused an explosion in data submission. This is because the tool provides a simple wizard for converting various proteomics data formats to the PRIDE XML format to be submitted. In addition to converting the data, it is also now possible to submit very large data sets via FTP server. This has essentially removed all size limitations on data submissions.
The PRIDE BioMart interface is used for integrating information found within pride with other resources. The quantity of these resources is continuously growing. It is essential to our understanding of biology to be able to link together these resources to obtain a clearer picture of biology as a whole. Future goals include implanting a technique to allow sharing of proteomics data between all members of the community using PRIDE and NCBI as the primary points of submission. All data will be shared and publically available to ensure full exposure in the scientific community.
Relevance to a Traditional Proteomics Course
The field of proteomics has exploded in the past decade and the amount of information has jumped leaps and bounds. Mass spectrometry is just one technique that has generated a lot of data that now resides in PRIDE. It is important in any proteomics course to understand the importance of being able to store and share the data and the information derived from the data. Having a singular location for common access to shared data is immensely advantageous to anyone studying proteomics.