Structural Biochemistry/Bioinformatics

What Is Bioinformatics?

Bioinformatics is a rapidly developing field of science which use the advantage of computer technology to analyze the molecular biology. The method in bioinformatics field can be derived from statistics, linguistics, mathematics, chemistry, biochemistry, and physics. Sequence or structural data of nucleic acids or peptide chain as well as the experimental data can be used as data by the scientists in the bioinformatics field^[1]. Specifically, the area structural biochemistry that involves bioinformatics deals with how sequence alignments are obtained and eventually how the analysis of the sequences can help generate phylogenetic trees. These relations can eventually help contribute knowledge about how structures of macromolecules are displayed and compared with one another.

Properties of a Protein Data Bank

Some of the most well known structures of macromolecules are archived as atomic coordinates. These atomic coordinates are data files that contain the three dimensional structure of molecular structures. The link of atomic coordinates further explains the specifics of these data files. The array of molecular structures are archived in the Protein Data Bank also known as (PDB). The link of PDB is the URL to find the many coordinates publicly provided. Many scientific journals that publish results on macromolecular structures now require researchers to upload atomic coordinates to the database. As a result, in this bank, there are almost more than 20,000 macromolecular structures which include proteins, nucleic acids, carbohydrates that were determined through techniques such as X-ray crystallography, diffraction techniques, nuclear magnetic resonance (NMR), electron microscopy, and theoretical models. This bank is growing larger as around 2500 structures are presented each year.

As a structure is determined, a four character identifier is associated with the macromolecular structure known as a Protein Data Bank identification code (PDBid). The first character must be a digit from one to nine, while the remaining three characters can be letters of upper or lowercase. For example, the myoglobin structure is coded as 1MBO in the PDB. However, it is important to note that the identifiers do not necessarily need to have a relationship with the name of the macromolecule.

First, the atom coordinate file begins with information such as the identity and properties of the molecule under study, the date when the file was submitted, the organism from where the macromolecule was obtained, and the author(s) who found the structure along with journal references. Furthermore, the file contains a description of how the structure was determined and symmetry and residues that were not investigated. The sequences of the many chains are presented with one another with a description and formulas that accompany it which are called hetrogen groups (HET). HET are molecules that are not like standard amino acid or nucleotide residues such as organic molecules like the heme group, residues like Hyp, metal ions, and water molecules bound to other molecules. This file continues on to provide elements of the secondary structure along with any disulfide bonds present. Majority of the PDB file contain two series- the standard residue also known as ATOM and heterogens known as HETATM record lines. Each of these series, the ATOM and HETATM provide coordinates for a specific atom in the structure in correspondence to its serial number. Following the series, the atoms Cartesian coordinates (X,Y,Z) are presented relative to the fraction of sites that the atoms space occupies. Normally, this arbitrary origin is quantified as 1.00, but with groups that contain many conformations, or molecules that are not fully bound to a protein, the number is positive and less than one. In addition, an isotropic temperature is described since it can present the thermal mobility of the atom. A larger quantity of isotropic temperature means there is greater motion involved. If the structure was determined through NMR, the PDB would contain ATOM and HETATM series for the most representative member in a coordinate set that was calculated in finding the structure. Finally, the PDB file terminates with connectivity records (CONECT) which present non standard entities between atoms such as hydrogen bonding and disulfide bonds.

Properties of The Nucleic Acid Database

Similar to the Protein Data Bank, the Nucleic Acid Database (NDB) contains the atomic coordinates of nucleic acids. The following link of NDB is the direct URL of the database. The format of the file that nucleic acids is like that of the PDB files. The NDB however, has a contrasting organization and algorithms for searching that is specific to nucleic acids. This feature is particularly important since proteins are categorized by names like myoglobin whereas the identity of nucleic acids is defined by their sequences.

Viewing Macromolecular Structures in Three Dimensions

Studying the three dimensional structure is relatively important, as it provides much information to reactive sites along with functions of a macromolecule. The most revealing way to investigate the structure of a macromolecule is by utilizing molecular graphic programs. One useful program is called PyMOL. The following link is the direct website to PyMOL, and the programs capabilities of viewing a three dimensional structure. Programs like PyMOL allow a user to actively engage with a molecular structure by rotating it and obtaining an impression of the molecule that can enhance the understanding of the molecular rather than seeing it in two dimensions. PyMOL, along with many other popularly used programs like RasMol utilize PDB files as input for further visualization.

Structural Classification and Comparison

Many proteins that discovered are structurally related to other proteins. This similarity is due to evolution conserving the structures of proteins instead of their protein sequences. The following set of descriptions are some of the many websites for the public that have computational tools in order to classify and compare protein structures. By using these tools, the functions, distant evolutionary relationships that are not normally displayed in sequence comparisons, generation of special libraries of unique folds for the prediction of structures, and explanation of why certain structures are more dominant than one another are examined.

1. Class, Architecture, Topology and Homologous superfamily (CATH) categorizes proteins using the four topics into its respective structural hierarchies. First, "Class" is the highest level and has four categories of secondary structure. These are: Mainly alpha, Mainly beta, alpha/beta, and not many secondary structures respectively. Secondly, "Architecture" is the arrangement of the secondary structure that is separate from that of topology. Thirdly, "Topology" refers to both the holistic view of the proteins connectivity and shape. Fourthly, "Homologous superfamily" are proteins that are homologous to the protein that is selected. Furthermore, an interactive or still view of the protein can be displayed. An example of a CATH for myogolobin would be Class: Mainly alpha; Architecture: orthogonal bundle; Topology: globin-like; Homologous superfamily: globin. As a result CATH allows users who access the database to browse up and down to make a comparison of the many structural hierarchies.

What is the advantage of Bioinformatics?

1. Create an e-library of biological database

Biological database is the organized biological information stored electrically and able to revive. For example, a biological database can be a record of a nucleic acid sequence with the name, input sequence, the scientific name of the organism it was isolated from^[2].

In this computing era, the storage database give a great convenient for the communication between scientists. The data in the e-library can be used widely by people from scientists , students to knowledgeable laymen.

2. New methods to interact with molecular biology

Since analyzing molecular biology is one of main fields in bioinformatics, bioinformatics researches focus on creation of new tools, the methods to storage, retrieval and analysis the material such as protein sequences.

The methods to analyze target samples are usually computer programs which will help researchers determine the structure of interesting sample or help scientists enable determine the family group for the sample from storage data. One common program used in bioinformatics is BLAST, Basic Local Alignment Search Tool. The outcome of BLAST search is a list of sequence alignments which will help researchers identify homologous sequences of the sample sequence from the database of known sequences^[3].

3. Explore evolution

Proteins with a common ancestor will have resemble amino acid sequences^[3]. Therefore, with the information of sequence and structural data, scientists can organize an unknown protein into groups and reconstruct the evolution of the protein. Sequence alignment method is a technique to detect homologous genes or proteins. The evolution relationship of two genes or proteins will determine by calculating the score with identity matrix or substitution matrix . Structural alignment method, comparing tertiary structure of proteins, also can explore the evolution relationship of two protein sequences. Then, scientists can create evolution tree for proteins as well as for the life in this planet^[3].

Related Fields

Fields that are related to bioinformatics include^[4]:

Biophysics- a field where biology is investigated using the techniques and concepts found in the physical sciences.

Pharmacogenomics- as it relates to bioinformatics, a field where the techniques of bioinformatics are used to store and process pharmacological and genetic information of the whole genome.

Pharmacogenetics- similar to pharmacogenomics, it uses bioinformatic and genomic techniques to focus on one to a few genes and identify the correlates of genomes.

Medical informatics- is a discipline where computer applications such as algorithms and structures are used to help effectively convey and process medical information.

Mathematical biology- is a field that focuses on using mathematical tools and methods to represent, evaluate, and model the processes of biology.

Computational biology- much like bioinformatics, involves using computer applications and statistical methods to solve biological problems. As such, biological modeling, simulation, and imaging make techniques such as RNA structure and gene prediction, sequence alignment algorithms, and multiple sequence alignment possible.

Proteomics- is the study of the proteome. The proteome is complete collection of proteins that is expressed by a cell, tissue, or organism. The proteins are complementary for a specific genome.

Genomics- the purpose of this scientific branch is to investigate the genome, an organism's complete DNA sequence, through using methods of DNA sequencing and mapping.

Cheminformatics- is the use of computers and information technology to solve problems found in chemistry.

Reference

[1] Nelson, David L. and Cox, Michael M. Lehninger Principles of Biochemistry. New York: W. H. Freeman & Company. 2008

[2] National Center for Biotechnology Information <http://www.ncbi.nlm.nih.gov/>

[3] Berg, Jeremy M., Tymoczko, John L. and Stryer, Lubert. Biochemistry. New York: W. H. Freeman & Company. 2007

[4] Bioinformatics Organization. 2010. <http://wiki.bioinformatics.org/Bioinformatics_FAQ>