Structural Biochemistry/Bioinformatics/Homology: Difference between revisions

From Wikibooks, open books for an open world
Jump to navigation Jump to search
[checked revision][unreviewed revision]
Content deleted Content added
m →‎Introduction: fixed link/typo
Tptran (discuss | contribs)
No edit summary
Line 52: Line 52:
The major sources of error in homology modeling are the poor selection of template and inaccurate template-target sequence alignment. This can be improved by using multiple sequences and structural alignment.
The major sources of error in homology modeling are the poor selection of template and inaccurate template-target sequence alignment. This can be improved by using multiple sequences and structural alignment.


==Misuse of the Term==

The term “homology” is often mistakenly used when describing proteins or nucleic acid sequences due to the fact that “homology is a concept of quality and cannot be ‘quantified’<ref name=Misuse/>”. In a recent analysis, the term “homology” was searched on [http://www.pubmed.gov PubMed] in the 2007 database and 1966 abstracts contain the word homology either in the title or the abstract. Of these abstracts, 57% (1128) properly uses the term while 43% (828) uses the term incorrectly. Some of the incorrect usage of the term includes association with a percentage value and terms such as “high”, “low”, and “significant”. Analyzing the term for the abstracts in the 1986 database shows that the frequency of misusing the term “homology” has slightly decreased.<ref name=Habits/>


==References==
{{reflist|refs=

<ref name=Misuse>Lewin, R. (1987) When does homology mean something else? Science
237, 1570 </ref>

<ref name=Habits> "When it comes to homology, bad habits die hard." Trends in Biochemical Sciences.
Volume 34, Issue 3, March 2009, Pages 98-99. </ref>

}}
{{BookCat}}
{{BookCat}}

Revision as of 00:03, 17 October 2010

Introduction

Homology is a concept that takes into account similarities that occur among nucleic acid or

A wings.
Evolutionary change in birds wings are an example of homology as found by Darwin based on the similar occurrences in bone structure of wings. http://www.talkorigins.org/faqs/precursors/precurscommdesc.html

protein sequences of two different organisms. Coined by Richard Owen in 1948, homology is quantized by comparing matches that occur between two different samples of amino acid sequences in proteins or DNA sequences in DNA and assigning a system of point values to identical/similar matches that occur in alignments. This type of analysis is useful in determining relationships between species and can help to trace ancestral descent as well as evolutionary changes that have occured over time in a given set of species. Today, techniques have been developed to assess the probability of two organisms being homologous and has largely become the main area of focus for bioinformaticians around the world.

Orthologs

Orthologs are specific gene sequences that are closely related between two entirely different species, but often have the same functions. The term ortholog stems from the root "ortho" meaning "other" and was coined by Walter Fitch in 1970. Separated by a speciation event where a species diverges into two separate species, divergent copies of a solitary gene result in the orthologous homologous sequence.

An example of orthologous genes are the genes that code for hemoglobins in both cows and humans. The mapping of orthologs can help biologist construct evolutionary trees that are much more detailed and specific. Taxonomy and phylogenetic studies benefit from orthologous sequences. A simple example can be a bat and a bird; a bird and a bat are part of two different species and yet their wings have the same function.

Paralogs

Paralogs refer to gene sequences that are shared by organisms in the same species but exhibit different functions. Paralogs are usually the product of gene duplication which can be caused by any number of mechanisms such as transposons or unequal cross-overs. These duplicated genes typically have similar functions and can mutate further to take on other functions which results in the paralogs.

The number of differences or substitutions are proportional to the time that has passed since the gene has become duplicated. Thereby shedding light upon the way genomes evolve. Myoglobin and hemoglobin are considered to be the ancient paralogs which all evolve from.

Suspected paralogs are the genes that encode for hemoglobin and myoglobin as both have similar protein structures but differ in their oxygen-carrying duties. There are four known classes of hemoglobins (hemoglobin A, hemoglobin A2, hemoglobin B, and hemoglobin F) where are all paralogs of one another.Other examples of paralogs are Actin and Hsp-70. Their tertiary structures are similar but their functions are different; actin is part of the cytoskeleton, while Hsp-70 is a heat shock protein.

Sequence Alignments Detect Homologs

To test whether or not two molecules are homologous, it is important to examine the nucleic acid or protein sequence for matches that occur between the two sequences. Although forms of sequencing work, the protein sequencing is usually preferable because it's composed of 20 different building blocks (amino acids) while DNA and RNA are each only comprised of two; so having a significant number of matches in protein sequencing is much stronger evidence of a common ancestry than nucleic acid sequencing. Also, since there are multiple redundant sequences in the genetic code where different genes can code for the same amino acid, so the comparison of proteins is much more sensitive in determining similarities in function than with DNA or RNA.

Two different protein sequences can be compared by analyzing the number of times that their amino acids match when aligned directly above each other or when one sequence is slid past the other. For instance, when assessing the number of matches, amino acid one of the top strand can either be directly above amino acid 1 from the second strand or slid to the left/right of it thus causing different amino acids to align. The number of matches are then plotted against the alignment in order to assess what alignment the maximum number of matches occur. It is important to understand that a large number of matches does not mean the two proteins are homologs.

To account for mutations such as insertions and deletions, gaps may be inserted to create a better match. If two sequence comparisons appear to be a good match, a gap may be inserted to accommodate both comparisons. Scientist score the alignment: +10 points for each match and -25 points for each gap no matter the size. This score must then be plotted against a distribution of other scores obtained by randomly shuffling one of the protein strands and comparing it to the other many times to ensure the amino acid matches were not just due to chance. If the score deviates largely from the majority of the scores, then the two proteins are probably homologs. However, a low score does not rule out homology.

Homolog Sequencing Technology: Matrices

Scores may be calculated using identity or substitution matrices. Identity matrices assign a value of one for matches between sequences and zeros for non-matches. This method does not distinguish between likely and rare mutations and therefore does not give a clear answer to homology. Substitution matrices account for conservative mutations that are less likely to be deleterious or seriously change the function, such as switching glycine and alanine, by giving them a large positive score. So in other words, substitution matrices take into account not only if the sequences are identical (giving them the highest possible score), but unlike identity matrices they also assign values for amino acids sequences when they are "substituted" by another amino acid with similarities. The more simililar the amino acid sequence, the bigger the "value" it recieves. The more different the sequences are or "rare" the substitution of a given amino acid like A would be substituted for something like P, the bigger their "negative" values get. By making a distinction between the different types of mutations, better matches can be made and alignments based on random chance are avoided.


alt text
Substitution matrix from clcbio.com

.

]]

Homology Modeling

The primary goal of homology modeling is to study the structure of the macromolecules. X-ray crystallography and NMR are the only ways to provide detailed structural information; however, these techniques involve elaborate procedures and many proteins fail to crystallize or cannot be obtained or dissolved in adequate quantities for NMR analysis. Therefore, model building on the basis of the known three dimensional structure of a homologous protein is the most reliable way to obtain structural information about the unknown protein. These are the main steps in homology modeling:

1. Finding homologues protein database files (the template) Template selection is a critical step in homology modeling. Template identification can be aided by database search techniques.

2. Creation of the alignment, using single or multiple sequence alignments.

When more than one known is involved, the knowns will align together, then the unknown sequence aligned with the group; this helps ensure better domain conservation) furthermore, the alignment can be corrected by the insertion or deletion of gaps. Even though introduction of gap complicates the alignment, there are developed methods that use scoring systems to compare different systems and penalize gaps to prevent the unreasonable insertions. Scoring of alignment involves the construction of identity matrices and substitution matrices. Substitution matrices are believed to be the best, theses methods are based on the analysis of the frequency with which a given amino acid is observed to be replaced by other amino acids among proteins for which the sequences can be aligned.

3. Model generation: The information contained in the template and alignment can be used to generate a three dimensional structural model of the protein, which is represented as a set of Cartesian Coordinates.

4. Model Refinement: The major sources of error in homology modeling are the poor selection of template and inaccurate template-target sequence alignment. This can be improved by using multiple sequences and structural alignment.

Misuse of the Term

The term “homology” is often mistakenly used when describing proteins or nucleic acid sequences due to the fact that “homology is a concept of quality and cannot be ‘quantified’[1]”. In a recent analysis, the term “homology” was searched on PubMed in the 2007 database and 1966 abstracts contain the word homology either in the title or the abstract. Of these abstracts, 57% (1128) properly uses the term while 43% (828) uses the term incorrectly. Some of the incorrect usage of the term includes association with a percentage value and terms such as “high”, “low”, and “significant”. Analyzing the term for the abstracts in the 1986 database shows that the frequency of misusing the term “homology” has slightly decreased.[2]


References

  1. Lewin, R. (1987) When does homology mean something else? Science 237, 1570
  2. "When it comes to homology, bad habits die hard." Trends in Biochemical Sciences. Volume 34, Issue 3, March 2009, Pages 98-99.