Structural Biochemistry/Bioinformatics/Sequences Alignments

From Wikibooks, open books for an open world
Jump to navigation Jump to search

There are over a million different genes that fold into tens of thousands of different protein structures. Homologous structures therefore, must exist. Because there are a limited number of structures, two proteins can have very similar structures, and that's where sequence alignments step in. The theory of homology comes from experimental evidence of similarity in genes and proteins from evolution. Little is known about most genes so homology can be used to predict a gene's function. There are two types of homologs: paralogs and orthologs. Paralogs are found in the same organism with similar genomic structure, such as hemoglobin and myogoblin, but serve different functions. Orthologs are the inverse of paralogs; they are found in different organisms but essentially serve similar functions in its host organism, hinting evidence of evolutionary ancestry.

The human genome consists of over 3 billion base pairs and over 25,000 genes. Alternative splicing is what allows genes to encode numerous proteins.

Sequence Alignments can be used to detect homology between two polypeptide chains. Figuring out sequence alignments can help develop evolutionary origins and trace back the function, structure, and mechanism of a genome. Repeated motifs can be detected by aligning a sequence with itself. More than 10% of all proteins have two or more regions that are similar to one another. An example of this is the protein that binds to the TATA box which is comprised of two similar regions determined by sequence aligning the protein with itself. The three dimensional structure for this protein has been elucidated and the two similar regions have been confirmed.

The percentage of similarity between two gene sequences is known as the best possible alignment among all alignments that can be made to the sequence.

The simplest way to compare protein sequences is to align each strand and count for matching residues. The sequence is slid down one residue and each the sequences are realigned and matched again. Continuing this process for all possible combinations of alignments produces an alignment score for each combination.

Amino acids can be very similar to each other, and therefore replace one another over the course of an evolutionary period. Sequence alignments acknowledge this by including mismatches while accounting for probability and percentage of identities.

Newly elucidated protein sequences can be aligned by inputting the sequence into a large database of previously sequenced proteins. This procedure is a called a BLAST (Basic Local Alignment Search Tool) search. Using blast, homology of a newly sequenced protein can be determined, as well as predict function and tertiary structure of a protein. The first completed genome using the bacteria, Haemophilus influenza, identified roughly 1743 protein sequences. Using a BLAST search, researchers were able to identify possible function and structures for 1007 of these protein sequences.

A sequence alignment, produced by ClustalO, of mammalian histone proteins.
Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).[1]

Homology[edit | edit source]

With the thousands of genes that currently exist, it's less feasible to deduce complete information about a gene and more feasible to compare genes are proteins through evolutionary characteristics. Homologous genes and proteins therefore, are the proteins and genes that have markedly similar characteristics.

Two sequences can be extremely similar with identical evolutionary backgrounds, however, over the years the sequence could have lost a set of amino acids or proteins that barely affect the function of the gene or protein. Similar amino acids can also replace each other and have little to no effect on the function of the gene or protein. These substitutions between proteins or genes still are homologous.

Gaps[edit | edit source]

Gaps are introduced when a sequence can be better aligned to encompass an increased amount of matching residues. For example, if two alignments appear to be a good match, a gap may be inserted to accommodate both alignments. Gaps also reflect upon the insertions, deletions, and mutations of nucleotides over time.

Gaps Increase Complexity

In principle, any arbitrary size and number of gaps can be added to any place of a sequence. To avoid an excessive number of gaps and deter further from the original sequence, scoring systems with penalties are used. One example is giving a penalty of -25 a gap of any size. However, each new sequence aligned based on the gaps receives a score of +8. If there are 50 new identities and 1 gap, the score would be [(50*8)-(1*25)], the score would be 375. In a sequence with 86 residues, there would be a 50/86 % identity match. The total score is calculated into a percentage of identity [see below], indicating the statistical probability of sequence similarity.


Shuffling[edit | edit source]

To check if the original sequences are accurate, the original sequences are shuffled randomly. The matching residues of the random sequences to the original sequence to produce an alignment score. Then number of matching alignments are compared between the alignment score of both the original and random sequences.

When comparing the un-shuffled alignment score with the shuffled alignment scores, if the un-shuffled alignment score deviates far from the mean and standard deviation of shuffled scores (is an outlier), this indicates that the sequences are likely homologous and the similarities and not simply due to chance. The probability of the un-shuffled alignment score deviating greatly from the shuffled alignment scores is approximately 1 in 1020,[2] indicating the likelihood of the authentic alignment to be unique in terms of alignment of bases. This method does not rule out homology.

Identity Matrices[edit | edit source]

An identity matrix is a way to evaluate the likeness of two different sequences of amino acids. In an identity matrix, the two sequences are given a point for every time there is an exact match in amino acids. It is all or nothing, the two amino acids either match, or they don't. Identity matrices are not as accurate in evaluating the likely hood of two sequences expressing homology because there are often mutations in amino acid sequence that either doesn't change the function of the protein, or does very little to change the function. These usually occur with similar amino acids such as Leucine and Isoleucine. Because of this factor, other techniques such as the substitution matrix is preferred.

Substitution Matrices[edit | edit source]

Homology is an important tool in evolutionary biology. Substitution matrix is one way to study homology in that it describes the similarities in protein sequences or in DNA Sequences. It accomplishes this by assigning a point system where the two sequences are compared to their randomized sequences. Amino acids have a certain ability to mutate into another amino acid. Hydrophobic amino acids, (i.g. Valine) have greater chance to mutate to another hydrophobic amino acid (i.g. Leucine). Substitutions that are made often receive a high positive score and rare substitution have given negative scores. Identical amino acid matches are also given points in a substitution matrix. There many types of substitution matrix that have been developed that have assigned different points for substitution examples are PAM, Blosum, BLAST matrices. These matrices are 20X20 matrices for protein. Blosum ( block substitution) matrices calculates homology by comparing blocks of conserved sequences in many sequence alignments compared to the identity sequence. The blocks are assumed to have functional significance in evolutionary biology.

Sequence analysis using substitution matrices are much more sensitive than identity matrices because it accounts for conservative substitutions that may have happened over time which do not significantly alter the structure of the protein. Substitution matrices can detect homology between sequences that would have otherwise been found not homologous using simple identity matrices.

substitutional matrix

Probability of Identity[edit | edit source]

If the 2 sequences are greater than 25% alike in a chain of at least 100 amino acids, the likelihood that they are homologs are high. If the 2 sequences are less than 15% alike, the likelihood that they are homologs are low. Between 15% and 25%, other methods, such as comparison of the tertiary structure, must be done to confirm homology.

Sequence Templates[edit | edit source]

In sequence alignment, certain amino acid residues are more important to the function of the protein than others and are more highly conserved throughout evolution. The areas that are critical to function and the amino acid residues comprising that area can be determined by examining the three dimensional structure of the protein. For example, the globin family (hemoglobin, myoglobin, leghemoglobin) of proteins that bind to oxygen, bind oxygen via a heme group that is comprised of a histidine residue that interacts with the iron in the heme group. This histidine residue is conserved in all of the globin family of proteins. This region that is significant to globin proteins can be used as a sequence template that is characteristic of this family of proteins. Newly elucidated protein sequences can then be matched to this sequence template to match that protein with certain families or to determine whether the new protein has similar functions to those families.

Methods of Sequencing[edit | edit source]

The Sanger Dideoxy method is used to sequence DNA. This process is a fast and simple one in which it involves the use of DNA polymerase to synthesize a complementary sequence containing fluorescent tags on the four deoxyribonucletide bases. The fragments of DNA strands containing the fluorescent bases are then separated via electrophoresis or chromatography then sent through a detector. Another method to sequence genomic DNA is the Shotgun method.

Edman degradation is used to sequence proteins. Phenyl isothiocyanate reacts with the amino group in the N-terminal amino acid, then acidified to remove it. High pressure liquid chromatography (HPLC) is used to identify the amino acid. The process is repeated for each of the following proteins.

Databases[edit | edit source]

Isolating and comparing an individual strand with any given strand can be tedious and time consuming. Therefore, there exists databases with homologous sequences that can be readily obtained and utilized. The methods of sequence alignments as listed above are tremendously useful when utilized alongside the broad databases and resources in available on the Internet.

PAM and BLOSUM matrices are two of the most frequently used scoring techniques.

BLOSUM, or Block Substitution Matrix, is a technique that measures local multiple alignments of related sequences. BLOSUM 62 is the 
default matrix for BLAST. BLOSUM 62 requires 62% sequence identity, while BLOSUM 80 would require 80% identity, etc. 

- Basic Local Alignment Search Tool (BLAST) is located at the National Center for Biotechnology information. The individual amino acid sequence can be searched through the web browser. There are over 3 million sequences are in the database. In addition, the amino acid sequence entered can be compared with a chosen genome (such as humans), all the genomes currently in the database. The database gives a list of sequence alignment and a percentage of identity. It will look for similarity between DNA or protein sequences. The website is [1].

PAM stands for Percentage of Acceptable Point Mutation per 10^8 years. This process measures the global alignment of similar proteins. This practice requires the sequence to be less than or equal to 1% divergence. The mutation probability provides the scores over a period of time by column X, representing amino acid mutation, vs. row Y, the product of mutation. By multiplying this matrices by itself repeatedly, new matrices can be made to measure greater evolutionary distances.

There are three main Databases for DNA: Genebank, EMBL,DDBJ. These contain numerous entries that are the DNA sequence of genes and other DNAs such as genetic mapping markers discover and cloned by scientists so far. Each sequence entry was assign a unique accession number.

NCBI(National Center for Biotechnology Information)- a collection of databases and analysis tools. This site is supported by the National Institutes of Health and free for researchers or anyone who is interested in it. You can simply get on website: and search for a sequence of protein, DNA, RNA...etc. Many of database with NCBI are linked through a search and retrieval system called Entrez which allows for text-specific searches using key words.

ExPASy(Expert Protein Analysis System)- a very useful collection of protein and amino acid sequence analysis tools that is part of the server of the Swiss Institute of Bioinformatics. website:

Protein data bank- a database of protein structural information. website:

Clustal W- An online amino acid sequence alignment program that is part of the European Bioinformatics Institute website. This is a powerful website for compaing protein sequences, after align, one can click on "show colors" to view a color based representation of amino acid similarities. website:

How to look up sequence in Genbank[edit | edit source]

The following will be a step by step guideline of how to use program and website available online:

1. Go to the NCBI home page. (

2. The menu bar next to "all databases" should have all the different types of databases available. Pick the appropriate one. For example, if you want to find DNA sequence, you will pick nucleotide.

3. Use "key words" to find the sequence. It will have many varieties of options. which one is the one we are looking for then? If we are trying to find a DNA sequence that contains the entire coding region of the gene then we will have to find something with mRNA which introns were taken out already or complete cds of the coding sequence. It will be easier for one to find the sequence desire by typing down the domestic name of the animal (if you are looking for animal's gene).

4. Accession number is the ID tag for the specific sequence which appears in blue once one find the sequence desire.

5. The DNA sequence is given at the bottom of the page and numbering for the nucleotide in the sequence is given to the right.

6.CDS stands for coding sequence.

If one wants to find a homology then BLAST will be use:

1. Go to the NCBI homepage and click on BLAST. They are many different option of align, in this case, we will pick nucleotide blast.

2. Type the unknown sequence into the large field. For choose search set, one will pick others. then BLAST it.

3. Then a page of summary of the matches query nucleotide sequence is given from highest similarity (top) to least(bottom).

4. Query coverage and maximum identity columns are available too. Query coverage will show us the percentage of nucleotide that were the same or how well they match up. Then the homology of your unknown sequence will be determined.

BLAST can also be use to compare or align two DNA sequences to see how similar they are:

1. Get the entire gene sequence for both sequences one wants to compare (Like mention before.)

2. Open BLAST homepage and click on align under Specialized Blast.

3. In the query sequence box, you can either enter the accession number or the whole sequence.

4. Program selection has many different program you can use. After selecting apporpriate one, click BLAST. Then you will align the two selected DNA sequences.


Three stage approach to genome sequencing

The initial stage:

Cytogenetic maps based on this type of information provided the starting point for more detailed mapping. With these cytogenetic maps of chromosomes in hands, the initial stage in sequencing the human genome was to construct a linkage map of several thousand genetic markers spaced throughout the chromosomes. On the stage, the order of the markers and the relative distances between them on such a map are based on recombination frequencies. The markers can be genes or any other identifiable sequences in the DNA. It was also valuable as a framework for organizing more detailed maps of particular regions.

The second stage:

This stage was the physical mapping of the human genome. In a physical map, the distances between markers are expressed by some physical measure, usually the number of base pairs along the DNA. The key is to make fragments that overlap and then use probes or automated nucleotide sequencing of the ends to find the overlap. In this way, fragments can be assigned to a sequencing order that corresponds to their order in a chromosome. In working with large genome, researchers carry out several rounds of DNA cutting, cloning, and physical mapping. After such long fragments are put in order, each fragment is cut into smaller pieces, which are cloned in plasmids or phages, ordered in turn, and finally sequenced.

The last stage:

The ultimate goal in a mapping a genome is to determine the complete nucleotide sequence of each chromosome. For the human genome, this was accomplished by sequence machines, using chain-termination method.

Sequence Alignment Programs: Geneious[edit | edit source]

There are may programs that are used to align sequences that have already been processed by sequencing companies. The most accredited sequencing program is Geneious. Geneious is a program that is a suite of cross platforms for bioinformatics, applications involved in sequence alignment, and sequence BLAST searches in correspondence with NCBI. Geneious comes with many features that involve everything from split viewer genome browsing for easy restriction analysis and cloning workflows to PCR priming design, allowing one to design and test degenerate primers capable of mismatching multiple primers in order to search for implementable DNA sequencing.

References[edit | edit source]

  1. "Clustal FAQ #Symbols". Retrieved 8 December 2014. 
  2. Berg, Jeremy M. John L. Tymoczko. Lubert Stryer. Biochemistry Sixth Edition. W.H. Freeman and Company. New York, 2007.

1. Berg, Jeremy M. John L.,Tymoczko, and Lubert Stryer. Biochemistry Sixth Edition. W.H. Freeman and Company. New York, 2007.

2. Coleman,Aaron Gould Meredith Stephano Luis Jose. Biochemical Techniques. University of California, San Diego. 2009

3. “Genomes and their evolution.” Biology. Campbell and Reece. Ed 8th. 2007.500-600.