Next Generation Sequencing (NGS)/Alignment

From Wikibooks, open books for an open world
Jump to: navigation, search
Next Generation Sequencing (NGS)
Pre-processing Alignment DNA Variants

Introduction[edit]

Alignment, also called mapping,[1] of reads is an essential step in re-sequencing. Having sequenced an organism of a species before, and having constructed a reference sequence, re-sequencing more organisms of the same species allows us to see the genetic differences to the reference sequence, and, by extension, to each other. Alignments of data from these re-sequenced organisms is a relatively simple method of detecting variation in samples. There are certain instances (such as new genes in the sequenced sample that are not found in the existing reference sequence) that can not be detected by alignment alone; however, while other approaches, such as de novo assembly, are potentially more powerful, they are also much harder or, for some organisms, impossible to achieve with current sequencing methods.

Next-generation sequencing generally produces short reads or short read pairs, meaning short sequences of <~200 bases (as compared to long reads by Sanger sequencing, which cover ~1000 bases). To compare the DNA of the sequenced sample to its reference sequence, we need to find the corresponding part of that sequence for each read in our sequencing data. This is called aligning or mapping the reads against the reference sequence. Once this is done, we can look for variation (e.g. SNPs) within the sample. This poses a number of problems:

  • The short reads do not come with position information, that is, we do not know what part of the genome they came from; we need to use the sequence of the read itself to find the corresponding region in the reference sequence.
  • The reference sequence can be quite long (~3 billion bases for human), making it a daunting task to find a matching region.
  • Since our reads are short, there may be several, equally likely places in the reference sequence from which they could have been read. This is especially true for repetitive regions.
  • If we were only looking for perfect matches to the reference, we would never see any variation. Therefore, we need to allow some mismatches and small structural variation (InDels) in our reads.
  • Any sequencing technology produces errors. Similar to the "real" variation, we need to tolerate a low level of sequencing errors in our reads, and separate them from the "real" variation later.
  • We need to do that for each of the millions of reads in our sequencing data.

Short reads[edit]

Raw short reads often come in (or can be converted into) a file format called FASTQ[2]. It is a plain text format, where each single read occupies four consecutive lines:

@Read_id_1
CTGATGTGCCGCCTCACTTCGGTGGT
+
@@@DDDDDH8<BAHG@BHGIHIII>(
@Read_id_2
TGATGTGCCGCCTCACTACGGTGGTG
+
FHHHHHJIJIJIJIIIJJIIJGIGII
@Read_id_3
...

The four lines are:

  1. The name/ID of the read, preceded by a "@". For read pairs, there will be two entries with that name, either in the same or a second FASTQ file.
  2. The sequence of the read.
  3. A "+" sign. In very old FASTQ files, this is followed by the read name from the first line. Today, this line is present for historical reasons backwards compatibility only.
  4. The quality scores of the bases from line 2. The scores are generated by the sequencing machine, and encoded as ASCII (33+score) characters. The line should have the same length as line 2, as there is one quality score per base.

Alignment[edit]

For each of the short reads in the FASTQ file, a corresponding location in the reference sequence (or that no such region exists) needs to be determined. This is achieved by comparing the sequence of the read to that of the reference sequence. A mapping algorithm will try to locate a (hopefully unique) location in the reference sequence that matches the read, while tolerating a certain amount of mismatch to allow subsequence variation detection. Reads aligned (mapped) to a reference sequence will look like this:

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG  Reference sequence
 CTGATGTGCCGCCTCACTTCGGTGGT        Short read 1
  TGATGTGCCGCCTCACTACGGTGGTG       Short read 2
   GATGTGCCGCCTCACTTCGGTGGTGA      Short read 3
GCTGATGTGCCGCCTCACTACGGTG          Short read 4
GCTGATGTGCCGCCTCACTACGGTG          Short read 5

You can see the reference sequence on the top row, and five short reads stacked below; this is called a pileup. While two of the reads are a perfect match to the reference, the three other reads show a mismatch each, highlighted in red ("A" in the read, instead of "T" in the reference). Since there are multiple reads showing the mismatch, at the same position, with the same difference, one could conclude that it is an actual genetic difference (point mutation or SNP), rather than a sequencing error or mismapping.

Mapping algorithms[edit]

There are several alignment algorithms in existence; you can find an (incomplete) list further down in software packages. Some notes on mapping algorithms:

  • The reference sequence, the short reads, or both, are often pre-processed into an indexed form for rapid searching. (See BWT)

Sources of errors[edit]

There are several potential sources for errors in an alignment, including (but not limited to):

  • PCR artefacts. Many NGS methods involve one or multiple PCR steps. PCR errors will show as mismatches in the alignment, and especially errors in early PCR rounds will show up in multiple reads, falsely suggesting genetic variation in the sample. A related error would be PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
  • Sequencing errors. The sequencing machine can make an erroneous call, either for physical reasons (e.g. oil on an Illumina slide), or due to properties of the sequenced DNA (e.g., homopolymers). As sequencing errors are often random, they can be filtered out as singleton reads during variant calling.
  • Mapping errors. The mapping algorithm can map a read to the wrong location in the reference. This often happens around repeats or other low-complexity regions.

Alignment types[edit]

Alignments can be used for different purposes:

  • Whole-genome sequencing. This would be the "default" use; sequence all DNA from an organism and map it to the appropriate reference sequence, to find genetic variation.
  • Exome sequencing. For large genomes (e.g., human), capture just the exomic DNA before sequencing. This will return sequencing data for most of the genes, at a fraction of the cost.
  • Transcriptome sequencing (RNA-Seq). Sequencing of the transcriptome, that is, of the RNA present in the sample. This can show which genes are transcribed in the sample, and help fine-tune gene annotation (exon boundaries etc.). Mapping can be done either to the full reference sequence, or to a special "transcriptome reference".
  • ChIP-Seq (Protein-DNA interaction).

The SAM/BAM format[edit]

The SAM/BAM format has emerged as the de facto standard format for short read alignments. SAM[3] is the plain-text version of the binary, compressed BAM format. They can be converted into one another by the name-giving samtools[4] command-line tool. BAM (without alignment position data) is increasingly used as a space-saving alternative to FASTQ files for containing the short raw read data, and all current alignment software can generate SAM/BAM as an output format. Once in BAM format, the file can be indexed, giving quick access to any region of the reference sequence. Subsequently, using samtools or other software, BAM files can be analysed (e.g. for quality control), modified (removal of PCR duplicates, local realignment, base quality recomputation), or used to call variation, either small (SNPs, short InDels) or large (inversions, tandem duplications, deletions, translocations). BAM files can be visualised using tools like Artemis, ACT, or LookSeq[5]. Last not least, alignments in BAM format can be used to "morph" the reference sequence to correspond to the short read data with ICORN[6]; this can be useful to get an actual DNA sequence for a sample, or to construct a new reference sequence based on a closely related species.

Other useful SAM/BAM-related software includes:

SAM/BAM tools examples[edit]

  • Convert SAM to BAM format:
    • samtools view -bS aln.sam > aln.bam
  • Sort BAM file according to position on reference sequence
    • samtools sort aln.bam aln_sorted.bam
  • Index BAM file (needed for visualising the alignment)
    • samtools index aln_sorted.bam aln_sorted.bai
  • Extract the header information from a BAM file
    • samtools view –h aln_sorted.bam > aln.sam
  • Generate a FASTQ file from a BAM file
    • bam2fastq -o aligned.fastq --no-unaligned aln.bam

Software packages[edit]

Software Type Supported
technologies
Interface Notes
Partek Commercial All GUI
  • Free trial
  • Easy to use, no command line
  • Vast choice of publicly available aligners recommended by Illumina & Life Technologies, Ion Torrent
  • Guidance on alignment choice
BWA[9] Free software Illumina
SOLiD
454
Command line
  • The SAM/BAM output adhere to SAM format, contains mapped and unmapped data, easy to parse
  • Not fully threaded. sampe and samse can only utilize 1 CPU. bwasw (454 longer reads) can be fully threaded, though
  • Not as sensitive as Stampy and Novoalign
  • May be outperformed by BWA-MEM for 70-100bp Illumina reads.
Bowtie[10] Free software Illumina
SOLiD
Command line
  • Discussed in the SeqAnswers forum
  • Fast
  • No mapping quality reported
  • Not as sensitive as Stampy and Novoalign
Stampy[11] Free software Illumina Command line
  • Balance of speed and sensitivity
  • Can be slow even using BWA as premapper
SHRiMP2 Free software Illumina Command line
  • Higher sensitivity than BWA
  • One step mapping, Indexing of genome is not needed
  • Alignment can take less time than BWA is the reference sequence is short, e.g. mapping of reads against a targeted region
  • Alignment speed is slow IF mapping is done onto a large genome
TMAP Free software IonTorrent Command line
  • Uses a selection of algorithms to balance speed and sensitivity
SNP-o-matic[12] Free software Illumina Command line
  • Very fast, especially on genomes <100mbp
  • No/limited de novo variation discovery
  • Also works as a genotyper
CLC workstation Commercial All GUI
  • Easy to use
  • Expensive
  • Alignment is spurious based on our dataset
  • Alignment speed is NOT impressive at all compared to BWA or Bowtie (i7 860 + 16GB memory, windows 2008 R2-64bit)
Novoalign Commerical for multi-threaded version. Single threaded version is free Illumina Command Line Fast and accurate. Probably the best aligner as of 2013.
GSMapper Commerical 454 GUI /
SSAHA2 Free software 454 Command line Fast and accurate for all reads it can map
BLAT Free software 454 Command line Not designed for NGS data.
Mosaik Free software All Command line Tedious steps. Alignment speed can be slow. Huge memory requirement.
BWA-SW[13] Free software 454, IonTorrent Command line
  • For long sequences ranged from 70bp to 1Mbp.
  • Authors recommend to use BWA-MEM (which is the latest) instead of BWA-SW.
BWA-MEM[14] Free software 454, IonTorrent Command line
  • For long sequences ranged from 70bp to 1Mbp.
  • Newer version of BWA-SW, so recommended to use instead of BWA-SW.
  • May outperform by BWA for 70-100bp Illumina reads.
  • May outperform Novoalign for variants call [15]
Bfast[16] Free software SOLiD Command Line Speed of alignment may be too slow for large NGS data [17]
Tophat[18] Free software Illumina Command Line Transcriptome data only
Splicemap Free software Illumina Command Line Transcriptome data only
MapSplice Free software Illumina Command Line Transcriptome data only
AbMapper Free software Illumina Command Line Transcriptome data only
ERNE-map (rNA)[19] Free software Illumina Command line
  • Sensitive and efficient
  • Can be paired with an independent trimming module (ERNE-filter[20]) and a bisulfite-treated-specific read aligner program (ERNE-bs5[21])
  • Slow when dealing with gapped alignments
mrsFAST-Ultra[22] Free software Illumina Command line / GUI
  • Full sensitivity
  • Fast and efficient
  • Multi-threaded

An exhaustive list of NGS aligner is maintained at HTS mapper

Based on personal experience and prevalence and based on literature data on the performance[23][24][25], a quick primer on when to use which software:

  • If only speed matters use bowtie.
  • BWA is a bit slower but more sensitive.
  • If sensitivity and specificity is needed, try Stampy, Novoalign, or SHRiMP 2.

Further Reading Material and References[edit]

  1. Brief review of alignment aglorithm - Alignment section
  2. Cock et al (2009) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, doi:10.1093/nar/gkp1137
  3. SAM format specifications
  4. samtools, from the WTSI
  5. http://genome.cshlp.org/content/19/11/2125.long
  6. ICORN
  7. Picard by Broad Institute
  8. Bio-SamTools
  9. Li and Durbin, 2009
  10. Langmead et al., 2009
  11. Lunter and Goodson, 2010
  12. http://bioinformatics.oxfordjournals.org/content/25/18/2434
  13. Li and Durbin, 2010
  14. Li, submitted in 2013
  15. http://bcbio.wordpress.com/2013/05/06/framework-for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/
  16. http://en.wikipedia.org/wiki/BFAST
  17. http://massgenomics.org/short-read-aligners
  18. http://tophat.cbcb.umd.edu/
  19. Vezzi, Del Fabbro, Tomescu and Policriti, 2012
  20. ERNE website - filter
  21. ERNE website - bs5
  22. Hach et al., 2014
  23. Evaluation of next-generation sequencing software in mapping and assembly
  24. A list of old short reads aligners
  25. ROC curves, including bowtie2