Next Generation Sequencing (NGS)/Alignment

From Wikibooks, open books for an open world
Jump to navigation Jump to search
Next Generation Sequencing (NGS)
Pre-processing Alignment DNA Variants

Introduction[edit | edit source]

Alignment, also called mapping,[1] of reads is an essential step in re-sequencing. Having sequenced an organism of a species before, and having constructed a reference sequence, re-sequencing more organisms of the same species allows us to see the genetic differences to the reference sequence, and, by extension, to each other. Alignments of data from these re-sequenced organisms is a relatively simple method of detecting variation in samples. There are certain instances (such as new genes in the sequenced sample that are not found in the existing reference sequence) that can not be detected by alignment alone; however, while other approaches, such as de novo assembly, are potentially more powerful, they are also much harder or, for some organisms, impossible to achieve with current sequencing methods.

Next-generation sequencing generally produces short reads or short read pairs, meaning short sequences of <~200 bases (as compared to long reads by Sanger sequencing, which cover ~1000 bases). To compare the DNA of the sequenced sample to its reference sequence, we need to find the corresponding part of that sequence for each read in our sequencing data. This is called aligning or mapping the reads against the reference sequence. Once this is done, we can look for variation (e.g. SNPs) within the sample. This poses a number of problems:

  • The short reads do not come with position information, that is, we do not know what part of the genome they came from; we need to use the sequence of the read itself to find the corresponding region in the reference sequence.
  • The reference sequence can be quite long (~3 billion bases for human), making it a daunting task to find a matching region.
  • Since our reads are short, there may be several, equally likely places in the reference sequence from which they could have been read. This is especially true for repetitive regions.
  • If we were only looking for perfect matches to the reference, we would never see any variation. Therefore, we need to allow some mismatches and small structural variation (InDels) in our reads.
  • Any sequencing technology produces errors. Similar to the "real" variation, we need to tolerate a low level of sequencing errors in our reads, and separate them from the "real" variation later.
  • We need to do that for each of the millions of reads in our sequencing data.

Short reads[edit | edit source]

Raw short reads often come in (or can be converted into) a file format called FASTQ.[2] It is a plain text format, containing the sequence and quality scores for every read, where each single read normally occupies four consecutive lines:

@Read_id_1
CTGATGTGCCGCCTCACTTCGGTGGT
+
@@@DDDDDH8<BAHG@BHGIHIII>(
@Read_id_2
TGATGTGCCGCCTCACTACGGTGGTG
+
FHHHHHJIJIJIJIIIJJIIJGIGII
@Read_id_3
...

The four lines are:

  1. The name/ID of the read, preceded by a "@". For read pairs, there will be two entries with that name, either in the same or a second FASTQ file.
  2. The sequence of the read.
  3. A "+" sign. In very old FASTQ files, this is followed by the read name from the first line. Today, this line is present for historical reasons backwards compatibility only.
  4. The quality scores of the bases from line 2. The scores are generated by the sequencing machine, and encoded as ASCII (33+score) characters. The line should have the same length as line 2, as there is one quality score per base.

Alignment[edit | edit source]

For each of the short reads in the FASTQ file, a corresponding location in the reference sequence (or that no such region exists) needs to be determined. This is achieved by comparing the sequence of the read to that of the reference sequence. A mapping algorithm will try to locate a (hopefully unique) location in the reference sequence that matches the read, while tolerating a certain amount of mismatch to allow subsequence variation detection. Reads aligned (mapped) to a reference sequence will look like this:

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG  Reference sequence
 CTGATGTGCCGCCTCACTTCGGTGGT        Short read 1
  TGATGTGCCGCCTCACTACGGTGGTG       Short read 2
   GATGTGCCGCCTCACTTCGGTGGTGA      Short read 3
GCTGATGTGCCGCCTCACTACGGTG          Short read 4
GCTGATGTGCCGCCTCACTACGGTG          Short read 5

You can see the reference sequence on the top row, and five short reads stacked below; this is called a pileup. While two of the reads are a perfect match to the reference, the three other reads show a mismatch each, highlighted in red ("A" in the read, instead of "T" in the reference). Since there are multiple reads showing the mismatch, at the same position, with the same difference, one could conclude that it is an actual genetic difference (point mutation or SNP), rather than a sequencing error or mismapping.

Mapping algorithms[edit | edit source]

There are several alignment algorithms in existence; you can find an (incomplete) list further down in software packages. Some notes on mapping algorithms:

  • The reference sequence, the short reads, or both, are often pre-processed into an indexed form for rapid searching. (See BWT)

Sources of errors[edit | edit source]

There are several potential sources for errors in an alignment, including (but not limited to):

  • Polymerase Chain Reaction artifacts (PCR artifacts). Many NGS methods involve one or multiple PCR steps. PCR errors will show as mismatches in the alignment, and especially errors in early PCR rounds will show up in multiple reads, falsely suggesting genetic variation in the sample. A related error would be PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
  • Sequencing errors. The sequencing machine can make an erroneous call, either for physical reasons (e.g. oil on an Illumina slide), or due to properties of the sequenced DNA (e.g., homopolymers). As sequencing errors are often random, they can be filtered out as singleton reads during variant calling.
  • Mapping errors. The mapping algorithm can map a read to the wrong location in the reference. This often happens around repeats or other low-complexity regions.

Alignment types[edit | edit source]

Alignments can be used for different purposes:

  • Whole-genome sequencing. This would be the "default" use; sequence all DNA from an organism and map it to the appropriate reference sequence, to find genetic variation.
  • Exome sequencing. For large genomes (e.g., human), capture just the exomic DNA before sequencing. This will return sequencing data for most of the genes, at a fraction of the cost.
  • Transcriptome sequencing (RNA-Seq). Sequencing of the transcriptome, that is, of the RNA present in the sample. This can show which genes are transcribed in the sample, and help fine-tune gene annotation (exon boundaries etc.). Mapping can be done either to the full reference sequence, or to a special "transcriptome reference".
  • ChIP-Seq (Protein-DNA interaction).

The SAM/BAM format[edit | edit source]

The SAM/BAM format has emerged as the de facto standard format for short read alignments. SAM[3] is the plain-text version of the binary, compressed BAM format. They can be converted into one another by the name-giving samtools[4] command-line tool. BAM (without alignment position data) is increasingly used as a space-saving alternative to FASTQ files for containing the short raw read data, and all current alignment software can generate SAM/BAM as an output format. Once in BAM format, the file can be indexed, giving quick access to any region of the reference sequence. Subsequently, using samtools or other software, BAM files can be analysed (e.g. for quality control), modified (removal of PCR duplicates, local realignment, base quality recomputation), or used to call variation, either small (SNPs, short InDels) or large (inversions, tandem duplications, deletions, translocations). BAM files can be visualised using tools like Artemis, ACT, or LookSeq[5]. Last but not least, alignments in BAM format can be used to "morph" the reference sequence to correspond to the short read data with ICORN[6]; this can be useful to get an actual DNA sequence for a sample, or to construct a new reference sequence based on a closely related species.

Other useful SAM/BAM-related software includes:

SAM/BAM tools examples[edit | edit source]

  • Convert SAM to BAM format:
    • samtools view -bS aln.sam > aln.bam
  • Sort BAM file according to position on reference sequence
    • samtools sort aln.bam aln_sorted.bam
  • Index BAM file (needed for visualising the alignment)
    • samtools index aln_sorted.bam aln_sorted.bai
  • Extract the header information from a BAM file
    • samtools view –h aln_sorted.bam > aln.sam
  • Generate a FASTQ file from a BAM file
    • bam2fastq -o aligned.fastq --no-unaligned aln.bam

Software packages[edit | edit source]

Software Type Supported
technologies
Interface Notes
Partek Commercial All GUI
  • Free trial
  • Easy to use, no command line
  • Vast choice of publicly available aligners recommended by Illumina & Life Technologies, Ion Torrent
  • Guidance on alignment choice
BWA[9] Free software Illumina
SOLiD
454
Command line
  • The SAM/BAM output adhere to SAM format, contains mapped and unmapped data, easy to parse
  • Not fully threaded. sampe and samse can only utilize 1 CPU. bwasw (454 longer reads) can be fully threaded, though
  • Not as sensitive as Stampy and Novoalign
  • May be outperformed by BWA-MEM for 70-100bp Illumina reads.
Bowtie[10] Free software Illumina
SOLiD
Command line
  • Discussed in the SeqAnswers forum
  • Fast
  • No mapping quality reported
  • Not as sensitive as Stampy and Novoalign
Stampy[11] Free software Illumina Command line
  • Balance of speed and sensitivity
  • Can be slow even using BWA as premapper
SHRiMP2 Free software Illumina Command line
  • Higher sensitivity than BWA
  • One step mapping, Indexing of genome is not needed
  • Alignment can take less time than BWA is the reference sequence is short, e.g. mapping of reads against a targeted region
  • Alignment speed is slow IF mapping is done onto a large genome
TMAP Free software IonTorrent Command line
  • Uses a selection of algorithms to balance speed and sensitivity
SNP-o-matic[12] Free software Illumina Command line
  • Very fast, especially on genomes <100mbp
  • No/limited de novo variation discovery
  • Also works as a genotyper
CLC workstation Commercial All GUI
  • Easy to use
  • Expensive
  • Alignment is spurious based on our dataset
  • Alignment speed is NOT impressive at all compared to BWA or Bowtie (i7 860 + 16GB memory, windows 2008 R2-64bit)
NextGenMap[13] Open source Illumina, Ion Torrent Command Line Fast and accurate. Self adjusting to the underlying data. Robust for high polymorphism
  • Easy to use
  • Fast and accurate
  • Robust to SNPs
  • Self adapts to the data set.
Novoalign Commercial for multi-threaded version. Single threaded version is free Illumina Command Line Fast and accurate. Probably the best aligner as of 2013.
GSMapper Commercial 454 GUI /
SSAHA2 Free software 454 Command line Fast and accurate for all reads it can map
BLAT Free software 454 Command line Not designed for NGS data.
Mosaik Free software All Command line Tedious steps. Alignment speed can be slow. Huge memory requirement.
BWA-SW[14] Free software 454, IonTorrent Command line
  • For long sequences ranged from 70bp to 1Mbp.
  • Authors recommend to use BWA-MEM (which is the latest) instead of BWA-SW.
BWA-MEM[15] Free software 454, IonTorrent Command line
  • For long sequences ranged from 70bp to 1Mbp.
  • Newer version of BWA-SW, so recommended to use instead of BWA-SW.
  • May outperform BWA for 70-100bp Illumina reads.
  • May outperform Novoalign for variants call [16]
Bfast[17] Free software SOLiD Command Line Speed of alignment may be too slow for large NGS data [18]
Tophat[19] Free software Illumina Command Line Transcriptome data only
Splicemap Free software Illumina Command Line Transcriptome data only
MapSplice Free software Illumina Command Line Transcriptome data only
AbMapper Free software Illumina Command Line Transcriptome data only
ERNE-map (rNA)[20] Free software Illumina Command line
  • Sensitive and efficient
  • Can be paired with an independent trimming module (ERNE-filter) and a bisulfite-treated-specific read aligner program (ERNE-bs5)[21]
  • Slow when dealing with gapped alignments
mrsFAST-Ultra[22] Free software Illumina Command line / GUI
  • Full sensitivity
  • Fast and efficient
  • Multi-threaded

An exhaustive list of NGS aligner is maintained at HTS mapper

Based on personal experience and prevalence and based on literature data on the performance[23][24][25], a quick primer on when to use which software:

  • If only speed matters use bowtie.
  • BWA is a bit slower but more sensitive.
  • If sensitivity and specificity is needed, try NextGenMap, Stampy, Novoalign, or SHRiMP 2.

Furthermore a framework (Teaser) was proposed that automatically benchmarks several mappers within minutes.[26]

References[edit | edit source]

  1. Flicek, P.; Birney, E. (2009). "Sense from sequence reads: Methods for alignment and assembly". Nature Methods. 6 (11 Suppl): S6–S12. doi:10.1038/nmeth.1376. PMID 19844229.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  2. Cock, P.J.; Fields, C.J.; Goto, N.; Heuer, M.L.; Rice, P.M. (2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Research. 38 (6): 1767–71. doi:10.1093/nar/gkp1137. PMC 2847217. PMID 20015970.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  3. The SAM/BAM Format Specification Working Group (18 November 2015). "Sequence Alignment/Map Format Specification" (PDF). GitHub. p. 16. Retrieved 28 April 2016.
  4. "Samtools organisation and repositories". GitHub. Retrieved 28 April 2016.
  5. Manske, H.M.; Kwiatkowski, D.P. (2009). "LookSeq: A browser-based viewer for deep sequencing data". Genome Research. 19 (11): 2125–2132. doi:10.1101/gr.093443.109. PMC 2775587. PMID 19679872.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  6. "ICORN: Iterative Correction of Reference Nucleotides". SourceForge. Retrieved 28 April 2016.
  7. Broad Institute. "Picard". GitHub. Retrieved 28 April 2016.
  8. Stein, L.D. (12 February 2016). "Bio-SamTools". CPAN. Perl.org. Retrieved 28 April 2016.
  9. Li, H.; Durbin, R. (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC 2705234. PMID 19451168.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  10. Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S.L. (2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biology. 10 (3): R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996. PMID 19261174.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  11. Lunter, G.; Goodson, M. (2011). "Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads". Genome Research. 21 (6): 936–939. doi:10.1101/gr.111120.110. PMC 3106326. PMID 20980556.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  12. Manske, H.M.; Kwiatkowski, D.P. (2009). "SNP-o-matic". Bioinformatics. 25 (18): 2434–2435. doi:10.1093/bioinformatics/btp403. PMC 2735664. PMID 19574284.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  13. Cibiv. "NextGenMap". GitHub. Retrieved 28 April 2016.
  14. Li, H.; Durbin, R. (2010). "Fast and accurate long-read alignment with Burrows–Wheeler transform". Bioinformatics. 26 (5): 589–595. doi:10.1093/bioinformatics/btp698. PMC 2828108. PMID 20080505.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  15. Li, H. (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM". Preprint: 1–3.
  16. Chapman, B. (6 May 2013). "Framework for evaluating variant detection methods: Comparison of aligners and callers". Blue Collar Bioinformatics. Retrieved 28 April 2016.
  17. Homer, N. "Blat-like Fast Accurate Search Tool". SourceForge. Retrieved 28 April 2016.
  18. Koboldt, D. (3 July 2011). "Aligners". MassGenomics. Retrieved 28 April 2016.
  19. Kim, D.; Salzberg, S. (23 Februart 2016). "TopHat". Johns Hopkins University Center for Computational Biology. Retrieved 28 April 2016. {{cite web}}: Check date values in: |date= (help)CS1 maint: multiple names: authors list (link)
  20. Vezzi, F.; Del Fabbro, C.; Tomescu, A.I.; Policriti, A. (2011). "rNA: a fast and accurate short reads numerical aligner". Bioinformatics. 28 (1): 123–124. doi:10.1093/bioinformatics/btr617. PMID 22084252.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  21. "Welcome to ERNE 2!". SourceForge. 29 September 2014. Retrieved 28 April 2016.
  22. Hach, F.; Sarrafi, I.; Hormozdiari, F.; Alkan, C.; Eichler, E.E.; Sahinalp, S.C. (2014). "mrsFAST-Ultra: A compact, SNP-aware mapper for high performance sequencing applications". Nucleic Acids Research. 42 (W1): W494–W500. doi:10.1093/nar/gku370. PMC 4086126. PMID 24810850.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link)
  23. Bao, S.; Jiang, R.; Kwan, W.; Wang, B.; Ma, X.; Song, Y.Q. (2011). "Evaluation of next-generation sequencing software in mapping and assembly". Journal of Human Genetics. 56 (6): 406–14. doi:10.1038/jhg.2011.43. PMID 21525877.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  24. Li, H. (19 November 2009). "NGS Alignment Programs". SourceForge. Retrieved 28 April 2016.
  25. Li, H. (19 November 2009). "NGS mapper ROC curves". SourceForge. Retrieved 28 April 2016.
  26. CIBIV (9 January 2016). "Teaser: A tool for fast personalized benchmarks and optimization of NGS read mapping". University of Vienna. Retrieved 28 April 2016.