Next Generation Sequencing (NGS)/Introduction

From Wikibooks, open books for an open world
< Next Generation Sequencing (NGS)
Jump to: navigation, search
Next Generation Sequencing (NGS)
Introduction Big Data


Platforms and Technologies[edit]

Employing different technologies, the purpose of NGS platform is to decode the identity or modification on the nucleotides.

NGS platforms evolve quickly. Usually, new technologies & platforms are announced at the Advances in Genome Biology & Technology (AGBT) conference [1]

For educational purposes, some reviews of NGS platforms published in 2011 [2]. Read more about the sequencing technologies here

File format and terminology[edit]


The FASTA format, generally indicated with the suffix .fa or .fasta, is a straightforward, human readable format. Normally, each file consists of a set of sequences, where each sequence is represented by a one line header, starting with the '>' character, followed by the corresponding nucleotide sequence, in multiple lines of regular width (generally 60 or 80 characters wide). In practice, some tools may produce a sequence with a header and a single long line of sequence. For more detailed information see the FASTA Wikipedia page.


FASTQ files are text file formats (human readable) providing a 4-lines entries per sequence.

  1. Sequence identifier
  2. The sequence
  4. Quality scores

FASTQ format is commonly used to store sequencing reads, in particular from Illumina and Ion Torrent platforms.

Paired-end reads may be stored either in one FASTQ file (alternating) or in two different FASTQ files. Paired-end reads may have sequence identifiers ended by "/1" and "/2" respectively.

Example FASTQ entry for one Illumina read:


Generally a FASTQ file is stored in files with the suffix .fq or .fastq using Gzip file compression indicated by the suffix .gz or .gzip.

For more detailed information see the FASTQ Wikipedia page.


SFF is a binary file format used to encode sequencing reads from the 454 platform.


File formats used to encode short reads alignment. See Next_Generation_Sequencing_(NGS)/Alignment for more information.


FASTG is an emerging file format for genome assemblies that take ambiguities into account. FASTG is like FASTA, but the G stands for ‘graph’.


The Variant Call Format (VCF) is a specification used in bioinformatics for storing gene sequence variations. See [1] for more information.

Read lengths[edit]

As of Feb 2013, the read-length of second generation sequencing platforms are shorter than conventional Sanger sequencing, creating challenges in reads mapping and assembly.

  • The most well used Illumina platforms can produce read-length up to 250bp. In practice, ~100bp is mostly accessible to researchers worldwide.
  • Ion Torrent: Varies, typically peak at 400bp
  • SOLiD: 50-75bp


  • Single-end reads means the sequence fragment are sequenced from 1 direction only.
  • In paired-end sequencing, a single fragment are sequenced from both 5' and 3' end, giving rise to forward and reverse read. The sequenced fragments could be separated by a certain bases (inner insert size) or can be overlapping, giving rise to a contiguous longer single-end fragment after merging. The uses of paired-end reads can improve the accuracy of reads mapping onto a reference genome. The typical fragment size (external inserts size) is 200bp to 500bp


Mate-pair is different from paired-end" in the sense of how the sequence library is made. In "Mate-pair" sequencing, 2-5kb fragments are selected and sequenced from both end, thus giving information how nucleotides far apart are linked together. Mate-pairs are more indeal for studying genomic structural rearrangement and help de novo genome assembly. They also facilitate sensitive structural variant (SV) detection across a widened SV size-spectrum and in repetitive areas of the genome.


Colorspace is a 2-base encoding system commercialized by Life Tech and used in SOLiD platforms. Technology overview is described here

Quality scores[edit]

Quality score is an indication of probability of the base call being incorrect. Quality score is used in the FASTQ format.

Various encoding schemes are available, including most common [Phred quality scores].

Error profiles & Sequencing biases[edit]

Uses of NGS[edit]




ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Chromatin structure[edit]

General NGS Workflow Overview[edit]