Next Generation Sequencing (NGS)/Introduction

From Wikibooks, open books for an open world
< Next Generation Sequencing (NGS)
Jump to: navigation, search
Next Generation Sequencing (NGS)
Introduction Big Data


  • The first four chapters are general introductions to broad concepts of bioinformatics and NGS in particular. They are 'required pre-requisites', and will be referred to in the rest of the book:
    • In the Introduction, we give a near complete overview of the field. Starting with sequencing technologies, their properties, strengths and weaknesses, covering the various biologies that they assay, and finishing with a section on common sequencing terminology. Finally we finish with an overview of a typical sequencing workflow.
    • In Big Data we deal with some of the (perhaps unexpected) difficulties that arise when dealing with typcal volumes of NGS data. From shipping hard drives around the world to the amount of memory you'll need in your computer to assemble the data when they arrive. We'll get into the file formats, archives, and algorithms that have been developed to deal with these problems.
    • In Bioinformatics from the outside we will discuss the interfaces used by bioinformaticians. We will present the command line with its text interface and blinking cursor, but also more user friendly graphical user interfaces (GUIs) which were developed specially for bioinformatics pipelines.
    • In Pre-processing we will discuss the best practices of controlling the quality of a NGS dataset, and cleaning out low quality data.
  • The next five chapters describe the analyses which can be done using a reference genome sequence, assuming one is available:
    • In Alignment we will discuss how to map a set of reads to a reference dataset.
    • In DNA Variation we will describe how to call variants (either SNVs, CNVs or breakends) using mapped reads.
    • In RNA we will explain how to determine exons, isoforms and gene expression levels from mapped RNA-seq reads.
    • In Epigenetics we will describe pull down assays which are used to determine epigenetic traits such as histone or CpG methylation.
    • In Chromatin structure we will discuss technologies used to determine the structure of the chromatin, e.g. the placement of the histones or the physical proximity of different chromosomal regions when the DNA lies in the nucleus.
  • Finally the last two chapters will describe analyses in the absence of a reference genome:
    • De novo assembly will describe how to assemble a genome from NGS reads.
    • De novo RNA assembly will explain how to assemble a transcriptome from NGS reads only.


Platforms and Technologies[edit]

Employing different technologies, the purpose of NGS platform is to decode the identity or modification on the nucleotides.

NGS platforms evolve quickly. Usually, new technologies & platforms are announced at the Advances in Genome Biology & Technology (AGBT) conference [1]

For educational purposes, some reviews of NGS platforms published in 2011 [2]. Read more about the sequencing technologies here

File format and terminology[edit]


The FASTA format, generally indicated with the suffix .fa or .fasta, is a straightforward, human readable format. Normally, each file consists of a set of sequences, where each sequence is represented by a one line header, starting with the '>' character, followed by the corresponding nucleotide sequence, in multiple lines of regular width (generally 60 or 80 characters wide). In practice, some tools may produce a sequence with a header and a single long line of sequence. For more detailed information see the FASTA Wikipedia page.


FASTQ files are text file formats (human readable) providing a 4-lines entries per sequence.

  1. Sequence identifier
  2. The sequence
  4. Quality scores

FASTQ format is commonly used to store sequencing reads, in particular from Illumina and Ion Torrent platforms.

Paired-end reads may be stored either in one FASTQ file (alternating) or in two different FASTQ files. Paired-end reads may have sequence identifiers ended by "/1" and "/2" respectively.

Example FASTQ entry for one Illumina read:


Generally a FASTQ file is stored in files with the suffix .fq or .fastq using Gzip file compression indicated by the suffix .gz or .gzip.

For more detailed information see the FASTQ Wikipedia page.


SFF is a binary file format used to encode sequencing reads from the 454 platform.


File formats used to encode short reads alignment. See Next_Generation_Sequencing_(NGS)/Alignment for more information.


FASTG is an emerging file format for genome assemblies that take ambiguities into account. FASTG is like FASTA, but the G stands for ‘graph’.


The Variant Call Format (VCF) is a specification used in bioinformatics for storing gene sequence variations. See [1] for more information.

Read lengths[edit]

As of Feb 2013, the read-length of second generation sequencing platforms are shorter than conventional Sanger sequencing, creating challenges in reads mapping and assembly.

  • The most well used Illumina platforms can produce read-length up to 250bp. In practice, ~100bp is mostly accessible to researchers worldwide.
  • Ion Torrent: Varies, typically peak at 400bp
  • SOLiD: 50-75bp


  • Single-end reads means the sequence fragment are sequenced from 1 direction only.
  • In paired-end sequencing, a single fragment are sequenced from both 5' and 3' end, giving rise to forward and reverse read. The sequenced fragments could be separated by a certain bases (inner insert size) or can be overlapping, giving rise to a contiguous longer single-end fragment after merging. The uses of paired-end reads can improve the accuracy of reads mapping onto a reference genome. The typical fragment size (external inserts size) is 200bp to 500bp


Mate-pair is different from paired-end" in the sense of how the sequence library is made. In "Mate-pair" sequencing, 2-5kb fragments are selected and sequenced from both end, thus giving information how nucleotides far apart are linked together. Mate-pairs are more indeal for studying genomic structural rearrangement and help de novo genome assembly. They also facilitate sensitive structural variant (SV) detection across a widened SV size-spectrum and in repetitive areas of the genome.


Colorspace is a 2-base encoding system commercialized by Life Tech and used in SOLiD platforms. Technology overview is described here

Quality scores[edit]

Quality score is an indication of probability of the base call being incorrect. Quality score is used in the FASTQ format.

Various encoding schemes are available, including most common [Phred quality scores].

Error profiles & Sequencing biases[edit]

Uses of NGS[edit]




ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Chromatin structure[edit]

General NGS Workflow Overview[edit]