Next Generation Sequencing (NGS)

From Wikibooks, open books for an open world
Jump to: navigation, search

The need for an up-to-date synthesis of Next Generation Sequencing know-how[edit]

Loupe light.svg

The high demand for low-cost sequencing has driven the development of high-throughput sequencing, which is also termed as Next generation sequencing (NGS). Thousands or millions of sequences concurrently produced in next-generation sequencing process. Next generation sequencing has become a commodity. With the commercialization of various affordable desktop sequencers, NGS will be of reach by more traditional wet-lab biologists . As seen in recent years, genome-wide scale computational analysis is increasingly being used as a backbone to foster novel discovery in biomedical research. However, as the quantities of sequence data increase exponentially, the analysis bottle-neck is yet to be solved.

The current sources for NGS informatics are extremely fragmented. A novice could read review articles in various journals, follow discussion threads on forums such as Biostar[1] or SEQanswers [2], or sign up for courses organized by various institutes. Finding a centralized synthesis is much more difficult. Books are available, but the development of the field is so fast that book chapters risk being obsoleted by the time they are even printed. Moreover, cost for a handful of authors to continually update their text would presumably take up a lot of their schedule.

Drawing from the obvious goodwill and community spirit displayed on discussion forums, and exploiting the collaborative tools made available by the Wikimedia foundation, we propose to initiate the editing of a collaborative WikiBook on NGS. Our plan is to collect a sufficient amount of text that people will be incentivized to contribute to it, essentially providing the same information as a forum but in a tidier form. Ultimately, our goal is to create a collective lab book that explains the key concepts and describes best practices in NGS.

Target audience[edit]

This set of dynamic materials are designed for the bench biologists (advanced PhD students and early career postdoctoral researchers with no or basic bioinformatics experience and demonstrate interest in NGS data analysis). Advanced materials might be added as the community contributes and the needs and trends in the field develop. The flexibility of online material should allow the reader to ignore details in a first read, yet have immediate access to the details they need. However, the overall structure and style should be in priority designed for the non-bioinformatician reader.

Some chapters come with practical exercises so readers may get themselves familiar with the steps.

Get stuck at data analysis?[edit]

Go find help from online communities, including Biostar and SEQanswers, please make sure you follow the guidelines framed by Dall’Olio et al.[3]

Table of contents[edit]

  1. Introduction 50% developed
  2. Big Data 0% developed
  3. Bioinformatics from the outside 100% developed
  4. Pre-processing 50% developed
  5. Alignment 50% developed
  6. DNA Variants 50% developed
  7. RNA 50% developed
  8. Epigenetics 25% developed
  9. Metagenomics 50% developed
  10. Chromatin structure 0% developed
  11. De novo genome assembly 75% developed
  12. De novo RNA assembly 50% developed
  13. Genome Wide Association Studies 25% developed
  14. Integrative platforms 25% developed
  15. Authors
Wikibook Development Stages
Sparse text 0% Developing text 25% Maturing text 50% Developed text 75% Comprehensive text 100%

About this book[edit]

  • The first four chapters are general introductions to broad concepts of bioinformatics and NGS in particular. They are 'required pre-requisites', and will be referred to in the rest of the book:
    • In the Introduction, we give a nearly complete overview of the field, starting with sequencing technologies, their properties, strengths and weaknesses, covering the various biological processes they can assay, and finishing with a section on common sequencing terminology. Finally we finish with an overview of a typical sequencing workflow.
    • In Big Data we deal with some of the (perhaps unexpected) difficulties that arise when dealing with typical volumes of NGS data. From shipping hard drives around the world, to the amount of memory you'll need in your computer to assemble the data when they arrive, these issues often take novices by surprise. We'll get into the file formats, archives, and algorithms that have been developed to deal with these problems.
    • In Bioinformatics from the outside we will discuss the interfaces used by bioinformaticians. We will present the command line with its text interface and blinking cursor, but also more user friendly graphical user interfaces (GUIs) which were developed specially for bioinformatics pipelines.
    • In Pre-processing we will discuss the best practices of controlling the quality of a NGS dataset, and cleaning out low quality data.
  • The next five chapters describe the analyses which can be done using a reference genome sequence, assuming one is available:
    • In Alignment we will discuss how to map a set of reads to a reference dataset.
    • In DNA Variation we will describe how to call variants (either SNVs, CNVs or breakends) using mapped reads.
    • In RNA we will explain how to determine exons, isoforms and gene expression levels from mapped RNA-seq reads.
    • In Epigenetics we will describe pull down assays which are used to determine epigenetic traits such as histone or CpG methylation.
    • In Chromatin structure we will discuss technologies used to determine the structure of the chromatin, e.g. the placement of the histones or the physical proximity of different chromosomal regions when the DNA lies in the nucleus.
  • Finally the last two chapters will describe analyses in the absence of a reference genome:
    • De novo assembly will describe how to assemble a genome from NGS reads.
    • De novo RNA assembly will explain how to assemble a transcriptome from NGS reads only.


  1. In Pre-processing, fastq, QC, trimming, error correction, etc.
  2. In Alignment, formats, algos, assessment.
  3. In DNA Variation, protocols, formats, databases, visualization.
  4. In RNA, transcriptomics workflow, tools, gene prediction, formats, databases.
  5. In Epigenetics... bisulphite sequencing,
  6. In Chromatin structure ... chipseq eh?
  7. In De novo assembly algos, workflows, tools, databases.
  8. In RNA assembly, similarities differences and challenges relative to DNA assembly.


  1. Parnell, Laurence D.; Lindenbaum, Pierre; Shameer, Khader; Dall'Olio, Giovanni Marco; Swan, Daniel C.; Jensen, Lars Juhl; Cockell, Simon J.; Pedersen, Brent S.; Mangan, Mary E.; Miller, Christopher A.; Albert, Istvan; Bourne, Philip E. (27 October 2011). "BioStar: An Online Question & Answer Resource for the Bioinformatics Community". PLoS Computational Biology 7 (10): e1002216. doi:10.1371/journal.pcbi.1002216. 
  2. Li, J.-W.; Schmieder, R.; Ward, R. M.; Delenick, J.; Olivares, E. C.; Mittelman, D. (13 March 2012). "SEQanswers: an open access community for collaboratively decoding genomes". Bioinformatics 28 (9): 1272–1273. doi:10.1093/bioinformatics/bts128. 
  3. Dall'Olio, Giovanni M.; Marino, Jacopo; Schubert, Michael; Keys, Kevin L.; Stefan, Melanie I.; Gillespie, Colin S.; Poulain, Pierre; Shameer, Khader; Sugar, Robert; Invergo, Brandon M.; Jensen, Lars J.; Bertranpetit, Jaume; Laayouni, Hafid; Bourne, Philip E. (28 September 2011). "Ten Simple Rules for Getting Help from Online Scientific Communities". PLoS Computational Biology 7 (9): e1002202. doi:10.1371/journal.pcbi.1002202.