Next Generation Sequencing (NGS)/Big Data

From Wikibooks, open books for an open world
< Next Generation Sequencing (NGS)
Jump to: navigation, search
Next Generation Sequencing (NGS)
Introduction Big Data Bioinformatics from the outside

Big Data[edit]

Data Deluge[edit]

The first problem you face is probably the large size of the NGS FASTQ files - the "data deluge" problem. You no longer only have to deal with microplate readings, or digitalized gel photos; the size of NGS data can be huge. For example, compressed FASTQ files from a 60x human whole genome sequencing can still require 200Gb. A small project with 10 - 20 whole genome sequencing (WGS) samples can generate ~4TB of raw data. Even these estimates do not include the disk space required for downstream analysis.

Storing data[edit]

Referenced from a post from BioStar[1]:

  • Very high end: enterprise cluster and SAN.
  • High end: Two mirrored servers in separate buildings or Cloud.
  • Typical: External hard drives and/or NAS with raid-5/6

Moving data[edit]

Moving data between collaborators is also non-trivial. For RNA-Seq samples, FTP may suffice, but for WGS data, shipping hard drives may be the only solution.

Externalizing compute requirements from the research group[edit]

It is difficult for a single lab to maintain sufficient computing facilities. A single lab will probably own some basic computing hardware; however, many tasks will have huge computational demands (e.g. memory for de novo genome assembly) that require them to be performed elsewhere. An institution / core facility may host a centralized cluster. Alternatively, one might consider doing the task on the cloud.

  • NIH maintains a centralized computing cluster Biowulf
  • Bioinformatics cloud computing is suggested [2] [3]. EBI has adopted a cloud based platform Helix Nebula [4]