Next Generation Sequencing (NGS)/De novo RNA assembly
De novo RNA-seq assembly consists in assembling transcripts from RNA-seq read without the support of a reference genome. This process is done either because no genome assembly is available or to detect events which are inconsistent with the genome assembly (e.g. to detect fusion genes after rearrangements).
Most of the techniques of RNA-seq de novo assembly are derived from de novo genome assembly, and most of the issues of DNA assembly apply to RNA assembly. However, RNA-seq de novo assembly is arguably a step more complex than the DNA version. In particular, RNA-seq assembly must deal with extremely uneven coverage depths (across genes, isoforms, and even position along a transcript), conserved gene families which present high sequence identity, and alternative splicing.
- 1 Typical workflow
- 2 Protocols
- 3 File formats
- 4 Creating a dataset
- 5 Reference datasets
- 6 Viewing datasets
- 7 Comparing datasets
- 8 De-novo short read assemblers of Transcriptomes
- 9 Decision Helper
- 10 Software Packages
The typical workflow is largely identical to the one described in de novo genome assembly. Below are points which are specific to RNA-seq analysis:
- Choosing a protocol
- Quality control and data filtering
- Adapting parameters to expression levels
- Merging assemblies
Amplification and normalization
Creating a dataset
Velvet and Oases can be used together to assemble de novo transcriptomes. A hash table must first be generated using velveth, and then velvetg is used to assemble the nodes. Finally, Oases is used to reassemble the nodes into transcripts, transcript variants, and splice junctions. A final validation step can be performed by mapping the reads back to the assembly using mapping software capable of accounting for transcript variants, such as Tophat.
The following is a sample of commands
./velveth NewDirectoryName(default is Assem) 21 -shortPaired reads.fa
where 21 equals the hash length length, and reads.fa is a paired end fasta file with the reverse reads directly following the forward reads. Paired end reads can also be entered as two separate files using the option -separate
./velvetg NewDirectoryName -read_trkg yes
the option -read_trkg yes must be on for Oases to run
the output from Oases will be three files in NewDirectoryName directory/transcripts.fa directory/splicing_events.txt directory/contig-ordering.txt If this assembly is to be used as a reference to map additional reads, directory/transcripts.fa should be used as the reference
De-novo short read assemblers of Transcriptomes
Assembling a transcriptome brings its own challenges with it. This is because reads are not randomly sampled from all genes, but there will be more reads from genes that are more highly expressed.
Some steps which are likely common to most assemblies
- Before you start make sure you have suitable hardware, you might need >100 GB of RAM (see below)
- If it is within reason and would not tamper with the biology: Try to get strand specific RNA
- It may help to generate normalized cDNA libararies
- Make sure that all libraries are really ok quality-wise and that there is no major concern (Quality Control Sotware)
- Before submitting data to a de-novo assembler it might often a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. That said, Trinity for example can use the ALLPATHS-LG read correction module prior to assembly. In addition, remove adapter and/or primer sequences that might still be present. (Trimming Tools)
- Be prepared to have >50 Million read pairs for mamalia (This is based on the Trinity publication, where with 52.6 million 76bp read pairs a good result was obtained. More is probably even better.
- Before running any large assembly double and triple check the parameters you feed the assembler.
- Post assembly, it is often advisable to check how well your read data really agrees with the assembly and potentially to visualize the data (Assembly Visualization)
This is based on personal experience and from surveying the literature. In particular, the original publications introducing new tools were searched for comparisons (even though these might be often biased towards new tools introduced by the authors). In addition,data from manuscript comparing transcript assemblers were queried.
If you use 454 data => use a OLC based assembler, probably you will obtain very good results with Newbler.
If you use Illumina data => try Trinity, Trans-AbySS or Velvet-Oases if you have the ressources. Which method will perform best is a function of read length, sequencing coverage, and transcriptome complexity. Please consult references for comparisons of the assemblers below.
If you have a CLC pipeline and no computer experience => this is probably good enough
AbySS is a de-novo assembler which can run on multiple nodes where it uses the message parsing inerface (MPI) interface for communication. As AbySS distributes tasks, the amount of RAM needed per machine is smaller and thus Abyss is able to cope with large genomes. For transcriptome assemblies it is usually combined with Trans-ABySS.
- distributed interface a cluster can be used
- relatively slow
MIRA is a general purpose assembler that can integrate various platform data and perform true hybrid assemblies.
- very well documented and many switches
- can combine different sequencing technologies
- likely relatively good quality data
- Only partly multithreaded and as an effect and based on the technology extremly slow
- Probably not recommended to assemble larger transcriptomes
SOAP de novo
SOAPdenovo is an all purpose genome assembler. It was used to assemble the giant panda genome.
- SOAP de novo uses a medium amount of RAM
- SOAP de novo is relatively fast (probably the fastest free assembler)
- SOAP de novo contains a scaffolder and a read-corrector
- SOAP de novo is relatively modular (read-corrector, assembly, scaffold, gap-filler)
- potentially somewhat confusing way in which contigs are built.
- SOAP denovo has no special extension for transcriptome assemblies
Trinity is the set of three programs fullfilling three different tasks Inchworm, Chrysalis and Butterfly. It runs best with strand specific data. When compared by the Trinity authors to Trans-Abyss and SOAPdenovo it performed better than these in recovering full length mouse and yeast genes. Trinity recommends 1 GB RAM per 1 Million Illumina read pairs. Trinity can use the ALLPATHS-LG read corrector. However this requires ALLPATHS to be installed.
- Produces very good transcriptome assemblies
- Takes time, inchworm the assembler (the first step) does not profit much from multithreading
- Oases is one of the most sensitive and accurate de novo transcriptome assemblers
- Oases contains a module to merge several single-k assemblies into one
- Oases users get fast answers via the Oases mailing list 
- Oases supports diverse input data types and formats
- According to the velvet/oases mailing list assembling 200 million paired end reads of ~100bp each can require up to 200GB of RAM. However, absolute memory consumption is a function of the complexity of the transcriptome and hard to estimate a priori.
The CLC assembly cell is a commercial assembler released by CLC. It is most likely based on a kmer approach.
- CLC uses very little RAM
- CLC is very fast
- CLC is not free
Newbler is an assembler released by the Roche company.
- Newbler has been used in many assembly projects
- Newbler seems to be able to produce good N50 values
- Newbler is often relatively precise
- Newbler is usually available free of charge
- Newbler is tailored to (mostly) 454 data. Whilst it can accommodate some limited amount of Illumina data as has been described by bioinformatician Lex Nederbragt, this is not possible for larger data sets.
- As Newbler at least partly uses the OLC approach large assemblies can take time
Further Reading Material
- Martin and Wang 2011 A review about transcriptome assembly
- Original publications
- Comparisons 454 data
- Kumar and Baxter 2010 found that for 454 data amongst the assemblers CAP3, MIRA, Newbler, SeqMan and CLC, Newbler performed best for their test data set
- Garg et al., 2011 once again using 454 data and found that the short read assemblers Velvet an ABySS performed less well whereas CLC performed almost comparable to MIRA, Newbler v2.3, Newbler v2.5p1, CAP3 and TGICL. Interestingly Newbler v2.3. might have performed better than the newer version 2.5p1.
- Mundry et al. 2012 compare the CAP3, MIRA, Newbler, and Oases assemblers on simulated 454 data.
- Comparisons Illumina data
- Zaho et al. 2011 compared SOAPdenovo, ABySS, Trinity and Oases on three different RNA-seq data sets analyzing the influence of merging different single-k assemblies.
- Zerbino, D. (29 August 2008). "Velvet Manual - version 1.1". NIH HPC Group. Archived from the original on 14 September 2015. https://web.archive.org/web/20150914160547/http://helix.nih.gov/Applications/velvet_manual.pdf. Retrieved 4 May 2016.
- Martin, J. (10 June 2011). "(Oases-users) Memory requirements". Oases-users mailing list. The European Bioinformatics Institute. http://listserver.ebi.ac.uk/pipermail/oases-users/2011-June/000198.html. Retrieved 4 May 2016.
- Nederbragt, L. (21 January 2011). "Newbler input II: Sequencing reads from other platforms". An assembly of reads, contigs and scaffolds. https://contig.wordpress.com/2011/01/21/newbler-input-ii-sequencing-reads-from-other-platforms/. Retrieved 4 May 2016.