Next Generation Sequencing (NGS)/Velvet
Velvet practical: Part 1
[edit | edit source]This practical will cover the following items:
- Compile Velvet
- Single end
- K-mer length
- Coverage cut-offs
- Whole genome sequence as input???
Prepare the environment
[edit | edit source]First make sure that you are in your home directory by typing:
cd
and making absolutely sure you're there by typing:
pwd
Now create sub-directories for this and the two other velvet practicals. All these directories will be made as sub-directories of a directory for the whole course called NGS. For this you can use the following commands:
mkdir -p NGS/velvet/{part1,part2,part3}
# The -p tells mkdir (make directory) not to worry if a parent directory is missing. # That is, if a sub-directory cannot be made because its parent directory does not exist, # just make the parent directory first rather than reporting an error. # The “one at a time” approach would be: mkdir NGS mkdir NGS/velvet mkdir NGS/velvet/part1 mkdir NGS/velvet/part2 mkdir NGS/velvet/part3
After creating the directories, examine the structure and move into the directory ready for the first velvet exercise by typing:
ls -R NGS;
cd NGS/velvet/part1; pwd;
Downloading and Compile Velvet
[edit | edit source]You can find the latest version of velvet at:
http://www.ebi.ac.uk/~zerbino/velvet/
You could go to this URL and download the latest velvet version, or equivalently, you could type the following, which will download, unpack, inspect, compile and execute your locally compiled version of velvet:
cd ~/NGS/velvet/part1; pwd;
cp ~/NGS/Data/velvet_1.2.07.tgz . ;
tar xzf ~/NGS/Data/velvet_1.2.07.tgz;
ls -R; cd velvet_1.2.07;
make velveth velvetg;
./velveth'''
Take a look at the executables you have created. They will be displayed as green by the command:
ls --color=always;
The switch --color, instructs that files be coloured according to their type. This is often the default. Here we are just being cautious. The =always is included as colouring is usually suppressed in scripts. If you run this exercise using the script provided, just --color would not be enough. --color=always insists on file colouring even from a script.
Have a look of the output the command produces and you will see the following parameters passed into the compiler:
“MAXKMERLENGTH=31” and “CATEGORIES=2”
This indicates that the default compilation was set for De Bruijn graph KMERs of maximum size 31 and to allow a maximum of just 2 read categories. You can override these, and other, default configuration choices using command line parameters. Assume, you want to run velvet with a KMER length of 41 using 3 categories, velvet needs to be recompiled to enable this functionality by typing:
make clean; make velveth velvetg MAXKMERLENGTH=41 CATEGORIES=3;
./velveth
velvet can also be used to process SOLID colour space data. To do this you need a further make parameter. With the following command clean away your last compilation and try the following parameters:
make clean; make MAXKMERLENGTH=41 CATEGORIES=3 color
./velveth_de
For a further description of velvet compile and runtime parameters please see the velvet Manual: https://github.com/dzerbino/velvet/wiki/Manual
Single ended read assembly
[edit | edit source]The data you will examine is from Staphylococcus aureus USA300 which has a genome of around 3MB. The reads are Illumina and are unpaired, also known as single-end library. Even though you have carefully installed velvet in your own workspace, we will use a pre-installed version. The data needed for this section can be obtained from the Sequence Read Archive (SRA). For the following example use the run data SRR022825 and SRR022823 from the SRA Sample SRS004748. The SRA experiment could be viewed by setting your browser to the URL: http://www.ebi.ac.uk/ena/data/view/SRS004748
The following exercise focuses on velvet using single-end reads, how the available parameters effect an assembly and how to measure and compare the changes. To begin with, first move back to the directory you prepared for this exercise, create a new folder with a suitable name for this part and move into it. The command to download the file from the internet would be:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022825/SRR022825.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022823/SRR022823.fastq.gz
or if you had the files installed locally, just create soft links to the files. Continue by copying (or typing):
<syntax highlight="text> cd ~/NGS/velvet/part1 mkdir SRS004748 cd SRS004748 pwd ln -s ~/NGS/Data/SRR022825.fastq.gz . ln -s ~/NGS/Data/SRR022823.fastq.gz . ls -l </syntaxhighlight>
You are ready to process your data with velvet, which is a compressed fastq file. Velvet has two main components: velveth -used to construct, from raw read data, a dataset organised in the fashion expected by the second component, velvetg. velvetg -the core of velvet where the de Bruijn graph assembly is built and manipulated. You can always get further information about the usage of both velvet programs by typing velvetg or velveth in your terminal.
Now run velveth for the reads in SRR022825.fastq.gz and SRR022823.fastq.gz using the following options:
- A de Bruijn graph k-mer of 25
- An output directory called run_25
velveth run_25 25 -fastq.gz -short SRR022825.fastq.gz SRR022823.fastq.gz
velveth talks to itself for a while and ends with some files in the output directory. Move into the output directory run_25 and take a look around at what velveth had done so far. The UNIX command less allows you to look at output files (press q for quit). Just in case you still need a hint:
cd run_25;
ls -l;
head Sequences;
Now move one directory level up and run velvetg on your output directory, with the commands:
cd ..
time velvetg run_25
Move back into your results directory to examine the effects of velvetg:
cd run_25; ls -l;
FOR YOU: once you run the command above:
- Q1: What extra files do you see in the folder run_25?
- Q2:What do you suppose they might represent?
- Q3: In the Log file in run_25, what is the N50?
N50 statistic: Broadly, it is the median (not average) of a sorted data set using the length of a set of sequences. Usually it is the length of the contig whose length. When added to the length of all longer contigs, makes a total greater that half the sum of the lengths of all contigs. Easy, but messy – a more formal definition can be found here: http://www.broadinstitute.org/crd/wiki/index.php/N50
Backup the contigs.fa file and calculate the N50 (and the N25,N75) value with the command:
cp contigs.fa contigs.fa.0
YOU now try:
gnx -min 100 -nx 25,50,75 contigs.fa
- Q4. Does the value of N50 agree with the value stored in the Log file?
- Q5. If not, why do you think this might be?
In order to improve our results, take a closer look at the standard options of velvetg by typing 'velvetg' without parameters. For the moment focus on the two options -cov_cutoff and -exp_cov. Clearly -cov_cutoff will allow you to exclude contigs for which the kmer coverage is low, implying unacceptably poor quality. The -exp_cov switch is used to give velvetg an idea of the coverage to expect. If the expected coverage of any contig is substantially in excess of the suggested expected value, maybe this would indicate a repeat. For further details of how to choose the parameters, go to 'Choice of a coverage cutoff': http://wiki.github.com/dzerbino/velvet/
Briefly, the Kmer coverage (and much more information) for each contig is stored in the file stats.txt and can be used with R to visualize the coverage distribution. Take a look at the stats.txt file, start R, load and visualize the data using the following commands:
R
library(plotrix)
data <- read.table("stats.txt", header=TRUE)
x11()
weighted.hist(data$short1_cov, data$lgth, breaks=0:50)
A weighted histogram is a better way of visualizing the coverage information, because of noise (lot of very short contigs). You can see an example output below:
After choosing the expected coverage and the coverage cut-off, you can exit R by typing:
q()
n
The weighted histogram suggests to me that the expected coverage is around 14 and that everything below 6 is likely to be noise.
Some coverage is also represented at around 20, 30 and greater 50, which might be contamination or repeats (depending on the dataset), but at the moment this should not worry you.
To see the improvements, rerun velvetg first with -cov_cutoff 6 and after checking the N50 use only / add -exp_cov 14 to the command line option.
Also keep a copy of the contigs file for comparison:
cd ~/NGS/velvet/part1/SRS004748
time velvetg run_25 -cov_cutoff 6
# Make a copy of the run
cp run_25/contigs.fa run_25/contigs.fa.1
time velvetg run_25 -exp_cov 14
cp run_25/contigs.fa run_25/contigs.fa.2
time velvetg run_25 -cov_cutoff 6 -exp_cov 14
cp run_25/contigs.fa run_25/contigs.fa.3
- Q6. What is the N50 with no parameter:
- Q7. What is the N50 with -cov_cutoff 6:
- Q8. What is the N50 with -exp_cov 14:
- Q9. What is the N50 with -cov_cutoff 6 -exp_cov 14:
- Q10Did you notice a variation in the time velvetg took to run? If so, can you explain why that might be?
You were running velvetg with the given -exp_cov and -cov_cutoff parameters. Now try to experiment using different cut-offs, expected parameters and also explore other settings (e.g. -max_coverage, -max_branch_length, -unused_reads, -amos_file, -read_trkg or see velvetg help menu).
In particular, look at the -amos_file parameter which instructs velvetg to create a version of the assembly that can be processed and viewed with a program called AMOS Hawkeye. Another program, called tablet, can also understand and display the AMOS file. For now, we will take a look at tablet.
Run velvetg with just -cov_cutoff 6 but requesting an amos file:
velvetg run_25 -cov_cutoff 6 -amos_file yes
Now take a swift look at the assembly with tablet:
tablet run_25/velvet_asm.afg &
velvet_asm.afg being the file velvetg made is response to the inclusion of the -amos_file yes switch. tablet will take quite a bit of memory and you can safely ignore the complaints from tablet. Once the file has loaded, select one of the longer contigs. Note the lack of Read information and the one dimensional nature of the contig display. Close down tablet when you have seen enough. Now rerun velvetg adding the additional parameter -exp_cov 7 with no need to save any files as the amos file needs to be changed. Now rerun velvetg adding the additional parameter -exp_cov 14 with no need to save any files as the AMOS file needs to be changed.
velvetg run_25 -cov_cutoff 6 -exp_cov 14 -amos_file yes
Again, view the amos file with tablet:
tablet run_25/velvet_asm.afg &
Select a longish contig as before and explore the display options before close down table. FOR YOU:
- Q11. Why do you think there was read information only for the second use of tablet?
- Q12. You may have noticed velvetg took a little longer with -exp_cov 14 on. Does this make sense ?
If you want to explore the behaviour of velvet even further, you could experiment with the following.
Reduce the sequence coverage by only choosing one input file for the assembly e.g. SRR022825.fastq.gz
Increase the sequence coverage by downloading further files from the same ENA sample SRS004748 (http://www.ebi.ac.uk/ena/data/view/SRS004748).
FOR YOU: How did the N50 change by
- Q13. reducing the coverage?
- Q14. increasing the coverage?