Assembly using velvet¶
De novo assembly of Illumina reads using velvet¶
Assembling short-reads with Velvet¶
We will use Velvet to assemble Illumina reads on their own. Velvet uses the de Bruijn graph approach.
We will assemble E. coli K12 strain MG1655 which was sequenced on an Illumina MiSeq. The instrument read 150 bases from each direction.
We wil first use paired end reads only:
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq
Building the Velvet Index File¶
Velvet requires an index file to be built before the assembly takes place. We must choose a k- mer value for building the index. Longer k- mers result in a more stringent assembly, at the expense of coverage. There is no definitive value of k for any given project. However, there are several absolute rules:
- k must be less than the read length
- it should be an odd number
Firstly we are going to run Velvet in single-end mode, ignoring the pairing information. Later on we will incorporate this information.
First, we need to make sure we can use velvet:
Set up the environment¶
For this part of the course, every time you log into the server you
need to execute the command below. IMPORTANT do not use spaces between
PATH=
and $PATH
!
export PATH=/data/bin/:$PATH
To be able to use velvet, load the following module:
module load velvet
Now, ‘go home’:
cd ~
or simply type
cd
Create the assembly folder:
mkdir assembly
cd assembly
mkdir velvet
cd velvet
A first assembly¶
Find a value of k (between 21 and 113) to start with, and record your
choice in this google spreadsheet:
bit.ly/INFBIO1. Run velveth
to build the
hash index (see below).
Program | Options | Explanation |
---|---|---|
velveth | Build the Velvet index file | |
foldername | use this name for the results folder | |
value_of_k | use k-mers of this size | |
-short | short reads (as opposed to long, Sanger-like reads) | |
-separate | read1 and read2 are in separate files | |
-fastq | read type is fastq |
Build the index as follows:
velveth ASM_NAME VALUE_OF_K \
-short -separate -fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq
NOTES
- Change
ASM_NAME
to something else of your choosing - Change
VALUE_OF_K
to the value you have picked - The command is split over several lines by adding a space, and a
\
(backslash) to each line. This trick makes long commands more readable. If you want, you can write the whole command on one line instead.
After velveth
is finished, look in the new folder that has the name
you chose. You should see the following files:
Log
Roadmaps
Sequences
The ‘Log
‘ file has a useful reminder of what commands you typed to
get this assembly result, for reproducing results later on.
‘Sequences
‘ contains the sequences we put in, and ‘Roadmaps
‘
contains the index you just created.
Now we will run the assembly with default parameters:
velvetg ASM_NAME
Velvet will end with a text like this:
Final graph has ... nodes and n50 of ..., max ..., total ..., using .../... reads
The number of nodes represents the number of nodes in the graph, which (more or less) is the number of contigs. Velvet reports its N50 (as well as everything else) in ‘kmer’ space. The conversion to ‘basespace’ is as simple as adding k-1 to the reported length.
Look again at the folder ASM_NAME
, you should see the following
extra files:
contigs.fa
Graph
LastGraph
PreGraph
stats.txt
The important files are:
contigs.fa
- the assembly itselfGraph
- a textual representation of the contig graphstats.txt
- a file containing statistics on each contigQuestions
- What k-mer did you use?
- What is the N50 of the assembly?
- What is the size of the largest contig?
- How many contigs are there in the
contigs.fa
file? Usegrep -c NODE contigs.fa
. Is this the same number as velvet reported?
Log your results in this google spreadsheet: bit.ly/INFBIO1
We will discuss the results together and determine *the optimal* k-mer for this dataset.
Advanced tip: You can also use VelvetOptimiser to automate this process of selecting appropriate k-mer values. VelvetOptimizer is included with the Velvet installation.
Now run velveth
and velvetg
for the kmer size determined by the
whole class. Use this kmer from now on!
Estimating and setting exp_cov
¶
Much better assemblies are produced if Velvet understands the expected
coverage for unique regions of your genome. This allows it to try and
resolve repeats. The data to determine this is in the stats.txt
file. The full description of this file is in the Velvet Manual, at
http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf.
A so-called Jupyter notebook has been provided to plot the distribution of the coverage of the nodes. In order to use it, you need to do the following on the local linux machine Not on the server:
NOTE: if you are on vetur
OR vor
, type:
ssh nordur
OR
ssh austur
and enter your password.
- install the Jupyter notebook and some python packages (this may take a few minutes):
pip install --user jupyter pandas numpy pysam
- prepare a folder on your linux machine
cd ~
mkdir assembly
cd assembly
mkdir velvet
cd velvet
mkdir ASM_NAME
cd ASM_NAME
- copy the
stats.txt
file from the server to this folder using thersync
command - copy the notebook file
/data/assembly/node_coverage.ipynb
from the server to this folder usingrsync
- start the notebook:
python notebook node_coverage.ipynb
OR
ipython notebook node_coverage.ipynb
- After a little while, your web browser will start with a new tab with the notebook in it
- follow the instructions in the notebook
Question:
- What do you think is the approximate expected k-mer coverage for your assembly?
When you are done with the Jupyter notebook:
- save the notebook
- close the browser windows
- in the terminal where you started Jupyter notebook, click ctrl-c and confirm.
Now run velvet again, supplying the value for exp_cov
(k-mer
coverage) corresponding to your answer:
velvetg ASM_NAME -exp_cov PEAK_K_MER_COVERAGE
Question:
- What improvements do you see in the assembly by setting a value for
exp_cov
?
Setting cov_cutoff
¶
You can also clean up the graph by removing low-frequency nodes from the
de Bruijn graph using the cov_cutoff
parameter. Low-frequency
nodes can result from sequencing errors, or from parts of the genome
with very little sequencing coverage. Removing them will often result in
better assemblies, but setting the cut-off too high will also result in
losing useful parts of the assembly. Using the histogram from
previously, estimate a good value for cov_cutoff
.
velvetg ASM_NAME -exp_cov YOUR_VALUE -cov_cutoff YOUR_VALUE
Try some different values for cov_cutoff
, keeping exp_cov
the
same and record your assembly results.
Asking velvet to determine the parameters¶
You can also ask Velvet to predict the values for you:
velvetg ASM_NAME -exp_cov auto -cov_cutoff auto
Questions:
- What values of exp_cov and cov_cutoff did Velvet choose?
- Check the output to the screen. Is this assembly better than your best one?
Incorporating paired-end information¶
Paired end information contributes additional information to the
assembly, allowing contigs to be scaffolded. We will first re-index your
reads telling Velvet to use paired-end information, by using
-shortPaired
instead of -short
for velveth
. Then, re-run
velvetg using the best value of k
, exp_cov
and cov_cutoff
from the previous step.
!!! IMPORTANT Pick a new name for your assembly !!!
velveth ASM_NAME2 VALUE_OF_K \
-shortPaired -fastq -separate \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq
velvetg ASM_NAME2 -exp_cov auto \
-cov_cutoff auto
Questions:
- How does doing this affect the assembly?
- what does velvet say about the insert size of the paired end library?
Scaffold and contig metrics¶
contigs.fa
file are actually scaffolds.assemblathon_stats.pl
script to generate metrics for this,
and all following assemblies.The assemblathon stats script
The assemblathon www.assemblathon.org used a perl script to obtain standardized metrics for the assemblies that were submitted. Here we use (a slightly modified version of) this script. It takes the size of the genome, and one sequence fasta file as input. The script breaks the sequences into contigs when there are 20 or more N’s, and reports all sorts of metrics.
Program | Options | Explanation |
---|---|---|
assemblathon_stats.pl | Provide basic assembly metrics | |
-size | size (in Mbp, million basepairs) of target genome (optional) | |
seq.fasta | fasta file of contigs or scaffolds to report on |
Example, for a 3.2 Mbp genome:
assemblathon_stats.pl -s 3.2 scaffolds.fasta
OR, save the output to a file with
assemblathon_stats.pl -s 3.2 scaffolds.fasta > metrics.txt
Here, >
(redirect) symbol used to ‘redirect’ what is written to the
screen to a file.
For this exercise, use the known length for this strain, 4.6 Mbp, for the genome size.
NOTE make sure you have run this command to enable the use of the script:
export PATH=/data/bin/:$PATH
Some of the metrics the script reports are:
- N50 is based on the total assembly size
- NG50 is based on the estimated/known genome size
- L50 (LG50) count: number of scaffolds/contigs at least N50 (NG50) bases
Questions
- How much of the estimated genome size is covered in the scaffolds
- how many gap bases (‘N’) are left in the scaffolds
Looking for repeats¶
Have a look for contigs which are long and have a much higher coverage
than the average for your genome. One tedious way to do this is to look
into the contigs.fa
file (with less
). You will see the name of
the contig (‘NODE’), it’s length and the kmer coverage. However, trying
to find long contigs with high coverage this way is not very efficient.
A faster was is to again use the stats.txt
file.
Relevant columns are:
- ID –> sequence ID, same as ‘NODE’ number in the
contigs.fa
file - lgth –> sequence ‘length’
- short1_cov –> kmer coverage (column 6)
Knowing this, we can use the awk
command to select lines for contigs
at least 1kb, with k-mer coverage greater than 60:
awk '($2>=1000 && $6>=60)' stats.txt
awk
is an amazing program for tabular data. In this case, we ask it
to check that column 2 ($2, the length) is at least 1000 and column 6
($6, coverage) at least 60. If this is the case, awk will print the
entire line. See http://bit.ly/QjbWr7 for more information on awk.
Find the contig with the highest coverage in the contigs.fa
file.
Perform a BLAST search using NCBI.
Question:
- What is it?
- Is this surprising? Why, or why not?
The effect of mate pair library reads¶
Long-range “mate-pair” libraries can also dramatically improve an assembly by scaffolding contigs. Typical sizes for Illumina are 2kb and 6kb, although any size is theoretically possible. You can supply a second library to Velvet. However, it is important that files are reverse-complemented first as Velvet expects a specific orientation. We have supplied a 3kb mate-pair library in the correct orientation.
!!! IMPORTANT Pick a new name for your assembly !!!
We will use -shortPaired
for the paired end library reads as before,
and add -shortPaired2
for the mate pairs. Also, to make sure we all
end up having the same assembly, the kmer size is given:
velveth ASM_NAME3 81 \
-shortPaired -separate -fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq \
-shortPaired2 -separate -fastq \
/data/assembly/Nextera_MP_R1_50x.fastq \
/data/assembly/Nextera_MP_R2_50x.fastq
We use auto values for velvetg because the addition of new reads will change the genome coverage. The assembly command then becomes:
velvetg ASM_NAME3 -cov_cutoff auto -exp_cov auto
Questions:
- What is the N50 of this assembly?
- How many scaffolds?
- How many bases are in gaps?
- What did velvet estimate for the insert length of the paired-end reads, and for the standard deviation? Use the last mention of this in the velvet output.
- And for the mate-pair library?
TIP Some mate pair libraries have a significant amount of paired end
reads present as a by-effect of the library preparation. This may
generate misassemblies. If this is the case for your data, add the
-shortMatePaired2 yes
to let Velvet know it.
Make a copy of the contigs file and call it velvet_pe+mp.fa
Optional: Skipping the paired end reads¶
As both the mate pairs and the paired end reads are of the same length, and provide the same coverage, it could be interesting to try an assembly of the mate pair reads only. The read sequences would still be used to build the contigs, and the mate pair information to build scaffolds.
!!! IMPORTANT Pick a new name for your assembly !!!
The assembly for this part then becomes:
velveth ASM_NAME4 81 \
-shortPaired -separate -fastq \
/data/assembly/Nextera_MP_R1_50x.fastq \
/data/assembly/Nextera_MP_R2_50x.fastq
velvetg ASM_NAME4 -cov_cutoff auto -exp_cov auto
Questions:
- What is the N50 of this assembly?
- How many scaffolds?
- How many bases are in gaps?
- How does this assembly compare to the previous ones?
Make a copy of the contigs file and call it velvet_mp_only
Next steps¶
Next, map the reads used for the assemblies back to the scaffolds. See the tutorial ‘Mapping reads to an assembly’