Assembly using velvet¶

De novo assembly of Illumina reads using velvet¶

Assembling short-reads with Velvet¶

We will use Velvet to assemble Illumina reads on their own. Velvet uses the de Bruijn graph approach.

We will assemble E. coli K12 strain MG1655 which was sequenced on an Illumina MiSeq. The instrument read 150 bases from each direction.

We wil first use paired end reads only:

/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq

/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq

Building the Velvet Index File¶

Velvet requires an index file to be built before the assembly takes place. We must choose a k- mer value for building the index. Longer k- mers result in a more stringent assembly, at the expense of coverage. There is no definitive value of k for any given project. However, there are several absolute rules:

k must be less than the read length
it should be an odd number

Firstly we are going to run Velvet in single-end mode, ignoring the pairing information. Later on we will incorporate this information.

First, we need to make sure we can use velvet:

Set up the environment¶

For this part of the course, every time you log into the server you need to execute the command below. IMPORTANT do not use spaces between PATH= and $PATH!

export PATH=/data/bin/:$PATH

To be able to use velvet, load the following module:

module load velvet

Now, ‘go home’:

cd ~

or simply type

cd

Create the assembly folder:

mkdir assembly
cd assembly
mkdir velvet
cd velvet

A first assembly¶

Find a value of k (between 21 and 113) to start with, and record your choice in this google spreadsheet: bit.ly/INFBIO1. Run velveth to build the hash index (see below).

Program	Options	Explanation
velveth		Build the Velvet index file
	foldername	use this name for the results folder
	value_of_k	use k-mers of this size
	-short	short reads (as opposed to long, Sanger-like reads)
	-separate	read1 and read2 are in separate files
	-fastq	read type is fastq

Build the index as follows:

velveth ASM_NAME VALUE_OF_K \
-short -separate -fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq

NOTES

Change ASM_NAME to something else of your choosing
Change VALUE_OF_K to the value you have picked
The command is split over several lines by adding a space, and a \ (backslash) to each line. This trick makes long commands more readable. If you want, you can write the whole command on one line instead.

After velveth is finished, look in the new folder that has the name you chose. You should see the following files:

Log
Roadmaps
Sequences

The ‘Log‘ file has a useful reminder of what commands you typed to get this assembly result, for reproducing results later on. ‘Sequences‘ contains the sequences we put in, and ‘Roadmaps‘ contains the index you just created.

Now we will run the assembly with default parameters:

velvetg ASM_NAME

Velvet will end with a text like this:

Final graph has ... nodes and n50 of ..., max ..., total ..., using .../... reads

The number of nodes represents the number of nodes in the graph, which (more or less) is the number of contigs. Velvet reports its N50 (as well as everything else) in ‘kmer’ space. The conversion to ‘basespace’ is as simple as adding k-1 to the reported length.

Look again at the folder ASM_NAME, you should see the following extra files:

contigs.fa
Graph
LastGraph
PreGraph
stats.txt

The important files are:

contigs.fa - the assembly itself
Graph - a textual representation of the contig graph
stats.txt - a file containing statistics on each contig

Questions

What k-mer did you use?
What is the N50 of the assembly?
What is the size of the largest contig?
How many contigs are there in the contigs.fa file? Use grep -c NODE contigs.fa. Is this the same number as velvet reported?

Log your results in this google spreadsheet: bit.ly/INFBIO1

We will discuss the results together and determine *the optimal* k-mer for this dataset.

Advanced tip: You can also use VelvetOptimiser to automate this process of selecting appropriate k-mer values. VelvetOptimizer is included with the Velvet installation.

Now run velveth and velvetg for the kmer size determined by the whole class. Use this kmer from now on!

Estimating and setting `exp_cov`¶

Much better assemblies are produced if Velvet understands the expected coverage for unique regions of your genome. This allows it to try and resolve repeats. The data to determine this is in the stats.txt file. The full description of this file is in the Velvet Manual, at http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf.

A so-called Jupyter notebook has been provided to plot the distribution of the coverage of the nodes. In order to use it, you need to do the following on the local linux machine Not on the server:

NOTE: if you are on vetur OR vor, type:

ssh nordur

OR

ssh austur

and enter your password.

install the Jupyter notebook and some python packages (this may take a few minutes):

pip install --user jupyter pandas numpy pysam

prepare a folder on your linux machine

cd ~
mkdir assembly
cd assembly
mkdir velvet
cd velvet
mkdir ASM_NAME
cd ASM_NAME

copy the stats.txt file from the server to this folder using the rsync command
copy the notebook file /data/assembly/node_coverage.ipynb from the server to this folder using rsync
start the notebook:

python notebook node_coverage.ipynb

OR

ipython notebook node_coverage.ipynb

After a little while, your web browser will start with a new tab with the notebook in it
follow the instructions in the notebook

Question:

What do you think is the approximate expected k-mer coverage for your assembly?

When you are done with the Jupyter notebook:

save the notebook
close the browser windows
in the terminal where you started Jupyter notebook, click ctrl-c and confirm.

Now run velvet again, supplying the value for exp_cov (k-mer coverage) corresponding to your answer:

velvetg ASM_NAME -exp_cov PEAK_K_MER_COVERAGE

Question:

What improvements do you see in the assembly by setting a value for exp_cov?

Setting `cov_cutoff`¶

You can also clean up the graph by removing low-frequency nodes from the de Bruijn graph using the cov_cutoff parameter. Low-frequency nodes can result from sequencing errors, or from parts of the genome with very little sequencing coverage. Removing them will often result in better assemblies, but setting the cut-off too high will also result in losing useful parts of the assembly. Using the histogram from previously, estimate a good value for cov_cutoff.

velvetg ASM_NAME -exp_cov YOUR_VALUE -cov_cutoff YOUR_VALUE

Try some different values for cov_cutoff, keeping exp_cov the same and record your assembly results.

Asking velvet to determine the parameters¶

You can also ask Velvet to predict the values for you:

velvetg ASM_NAME -exp_cov auto -cov_cutoff auto

Questions:

What values of exp_cov and cov_cutoff did Velvet choose?
Check the output to the screen. Is this assembly better than your best one?

Incorporating paired-end information¶

Paired end information contributes additional information to the assembly, allowing contigs to be scaffolded. We will first re-index your reads telling Velvet to use paired-end information, by using -shortPaired instead of -short for velveth. Then, re-run velvetg using the best value of k, exp_cov and cov_cutoff from the previous step.

!!! IMPORTANT Pick a new name for your assembly !!!

velveth ASM_NAME2 VALUE_OF_K \
-shortPaired -fastq -separate \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq

velvetg ASM_NAME2 -exp_cov auto \
-cov_cutoff auto

Questions:

How does doing this affect the assembly?
what does velvet say about the insert size of the paired end library?

Scaffold and contig metrics¶

The sequences in the contigs.fa file are actually scaffolds.

Use the assemblathon_stats.pl script to generate metrics for this, and all following assemblies.

The assemblathon stats script

The assemblathon www.assemblathon.org used a perl script to obtain standardized metrics for the assemblies that were submitted. Here we use (a slightly modified version of) this script. It takes the size of the genome, and one sequence fasta file as input. The script breaks the sequences into contigs when there are 20 or more N’s, and reports all sorts of metrics.

Program	Options	Explanation
assemblathon_stats.pl		Provide basic assembly metrics
	-size	size (in Mbp, million basepairs) of target genome (optional)
	seq.fasta	fasta file of contigs or scaffolds to report on

Example, for a 3.2 Mbp genome:

assemblathon_stats.pl -s 3.2 scaffolds.fasta

OR, save the output to a file with

assemblathon_stats.pl -s 3.2 scaffolds.fasta > metrics.txt

Here, > (redirect) symbol used to ‘redirect’ what is written to the screen to a file.

For this exercise, use the known length for this strain, 4.6 Mbp, for the genome size.

NOTE make sure you have run this command to enable the use of the script:

export PATH=/data/bin/:$PATH

Some of the metrics the script reports are:

N50 is based on the total assembly size
NG50 is based on the estimated/known genome size
L50 (LG50) count: number of scaffolds/contigs at least N50 (NG50) bases

Questions

How much of the estimated genome size is covered in the scaffolds
how many gap bases (‘N’) are left in the scaffolds

Looking for repeats¶

Have a look for contigs which are long and have a much higher coverage than the average for your genome. One tedious way to do this is to look into the contigs.fa file (with less). You will see the name of the contig (‘NODE’), it’s length and the kmer coverage. However, trying to find long contigs with high coverage this way is not very efficient.

A faster was is to again use the stats.txt file.

Relevant columns are:

ID –> sequence ID, same as ‘NODE’ number in the contigs.fa file
lgth –> sequence ‘length’
short1_cov –> kmer coverage (column 6)

Knowing this, we can use the awk command to select lines for contigs at least 1kb, with k-mer coverage greater than 60:

awk '($2>=1000 && $6>=60)' stats.txt

awk is an amazing program for tabular data. In this case, we ask it to check that column 2 ($2, the length) is at least 1000 and column 6 ($6, coverage) at least 60. If this is the case, awk will print the entire line. See http://bit.ly/QjbWr7 for more information on awk.

Find the contig with the highest coverage in the contigs.fa file. Perform a BLAST search using NCBI.

Question:

What is it?
Is this surprising? Why, or why not?

The effect of mate pair library reads¶

Long-range “mate-pair” libraries can also dramatically improve an assembly by scaffolding contigs. Typical sizes for Illumina are 2kb and 6kb, although any size is theoretically possible. You can supply a second library to Velvet. However, it is important that files are reverse-complemented first as Velvet expects a specific orientation. We have supplied a 3kb mate-pair library in the correct orientation.

!!! IMPORTANT Pick a new name for your assembly !!!

We will use -shortPaired for the paired end library reads as before, and add -shortPaired2 for the mate pairs. Also, to make sure we all end up having the same assembly, the kmer size is given:

velveth ASM_NAME3 81 \
-shortPaired -separate -fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R1.fastq \
/data/assembly/MiSeq_Ecoli_MG1655_50x_R2.fastq \
-shortPaired2 -separate -fastq \
/data/assembly/Nextera_MP_R1_50x.fastq \
/data/assembly/Nextera_MP_R2_50x.fastq

We use auto values for velvetg because the addition of new reads will change the genome coverage. The assembly command then becomes:

velvetg ASM_NAME3 -cov_cutoff auto -exp_cov auto

Questions:

What is the N50 of this assembly?
How many scaffolds?
How many bases are in gaps?
What did velvet estimate for the insert length of the paired-end reads, and for the standard deviation? Use the last mention of this in the velvet output.
And for the mate-pair library?

TIP Some mate pair libraries have a significant amount of paired end reads present as a by-effect of the library preparation. This may generate misassemblies. If this is the case for your data, add the -shortMatePaired2 yes to let Velvet know it.

Make a copy of the contigs file and call it velvet_pe+mp.fa

Optional: Skipping the paired end reads¶

As both the mate pairs and the paired end reads are of the same length, and provide the same coverage, it could be interesting to try an assembly of the mate pair reads only. The read sequences would still be used to build the contigs, and the mate pair information to build scaffolds.

!!! IMPORTANT Pick a new name for your assembly !!!

The assembly for this part then becomes:

velveth ASM_NAME4 81 \
-shortPaired -separate -fastq \
/data/assembly/Nextera_MP_R1_50x.fastq \
/data/assembly/Nextera_MP_R2_50x.fastq

velvetg ASM_NAME4 -cov_cutoff auto -exp_cov auto

Questions:

What is the N50 of this assembly?
How many scaffolds?
How many bases are in gaps?
How does this assembly compare to the previous ones?

Make a copy of the contigs file and call it velvet_mp_only

Next steps¶

Next, map the reads used for the assemblies back to the scaffolds. See the tutorial ‘Mapping reads to an assembly’

Assembly using velvet¶

De novo assembly of Illumina reads using velvet¶

Assembling short-reads with Velvet¶

Building the Velvet Index File¶

Set up the environment¶

A first assembly¶

Estimating and setting `exp_cov`¶

Setting `cov_cutoff`¶

Asking velvet to determine the parameters¶

Incorporating paired-end information¶

Scaffold and contig metrics¶

Looking for repeats¶

The effect of mate pair library reads¶

Optional: Skipping the paired end reads¶

Next steps¶

Table Of Contents

This Page

Assembly using velvet¶

De novo assembly of Illumina reads using velvet¶

Assembling short-reads with Velvet¶

Building the Velvet Index File¶

Set up the environment¶

A first assembly¶

Estimating and setting exp_cov¶

Setting cov_cutoff¶

Asking velvet to determine the parameters¶

Incorporating paired-end information¶

Scaffold and contig metrics¶

Looking for repeats¶

The effect of mate pair library reads¶

Optional: Skipping the paired end reads¶

Next steps¶

Estimating and setting `exp_cov`¶

Setting `cov_cutoff`¶