Sequence Assembly Practical

image
Prerequisites: Basic knowledge of Linux command-line usage.
Level: Intermediate.
Objectives:
To learn the basics of how to use Canu assembler.

Introduction

In this practical, we will learn the intricacies of assembling a ~5 Mb haploid genome using long read PacBio sequencing technology.

Familiarize with PacBio sequencing

See the video from time 1:26 to 2:36.

Pacific Biosciences Glossary of Terms

Now that you know the sequencing part, the next thing to do is to learn the genome assembly concepts. We will use the Canu assembler version 2.0 to perform the genome assembly. More information on Canu here: https://github.com/marbl/canu
More info on Canu publication here: https://genome.cshlp.org/content/27/5/722.long

There are three important stages when assembling a genome with Canu.

  • 1. Correction – This step is to increase the accuracy of bases in all reads.
  • 2. Trimming – This step is to trim reads to the regions that consist of high-quality bases and discard suspicious sequences such as the PacBio adapter (i.e. SMRTbell).
  • 3. Assembly – Reads are ordered based on their overlaps to create contigs and produce final consensus sequences. In addition, this assembler can also output the assembly graphs in GFA format.

Methods

Copy an example Escherichia coli K12 PacBio fastq to a folder that you will name assembly. This is a subset of reads from this source: https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly

#from any folder on your server wget http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq mv ecoli_p6_25x.filtered.fastq pacbio.fastq

ls
pwd
mkdir assembly
cd assembly/
cp /pacbio.fastq .

Perform contig assembly

canu
canu -version
canu -p ecoli -d ecoli-pacbio genomeSize=4.8m corThreads=2 -pacbio-raw pacbio.fastq

To understand what the options mean, please read https://canu.readthedocs.io/en/latest/parameter-reference.html. Write your answer next to the options.

The essential options for Canu are:

  -p          
    -d          
    genomeSize  
    corThreads  
    -pacbio-raw
  

Show answer

-p          assembly-prefix
  -d          assembly-directory
  genomeSize  expected genome size
              is 4.8 megabases
  corThreads  number of threads for
              the correction step
  -pacbio-raw raw PB data
  

The program will run for quite a while depending on your computer and the number of cores you decided to use. We will go through some exercises to cement your knowledge on assembly.

If you have used older Canu version e.g. v1.7 and failed at the precompute stage due to Java Virtual Machine -d64 error, you can fix it by changing mhap.sh and precompute.sh. Or get a newer Canu version e.g. v2.0. Show more…

Errors

Fixing error demonstrates what a typical day in the life of a bioinformatician may be like. It will also show a feature of Canu that allows incomplete assembly steps to be resumed. First, we need to understand the problem. In case you encounter errors, you can find the last command when Canu tried to run something but failed. See:

/precompute.sh 1 > ./precompute.000001.out 2>&1
./precompute.sh 2 > ./precompute.000002.out 2>&1

Find the script, precompute.sh, and the corresponding *.out.

cat precompute.000002.out

Example of an error output:


    Running job 2 based on command line options.
    Dumping reads from 9001 to 12528 (inclusive).
    Starting mhap precompute.
    Unrecognized option: -d64
    Error: Could not create the Java Virtual Machine.
    Error: A fatal exception has occurred. Program will exit.
    Mhap failed
  
  

That is our problem right there. It is good to Google for ‘JVM -d64 problem’ and stackoverflow sometimes have the solutions when other users encountered similar problems, see stackoverflow.com. It is also a good practice to check on the official Canu website to see if the problem has been asked and resolved before.

cd ecoli-pacbio/correction/1-overlapper/
grep "d64" *.sh

Output:


    mhap.sh:arch=`uname -m | sed s/x86_64/amd64/`
    mhap.sh: /home/student/miniconda3/envs/denovo/bin/java -d64 -server -Xmx6144m
    \
    precompute.sh:arch=`uname -m | sed s/x86_64/amd64/`
    precompute.sh:/home/student/miniconda3/envs/denovo/bin/java -d64 -server -
    Xmx6144m \
  
  

Fix the script by removing the‘-d64’ option.

sed 's/-d64//' mhap.sh > tmp; mv tmp mhap.sh
sed 's/-d64//' precompute.sh > tmp; mv tmp precompute.sh

Resume assembly by rerunning the assembly command.

cd "you path"/canu_tutorial
canu -p ecoli -d ecoli-pacbio genomeSize=4.8m corThreads=2 -pacbio-raw pacbio.fastq

Exercises

1. How many reads do we need to sequence to achieve 25X coverage of a genome with size of 4.8 Mb, and each read length is 1 kb? Show answer: 4.8e6 * 25 / 1e3 = 120000 reads

2. Do you think PacBio generates the same read length each time? Why yes or why not? Show answer: No. The polymerase in each ZMW fails at different time.

3. Below you will find 7 reads. Try to piece them together to rebuild a reference sequence. In your process to piece the reads together, what problems did you encounter? Try it for 15 minutes before reading section 3.1 – 3.6 in below link for a more comprehensive answer on assembly problems. The Principles of WGS Sequencing and Automated Fragment Assembly


    GACACTCGTG
    CGTGCACTT
    CACTTGAGTA
    GAGTAGGCTA
    ACTCGTGCAC
    TGCACTTGA
    ACTTGAGTAGGC

Show the answer to question 3:

GACACTCGTGCACTTGAGTAGGCTA

Examine the Canu output

4. When Canu has finished with ‘Computing correction layouts’, it will output some summary statistics of the reads. Verify the coverage value from below table. What are the advantages of PacBio sequencing over Illumina sequencing when it comes to genome assembly? To refresh your knowledge on Illumina sequencing, please see:

Category Original raw reads w/overlaps Original raw reads w/o/overlaps*
Number of Reads 12,348 180
Number of Bases 114,631,263 218,769
Coverage 23.882 0.046
Median 7,651 0
Mean 9,283 1,215
N50 14,222 12,135
Minimum 1,000 0
Maximum 42,279 17,689
Show answer to question 4:

114,631,263/4.8e6 = 23.882 PB has much longer reads than can span most repeats and has random sequencing error. Check out the mean of 9kb here vs 100 or 150 PE that is typical from Illumina!


Assembly of a diploid genome

E. coli is a haploid genome and its genome size of ~5 Mb is considered small nowadays. It i not a very challenging task to assemble a genome such as E. coli. What is still challenging but is increasingly easier to do is the assembly of larger diploid genomes e.g. cattle is diploid and its genome size is ~2.7 Gb. Can you think of a reason why the assembly of diploid genome is difficult?

Canu is the first assembler that can completely phase a diploid genome. See section ‘Trio Binning Assembly’ in the Quick Start (https://canu.readthedocs.io/en/latest/quick-start.html) to see an example of how to separate a synthetic mix of two E. coli strains (i.e. creating ascenario that is similar to assembling a diploid genome). If you want to try this assembly, please install Canu version 2.0 (https://github.com/marbl/canu/releases) as older versions of the assembler were not equipped with the ability of phase genomes.

For information on a ‘real’ genome project and the first phasing of cattle genomes, please read below article. Can you identify what is the key difference that makes phasing works in the Canu pipeline? In other words, try to identify its key innovation. Hint: study figure 1 of the paper in detail. doi: 10.1101/720797v3.full

You may also be interested in

Tutorial: Genome assembly Quality Metrics

Pair-wise sequence alignment

Pair-wise sequence alignment methods

Construction of substitution matrices

DNA scoring matrices

Multiple sequence alignment (MSA) tools and resources

Single nucleotide polymorphism (SNP) tools and resources

Back to Tutorials Main Page