Bacterial Genome

Genome assembly: PacBio, Illumina and Illumina+PacBio hybrid data

In this tutorial we will learn to assemble a bacteria genome with long reads (PacBio), short reads (Illumina) and the combination of both short and long reads (hybrid). Then we will compare them to assess their quality.

Background

We will use data from Lactobacillus fermentum strain (8633) that has been associated with an overall promotion of brain health in rats, including a reduction in anxiety- and depression-related behaviours (Ong, J.S. et al 2020, https://www.sciencedirect.com/science/article/pii/S0888754319309814). Genomic DNA of the L. fermentum 8633 was sequenced using PacBio RS2 and Illumina HiSeq 2000. The data associated with this work can be downloaded from here: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA312743/

Overall steps

Download raw sequences
Filter out bad sequences
Assemble the genome
Check the quality of the genome

1 - Download raw sequences

Illumina paired end (PE) reads

Software: install SRA Toolkit from https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

Software version: SRA-Toolkit 2.8.2-1

Source of Illumina data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3205957

Download

$ wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3205957/SRR3205957_1_unpaired

$ prefetch SRR3205957

To use prefetch, you will need to have the SRA Toolkit installed.

$ fastq-dump --split-files SRR3205957

PacBio reads

Software: install smrtlink from https://www.pacb.com/support/software-downloads/

Software version: smrtlink v6.0.0

Source of PacBio data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3205956

Download

$ wget https://sra-pub-src-1.s3.amazonaws.com/SRR3205956/m140916_193254_42181_c100662392550000001823127111271464_s1_p0.1.bax.h5.1$ wget https://sra-pub-src-1.s3.amazonaws.com/SRR3205956/m140916_193254_42181_c100662392550000001823127111271464_s1_p0.2.bax.h5.1$ wget https://sra-pub-src-1.s3.amazonaws.com/SRR3205956/m140916_193254_42181_c100662392550000001823127111271464_s1_p0.3.bax.h5.1

As we only need the extension suffix to be *h5, rename the downloaded files. Example

$ bax2bam -o lfer8633pbbam m140916_193254_42181_c100662392550000001823127111271464_s1_p0.1.bax.h5 m140916_193254_42181_c100662392550000001823127111271464_s1_p0.2.bax.h5 m140916_193254_42181_c100662392550000001823127111271464_s1_p0.3.bax.h5$ bam2fastq -o lfer8633pb lfer8633pbbam.subreads.bam

2 – Filter out bad sequences

Use: Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)

Key features:

Find reads for removal. Mainly used to remove sequencing adapters but it can also be used to remove contaminants.
Produces compressed fastq files to save data storage space.
Keeps single end reads that are unpaired in paired-end reads.
Software version used: v0.32

$ java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.38.jar PE SRR3205957_1.fastq SRR3205957_2.fastq SRR3205957_1_paired.fastq.gz SRR3205957_1_unpaired.fastq.gz SRR3205957_2_paired.fastq.gz SRR3205957_2_unpaired.fastq.gz ILLUMINACLIP:/apps/software/Trimmomatic/0.38-Java-1.8.0_121/adapters/TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:30

3 - Assemble the genome

Use: SPAdes (https://github.com/ablab/spades)

Key features:

One of the best assemblers for small genomes such as bacteria.
de Bruijn assembler optimized for short reads.
Supports more than one sequencing platforms e.g. Illumina, PacBio.
Software version used: 3.12.0

Installation: Download the Linux binary from the website and unpack it. It also requires Python to be installed.

$ wget http://cab.spbu.ru/files/release3.12.0/SPAdes-3.12.0-Linux.tar.gz $ tar –xzf SPAdes-3.12.0-Linux.tar.gz

Assembly command

$ spades.py --pe1-1 SRR3205957_1_paired.fastq.gz --pe1-2 SRR3205957_2_paired.fastq.gz --pe1-s SRR3205957_1_unpaired.fastq.gz --pe1-s SRR3205957_2_unpaired.fastq.gz --careful -t 4 -o lfer8633_illupe

Paired-end reads that survived the Trimmomatic filter were used as input with the the --pe argument. The first number after “pe” refers to the arbitrary library number where the reads came from whereas the subsequent number refers to the pairing i.e. 1 for forward, 2 for reverse.
Reads that were turned into single-end reads after Trimmomatic can still be used as input for assembly by using the --pe argument with the addition of the –s flag.
We added the --careful argument to reduce mismatches and short indels even though this will increase computational time.
The -t argument is optional and specify the number of threads, which is 4 in this case.
The name for the output directory is given with the –o argument.

Important Output file(s):

(1) lfer8633_illupe/scaffolds.fasta

Rename scaffolds.fasta to lfer8633_illum.fasta

Summary results: 1347 scaffolds

3b-Hybrid assembly with both Illumina and PacBio reads

Use: SPAdes (https://github.com/ablab/spades)

Key features and installation: (refer to “Short paired-end read assembly” above).

In this demonstration, we have both Illumina and PacBio reads of L. fermentum 8633. You should only assemble reads from the same individual to ensure good assembly quality.

PacBio reads are excellent for gap closure and resolving long repeats because their average read length is long (e.g. >10 Kb). There are different categories of PacBio reads (e.g. CLR for continuous long reads, CCS for circular consensus sequencing reads) and for the purpose here, we have used filtered subreads in FASTQ format.

Assembly command

$ spades.py --pe1-1 SRR3205957_1_paired.fastq.gz --pe1-2 SRR3205957_2_paired.fastq.gz --pe1-s SRR3205957_1_unpaired.fastq.gz --pe1-s SRR3205957_2_unpaired.fastq.gz --pacbio lfer8633pb.fastq --careful -t 4 -o lfer8633_illupe_pacbio

An important parameter to adjust: -k, which determines the k-mer size to index.
Technically, the Spades assembler only used PacBio long reads to improve assembly continuity but not used them for assembly directly. The argument --pacbio indicates PacBio reads as input in fastq format.

Important Output file(s):

lfer8633_illupe_pacbio /scaffolds.fasta

Rename scaffolds.fasta to lfer8633_illum_pb.fasta

Summary results: 962 scaffolds.

3c-Long SE read assembly (PacBio)

We will use the Canu assembler (https://github.com/marbl/canu/releases/tag/v2.0) for long read assembly. This assembler is based on the Celera Assembler (http://wgs-assembler.sourceforge.net/).

Key features:

Overlap Layout Consensus (OLC) assembler optimized for long reads.
Mainly for sequencing platforms that output long reads e.g. Sanger, PacBio, Oxford Nanopore.
It implements correction, trimming and assembly as part of its pipeline.

For assembling long reads, we can use the Overlap Consensus (OLC) method. This is implemented in the Celera Assembler, which Canu has adapted its code.

Celera Assembler is a part of the SMRT Analysis suite of programs that is used to clean, process and assemble PacBio reads. There is a pipeline called Canu that contains steps to assemble the bacteria genome: http://canu.readthedocs.org/en/latest/quick-start.html#quickstart

Version used: Canu 2.0

Installation: Download the Linux binaries version from the website and unpack the file.

Assembly command

$ canu -p lfer8633 -d lfer8633_pacbio genomeSize=2.2m corThreads=2 gridOptions="-N 1" -pacbio lfer8633pb.fastq.gz

Canu detects the PacBio fastq file with the –pacbio argument.
The assembly output goes to the directory named using –d with files having prefixes named using –p.
The genomeSize parameter can be a rough estimate of the expected genome size.
Note that the gridOptions="-N 1" argument is only needed in this case as the run was set on a server that needs to know this information. You should try without this argument to see if it works.

Important Output file(s):

(1) lfer8633.contigs.fasta

Rename to lfer8633_pb.fasta.

Summary results: 5 scaffolds.

4-Check the quality of the genome

There are at least two assembly metrices to assess the quality of a genome:

Statistical
Evolutionary

The three assemblies are assessed as follows:

Statistical

Use: QUAST (http://quast.sourceforge.net/quast.html)

Key features:
Works both with and without a given genome reference.
Able to do multiple genome comparisons.
Generates interactive reports that can be opened in web browsers.

Software version used: 4.5

Installation: Download the source code from the website and unpack the file. This tool requires Python to be installed. QUAST is clever in that it also installs the necessary software during its first use even if you have not set them up (see the README.md).

Since this is a de novo assembly demonstration, let’s first performed QUAST without a reference genome. Then, in order to demonstrate how to compare to a reference genome, we will use the L. fermentum genome that was assembled using the PacBio HGAP software.

We will proceed to compare the metrices:

$ quast.py lfer8633_illum.fasta lfer8633_illum_pb.fasta lfer8633_pb.fasta --glimmer --scaffolds -o lfer8633_quast#compare to a reference$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/676/625/GCF_009676625.1_ASM967662v1/GCF_009676625.1_ASM967662v1_genomic.fna.gz$ gzip -d GCF_009676625.1_ASM967662v1_genomic.fna.gz $ mv GCF_009676625.1_ASM967662v1_genomic.fna lfer8633_hgap.fasta$quast.py lfer8633_illum.fasta lfer8633_illum_pb.fasta lfer8633_pb.fasta --glimmer --scaffolds -o lfer8633_quast -R lfer8633_hgap.fasta

The arguments should be self-explanatory. If not, run quast.py again to see details of its arguments.
It is a good idea to investigate the following parameters: --gage, --contig-thresholds, --use-all-alignments, --ambiguity-usage, --strict-NA, --extensive-mis-size

Important Output file(s):

report.pdf
report.html

The HTML files can be opened in a web browser for interactive reports.

The QUAST results show that the PacBio assembly is the best because it has only five contigs and scaffold N50 of 1,806,648 bp that is very close to the total length of the genome (~2.2 Mb).

Evolutionary

Use: BUSCO (http://busco.ezlab.org/)

Key features:

Uses universal single-copy orthologs to benchmark genome quality.
A successor to CEGMA (http://korflab.ucdavis.edu/datasets/cegma/).

Software version: v2.0.1

Installation: Download the program and unpack it. There are a number of dependencies needed to get BUSCO running. Please refer to the BUSCO_userguide.pdf file for detailed instructions. The following dependencies were used:

BLAST+ 2.6.0
hmmer-3.1b2
augustus-3.2.3

BUSCO searches for the presence of evolutionary conserved genes. A closely related model of expected single copy orthologs is used to compare against genes predicted in the newly assembled genome. Download the necessary BUSCO profile for your type of organism. In this case, we use the bacteria profile which can be downloaded:

$ wget http://busco.ezlab.org/v2/datasets/bacteria_odb9.tar.gz

The file should be unpacked in a location, such as BUSCO’s directory (/path/to/BUSCO_v2.0.1/) or the path where you have the assembly fasta files:

Make the runs for each assembly:

1. Illumina only assembly:

$ BUSCO.py -i lfer8633_illum.fasta -o lfer8633_illum -c 4 -l bacteria_odb9/ -m geno

2. Illumina + PacBio assembly:

$ BUSCO.py -i lfer8633_illum_pb.fasta -o lfer8633_illum_pb -c 4 -l bacteria_odb9/ -m geno

3. PacBio only assembly:

$ BUSCO.py -i lfer8633_pb.fasta -o lfer8633_pb -c 4 -l bacteria_odb9/ -m geno

Important Output file(s):

/run_lfer8633_illum_pb/short_summary_lfer8633_illum_pb.txt

/run_lfer8633_pb/short_summary_lfer8633_pb.txt

/run_lfer8633_illum/short_summary_lfer8633_illum.txt

Table 1. BUSCO results of the three assemblies.

	Illumina	Illumina+PacBio	PacBio
Complete Single-Copy BUSCOs	143	143	137
Complete Duplicated BUSCOs	2	2	2
Fragmented BUSCOs	1	1	1
Missing BUSCOs	2	2	5
Total BUSCO groups searched	148	148	148

The BUSCO results indicate that both Illumina and hybrid Illumina + PacBio assemblies are the best because they only missed 2 known BUSCO genes and had only 1 fragmented bacterial gene. The PacBio assembly is the worst according to the BUSCO metric as it has the highest number of fragmented and missing BUSCO genes.