CANU

CANU assembles de novo genomes from noisy single-molecule long reads generated by Pacific Biosciences (PacBio) and Oxford Nanopore to produce high-quality reference assemblies.


Key Features:

  • Support for Long-Read Sequencing: Optimized for long-read data from Pacific Biosciences (PacBio) and Oxford Nanopore for improved reconstruction of reference genomes.
  • Error Rate Management: Implements algorithmic strategies to handle high error rates in single-molecule long reads while resolving large repeats and closely related haplotypes.
  • Adaptive Overlapping Strategy: Uses a tf-idf weighted MinHash adaptive overlapping strategy to improve overlap detection accuracy and assembly continuity.
  • Sparse Assembly Graph Construction: Constructs sparse assembly graphs to prevent collapse of diverged repeats and haplotypes for more accurate assemblies.
  • Reduced Depth-of-Coverage Requirements: Lowers depth-of-coverage requirements by approximately half compared to Celera Assembler 8.2.
  • Improved Runtime Efficiency: Achieves substantial runtime reductions—by an order of magnitude for large genomes—relative to earlier versions.
  • High Assembly Continuity and Quality: Capable of producing complete microbial genomes and near-complete eukaryotic chromosomes, achieving contig NG50 > 21 Mbp on human and Drosophila melanogaster PacBio datasets.

Scientific Applications:

  • Complex genome assembly: Reconstruction of microbial and eukaryotic genomes from long-read sequencing data.
  • Reference-quality genome generation: Automated production of reference-quality genome assemblies for downstream genomic analyses.
  • Graph-based integration: Outputs assembly graphs in GFA format for integration with phasing and scaffolding techniques.

Methodology:

Uses novel overlapping and assembly algorithms including an adaptive overlapping strategy based on tf-idf weighted MinHash and sparse assembly graph construction to handle high error rates and avoid collapse of diverged repeats and haplotypes.

Topics

Details

Tool Type:
command-line tool
Operating Systems:
Linux, Mac
Programming Languages:
Shell, Perl
Added:
11/27/2017
Last Updated:
11/24/2024

Operations

Publications

Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive <i>k</i> -mer weighting and repeat separation. Genome Research. 2017;27(5):722-736. doi:10.1101/gr.215087.116. PMID:28298431. PMCID:PMC5411767.

PMID: 28298431
PMCID: PMC5411767
Funding: - National Institutes of Health: HSHQDC-07-C-00020 - National Science Foundation: NSF IOS-1237993

Documentation