ntEdit

ntEdit polishes genome assembly drafts by correcting base substitutions and indels using a Bloom filter-based approach to improve assembly accuracy for datasets including mammalian and conifer genomes.


Key Features:

  • Scalability: Scales linearly with input dataset size, enabling application to small model organisms and large genomes such as human and spruce.
  • Performance at Low Coverage: Operates effectively at low sequence depths (<20×), achieving over 97% correction rate for base substitutions and indels and maintaining consistent performance across coverage levels.
  • Efficiency: Demonstrated fast runtimes (under 14 seconds for Escherichia coli, under 3 minutes for Caenorhabditis elegans, and ~30–40 minutes for a sub-20× human genome dataset).
  • Application to Complex Genomes: Applied to long-read and linked-read assemblies of the human genome (NA12878) to correct frameshifts in coding sequences using high-coverage Illumina data, and to pseudo-haploid assemblies of large conifer genomes (interior and white spruce) within 4–5 hours.

Scientific Applications:

  • Genome Assembly Polishing: Corrects base-level errors in assembly drafts to improve sequence accuracy and reliability for downstream analyses.
  • Haploidization: Aids in haploidizing gene and genome sequences to simplify complex assemblies for further analysis.

Methodology:

ntEdit employs a Bloom filter-based approach to identify and correct sequence errors, enabling efficient processing of large datasets and effective operation at low read coverage.

Topics

Details

License:
GPL-3.0
Maturity:
Mature
Cost:
Free of charge
Tool Type:
command-line tool
Operating Systems:
Linux, Mac
Programming Languages:
C++, C
Added:
8/9/2019
Last Updated:
6/16/2020

Operations

Publications

Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I. ntEdit: scalable genome sequence polishing. Bioinformatics. 2019;35(21):4430-4432. doi:10.1093/bioinformatics/btz400. PMID:31095290. PMCID:PMC6821332.

PMID: 31095290
Funding: - Genome Canada and Genome BC: 243FOR, 281ANV - National Institutes of Health: 2R01HG007182-04A1

Documentation

Downloads

Links