ntEdit
ntEdit polishes genome assembly drafts by correcting base substitutions and indels using a Bloom filter-based approach to improve assembly accuracy for datasets including mammalian and conifer genomes.
Key Features:
- Scalability: Scales linearly with input dataset size, enabling application to small model organisms and large genomes such as human and spruce.
- Performance at Low Coverage: Operates effectively at low sequence depths (<20×), achieving over 97% correction rate for base substitutions and indels and maintaining consistent performance across coverage levels.
- Efficiency: Demonstrated fast runtimes (under 14 seconds for Escherichia coli, under 3 minutes for Caenorhabditis elegans, and ~30–40 minutes for a sub-20× human genome dataset).
- Application to Complex Genomes: Applied to long-read and linked-read assemblies of the human genome (NA12878) to correct frameshifts in coding sequences using high-coverage Illumina data, and to pseudo-haploid assemblies of large conifer genomes (interior and white spruce) within 4–5 hours.
Scientific Applications:
- Genome Assembly Polishing: Corrects base-level errors in assembly drafts to improve sequence accuracy and reliability for downstream analyses.
- Haploidization: Aids in haploidizing gene and genome sequences to simplify complex assemblies for further analysis.
Methodology:
ntEdit employs a Bloom filter-based approach to identify and correct sequence errors, enabling efficient processing of large datasets and effective operation at low read coverage.
Topics
Details
- License:
- GPL-3.0
- Maturity:
- Mature
- Cost:
- Free of charge
- Tool Type:
- command-line tool
- Operating Systems:
- Linux, Mac
- Programming Languages:
- C++, C
- Added:
- 8/9/2019
- Last Updated:
- 6/16/2020
Operations
Publications
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I. ntEdit: scalable genome sequence polishing. Bioinformatics. 2019;35(21):4430-4432. doi:10.1093/bioinformatics/btz400. PMID:31095290. PMCID:PMC6821332.
PMID: 31095290
Funding: - Genome Canada and Genome BC: 243FOR, 281ANV
- National Institutes of Health: 2R01HG007182-04A1
Documentation
Downloads
- Source codehttps://github.com/bcgsc/ntEdit/releases
Links
Issue tracker
https://github.com/bcgsc/ntedit/issues