SourceFinder
SourceFinder classifies chromosomal, plasmid, and bacteriophage sequences from bacterial genome assemblies using k-mer distributions and a random forest machine-learning classifier to identify sequence origin.
Key Features:
- Three-Class Classification: Distinguishes chromosomal, plasmid, and bacteriophage sequences using k-mer distribution analysis.
- Random Forest Classifier: Utilizes a random forest algorithm as the core predictive model.
- Training Dataset: Trained on 23,211 sequences from bacterial chromosomes, plasmids, and bacteriophages across hundreds of species.
- Fragmentation Strategy: Complete sequences were fragmented into 5,000-nucleotide segments to represent sequence plasticity and incomplete assemblies.
- K-mer Subdivision: Sequence fragments were subdivided into k-mers to capture short-subsequence patterns used for classification.
- Performance (AUC): Demonstrates robust performance with a minimum area under the receiver operating characteristic curve (AUC) of 0.939, including on simulated metagenomic scaffolds.
- Adaptability to Incomplete Data: Training on subsampled and fragmented sequences enables handling of incomplete genomic assemblies.
Scientific Applications:
- Extra-chromosomal element identification: Detects plasmids and bacteriophages that encode antimicrobial resistance, metal resistance, and virulence genes.
- Gene dissemination analysis: Assesses the distribution and dissemination of accessory genes across microbial communities.
- Horizontal gene transfer and evolutionary studies: Supports analysis of horizontal gene transfer and evolutionary dynamics within microbiomes.
Methodology:
A random forest classifier was trained on 23,211 bacterial chromosome, plasmid, and bacteriophage sequences that were fragmented into 5,000-nucleotide segments and subdivided into k-mers to model k-mer distributions for classification.
Topics
Details
- Cost:
- Free of charge
- Tool Type:
- web application
- Operating Systems:
- Mac, Linux, Windows
- Added:
- 1/25/2023
- Last Updated:
- 11/24/2024
Operations
Publications
Aytan-Aktug D, Grigorjev V, Szarvas J, Clausen PTLC, Munk P, Nguyen M, Davis JJ, Aarestrup FM, Lund O. SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies. Microbiology Spectrum. 2022;10(6). doi:10.1128/spectrum.02641-22. PMID:36377945. PMCID:PMC9769690.