SourceFinder

SourceFinder classifies chromosomal, plasmid, and bacteriophage sequences from bacterial genome assemblies using k-mer distributions and a random forest machine-learning classifier to identify sequence origin.


Key Features:

  • Three-Class Classification: Distinguishes chromosomal, plasmid, and bacteriophage sequences using k-mer distribution analysis.
  • Random Forest Classifier: Utilizes a random forest algorithm as the core predictive model.
  • Training Dataset: Trained on 23,211 sequences from bacterial chromosomes, plasmids, and bacteriophages across hundreds of species.
  • Fragmentation Strategy: Complete sequences were fragmented into 5,000-nucleotide segments to represent sequence plasticity and incomplete assemblies.
  • K-mer Subdivision: Sequence fragments were subdivided into k-mers to capture short-subsequence patterns used for classification.
  • Performance (AUC): Demonstrates robust performance with a minimum area under the receiver operating characteristic curve (AUC) of 0.939, including on simulated metagenomic scaffolds.
  • Adaptability to Incomplete Data: Training on subsampled and fragmented sequences enables handling of incomplete genomic assemblies.

Scientific Applications:

  • Extra-chromosomal element identification: Detects plasmids and bacteriophages that encode antimicrobial resistance, metal resistance, and virulence genes.
  • Gene dissemination analysis: Assesses the distribution and dissemination of accessory genes across microbial communities.
  • Horizontal gene transfer and evolutionary studies: Supports analysis of horizontal gene transfer and evolutionary dynamics within microbiomes.

Methodology:

A random forest classifier was trained on 23,211 bacterial chromosome, plasmid, and bacteriophage sequences that were fragmented into 5,000-nucleotide segments and subdivided into k-mers to model k-mer distributions for classification.

Topics

Details

Cost:
Free of charge
Tool Type:
web application
Operating Systems:
Mac, Linux, Windows
Added:
1/25/2023
Last Updated:
11/24/2024

Operations

Publications

Aytan-Aktug D, Grigorjev V, Szarvas J, Clausen PTLC, Munk P, Nguyen M, Davis JJ, Aarestrup FM, Lund O. SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies. Microbiology Spectrum. 2022;10(6). doi:10.1128/spectrum.02641-22. PMID:36377945. PMCID:PMC9769690.

PMID: 36377945
PMCID: PMC9769690
Funding: - HHS | NIH | National Institute of Allergy and Infectious Diseases: 75N93019C00076 - Novo Nordisk Fonden: NNF16OC0021856