LINflow
LINflow computes genomic similarity matrices for prokaryotic genomes by combining alignment-free and alignment-based approaches to support genome-based classification and identification of strains.
Key Features:
- Alignment-free candidate selection: Uses the alignment-free tool sourmash to identify the genome in a dataset most similar to a given query genome to reduce computational demands.
- Alignment-based ANI calculation: Uses pyani to compute Average Nucleotide Identity (ANI) values between the query genome and the candidate identified by sourmash.
- Hybrid approach: Integrates alignment-free and alignment-based methods to optimize the trade-off between computational efficiency and ANI precision.
- Incremental dataset integration: Applies the same identification and ANI computation process iteratively when adding new genomes to an existing dataset.
- LIN storage and inference: Stores computed ANI values as Life Identification Numbers (LINs) and uses them to infer other pairwise ANI values within the dataset.
- Performance benchmarking: Demonstrated up to 150-fold speed improvement over pyani on four datasets totaling 484 genomes while maintaining high correlation with pyani-computed ANI values, with occasional minimal discrepancies.
Scientific Applications:
- Genome-based classification and strain identification: Generates ANI matrices to support classification and identification of prokaryotic strains.
- Continuous dataset updating: Rapidly integrates newly sequenced genomes into existing similarity matrices for ongoing genomic surveillance or comparative studies.
- Large-scale ANI estimation and inference: Enables efficient estimation of ANI and inference of pairwise similarities across large prokaryotic genome collections.
Methodology:
LINflow first uses sourmash (alignment-free) to find the most similar genome candidate for a query, then uses pyani (alignment-based) to compute ANI between the query and that candidate; computed ANI values are stored as Life Identification Numbers (LINs) and the same process is applied iteratively when adding new genomes, enabling inference of other pairwise ANI values.
Topics
Details
- License:
- MIT
- Tool Type:
- workflow
- Programming Languages:
- Python
- Added:
- 10/4/2021
- Last Updated:
- 10/4/2021
Operations
Publications
Tian L, Mazloom R, Heath LS, Vinatzer BA. LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes. PeerJ. 2021;9:e10906. doi:10.7717/peerj.10906. PMID:33828908. PMCID:PMC8000461.
Downloads
- Biological datahttps://code.vt.edu/linbaseproject/linflow_datasets