LINflow

LINflow computes genomic similarity matrices for prokaryotic genomes by combining alignment-free and alignment-based approaches to support genome-based classification and identification of strains.


Key Features:

  • Alignment-free candidate selection: Uses the alignment-free tool sourmash to identify the genome in a dataset most similar to a given query genome to reduce computational demands.
  • Alignment-based ANI calculation: Uses pyani to compute Average Nucleotide Identity (ANI) values between the query genome and the candidate identified by sourmash.
  • Hybrid approach: Integrates alignment-free and alignment-based methods to optimize the trade-off between computational efficiency and ANI precision.
  • Incremental dataset integration: Applies the same identification and ANI computation process iteratively when adding new genomes to an existing dataset.
  • LIN storage and inference: Stores computed ANI values as Life Identification Numbers (LINs) and uses them to infer other pairwise ANI values within the dataset.
  • Performance benchmarking: Demonstrated up to 150-fold speed improvement over pyani on four datasets totaling 484 genomes while maintaining high correlation with pyani-computed ANI values, with occasional minimal discrepancies.

Scientific Applications:

  • Genome-based classification and strain identification: Generates ANI matrices to support classification and identification of prokaryotic strains.
  • Continuous dataset updating: Rapidly integrates newly sequenced genomes into existing similarity matrices for ongoing genomic surveillance or comparative studies.
  • Large-scale ANI estimation and inference: Enables efficient estimation of ANI and inference of pairwise similarities across large prokaryotic genome collections.

Methodology:

LINflow first uses sourmash (alignment-free) to find the most similar genome candidate for a query, then uses pyani (alignment-based) to compute ANI between the query and that candidate; computed ANI values are stored as Life Identification Numbers (LINs) and the same process is applied iteratively when adding new genomes, enabling inference of other pairwise ANI values.

Topics

Details

License:
MIT
Tool Type:
workflow
Programming Languages:
Python
Added:
10/4/2021
Last Updated:
10/4/2021

Operations

Publications

Tian L, Mazloom R, Heath LS, Vinatzer BA. LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes. PeerJ. 2021;9:e10906. doi:10.7717/peerj.10906. PMID:33828908. PMCID:PMC8000461.

PMID: 33828908
PMCID: PMC8000461
Funding: - National Science Foundation: IOS-1354215

Downloads

Links