MMseqs

MMseqs performs fast sequence searching, clustering, and homology detection to reduce redundancy and enable large-scale sequence and metagenomic analyses.


Key Features:

  • Redundancy reduction: Clusters sequences to a maximum pairwise identity of 50% or lower to reduce redundancy in sequence databases.
  • Speed and sensitivity: Balances high speed and sensitivity, showing superior sensitivity to UBLAST and RAPsearch while running 4-30× faster than those tools, though not matching BLAST sensitivity.
  • Comparative performance: Outperforms BLASTclust, CD-HIT, and USEARCH in clustering and search throughput.
  • Prefiltering module: Identifies similar k-mers between query and target sequences and sums their scores to quickly prefilter candidates.
  • Local alignment module: Performs local alignments using SSE2 vector instructions and multi-core parallelization for efficient alignment.
  • Clustering module: Enables deep clustering of large databases down to approximately 30% sequence identity at speeds reported as hundreds of times faster than BLASTclust.
  • Cascaded clustering: Employs a cascaded clustering approach that allows database updates in linear time instead of quadratic time.

Scientific Applications:

  • Homology detection: Sensitive detection of homologs in large sequence datasets, with sensitivity exceeding UBLAST and RAPsearch.
  • Database clustering: Deep clustering of sequence databases to reduce redundancy and create representative sequence sets down to ~30% identity.
  • Metagenomic sequence analysis: Analysis of metagenomic datasets where many reads lack matches to known sequences by BLAST or HMMER3.
  • Large-scale sequence processing: Scalable processing of massive datasets using cascaded clustering and parallelized alignment.

Methodology:

Fast prefiltering by identifying similar k-mers and summing scores, local alignments using SSE2 and multi-core parallelization, and cascaded clustering enabling linear-time database updates.

Topics

Details

License:
GPL-3.0
Maturity:
Legacy
Cost:
Free of charge
Tool Type:
workflow
Operating Systems:
Linux
Programming Languages:
C++, C
Added:
8/3/2017
Last Updated:
11/25/2024

Operations

Publications

Hauser M, Steinegger M, Söding J. MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics. 2016;32(9):1323-1330. doi:10.1093/bioinformatics/btw006. PMID:26743509.

Documentation

Related Tools

MMseqs2
Relation: hasNewVersion