MMseqs2

MMseqs2 performs high-throughput sequence searching and clustering to enable large-scale analysis and annotation of protein and nucleotide sequences, including metagenomic datasets.


Key Features:

  • Speed and efficiency: Executes sequence searches with reported speedups up to 10,000-fold versus BLAST and over 100-fold speedups with comparable accuracy while maintaining near-equivalent sensitivity.
  • Profile searches: Supports profile-based searches with sensitivities comparable to PSI-BLAST at reported speeds exceeding 400-fold higher.
  • Linclust algorithm: Implements Linclust with runtime scaling linear in input size N and independent of number of clusters K, enabling clustering of very large datasets (e.g., 1.6 billion metagenomic fragments in ~10 hours on a single server at 50% sequence identity) with reported >1000-fold speedups versus previous methods.
  • Parallelization and scalability: Leverages parallel processing across multiple cores and servers to scale to massive protein and nucleotide sequence datasets.
  • Fast search modes: Provides low-runtime-overhead search modes with sensitivities approaching BLAST for rapid query response.
  • Taxonomic annotation (MMseqs2 taxonomy): Extracts protein fragments from metagenomic contigs, retains fragments useful for annotation, assigns taxonomic labels by weighted voting to determine contig identity, and includes modules for creating and manipulating taxonomic reference databases and visualizing assignments, with reported speedups of 2–18× versus state-of-the-art tools.

Scientific Applications:

  • Functional annotation: Enables large-scale functional annotation and structure-prediction workflows for metagenomic datasets comprising billions of protein sequences.
  • Redundancy reduction and clustering: Performs similarity-based clustering to reduce redundancy and produce representative sequence sets for downstream analyses.
  • Taxonomic classification: Assigns taxonomic labels to metagenomic contigs via protein-fragment extraction and weighted voting to support taxonomic profiling.
  • Database construction and curation: Facilitates creation and manipulation of large reference sequence and taxonomic databases for genomic and metagenomic research.

Methodology:

Uses the Linclust algorithm for linear-time clustering, supports profile searches comparable to PSI-BLAST, extracts protein fragments from contigs and assigns taxonomy via weighted voting, and employs parallel processing and optimized computational strategies for large-scale sequence search and clustering.

Topics

Details

License:
MIT
Maturity:
Mature
Cost:
Free of charge
Tool Type:
command-line tool
Operating Systems:
Windows, Linux, Mac
Programming Languages:
C++
Added:
7/3/2019
Last Updated:
11/10/2025

Operations

Data Inputs & Outputs

Sequence alignment

Sequence clustering

Inputs

Outputs

Publications

Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nature Communications. 2018;9(1). doi:10.1038/s41467-018-04964-5. PMID:29959318. PMCID:PMC6026198.

Mirdita M, Steinegger M, Söding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35(16):2856-2858. doi:10.1093/bioinformatics/bty1057. PMID:30615063. PMCID:PMC6691333.

PMID: 30615063
PMCID: PMC6691333
Funding: - Horizon 2020 Framework Programme: 685778, Virus-X

Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology. 2017;35(11):1026-1028. doi:10.1038/nbt.3988. PMID:29035372.

Steinegger M, Söding J. MMseqs2: sensitive protein sequence searching for the analysis of massive data sets. Unknown Journal. 2016. doi:10.1101/079681.

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Unknown Journal. 2020. doi:10.1101/2020.11.27.401018.

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics. 2021;37(18):3029-3031. doi:10.1093/bioinformatics/btab184. PMID:33734313. PMCID:PMC8479651.

PMID: 33734313
PMCID: PMC8479651
Funding: - ERC’s Horizon 2020 Framework Programme: 685778 - Korean government: 2019R1A6A1A10073437, NRF-2020M3A9G7103933

Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Cha S, Dallago C, Mirdita M, Schmidt B, Steinegger M. GPU-accelerated homology search with MMseqs2. Nature Methods. 2025;22(10):2024-2027. doi:10.1038/s41592-025-02819-8. PMID:40968302. PMCID:PMC12510879.

Funding: - National Research Foundation of Korea: 2020M3-A9G7-103933, 2021-M3A9-I4021220, 2021-R1C1-C102065, RS-2023-00250470, RS-2024-00396026 - Deutsche Forschungsgemeinschaft: 439669440 TRR319 RMaP TP C01 - Samsung: Creative-Pioneering Researchers Program - Novo Nordisk Fonden: NNF24SA0092560

Documentation

Downloads

Links

Related Tools

linclust
Relation: includes
mmseqs
Relation: isNewVersionOf
conterminator
Relation: usedBy
metaeuk
Relation: usedBy
plass
Relation: usedBy
spacepharer
Relation: usedBy