MMseqs2

MMseqs2 performs high-throughput sequence searching and clustering to enable large-scale analysis and annotation of protein and nucleotide sequences, including metagenomic datasets.

Key Features:

Speed and efficiency: Executes sequence searches with reported speedups up to 10,000-fold versus BLAST and over 100-fold speedups with comparable accuracy while maintaining near-equivalent sensitivity.
Profile searches: Supports profile-based searches with sensitivities comparable to PSI-BLAST at reported speeds exceeding 400-fold higher.
Linclust algorithm: Implements Linclust with runtime scaling linear in input size N and independent of number of clusters K, enabling clustering of very large datasets (e.g., 1.6 billion metagenomic fragments in ~10 hours on a single server at 50% sequence identity) with reported >1000-fold speedups versus previous methods.
Parallelization and scalability: Leverages parallel processing across multiple cores and servers to scale to massive protein and nucleotide sequence datasets.
Fast search modes: Provides low-runtime-overhead search modes with sensitivities approaching BLAST for rapid query response.
Taxonomic annotation (MMseqs2 taxonomy): Extracts protein fragments from metagenomic contigs, retains fragments useful for annotation, assigns taxonomic labels by weighted voting to determine contig identity, and includes modules for creating and manipulating taxonomic reference databases and visualizing assignments, with reported speedups of 2–18× versus state-of-the-art tools.

Scientific Applications:

Functional annotation: Enables large-scale functional annotation and structure-prediction workflows for metagenomic datasets comprising billions of protein sequences.
Redundancy reduction and clustering: Performs similarity-based clustering to reduce redundancy and produce representative sequence sets for downstream analyses.
Taxonomic classification: Assigns taxonomic labels to metagenomic contigs via protein-fragment extraction and weighted voting to support taxonomic profiling.
Database construction and curation: Facilitates creation and manipulation of large reference sequence and taxonomic databases for genomic and metagenomic research.

Methodology:

Uses the Linclust algorithm for linear-time clustering, supports profile searches comparable to PSI-BLAST, extracts protein fragments from contigs and assigns taxonomy via weighted voting, and employs parallel processing and optimized computational strategies for large-scale sequence search and clustering.

Visit Official Homepage →

Topics

Metagenomics Sequence analysis Proteins Nucleic acids Gene and protein families Taxonomy

Details

License:: MIT
Maturity:: Mature
Cost:: Free of charge
Tool Type:: command-line tool
Operating Systems:: Windows, Linux, Mac
Programming Languages:: C++
Added:: 7/3/2019
Last Updated:: 11/10/2025

Operations

Data Inputs & Outputs

Sequence alignment

Inputs

Report
- DICOM format

Outputs

Score
- DICOM format

Sequence clustering

Inputs

Score

Outputs

Score
- xlsx

Taxonomic classification

Inputs

Mass spectrometry data
- SMILES

Outputs

Quality control report
- DICOM format

Publications

Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nature Communications. 2018;9(1). doi:10.1038/s41467-018-04964-5. PMID:29959318. PMCID:PMC6026198.

DOI: 10.1038/s41467-018-04964-5

PMID: 29959318

PMCID: PMC6026198

Mirdita M, Steinegger M, Söding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35(16):2856-2858. doi:10.1093/bioinformatics/bty1057. PMID:30615063. PMCID:PMC6691333.

DOI: 10.1093/bioinformatics/bty1057

PMID: 30615063

PMCID: PMC6691333

Funding: - Horizon 2020 Framework Programme: 685778, Virus-X

Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology. 2017;35(11):1026-1028. doi:10.1038/nbt.3988. PMID:29035372.

DOI: 10.1038/nbt.3988

PMID: 29035372

Steinegger M, Söding J. MMseqs2: sensitive protein sequence searching for the analysis of massive data sets. Unknown Journal. 2016. doi:10.1101/079681.

DOI: 10.1101/079681

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Unknown Journal. 2020. doi:10.1101/2020.11.27.401018.

DOI: 10.1101/2020.11.27.401018

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics. 2021;37(18):3029-3031. doi:10.1093/bioinformatics/btab184. PMID:33734313. PMCID:PMC8479651.

DOI: 10.1093/bioinformatics/btab184

PMID: 33734313

PMCID: PMC8479651

Funding: - ERC’s Horizon 2020 Framework Programme: 685778 - Korean government: 2019R1A6A1A10073437, NRF-2020M3A9G7103933

Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Cha S, Dallago C, Mirdita M, Schmidt B, Steinegger M. GPU-accelerated homology search with MMseqs2. Nature Methods. 2025;22(10):2024-2027. doi:10.1038/s41592-025-02819-8. PMID:40968302. PMCID:PMC12510879.

DOI: 10.1038/s41592-025-02819-8

Funding: - National Research Foundation of Korea: 2020M3-A9G7-103933, 2021-M3A9-I4021220, 2021-R1C1-C102065, RS-2023-00250470, RS-2024-00396026 - Deutsche Forschungsgemeinschaft: 439669440 TRR319 RMaP TP C01 - Samsung: Creative-Pioneering Researchers Program - Novo Nordisk Fonden: NNF24SA0092560

Documentation

General

https://github.com/soedinglab/MMseqs2/blob/master/README.md

User manual

https://github.com/soedinglab/mmseqs2/wiki

Training material

https://github.com/soedinglab/MMseqs2/wiki/Tutorials

Tutorial material

Downloads

Source code
https://github.com/soedinglab/MMseqs2/releases

Links

Issue tracker

https://github.com/soedinglab/mmseqs2/issues

Repository

https://github.com/soedinglab/mmseqs2

Related Tools

linclust

Relation: includes

mmseqs

Relation: isNewVersionOf

Relation: usedBy

Relation: usedBy

Relation: usedBy

Relation: usedBy