OMArk
OMArk evaluates completeness, accuracy, and contamination in eukaryotic protein-coding gene repertoires by fast, alignment-free comparison of query proteomes to precomputed gene family profiles spanning the tree of life.
Key Features:
- Alignment-free comparison: Uses fast, alignment-free sequence comparison against precomputed gene family profiles spanning the tree of life.
- Gene family mapping: Maps query protein sequences to gene families to determine family membership and placement.
- Lineage-aware quality metrics: Computes metrics that account for expected lineage-specific gene content and family placements.
- Completeness assessment: Quantifies recovery of expected single- and multicopy conserved genes to assess completeness.
- Taxonomic consistency assessment: Evaluates family placements relative to the target lineage to detect taxonomic inconsistency.
- Contamination and error detection: Flags proteins assigned to families from unrelated taxa or failing to map as indicators of contamination, overprediction, or misannotation.
- Contrast to existing methods: Provides both presence/absence and inconsistency signals complementary to completeness-focused methods such as BUSCO, EukCC, DOGMA, and CheckM.
Scientific Applications:
- Proteome quality control: Assess completeness, accuracy, and contamination in eukaryotic proteomes prior to comparative, evolutionary, or functional analyses.
- Curation prioritization: Prioritize high-confidence proteomes and identify proteomes requiring manual curation or reannotation.
- Contamination and error discovery: Detect cross-species contamination, overprediction, and lineage-specific error propagation in proteome datasets.
- Benchmarking evidence: Benchmarking across 1,805 UniProt Eukaryotic Reference Proteomes revealed widespread contamination events and lineage-specific error propagation, including inflated avian proteomes derived from fragmented zebra finch reference annotations.
Methodology:
OMArk performs fast, alignment-free sequence comparison of query proteins to precomputed gene family profiles, maps proteins to families, and computes lineage-aware metrics including completeness via recovery of expected single- and multicopy conserved genes and taxonomic consistency via evaluation of family placements; proteins assigned to unrelated taxa or failing to map are flagged as potential contamination or erroneous models.
Topics
Details
- License:
- LGPL-3.0
- Maturity:
- Mature
- Cost:
- Free of charge
- Tool Type:
- command-line tool
- Programming Languages:
- Python
- Added:
- 3/18/2024
- Last Updated:
- 11/6/2024
Operations
Data Inputs & Outputs
Differential protein expression profiling
Outputs
Publications
Nevers, Y., Warwick Vesztrocy, A., Rossier, V. et al. Quality assessment of gene repertoire annotations with OMArk. Nat Biotechnol 43, 124–133 (2025).