OMArk

OMArk evaluates completeness, accuracy, and contamination in eukaryotic protein-coding gene repertoires by fast, alignment-free comparison of query proteomes to precomputed gene family profiles spanning the tree of life.


Key Features:

  • Alignment-free comparison: Uses fast, alignment-free sequence comparison against precomputed gene family profiles spanning the tree of life.
  • Gene family mapping: Maps query protein sequences to gene families to determine family membership and placement.
  • Lineage-aware quality metrics: Computes metrics that account for expected lineage-specific gene content and family placements.
  • Completeness assessment: Quantifies recovery of expected single- and multicopy conserved genes to assess completeness.
  • Taxonomic consistency assessment: Evaluates family placements relative to the target lineage to detect taxonomic inconsistency.
  • Contamination and error detection: Flags proteins assigned to families from unrelated taxa or failing to map as indicators of contamination, overprediction, or misannotation.
  • Contrast to existing methods: Provides both presence/absence and inconsistency signals complementary to completeness-focused methods such as BUSCO, EukCC, DOGMA, and CheckM.

Scientific Applications:

  • Proteome quality control: Assess completeness, accuracy, and contamination in eukaryotic proteomes prior to comparative, evolutionary, or functional analyses.
  • Curation prioritization: Prioritize high-confidence proteomes and identify proteomes requiring manual curation or reannotation.
  • Contamination and error discovery: Detect cross-species contamination, overprediction, and lineage-specific error propagation in proteome datasets.
  • Benchmarking evidence: Benchmarking across 1,805 UniProt Eukaryotic Reference Proteomes revealed widespread contamination events and lineage-specific error propagation, including inflated avian proteomes derived from fragmented zebra finch reference annotations.

Methodology:

OMArk performs fast, alignment-free sequence comparison of query proteins to precomputed gene family profiles, maps proteins to families, and computes lineage-aware metrics including completeness via recovery of expected single- and multicopy conserved genes and taxonomic consistency via evaluation of family placements; proteins assigned to unrelated taxa or failing to map are flagged as potential contamination or erroneous models.

Topics

Details

License:
LGPL-3.0
Maturity:
Mature
Cost:
Free of charge
Tool Type:
command-line tool
Programming Languages:
Python
Added:
3/18/2024
Last Updated:
11/6/2024

Operations

Data Inputs & Outputs

Differential protein expression profiling

Publications

Nevers, Y., Warwick Vesztrocy, A., Rossier, V. et al. Quality assessment of gene repertoire annotations with OMArk. Nat Biotechnol 43, 124–133 (2025).

Documentation