GkmExplain

GkmExplain interprets predictive sequence patterns learned by Support Vector Machines (SVMs) using gapped k-mer kernels to attribute nucleotide-level importance within regulatory DNA sequences and transcription factor binding models.


Key Features:

  • Interpretation of Complex Models: Attributes sequence features learned by gapped k-mer SVMs (gkm-SVMs), including models with nonlinear components, and addresses limitations of deltaSVM, in-silico mutagenesis (ISM), and SHAP.
  • Computational Efficiency: Computes feature attributions substantially faster than SHAP by several orders of magnitude while maintaining accuracy.
  • Theoretical Foundation: Has formal connections to Integrated Gradients for feature attribution.
  • Accuracy and Reliability: Demonstrates high accuracy in simulations with regulatory DNA sequences and avoids pitfalls associated with deltaSVM and ISM.
  • Application to Regulatory Variant Identification: Applied to gkm-SVM models trained on in vivo transcription factor (TF) binding data to recover consolidated, non-redundant TF motifs, and produces mutation impact scores that outperform deltaSVM and ISM in chromatin accessibility models.

Scientific Applications:

  • Regulatory DNA Sequence Analysis: Interpreting gapped k-mer patterns to elucidate functional signals in non-coding genomic regions.
  • Transcription Factor Binding Studies: Recovering and validating transcription factor motifs from in vivo TF binding gkm-SVM models.
  • Genetic Variant Impact Assessment: Scoring mutation impacts to prioritize regulatory genetic variants, including within chromatin accessibility models.

Methodology:

GkmExplain uses a feature attribution approach leveraging theoretical connections to Integrated Gradients to compute nucleotide-level importance scores for gapped k-mer SVM models.

Topics

Details

Tool Type:
command-line tool
Added:
11/14/2019
Last Updated:
12/3/2020

Operations

Publications

Shrikumar A, Prakash E, Kundaje A. GkmExplain: fast and accurate interpretation of nonlinear gapped <i>k</i> -mer SVMs. Bioinformatics. 2019;35(14):i173-i182. doi:10.1093/bioinformatics/btz322. PMID:31510661. PMCID:PMC6612808.

PMID: 31510661
PMCID: PMC6612808
Funding: - National Institute of Health: 1DP2GM123485, 1R01HG00967401, 1U01HG009431