GkmExplain

GkmExplain interprets predictive sequence patterns learned by Support Vector Machines (SVMs) using gapped k-mer kernels to attribute nucleotide-level importance within regulatory DNA sequences and transcription factor binding models.

Key Features:

Interpretation of Complex Models: Attributes sequence features learned by gapped k-mer SVMs (gkm-SVMs), including models with nonlinear components, and addresses limitations of deltaSVM, in-silico mutagenesis (ISM), and SHAP.
Computational Efficiency: Computes feature attributions substantially faster than SHAP by several orders of magnitude while maintaining accuracy.
Theoretical Foundation: Has formal connections to Integrated Gradients for feature attribution.
Accuracy and Reliability: Demonstrates high accuracy in simulations with regulatory DNA sequences and avoids pitfalls associated with deltaSVM and ISM.
Application to Regulatory Variant Identification: Applied to gkm-SVM models trained on in vivo transcription factor (TF) binding data to recover consolidated, non-redundant TF motifs, and produces mutation impact scores that outperform deltaSVM and ISM in chromatin accessibility models.

Scientific Applications:

Regulatory DNA Sequence Analysis: Interpreting gapped k-mer patterns to elucidate functional signals in non-coding genomic regions.
Transcription Factor Binding Studies: Recovering and validating transcription factor motifs from in vivo TF binding gkm-SVM models.
Genetic Variant Impact Assessment: Scoring mutation impacts to prioritize regulatory genetic variants, including within chromatin accessibility models.

Methodology:

GkmExplain uses a feature attribution approach leveraging theoretical connections to Integrated Gradients to compute nucleotide-level importance scores for gapped k-mer SVM models.

Visit Official Homepage →

Topics

Sequencing Transcription factors and regulatory sites Machine learning

Details

Tool Type:: command-line tool
Added:: 11/14/2019
Last Updated:: 12/3/2020

Operations

Publications

Shrikumar A, Prakash E, Kundaje A. GkmExplain: fast and accurate interpretation of nonlinear gapped <i>k</i> -mer SVMs. Bioinformatics. 2019;35(14):i173-i182. doi:10.1093/bioinformatics/btz322. PMID:31510661. PMCID:PMC6612808.

DOI: 10.1093/BIOINFORMATICS/BTZ322

PMID: 31510661

PMCID: PMC6612808

Funding: - National Institute of Health: 1DP2GM123485, 1R01HG00967401, 1U01HG009431

← Back to search