GkmExplain
GkmExplain interprets predictive sequence patterns learned by Support Vector Machines (SVMs) using gapped k-mer kernels to attribute nucleotide-level importance within regulatory DNA sequences and transcription factor binding models.
Key Features:
- Interpretation of Complex Models: Attributes sequence features learned by gapped k-mer SVMs (gkm-SVMs), including models with nonlinear components, and addresses limitations of deltaSVM, in-silico mutagenesis (ISM), and SHAP.
- Computational Efficiency: Computes feature attributions substantially faster than SHAP by several orders of magnitude while maintaining accuracy.
- Theoretical Foundation: Has formal connections to Integrated Gradients for feature attribution.
- Accuracy and Reliability: Demonstrates high accuracy in simulations with regulatory DNA sequences and avoids pitfalls associated with deltaSVM and ISM.
- Application to Regulatory Variant Identification: Applied to gkm-SVM models trained on in vivo transcription factor (TF) binding data to recover consolidated, non-redundant TF motifs, and produces mutation impact scores that outperform deltaSVM and ISM in chromatin accessibility models.
Scientific Applications:
- Regulatory DNA Sequence Analysis: Interpreting gapped k-mer patterns to elucidate functional signals in non-coding genomic regions.
- Transcription Factor Binding Studies: Recovering and validating transcription factor motifs from in vivo TF binding gkm-SVM models.
- Genetic Variant Impact Assessment: Scoring mutation impacts to prioritize regulatory genetic variants, including within chromatin accessibility models.
Methodology:
GkmExplain uses a feature attribution approach leveraging theoretical connections to Integrated Gradients to compute nucleotide-level importance scores for gapped k-mer SVM models.
Topics
Details
- Tool Type:
- command-line tool
- Added:
- 11/14/2019
- Last Updated:
- 12/3/2020
Operations
Publications
Shrikumar A, Prakash E, Kundaje A. GkmExplain: fast and accurate interpretation of nonlinear gapped <i>k</i> -mer SVMs. Bioinformatics. 2019;35(14):i173-i182. doi:10.1093/bioinformatics/btz322. PMID:31510661. PMCID:PMC6612808.
PMID: 31510661
PMCID: PMC6612808
Funding: - National Institute of Health: 1DP2GM123485, 1R01HG00967401, 1U01HG009431