GenomicDataCommons

GenomicDataCommons provides an R/Bioconductor interface to the National Cancer Institute’s Genomic Data Commons (GDC) enabling programmatic querying and retrieval of harmonized genomic, clinical, and biospecimen data for cancer genomics analyses.


Key Features:

  • RESTful API access: Exposes the GDC RESTful API enabling programmatic querying, filtering, and retrieval of metadata and molecular profiles directly from R.
  • Fluent query syntax: Constructs queries using a fluent, pipe-based syntax that mirrors the GDC Data Model and Data Dictionary to explore cases, files, annotations, and analytical results.
  • Harmonized sequencing data: Provides access to uniformly processed sequencing data generated by standardized pipelines for mutation calling, copy-number variation, structural variant detection, and other derived molecular features.
  • High-volume data transfer: Integrates with the GDC Data Transfer Tool to support large downloads of genomic files including BAM, FASTQ, VCF, and masked copy-number or mutation calls.
  • Controlled-access authentication: Facilitates authenticated retrieval of controlled-access datasets requiring dbGaP authorization while remaining compatible with open-access resources.
  • Reproducible Bioconductor workflows: Enables reproduction of GDC Data Portal analytical workflows within Bioconductor using downloaded harmonized datasets for downstream statistical analyses.
  • Analytical tool parity: Supports analyses comparable to GDC-provided methods such as mutation frequency visualizations, OncoGrid co-occurrence plots, survival analyses, cohort comparison utilities, and protein-domain mutation mapping.

Scientific Applications:

  • Aggregating cancer genomics data: Integrating genomic, clinical, and biospecimen data across major cancer research programs for cross-study analyses.
  • Mutation landscape analysis: Characterizing mutation frequencies, co-occurrence patterns, and protein-domain mutation distributions.
  • Copy-number and structural variant studies: Analyzing harmonized copy-number variation and structural variant calls across cohorts.
  • Survival and cohort comparisons: Performing survival analyses and cohort comparison studies linking molecular profiles to clinical outcomes.
  • Downstream reproducible analysis: Incorporating GDC harmonized datasets into Bioconductor pipelines for statistical and bioinformatic analyses.
  • Large-scale reanalysis: Downloading BAM, FASTQ, and VCF files for alignment, variant calling, or custom reanalysis workflows.

Methodology:

Expose the GDC RESTful API for programmatic querying, filtering, and retrieval; construct queries via a fluent, pipe-based syntax that mirrors the GDC Data Model and Data Dictionary; integrate with the GDC Data Transfer Tool for high-volume downloads; facilitate authenticated retrieval of controlled-access datasets requiring dbGaP authorization; and access uniformly processed sequencing data produced by standardized pipelines for mutation calling, copy-number variation, structural variant detection, and other derived molecular features.

Topics

Collections

Details

License:
Artistic-2.0
Tool Type:
library
Operating Systems:
Linux, Windows, Mac
Programming Languages:
R
Added:
7/9/2018
Last Updated:
12/10/2018

Operations

Documentation

Downloads

Links