gdsfmt

'gdsfmt' (Genomic Data Storage Format) is an R package 'SeqArray' component to address the challenges associated with the analysis of whole-genome sequencing (WGS) data, particularly the limitations of the Variant Call Format (VCF) in terms of large file sizes and slower data retrieval.
The 'gdsfmt' component within the SeqArray package contributes to implementing a new WGS variant data format. This format is designed array-oriented, providing capabilities similar to VCF but with enhanced compression options and efficient data access through high-performance parallel computing. Benchmarks using 1000 Genomes Phase 3 data demonstrate improved file sizes and faster genotype reading compared to VCF and binary VCF (BCF). The SeqArray package, including 'gdsfmt,' offers a flexible, feature-rich, and high-performance programming environment for analyzing WGS variant data within the R/Bioconductor framework.

Topic

Data management

Detail

  • Operation: Data handling

  • Software interface: Command-line user interface,Library

  • Language: R

  • License: The GNU General Public License v3.0

  • Cost: Free

  • Version name: 1.38.0

  • Credit: NIH.

  • Input: Nucleic acid features [Sequence variation annotation format]

  • Output: Nucleic acid features [Textual format] [Sequence variation annotation format]

  • Contact: Xiuwen Zheng zhengx@u.washington.edu

  • Collection: -

  • Maturity: Stable

Publications

  • SeqArray-a storage-efficient high-performance data format for WGS variant calls.
  • Zheng X, et al. SeqArray-a storage-efficient high-performance data format for WGS variant calls. SeqArray-a storage-efficient high-performance data format for WGS variant calls. 2017; 33:2251-2257. doi: 10.1093/bioinformatics/btx145
  • https://doi.org/10.1093/bioinformatics/btx145
  • PMID: 28334390
  • PMC: PMC5860110

Download and documentation


< Back to DB search