TeraPCA

TeraPCA computes principal components from tera-scale genotype matrices (millions of individuals genotyped at millions of markers) to support population-structure analysis and large-scale studies of genetic variation.


Key Features:

  • Scalability and Efficiency: Handles datasets on the order of millions of individuals and millions of markers and computes principal components with low memory requirements (on the order of a few gigabytes of RAM when applicable).
  • In-Core and Out-of-Core Capabilities: Operates in-core when sufficient memory is available and out-of-core using disk storage when datasets exceed memory capacity.
  • Minimal Dependencies: Requires BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) as external libraries.
  • Performance: In multi-threaded mode can compute the 10 leading principal components for a dataset with one million individuals genotyped at one million markers in under five hours.
  • Accuracy and Reliability: Experimental analyses reported show fast and accurate recovery of principal components for population-structure and genetic-variation studies.

Scientific Applications:

  • Population-structure analysis: Principal component analysis of large genotype matrices to infer population structure in human genetics.
  • Genetic diversity and evolutionary biology: Examination of genetic variation and patterns relevant to evolutionary studies using large-scale genotype data.
  • Disease association studies: Large-cohort PCA for controlling population stratification in association analyses.
  • Tera-scale genotype dataset analysis: Enabling PCA on tera-scale datasets that comprise millions of samples and markers.

Methodology:

The core method is Randomized Subspace Iteration using random projections and iterative refinement to compute principal components efficiently in-core or out-of-core without loading entire datasets into memory.

Topics

Details

License:
GPL-3.0
Maturity:
Mature
Cost:
Free of charge
Tool Type:
library
Operating Systems:
Linux, Windows, Mac
Programming Languages:
C++
Added:
5/17/2019
Last Updated:
6/16/2020

Operations

Publications

Bose A, Kalantzis V, Kontopoulou E, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics. 2019;35(19):3679-3683. doi:10.1093/bioinformatics/btz157. PMID:30957838.

PMID: 30957838
Funding: - National Science Foundation: IIS-1661756, IIS-1661760, IIS-1715202