TeraPCA
TeraPCA computes principal components from tera-scale genotype matrices (millions of individuals genotyped at millions of markers) to support population-structure analysis and large-scale studies of genetic variation.
Key Features:
- Scalability and Efficiency: Handles datasets on the order of millions of individuals and millions of markers and computes principal components with low memory requirements (on the order of a few gigabytes of RAM when applicable).
- In-Core and Out-of-Core Capabilities: Operates in-core when sufficient memory is available and out-of-core using disk storage when datasets exceed memory capacity.
- Minimal Dependencies: Requires BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) as external libraries.
- Performance: In multi-threaded mode can compute the 10 leading principal components for a dataset with one million individuals genotyped at one million markers in under five hours.
- Accuracy and Reliability: Experimental analyses reported show fast and accurate recovery of principal components for population-structure and genetic-variation studies.
Scientific Applications:
- Population-structure analysis: Principal component analysis of large genotype matrices to infer population structure in human genetics.
- Genetic diversity and evolutionary biology: Examination of genetic variation and patterns relevant to evolutionary studies using large-scale genotype data.
- Disease association studies: Large-cohort PCA for controlling population stratification in association analyses.
- Tera-scale genotype dataset analysis: Enabling PCA on tera-scale datasets that comprise millions of samples and markers.
Methodology:
The core method is Randomized Subspace Iteration using random projections and iterative refinement to compute principal components efficiently in-core or out-of-core without loading entire datasets into memory.
Topics
Details
- License:
- GPL-3.0
- Maturity:
- Mature
- Cost:
- Free of charge
- Tool Type:
- library
- Operating Systems:
- Linux, Windows, Mac
- Programming Languages:
- C++
- Added:
- 5/17/2019
- Last Updated:
- 6/16/2020
Operations
Publications
Bose A, Kalantzis V, Kontopoulou E, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics. 2019;35(19):3679-3683. doi:10.1093/bioinformatics/btz157. PMID:30957838.
PMID: 30957838
Funding: - National Science Foundation: IIS-1661756, IIS-1661760, IIS-1715202