Sequence Analysis
Genotype Imputation Tutorial
Learning objectives: To gain understanding how and why genotype imputation is done.
Introduction to Genotype Imputation
Genotype imputation software tools are a game-changer for genetics research. Missing genetic data can be frustrating, especially when working with extensive datasets. Fortunately, genotype imputation software tools have made it much more manageable. These tools use a reference panel of genetic data to impute missing genotypes in a given dataset, with high accuracy. Each software tool has strengths and weaknesses, and researchers can choose the one that best suits their research needs. But what is genotype imputation, and why is it essential? Our tutorial covers the fundamentals of genotype imputation, providing an overview of some of the most popular software tools available to researchers. Let's dive in and explore this powerful tool used in genetics research!
What is Genotype Imputation?
Genotype imputation is used in genetics research to fill in missing genetic information in datasets.
Missing genotypes are not available in a genetic dataset for various reasons. For instance, a particular individual may not have been genotyped for a specific variant, or the genotyping platform may not have captured information on a particular variant. In other cases, missing genetic information arises due to technical errors or quality control issues.
The absence of these genotypes can lead to incomplete datasets, hindering the accuracy of genetic analyses. Genotype imputation is used to fill in these gaps of missing information within a dataset. It is a powerful tool used in genetics research to infer the most likely genotypes for the missing data based on a reference panel of genetic data.
Genotype imputation allows researchers to work with incomplete datasets. The genotype imputation software tools use a reference panel of genetic data to impute missing genotypes in a given dataset. The reference panel typically contains genetic information from a large and diverse set of individuals, allowing for the imputation of missing data with high accuracy. The process of imputing missing genetic data can be complex, but the genotype imputation software tools have made it much more manageable.
How does Genotype Imputation work?
Genotype imputation is used in genetics research to fill in missing information within a dataset. This statistical inference method is complex and allows researchers to work with incomplete datasets, especially those with extensive genetic information that is missing. To impute missing genotypes in a dataset, genotype imputation software tools use a reference panel of genetic data.
The reference panel
The reference panel is a collection of genetic data from diverse individuals. It is used as a basis for comparison when imputing missing genotypes in a given dataset. The reference panel typically contains genetic information from a large and diverse set of individuals, allowing researchers to impute missing data with high accuracy. The reference panel is created by genotyping many individuals from different populations and combining the data into a single dataset. The individuals in the reference panel are carefully selected to represent the genetic diversity of the population being studied. The reference panel is an essential component of genotype imputation, as it provides a basis for comparison when imputing missing genotypes.
The genotype imputation means comparing the genotypes in the given dataset to the genotypes in the reference panel and then using statistical methods to infer the most likely genotypes for the missing data. The accuracy of the imputed genotypes depends on the quality and size of the reference panel used. A larger reference panel typically results in higher accuracy, providing more genetic information for comparison. A diverse reference panel is also essential, ensuring that the imputed genotypes represent the broader population.
The data in the reference panel is typically represented in a standardized format, such as VCF (Variant Call Format) or PLINK (Plink Binary File Format). These formats allow for efficient storage and processing of large datasets. The reference panel includes information on each individual's genotypes and the genotyping data quality. Quality information is essential, allowing researchers to filter out low-quality genotypes when imputing missing data.
The process of genotype imputation
The process of genotype imputation consists of three main steps: prephasing, imputation, and post-imputation quality control. The first step is prephasing to determine the phase of the genotypes in the dataset.
Prephasing
The phase means the order of the alleles on each chromosome. Simply put, prephasing determines which allele comes from which parent for each variant in a sample. This is important because the phase of the genotypes determines which alleles are inherited together and, thus, which alleles are more likely to be co-inherited in a family.
Prephasing is typically done using phasing algorithms, which use statistical methods to infer the phase of the genotypes. These algorithms use large reference panels of genetic data to estimate the phase of the genotypes in the dataset. The reference panel is typically a large and diverse set of individuals with known phased genetic information, which is used as a basis for comparison. The algorithm then uses this information to determine the most likely phase for each variant in the dataset.
The output of the prephasing step is a phased dataset, which is used as input for imputation. In this dataset, each allele is assigned to a specific parent, allowing for the imputation of missing genotypes.
It's important to note that the accuracy of the imputation results depends on the phasing algorithm and the size and diversity of the reference panel used. A larger reference panel typically results in higher accuracy, providing more genetic information for comparison. A diverse reference panel is also essential, ensuring that the imputed genotypes represent the broader population.
Imputation
Imputation is the second step, where we compare the genotypes in the dataset to the genotypes in the reference panel, using statistical methods to infer the most likely genotypes for the missing data. The accuracy of the imputed genotypes depends on the quality and size of the reference panel used.
Post-imputation quality control
The final step is post-imputation quality control, removing poorly imputed variants and samples with low imputation quality. Quality control is necessary to ensure the imputed data is reliable and accurate.
Minor allele frequency (MAF)
Removing variants with low MAFs can improve the accuracy of the imputed genotypes. For example, variants with MAFs less than 1% are often removed.
Minor allele frequency (MAF) measures the frequency of the less common allele at a given locus in a population. The MAF is the frequency of the rarer allele at a given locus, with the other allele being the major allele. The frequency of an allele refers to the proportion of individuals in a population who carry that allele.
For example, assuming a particular genomic variant has two alleles, A and T. The allele A has a frequency of 0.8 in the population; then the allele T has a frequency of 1 - 0.8 = 0.2, making it the minor allele (MAF) because it has the lowest frequency. The allele A has the highest frequency, thus, making it the major allele.
Statistical significance is determined by comparing the observed frequency of a genetic variant in cases (individuals with the trait or disease being studied) to the expected frequency based on the frequency of the variant in controls (individuals without the trait or disease being studied). If the observed frequency is significantly higher in cases than in controls, the variant is considered to be associated with the trait or disease.
Imputation quality score (INFO)
INFO scores measure the imputation accuracy of each variant. Variants with low INFO scores, typically less than 0.3, are often removed.
We can use genotype imputation software tools, such as (IMPUTE2, MaCH, Beagle, SHAPEIT), to compute the imputation quality score (INFO score) to indicate the imputation accuracy of each variant. The INFO score ranges from 0 to 1. A score of 1 indicates perfect imputation accuracy, while 0 indicates the variant was not imputed.
The INFO score is calculated using a selected imputation algorithm and the reference panel used for imputation. The algorithm calculates the probability of the imputed genotype based on the available genotype data and the reference panel. The INFO score is calculated as the ratio of the difference between the probability of the imputed genotype and the probability of the most likely observed genotype to the maximum possible difference between the two probabilities. The INFO score represents the amount of information gained by imputing the missing genotype and indicates the accuracy of the imputed data.
Researchers typically set a threshold for the INFO score, below which the imputed variant is considered unreliable and removed from the dataset. The threshold may vary depending on the research question, the studied population, and the imputation software used.
Sample-level quality control
Sample-level quality control ensures the imputed data's accuracy and reliability. This is where we remove samples with low genotyping and imputation quality from the dataset.
Several sample-level quality control measures include genotype-level quality control, Hardy-Weinberg equilibrium, and population stratification.
Genotype-level quality control
Genotype-level quality control is a crucial step in the process of improving the accuracy of imputed data. It focuses on identifying and removing poorly imputed variants, likely leading to inaccurate results in downstream analyses. By removing these variants, researchers can improve the accuracy and reliability of their genetic analyses.
This quality control measure is typically performed by setting a threshold for the genotyping quality score, which is calculated by the imputation software tool. The genotyping quality score measures the accuracy of the imputed genotype, with higher scores indicating higher accuracy. Researchers typically set a threshold below which the imputed variant is considered unreliable and removed from the dataset. The threshold may vary depending on the research question, the studied population, and the imputation software used.
Additionally, researchers may choose to filter poorly imputed variants based on other quality control measures, such as minor allele frequency (MAF) or imputation quality score (INFO). MAF measures the frequency of the less common allele at a given locus in a population, while INFO scores indicate the imputation accuracy of each variant. Variants with low MAFs or INFO scores may be removed from the dataset to improve the accuracy of the genetic analyses.
Hardy-Weinberg equilibrium
Variants that deviate significantly from Hardy-Weinberg equilibrium may be removed from the dataset, which can indicate genotyping errors or other issues.
Hardy-Weinberg equilibrium (HWE) is a principle that states that the frequency of alleles in a population should remain constant from generation to generation, assuming no evolutionary forces are acting on the population. In other words, the principle suggests that the proportion of different genotypes in a population will remain the same across generations, provided that certain assumptions are met.
The assumptions of HWE are as follows:- The population is large and randomly mating
- There is no gene flow between populations
- There is no mutation, selection, or genetic drift
- The alleles are equally viable and fertile
If these assumptions are met, then the allele frequencies in the population will remain constant across generations. The HWE principle is essential in genetics research because it provides a basis for comparing genetic data. Deviations from HWE can indicate genotyping errors or other issues in the data, leading to the removal of variants or samples from the dataset.
An example of calculating Hardy-Weinberg equilibrium:
Suppose we have a population of 100 individuals and are interested in a particular genetic variant with two alleles, A and a. Suppose the frequency of the A allele is 0.6, and the frequency of the allele is 0.4.
Under HWE, the frequency of each genotype can be calculated as follows:
- AA genotype frequency: (0.6)^2 = 0.36
- Aa genotype frequency: 2(0.6)(0.4) = 0.48
- aa genotype frequency: (0.4)^2 = 0.16
The sum of these frequencies should equal 1, indicating that all possible genotypes are accounted for in the population. In this case, the sum of the frequencies is:
0.36 + 0.48 + 0.16 = 1
Suppose we have observed genotypes in our population as follows:
- AA: 30 individuals
- Aa: 50 individuals
- aa: 20 individuals
We can calculate the observed genotype frequencies as follows:
- AA genotype frequency: 30/100 = 0.3
- Aa genotype frequency: 50/100 = 0.5
- aa genotype frequency: 20/100 = 0.2
We can then compare the observed frequencies to the expected frequencies under HWE to determine if there are any deviations. The expected frequencies under HWE are
- AA genotype frequency: 0.36
- Aa genotype frequency: 0.48
- aa genotype frequency: 0.16
We can calculate the expected number of individuals with each genotype by multiplying the expected frequency by the total population size. For example, the expected number of AA individuals is:
0.36 x 100 = 36
We can then compare the expected and observed numbers of individuals with each genotype to determine if there are any deviations. For example, the expected and observed numbers of AA individuals are:
Expected: 36 and Observed: 30
We can then test these two values using, for example, the chi-squared test. If the test results in a significant deviation from the Hardy-Weinberg equilibrium (HWE), the data contains genotyping errors or other issues and we should remove the offending variants or samples that cause the deviation from the dataset.
Population stratification
In genetic research, population stratification is a prevalent problem that can lead to spurious connections, resulting in erroneous positive or negative results. Population stratification refers to changes in allele frequencies across subpopulations within a larger population that can occur due to various events, such as migration, genetic drift, and selection. These distinctions can lead to misleading connections since they might be misinterpreted as genetic links between a trait and a certain genotype. Differences in allele frequencies between subpopulations account for the connections.
In genetic investigations, population stratification can lead to misleading relationships. To remedy this, principal component analysis (PCA) may be used to detect and exclude samples with high population stratification. The method works by identifying the principal components that explain the most variance in the genetic data and using them to cluster individuals into groups.
Statistical methods like multidimensional scaling (MDS), admixture mapping, and structured association analysis can also address population stratification in genetic studies. MDS is similar to PCA, but instead of identifying principal components, it uses a distance matrix to identify clusters of genetically similar individuals. Admixture mapping is a statistical method that can identify loci associated with differences in allele frequencies between subpopulations. Structured association analysis is a method that can account for population stratification by including it as a covariate in the statistical model.
Overall, post-imputation quality control is essential for ensuring the imputed data is reliable and accurate. It is a critical step in genotype imputation that involves various measures, including MAF, INFO scores, sample-level quality control, genotype-level quality control, Hardy-Weinberg equilibrium, and population stratification. By performing quality control on the imputed dataset, researchers can obtain more reliable and accurate results in their genetic analyses.
Note that the specific thresholds used for each quality control measure may vary depending on the research question, the population being studied, and the imputation software used. For example, the MAF threshold may be higher or lower depending on the population's genetic diversity, and the INFO score threshold may be higher for high-coverage datasets.
In addition to these standard quality control measures, we may want to perform additional checks, such as comparing the imputed genotypes to other sources of genetic data or using imputation methods that incorporate external data sources to improve imputation accuracy.
Post-imputation quality control is an essential step in genotype imputation that ensures the accuracy and reliability of imputed data. We can identify and remove poorly imputed variants and samples by performing quality control checks, leading to more accurate and reliable genetic analyses.
Conclusion
Genotype imputation is a potent tool in genetics research that enables researchers to work with incomplete datasets and impute missing genetic data with high accuracy. It comprises prephasing the dataset, imputing missing genotypes using a reference panel, and performing quality control on the imputed dataset. Researchers can choose the software tool that best suits their research needs, and the results obtained are reliable and accurate. The genotype imputation software tools have been widely used in genetics research and have demonstrated high accuracy in imputing missing genetic data.
However, note that PCA can only correct for population stratification if the reference panel used in the imputation represents the studied population. Therefore, we should carefully select the reference panel based on the population being studied and the research needs. A reference panel not representative of the population being studied can lead to incorrect clustering, which can, in turn, lead to false associations.
Popular Genotype Imputation Software Tools
See our list of 80 Free Genotype Imputation Tools - Software and Resources
and also our software tool database which we are continually curating:
Database of Bioinformatics Software Tools and Resources
References
- Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet. 2007 May;39(5):906-13. doi: 10.1038/ng2044
- Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009 Jun;5(6):e1000529. doi: 10.1371/journal.pgen.1000529
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010 Jul;34(5):816-34. doi: 10.1002/gepi.20533
- Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat Methods. 2012 Jun 3;9(2):179-81. doi: 10.1038/nmeth.1785
- Howie BN, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012 Jul 15;44(8):955-9. doi: 10.1038/ng.2354
- Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, Rosenberg NA, Scheet P. Genotype-imputation accuracy across worldwide human populations. Am J Hum Genet. 2009 Apr;84(4):235-50. doi: 10.1016/j.ajhg.2009.01.013
- Li Y, Huang L, Abecasis GR, et al. Analysis of genetic variation in Ashkenazi Jews by high density SNP genotyping. BMC Genet. 2008;9:14. doi: 10.1186/1471-2156-9-14
- Li Y, Willer CJ, Sanna S, Abecasis GR. Genotype imputation. Annu Rev Genomics Hum Genet. 2009;10:387-406. doi: 10.1146/annurev.genom.9.081307.164242
- Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010 Nov;11(10):499-511. doi: 10.1038/nrg2796
- Marchini J, Howie B. Genotype imputation with millions of reference samples. Am J Hum Genet. 2013 Sep 5;93(3):433-44. doi: 10.1016/j.ajhg.2013.06.007
- Howie B, Marchini J, Stephens M. Genotype imputation with thousands of genomes. G3 (Bethesda). 2011 Sep;1(6):457-70. doi: 10.1534/g3.111.001198
- O'Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, Traglia M, Huang J, Huffman JE, Rudan I, McQuillan R, Fraser RM, Campbell H, Polasek O, Asiki G, Ekoru K, Hayward C, Wright AF, Vitart V, Navarro P, Zagury JF, Wilson JF, Toniolo D, Gasparini P, Soranzo N, Sandhu MS, Marchini J. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014 Oct 23;10(10):e1004234. doi: 10.1371/journal.pgen.1004234
- Howie BN, Connelly JJ, Eaton D, et al. A flexible and scalable pipeline for building and validating imputation reference sets from genetic data sources. Bioinformatics. 2011 Dec 15;27(24):3648-53. doi: 10.1093/bioinformatics/btr602
- Howie B, Niblett D, Berkowitz N, et al. Imputation of untyped markers in a case-control study using reference haplotypes. Bioinformatics. 2006 Jul 15;22(14):e523-7. doi: 10.1093/bioinformatics/btl216
- Huang L, Li Y, Rosenberg NA, et al. The relationship between imputation error and statistical power in genetic association studies in diverse populations. Am J Hum Genet. 2010 Jun 11;86(6):949-56. doi: 10.1016/j.ajhg.2010.05.006
- Li Y, Huang L, Abecasis GR. Knowing the identity of haplotypes carried by two unrelated individuals is useful in many applications. Bioinformatics. 2007 Oct 15;23(20):2704-6. doi: 10.1093/bioinformatics/btm408
- Marchini J, Howie B, Genomes Project A. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2011 Jul 12;12(7):453-62. doi: 10.1038/nrg2986
- Marchini J, Howie B, Genomes Project A. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2011 Jul 12;12(7):453-62. doi: 10.1038/nrg2986
- Siva N. 1000 Genomes project. Nat Biotechnol. 2008 Nov;26(11):1185. doi: 10.1038/nbt1108-1185
- Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, Rosenberg NA, Scheet P. Genotype-imputation accuracy across worldwide human populations. Am J Hum Genet. 2009 Apr;84(4):235-50. doi: 10.1016/j.ajhg.2009.01.013