We live in a remarkable time for the study of human genetics. Nearly 150 years ago, Gregor Mendel published his laws of inheritance, which lay the foundation for understanding how the information that determines traits is passed from one generation to the next. Over 50 years ago, Watson and Crick discovered the structure of DNA, which is the molecule that encodes this genetic information. All humans share the same three billion-length DNA sequence at more than 99% of the positions. Almost 100 years ago, the first twin studies showed this small fraction of genetic differences in the sequence accounts for a substantial fraction of the diversity of human traits. These studies estimate the contribution of the genetic sequence to a trait by comparing the relative correlation of traits between pairs of maternal twins (which inherit identical DNA sequences from their parents) and pairs of fraternal twins (which inherit a different mix of the genetic sequence from each parent).^{5,29} This contribution is referred to as the “heritability” of a trait. For example, twin studies have shown that genetic variation accounts for 80% of the variability of height in the population.^{5,15,26} The amount of information about a trait encoded in the genetic sequence suggests it is possible to predict the trait directly from the genetic sequence and this is a central goal of human genetics.

### Key Insights

- Over the past several years, thousands of genetic variants that have been implicated in dozens of common diseases have been discovered.
- Despite this progress, only a fraction of the variants involved in disease have been discovered—a phenomenon referred to as “missing heritability.”
- Many challenges related to understanding the mystery of missing heritability and discovering the variants involved in human disease require analysis of large datasets that present opportunities for computer scientists.

Only in the past decade has technology developed to be able to cost effectively obtain DNA sequence information from individuals and a large number of the actual genetic differences have been identified and implicated in having an effect on traits. On the average, individuals who carry such a genetic difference, often referred to as a genetic variant, will have a different value for a trait compared to individuals who do not carry the variant. For example, a recently published paper reporting on a large study to identify the genetic differences that affect height reported hundreds of variants in the DNA sequence that either increase or decrease an individual’s height if the individual carries the variant.^{2,23} Knowing these variants and their effects allows us to take the first steps in predicting traits only using genetic information. For example, if an individual carried many variants that increased height, we would predict the individual’s height is higher than the population average. While predicting an easily measured trait such as height from genetic information seems like an academic exercise, the same ideas can be used to predict disease-related traits such as risk of heart attack or response to a certain drug. These predictions can help guide selecting the best treatment options. Over 1,000 genetic variants have been implicated in hundreds of traits including many human disease-related traits of great medical importance.^{16,31}

A majority of these discoveries were made using a type of genetic study called a genome-wide association study (GWAS). In a GWAS, data from a large number of individuals is collected, including both a measurement of the disease-related trait as well as information on genetic variants from the individual. GWAS estimate the correlation between the collected disease trait and the collected genetic variants to identify genetic variants that are “associated” with disease.^{27} These associated variants are genetic variations that may have an effect on the disease risk of an individual.

While GWAS have been extremely successful in identifying variants involved in disease, the results of GWASs have also raised a host of questions. Even though hundreds of variants have been implicated to be involved in some traits, their total contribution only explains a small fraction of the total genetic contribution that is known from twin studies. For example, the combined contributions of the 50 genes discovered to have an effect on height using GWASs through 2009 with over tens of thousands individuals only account for ~5% of the phenotypic variation, which is a far cry from the 80% heritability previously estimated from twin studies.^{32} The gap between the known heritability and the total genetic contribution from all variants implicated in genome studies is referred to as “missing heritability.”^{17}

After the first wave of GWAS results reported in 2007 through 2009, it became very clear the discovered variants were not going to explain a significant portion of the expected heritability. This observation was widely referred to as the “mystery of missing heritability.” A large number of possible explanations for the “missing heritability” were presented, including interactions between variants, interactions between variants and the environments, and rare variants.^{17} Missing heritability has very important implications for human health. A key challenge in personalized medicine is how to use an individual’s genomes to predict disease risk. The genetic variants discovered from GWASs up to this point only utilize a fraction of the predictive information we know is present in the genome. In 2009 and 2010, a pair of papers shook the field by suggesting the missing heritability was not really “missing,” but actually accounted for in the common variants,^{21,32} which had very small effects. This was consistent with the results of the larger GWAS studies performed in 2011 and 2012, which analyzed tens of thousands of individuals and reported even more variants involved in disease, many of them with very small effects as postulated. The results of these later studies provide a clearer picture of the genetic architecture of disease and motivate great opportunities for predicting disease risk for an individual using their genetic information. This article traces the history of the GWAS era from the first studies, through the mystery of missing heritability and to the current understanding of what GWAS has discovered.

What is exciting about the area of genetics is that many of these questions and challenges are “quantitative” in nature. Quantitative genetics is a field with a long and rich history dating back to the works of R.A. Fisher, Sewall Wright, and J.B.S. Haldane, which are deeply intertwined with the development of modern statistics. With the availability of large-scale genetic datasets^{1,28} including virtually all data from published GWASes, the critical challenges involve many computationally intensive data analysis problems. There are great opportunities for contributions to these important challenges from computer scientists.

### The Relation between Genotypes and Phenotypes

The genomes of any two humans are approximately 99.9% identical and the small amount of differences in the remaining genomic sequence accounts for the full range of phenotypic diversity we observe in the human population. A genetic variant is a position in the human genome where individuals in the population have different genetic content. The most common type of genetic variation is referred to as a single nucleotide polymorphism (SNP). For example, the SNP rs9939609 refers to the position 53820527 on chromosome 16, which is in the FTO gene and was implicated in Type 2 diabetes in one of the first genome-wide studies performed.^{30} For this SNP, 45% of the chromosomes in the European population have an “A” in that position while 55% have the “T” in that position.^{28} The occurring genomic content (“A” or “T”) is referred to as the “allele” of the variant and the frequency of the rarer allele of the variant (0.45) is referred to as the minor allele frequency (MAF). The less common allele (in this case “A”) is referred to as the minor allele and the more common allele (in this case “T”) is referred to as the major allele. The specific allele present in an individual is referred to as the genotype. Because mutations occur rarely in human history, for the vast majority of SNPs, only two alleles are present in the human population. Since humans are diploid—each individual has two copies of each chromosome—the possible genotypes are “TT,” “AT,” and “AA” typically encoded “0,” “1,” and “2” corresponding to the number of minor alleles the individual carries.

There are many kinds of genetic variation that are present in addition to SNPs such as single position insertion and deletions, referred to as indels, or even larger variants, referred to as structural variants, encompassing such phenomenon as duplications or deletions of stretches of the genome or even inversions or other rearrangements of the genome. Virtually all GWASes collect SNP information because SNPs are by far the most common form of genetic variation in the genome and are present in virtually every region in the genome as well as amenable to experimental techniques that allow for large-scale collection of SNP information.^{7,18} While other types of genetic variation may be important in disease, since SNPs are so common in the genome, virtually every other type of genetic variant occurs near a SNP that is highly correlated with that variant. Thus genetic studies collecting SNPs can capture the genetic effects of both the SNPs they collect as well as the other genetic variants that are correlated with these SNPs.

Genetic variation can be approximately viewed as falling into one of two categories: common and rare variation. The minor allele frequency threshold separating common and rare variation is obviously subjective and the threshold is usually defined in the range of 1%–5% depending on the context. Variants that are more common tend to be more strongly “correlated” to other variants in the region. The genetics community, for historical reasons, refers to this correlation by “linkage disequilibrium.” Two variants are “correlated” if whether or not an individual carries the minor allele at one variant provides information on carrying the minor allele at another variant. This correlation structure between neighboring variants is a result of human population history and the biological processes that pass variation from one generation to the next. The study of these processes and how they shape genetic variation is the rich field of population genetics.^{8}

The field of genetics assumes a standard mathematical model for the relationship between genetic variation and traits or phenotypes. This model is called the polygenic model. Despite its simplicity, the model is a reasonable approximation of how genetic variation affects traits and provides a rich starting point for understanding genetic studies. Here, we describe a variant of the classic polygenic model.

We assume our genetic study collects *N* individuals and the phenotype of individual *j* is denoted *y*_{j}. We assume a genetic study collects *M* variants and for simplicity, we assume all of the variants are independent of each other (not correlated). We denote the frequency of variant *i* in the population as *p*_{i}. We denote the genotype of the *i*th variant in the *j*th individual as *g*_{ij} ∈ {0, 1, 2}, which encodes the number of minor alleles for that variant present in the individual. In order to simplify the formulas later in this article, without loss of generality, we normalize the genotype values such that

since the mean and variance of the column vector of genotypes (*g*_{i}) is 2*p*_{i} and 2*p*_{i} (1 – *p*_{i}), respectively. Because of the normalization, the mean and variance of the vector of genotypes at a specific variant *i* denoted *X*_{i} is 0 and 1, respectively.

The phenotype can then be modeled using

where the effect of each variant on the phenotype is β_{i}, the model mean is *μ* and *e*_{j} is the contribution of the environment on the phenotype is assumed to be normally distributed with variance
, denoted *e*_{j} ~ *N* (0,
). We note that inherent to this model is the “additive” assumption in that the variants all contribute linearly to the phenotype value. More sophisticated models, which include nonadditive effects or gene-by-gene interactions, are an active area of research.

Missing heritability has very important implications for human health.

If we denote the vector of phenotypes *y* and vector of effect sizes *β*, the matrix of normalized genotypes *X* and the vector of environmental contributions **e**, then the model for the study population can be denoted

where **1** is a column vector of 1s, and **e** is a random vector drawn from the multivariate normal distribution with mean 0 and covariance matrix
**I**, denoted as **e** ~ *N* (0,
**I**).

### Genome-Wide Association Studies

Genome-wide association studies (GWAS) collect disease trait information, referred to as phenotypes, and genetic information, referred to as genotypes, from a set of individuals. The phenotype is either a binary indicator of disease status or a quantitative measure of a disease-related trait such as an individual’s cholesterol level. Studies that collect binary trait information are referred to as case/control studies and typically collect an equal number of individuals with and without the disease. Studies that collect quantitative measures are often performed on a representative sample of a population, referred to as a population cohort, and collect individuals using a criteria designed to be representative of a larger population (for example, all individuals who were born in a specific location in a specific year^{25}).

GWASes focus on discovering the common variation involved in disease traits. Because of the correlation structure of the genome, GWASes only collect a subset of the common variation typically in the range of 500,000 variants. Studies have shown that collecting only this fraction of the common variants “captures” the full set of common variants in the genome. For the vast majority of common variants in the genome, at least 1 of the 500,000 variants that is collected is correlated with the variant. GWASes typically collect genotype information on these variants in thousands of individuals along with phenotypic information.

The general analysis strategy of GWAS is motivated by the assumptions of the polygenic model (Equation 1). In a GWAS, genotypes and phenotypes are collected from a set of individuals with the goal of discovering the associated variants. Intuitively, a GWAS identifies a variant involved in disease by splitting the set of individuals based on their genotype (“0,” “1,” or “2”) and computing the mean of the disease-related trait in each group. If the means are significantly different, then this variant is declared associated and maybe involved in the disease. More formally, the analysis of GWAS data in the context of the model in Equation (1) corresponds to estimating the vector *β* from the data and we refer to the estimated vector as
following the convention that estimates of unknown parameters from data are denoted with the “hat” over the parameter. Since the number of individuals is at least an order of magnitude smaller than the number of variants, it is impossible to simultaneously estimate all of the components of *β.* Instead, in a typical GWAS, the effect size for each variant is estimated one at a time and a statistical test is performed to determine whether or not the variant has a significant effect on the phenotype. This is done by estimating the maximum likelihood parameters of the following equation

which results in estimates of
and
and performs a statistical test to see if the estimated value of
is non-zero. (See Appendix 1, available with this article in the ACM Digital Library, for more details on association statistics.) The results of an association study is then the set of significantly associated variants, which we denote using the set *A*, and their corresponding effect size estimates
.

The results of GWASes can be directly utilized for personalized medicine. In personalized medicine, one of the challenges is to identify individuals that have high genetic risk for a particular disease. In our model from Equation (1), each individual’s phenotype can be decomposed into a genetic mean
and an environmental component (*e*_{j}). The genetic mean, which is unique to each individual and a function of the effect sizes and the individual’s genotypes, can be thought of as a measure of the individual’s genetic risk. Thus, inferring this genetic mean is closely related to identifying individuals at risk for a disease and since the environmental contribution has mean 0, predicting the genetic mean and the phenotype are closely related problems.

In the genetics community, how much genetics influences a trait is quantified using “heritability,” which is the proportion of disease phenotypic variance explained by the genetics.

Knowing nothing about an individual’s genotypes or the effect sizes, the best prediction for an individual’s phenotype would be the prediction of the phenotypic mean of
. The more information we have on an individual’s genotypes and the effects sizes, the more closely our phenotype prediction is to the true phenotype. Using the results of a GWAS and the genotypes of a new individual *x**, we can use the discovered associated loci to make a phenotype prediction, *y**, for the individual using
. As we discuss here, while the prediction of a trait from GWAS is more informative than just using the mean, unfortunately, the predictions are not accurate enough to be clinically useful.

### What GWAS has Discovered and the Mystery of Missing Heritability

In the genetics community, how much genetics influences a trait is quantified using “heritability,” which is the proportion of disease phenotypic variance explained by the genetics. The heritability of a trait can be measured using other approaches taking advantage of related individuals. One approach for measuring heritability is taking advantage of twin studies. Twin studies measure the same trait in many pairs of twins. Some of these pairs of twins are monozygotic (MZ) twins, often referred to as maternal twins and some of the pairs are dizygotic (DZ) twins, often referred to as fraternal twins. The difference between MZ twins and DZ twins is that MZ twins have virtually identical genomes, while DZ twins only share about 50% of their genomes. By computing the relative correlation between trait values of MZ twins versus DZ twins, heritability of the trait can be estimated.^{29} Intuitively, if the MZ twins within a pair have very similar trait values while DZ twins within a pair have different trait values, then the trait is very heritable. If the difference in trait values with pairs of MZ twins is approximately the same as the difference between values within pairs of DZ twins, then the trait is not very heritable.

In our model, the total phenotypic variance Var(*y*) can be decomposed into a genetic component and environmental component. In our context, heritability refers to the proportion the variance of the genetic component (Σ_{i} β_{i}*X*_{i}) contributes to the overall variance. The variance corresponding to the environment is
. Since the genotypes are normalized, the phenotypic variance accounted for by each variant is
, thus the total genetic variance is
. The heritability, which is denoted *h*^{2} for historical reasons, is then

Unfortunately, we do not know the true values of β_{i} or
. The studies using twins have been shown to closely approximate the heritability as defined in Equation (4).

GWASes have been tremendously successful in discovering variation involved in traits. The initial studies found a few variants in disease. For example, one of the first GWASes was the Wellcome Trust Case Control Consortium study, which used 3,000 healthy individuals and 2,000 individuals from each of seven diseases.^{30} They found 24 associations. As sample sizes increased, more discoveries were found particularly because many smaller GWASes were combined to enable a meta-analysis of a larger population. The results of all GWASes are catalogued at the National Human Genome Research Institute (http://www.genome.gov/gwastudies) and as of November 2013, GWASes have identified 11,996 variants associated with 17 disease categories.^{10}

While the large number of associations discovered can lead to new insights about the genetic basis of common diseases, the vast majority of discovered loci have very small effect sizes. Yet it is well known that genetics plays a large role in disease risk. For example, for many diseases, it is known that parental disease history is a strong predictor of disease risk.

Now let us use the results of GWAS to estimate the heritability. We can also estimate the total phenotypic variance by estimating the variance of our phenotypes directly, Var(*y*), which is a reasonable approximation for the true phenotypic variance
. Let *A* be the set of associated variants and for these variants, the estimate
is a reasonable estimate for β_{i}. We can use them to estimate the heritability explained by GWAS which we denote

We note the main difference between
and *h*^{2} is there are only |*A*| terms in the numerator of
while there are *M* terms in *h*^{2}. For this reason,
< *h*^{2}. Intuitively, the difference between
and *h*^{2} is the gap between the contribution of the variants that have been discovered by GWAS and the contribution of all variants to the genetic effect.

A landmark survey in 2009 compared the heritability estimates from twin studies to the heritability explained by GWAS results.^{17} In this study, they showed that the 18 variants implicated by GWAS in Type 2 Diabetes only explained 6% of the known heritability. Similarly, the 40 variants implicated to be involved in height at that time only explained 5% of the heritability. The large gap between the heritability is referred to as the “missing heritability” and a large amount of effort has gone into finding this missing heritability.

Part of the picture of missing heritability can be explained by analyzing the statistical power of GWASes. An analysis of the statistical power shows that even very large GWAS studies often fail to detect trait-affecting variants that have low minor allele frequencies (see Appendix 1, available online, for a discussion and definition of statistical power). Thus, a possible explanation for missing heritability is that a very large number of variants with very small effects are present throughout the genome accounting for the remaining heritability and simply could not be discovered by GWAS due to power considerations. If this is the case, as study samples increase, more and more of these variants will be discovered and the amount of heritability explained by the GWAS results will slowly approach the total heritability of the trait. Unfortunately, there is a practical limit to how large GWASes can become due to cost considerations. Even putting cost aside, for some diseases, there are simply not enough individuals with the disease on the planet to perform large enough GWASes to discover all of the variants involved with the disease.

Without the ability to perform even larger GWASes, it was not clear if we could identify whether there are enough small effect size variants in the genome corresponding to the missing heritability or the missing heritability was due to some other reasons such as interactions between variants, structural variation, rare variants, or interactions between genetics and environment.

### Mixed Models for Population Structure and Missing Heritability

Another insight into missing heritability emerged from what initially seemed like an unrelated development addressing an orthogonal problem in association studies. GWAS statistics (Appendix 1, available online) make the same assumptions as linear regression, which assumes the phenotype of each individual is independently distributed. Unfortunately, this is not always the case. The reason is due to the discrepancy the statistical model that actually generated the data (Equation 2) and the statistical model that is assumed when performing a GWAS (Equation 3). The term that is missing from the testing model, Σ_{i≠k} β_{i}*x*_{ij}, is referred to as an unmodeled factor. This unmodeled factor corresponds to the effect of variants in the genome other than the variant being tested in the statistical test.

If the values for the unmodeled factor are independently distributed among individuals, then the factor will increase the amount of variance, but not violate the independently distributed assumption of the statistics. The effect of the unmodeled factor is it will increase the variance estimate of in Equation (3) compared to the true environmental variance in Equation (2). However, if the unmodeled factor is not independently distributed, then this will violate the assumptions of the statistical test in Equation (3).

Unfortunately, in association studies, the effect of the rest of the genome on a trait is not independent when individuals who are related are present in the association studies. Consider a pair of siblings who are present in an association study as well as a pair of unrelated individuals. Since siblings share about half of their genome, for half of the genome, they will have identical genotypes. Many of these variants will have an effect on the phenotype. The values of Σ_{i≠k} β_{i}*x*_{ij} will be much closer to each other for siblings compared to a pair of unrelated individuals. This applies for more distant relationships as well. This problem is referred to as “population structure” where differing degrees of relatedness between individuals in the GWAS cause an inflation of the values of the association statistics leading to false positives. Many methods for addressing population structure have been presented over the years including genomic control^{4} that scales the statistics to avoid inflation, principal component based methods,^{20} and most recently mixed model methods.^{11,12,14,34}

The basis of the mixed model approach to population structure is the insight the proportion of the genome shared corresponds to the expected similarity in the values of the unmodeled factors. In fact, the amount of similarity between the unmodeled factors in association studies will be proportional to the amount of the genome shared between individuals, particularly under some standard assumptions made about the effect sizes of the variants and the assumption that each variant has equal likelihood of being causal. More precisely, the covariance of the unmodeled factors is proportional to the amount of the genome shared. The amount of genome shared is referred to as the “kinship matrix” and since the genotypes are normalized, the kinship is simply **K** = *XX*^{T}/*M* where *X* is the *N* × *M* matrix of the normalized genotypes. We then add a term to the statistical model to capture these unmodeled factors resulting in the statistical model

where *x*_{k} is a column vector of normalized genotypes for variant *k*, *e* ~ *N* (0,
**I**), and **u** ~ *N* (0,
**K**) represents the contributions of the unmodeled factors. When performing an association, mixed model methods estimate the maximum likelihood for parameters μ, β_{k},
, and
using the likelihood *L*(*N*, *y*, *x*_{k}, μ, β_{k},
,
, **K**)

and compare this maximum likelihood to the maximum likelihood when β_{k} is restricted to 0. By comparing these likelihoods, mixed model methods can obtain a significance for the association at variant *k* correcting for population structure. Mixed models were shown to perform well for a wide variety of population structure scenarios and correct for the structure in studies involving closely related individuals^{13} to studies with more distant relationships.^{11}

A major development related to the mystery of missing heritability was when the connection was made between the mixed model estimates of
and
. In a seminal paper, it was pointed out that these estimates from GWAS data for a population cohort can be used to estimate the heritability.^{32} We refer to this estimate as
where

This method was applied to estimate the heritability of height from the full set of GWAS data and obtained an estimate of 0.45, which is dramatically higher than the estimate from the results of the association studies ( ), which was 0.05. This study suggests the common variants capture a much larger portion of the heritability than just the associated variants, which provides strong support that the main cause of missing heritability is simply many variants with very small effects spread throughout the genome.

Around the same time, another study showed if the criterion for including variants in a prediction model is not as stringent as standard GWAS, but instead, the significance threshold is reduced, the predictions of the model are more accurate.^{21} In this study, not only significant associated variants, but variants that had weaker effects were included in the model and the resulting predictive model showed better performance when it was evaluated using cross-validation. This further suggests many weakly associated variants are contributing the missing heritability. This concept of including more variants in the predictive model is analogous to the trade-off related to prediction accuracy and overfitting when including more features in a machine learning classifier.

While mixed model approaches are a step toward understanding the mystery of missing heritability, there are still many unanswered questions. There is still a significant discrepancy between estimates from related individuals and mixed model estimates. For example, height is estimated to have a heritability of 0.8, while mixed models only explain 0.45. One theory for the remaining heritability argues the remaining portion of the missing heritability can be explained by rare variants having small effects that are poorly correlated with the common variants used to compute kinship matrices.^{32} Other theories postulate that interactions between variants account for the remaining missing heritability.^{3,35} Additional questions are related to the fact the interpretation of the mixed model estimate of heritability is only correct under the assumption that only the causal variants are used for estimating the kinship matrices.^{35} Unfortunately, which variants are causal is unknown and various approaches have been proposed to address this issue.^{6}

The developments in mixed models provide interesting opportunities for phenotype prediction, which is a problem with a rich history in genetics, particularly in the literature on the best linear unbiased predictor (BLUP).^{9,19,24} Consider the scenario where we have a population of individuals with known phenotypes *y* and genotypes *X.* Given a new individual’s genome *x**, we can predict the individual’s phenotype *y** using mixed models. In order to make predictions, we first estimate the parameters of the mixed model
and
. We then compute the kinship values between the new individual and the set of individuals with known genotypes and phenotypes. We can then treat the new individual’s phenotype as missing and compute the most likely value for this phenotype value given the mixed model likelihood value.

### The Future of Phenotype Prediction

Phenotype prediction from genetic information is currently an active area of research. Clearly phenotype prediction using only associated variants ignores the information from the polygenic score obtained from mixed models and only leverages the information from the portion of the heritability that is accounted for in GWASes. However, using only the polygenic score from mixed models ignores variants that are clearly involved in the trait. Several strategies are utilizing both types of information by first utilizing the associated SNPs and then using a polygenic score from the rest of the genome.^{22,33} However, even these combined strategies seem to be missing out on information because variants that are just below the significance threshold have a higher chance of having an effect on the phenotype than other variants, yet all variants are grouped together when estimating the kinship matrix and the polygenic score from variants that are not associated. This problem is closely related to the standard classification problem widely investigated in the machine learning community.

Phenotype and genotype data for massive numbers of individuals is widely available. The actual disease study datasets are available through a National Center for Biotechnology Information database called the database of Genotypes and Phenotypes (dbGaP) available at http://www.ncbi.nlm.nih.gov/gap. Virtually all U.S. government-funded GWASes are required to submit their data into the dbGaP database. A similar project, the European Genome-Phenome Archive (EGA) hosted by the European Bioinformatics Institute (EBI) is another repository of genome wide association study data available at https://www.ebi.ac.uk/ega/. For both of these databases, investigators must apply for the data in order to guarantee they comply with certain restrictions on the use of the data due to the inherent privacy and ethical concerns. Hundreds of large datasets are available through both of these resources.

This computational challenge (as well as other computational challenges in human genetics listed in Appendix 2, available online) will have a great impact on human health and provide tremendous opportunities for important contributions from computer scientists.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment