Evolution. Its Complicated.: Tajima's D test for nonneutral evolution

One of the classic statistical tests for nonneutral evolution is Tajima's D test. This will get a little math-y, but I hope to give you some intuitive understanding of the test in the process. This, and most other tests, are typically devised to try to find evidence of selection. More modern tests are much more powerful at detecting selection, but this is one of the classics so I figured I should cover it.

Heterozygosity (H) is a metric that measures how different individuals are from each other, on average, over a given chunk of their genome. Homozygosity (G) is defined as 1-H, or how similar individuals are to each other. Mathematically, heterozygosity is the probability of randomly sampling two alleles from a population and having them be different.

Definition: Allele can be used in two ways- 1) a particular copy of a gene that is under consideration. For example, every genetically normal human has two copies of every gene, and therefore two alleles of every gene. 2) The state of the gene under consideration. If I have a mutation in my gene that you do not have in yours, I have a different allele than you do. If I don't have the mutation, we have the same allele.

As drift tends to cause the frequency of an allele to go to 0% or 100% (loss or fixation), it also drives heterozygosity to 0. This is because if a mutation becomes fixed in the population, there is a 100% chance that sampling two alleles from the population at that site will sample the same allele state, and therefore heterozygosity at that site is 0. However, mutation increases heterozygosity, as a mutation in a previously fixed site will take the probability of sampling two different allele states at that site from 0 to some small value.

Therefore, drift and mutation oppose each other in terms of heterozygosity.

In a population with a constant population size, no migration, random mating and no selection, we can calculate the expected heterozygosity at equilibrium when these forces balance out. This value is

Heq = 4Nu / (1 + 4Nu) for diploids

where N is the number of individuals in the population, and u is the mutation rate per individual per site of the genome per generation.

We define θ = 4Nu as one of the fundamental parameters of molecular evolutionary theory. This is essentially the number of mutations that are entering the population at a particular site every generation.

Therefore, if there is no violation of the above assumptions (constant populations size, no selection, no migration, random mating etc), we expect the heterozygosity in a population to be equal to Heq. However, it is hard to calculate N and u experimentally, but it is far easier to estimate the heterozygosity in the population.

It can be mathematically shown that the average of the site heterozygosities across an entire genome (which is the average of 2 * p * (1-p) where p is the frequency of the allele at the given site) is equal to θ. This average site heterozygosity can be estimated from a sample of individuals by multiplying in a factor of (n-1)/n to this average, where n is the number of individuals sampled. We call this estimate of θ as θpi, or the θ estimated from site heterozygosities.

There are actually a number of ways of estimating θ, of which this heterozygosity based estimate is one. We can also estimate θ by the number of sites that are segregating (not fixed for a mutation) in a sample.

This formula is:

θs = number of segregating sites / sum(1 + 1/2 + 1/3 + 1/4 .. 1/(n-1) ) where n is again the number of sequences sampled from the population.

Under our assumptions, we expect these estimates of θ to be equal.

We compute the test statistic, Tajima's D as

D = (θpi - θs) / C, where C is a normalizing constant whose computation is somewhat complex and unnecessary to go through here.

if D is significantly different from zero, it means that these assumptions have been violated.

If D is significantly greater than 0, then θpi is much greater than θs, indicating excess heterozygosity for the number of segregating sites. Most mutations are common, and there are not enough rare mutations. This could be due to selection with promotes the maintenance of multiple alleles at a site for a long time, or due to the mixture of two different populations at equal proportions.

If D is significantly less than 0, then heterozygosity is too low for the number of segregating sites, indicating too many rare alleles. This could be the case if the population underwent a sudden contraction in size (a bottleneck), or a recent selective sweep which removed all the variation in the region.

There are many other explanations for these values, I have only highlighted two possibilities for each. Significant values of D were originally interpreted as evidence for selection, but it is now clear that simple changes in the population structure, by migration or changes in population size could give significant D. You can also do fancier analysis like compare values of D from gene rich regions to values around gene poor regions. The idea here is that gene poor regions will likely not be subject to selection, while gene rich regions will be. You can then use gene-poor regions to control for the effects of changes in population structure (migration, bottlenecks etc) and isolate the effects of selection in gene-rich regions.

It is worth noting that this test was invented in 1989, well before whole genome sequencing was a possibility. People used to sequence a single gene from a few dozen individuals to perform this test. With modern sequencing capacity, we can now devise more powerful genome-wide approaches that are much more sensitive to the impact of selection.

This test was updated by Fay and Wu to Fay and Wu's H statistic. They use a third estimate of theta by incorporating sequence data from an outgroup species (a species that is more distantly related to every the individual in the sample than any pair of individuals in the sample are to each other). This is supposed to get you to tell apart the effect of population size changes from the effect of selection on the sample sequences. (Full disclosure: Justin Fay was my undergrad thesis adviser).

The articles linked below are publicly available.

Gillespie. "Population Genetics, a Concise Guide 2nd ed" 2004 Johns Hopkins University Press
Tajima F. "Statistical method for testing the neutral mutation hypothesis by DNA polymorphism" 1989 Genetics http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1203831/pdf/ge1233585.pdfFay and Wu. "Hitchhiking under positive Darwinian selection." 2000 Genetics. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1461156/pdf/10880498.pdf

Evolution. Its Complicated.

Tuesday, April 16, 2013

Tajima's D test for nonneutral evolution

No comments:

Post a Comment