Evolution. Its Complicated.: The McDonald-Kreitman Test

Sorry for taking last week off! I was feeling lazy.

As one of my lab members recently published a paper detailing an improved version of this test, I figured that I should use the test as the subject for this week's blog post

The original paper detailing the test is:

McDonald and Kreitman "Adaptive protein evolution at the Adh locus in Drosophila" Nature 1991.
http://www.nature.com/nature/journal/v351/n6328/abs/351652a0.html
pdf:
http://www.ecologia.unam.mx/laboratorios/evolucionmolecular/viejo/talleresycursos/arts/McDonald-Kreitman_1991.pdf

And the paper by my lab mate is:
Messer and Petrov. "Frequency adaptation and the McDonald-Kreitman Test" Proc. Nat Acad Sci 2013
http://www.pnas.org/content/early/2013/05/01/1220835110.abstract
prepublication PDF: http://arxiv.org/abs/1211.0060

The McDonald-Kretiman test (M-K test), is one of the classic tests for adaptive evolution in genomes. The logic of the test works as follows:

Proteins are encoded by DNA using a triplet code that converts a group of three DNA positions into a single amino acid. Each triplet of DNA bases is called a codon, and this conversion is essentially completely conserved throughout all life. Since there are 20 amino acids + the codons that signal the end of a protein (the stop codons), but 64 possible triplet codons, most of the amino acids are coded by multiple codons.

The human codon table: http://www.biogem.org/codon.jpg

You notice in this table that the codons used to code for the same amino acid are not random. In particular, there are many cases where the first two bases of the codon can specify the amino acid, and the 3rd position can be any one of the bases. For example, the amino acid glycine can be coded for by the codons GGU, GGC, GGA and GGG.

We can then classify mutations in protein coding regions by whether they change the amino acid sequence of the protein or not. Those that change the sequence are called nonsynonymous mutations, and those that don't are called synonymous. Most synonymous mutations happen at the 3rd position of the codon, while those in the first two are typically nonsynonymous.

We can then categorize every base in the gene that encodes for a protein as either a synonymous site or a nonsynonymous site or one that can have both synonymous and nonsynonymous mutations. For the purposes of this explanation we will ignore the last type of site where both types of mutation can occur.

So what does this have to do with evolution? If a mutation happens at a synonymous site, it does not change the protein sequence, and to a first approximation, is a completely neutral mutation and will not experience selection. On the other hand, a mutation at a nonsynonymous site will change the protein sequence, and is likely to be selected on. I will get into caveats of this logic later.

There is an additional layer to this, as the mutations could be fixed (at 100% frequency in the population), or still at intermediate frequency in the population. Practically, this is accomplished by sequencing a number of individuals from the population you want to run the analysis on, and sequencing one individual from a related but distinct species (e.g. sample 100 humans and 1 chimp). If mutations within the humans have not fixed yet, you can call them polymorphic (e.g. 50 humans have an "A" at one site and the other 50 have a "C"), while if all humans have a base that is different from the other species (all humans have an "A" while the chimp has a "G"), this is a fixed difference.

The M-K test uses this logic to compute the difference in mutations at nonsynonymous and synonymous sites. In particular, it is testing for ancient selection over long periods of time, and is underpowered to study recent or fast selective events. It computes the ratio of fixed differences to polymorphic differences at both nonsynonymous and synonymous sites. If the nonsynonymous ratio is much higher than the synonymous one, there is an excess of nonsynonymous fixed differences, which indicates positive selection, while a deficit indicates negative selection.

The test itself is a G test (similar to a chi-squared test) over the 4 variables
P = polymorphism count, D = fixed differences count

Pnonsyn Psyn
Dnonsyn Dsyn

Smith and Eyre-Walker ("Adaptive protein evolution in Drosophila", Nature 2002) proposed an extension to this idea to estimate the proportion of fixations that have occurred by natural selection as:

alpha = 1- (DsPn)/(DnPs) or 1 - the D/P ratio for synonymous sites over ratio for nonsynonymous sites.

if Dn/Pn > Ds/Ps, alpha is positive and between 0 and 1, so there has been positive selection, while if the opposite is true, alpha is negative and there has been negative selection.

By this metric, 45% of fixation events in Drosophila, and 35% of events in primates have been fixed by selection.

As mentioned, there are a number of caveats to this approach:

1) This approach is based on the assumptions of neutral theory that most evolution is random without selective effects. Recent data suggests that selection happens extremely frequently, so the approach makes invalid approximations and assumptions. Specifically, the polymorphic sites, which are expected to be neutral for the M-K test, are likely not.
2) synonymous and nonsynonymous sites, by the property that they are in the same gene, are very close to each other physically on the genome. Linkage, by which sites close by each other change in frequency together, means that if one site is selected on, all of the neighboring sites will also be selected on. A single selective event will change the frequencies of all of the other sites in the area, including both synonymous and nonsynonymous mutations. This process is called genetic draft, and seriously messes up the M-K test
3) slightly deleterious mutations are frequent and tend to be removed from the population, causing linked sites, including beneficial mutations to decrease in frequency. This process is called background selection, and further messes up the test

In sum, linkage and the high frequency of selective events causes the assumptions of independence between sites that is necessary for the M-K test to fail, leading to chronic underestimates of the proportion of selected sites as the supposedly neutral synonymous sites experience selection along with the selected sites.

An additional complication is the presence of codon bias, where mutations at synonymous sites are not truly neutral. Biochemically, different synonymous sites are more or less efficient for the protein synthesis machinery to process, and for proteins that need to be made in huge quantities (like hemoglobin in blood cells), efficiency is extremely important and is selected on. To get around this, some studies use non-protein DNA regions that are typically not thought to have any function (like short intronic regions) instead of synonymous sites as their reference.

To circumvent some of these problems Messer and Petrov compute alpha while binning the polymorphic sites by their frequency. This gets them an asymptotic estimate of the true alpha, which is much closer to the alpha inputted into their simulation code than the older methods. When applying this to human data, they get an asymptotic MK estimate of 13%, and a fly estimate of 57%. The paper also has a good (more detailed) summary of what the problems with the MK test actually are.

In any event, selection seems to be a very common part of genome evolution.

Evolution. Its Complicated.

Tuesday, May 7, 2013

The McDonald-Kreitman Test

No comments:

Post a Comment