Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Notes on population genetics and evolution: “Cheat sheet” for ..., Slides of Genetics

Notes on population genetics and evolution: “Cheat sheet” for review. 1. Genetic drift. Terminology. Genetic drift is the stochastic fluctuation in allele ...

Typology: Slides

2022/2023

Uploaded on 05/11/2023

cristelle
cristelle 🇺🇸

4.5

(52)

125 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download Notes on population genetics and evolution: “Cheat sheet” for ... and more Slides Genetics in PDF only on Docsity! Harvard-MIT Division of Health Sciences and Technology HST.508: Quantitative Genomics, Fall 2005 Instructors: Leonid Mirny, Robert Berwick, Alvin Kho, Isaac Kohane Notes on population genetics and evolution: “Cheat sheet” for review 1. Genetic drift Terminology. Genetic drift is the stochastic fluctuation in allele frequency due to random sampling in a population. Polymorphism describes sites (nucleotide positions, etc.) variable within a species; divergence describes sites variable between species. 1.1 Wright-Fisher model. The Wright-Fisher model describes the process of genetic drift within a finite population. The model assumes: 1.�N diploid organisms (so, 2N gametes) 2.�Monoecious reproduction with an infinite # of gametes (no sexual recombination) 3.�Non-overlapping generations 4.�Random mating 5.�No mutation 6. No selection The Wright-Fisher model assumes that the ancestors of the present generation are obtained by random sampling with replacement from the previous generation. Looking forward in time, consider the familiar starting point of classical population genetics: two alleles, A and a, segregating in the population. Let i be the number of copies of allele A, so that N–i is the number of copies of allele a. Thus the current frequency of A in the population is p = i/N, and the current frequency of a is 1–p. We assume that there is no difference in fitness between the two alleles, that the population is not subdivided, and that mutations do not occur. This gives the familiar formula for the probability that a gene with i copies in the present generation is found in j copies in the next generation: P ij = N j ! "# $ %& p j (1' p) N ' j 0 ( j ( N Let the current generation be generation zero and Kt represent the counts of allele A in future generations. The binomial equation above states that K1 is binomially distributed with parameters N and p = i/N , given K0 = i. From standard results in statistics, we know the mean and variance of K1: E[K 1 ] = Np = i Var[K 1 ] = Np(p !1) So, the number of copies of A is expected to remain the same on average, but in fact may take any value from zero to N. A particular variant may become extinct (go to zero copies) or fix (go to N copies) in the population even in a single generation. Over time, the frequency of A will drift randomly according to the Markov chain with transition Prepared by Professor Robert Berwick. probabilities given by the above formula, and eventually one or the other allele will be lost from the population. Perhaps the easiest way to see how the Wright-Fisher binomial sampling model works is through a biologically motivated example. Imagine that before dying each individual in the population produces a very large number of gametes. However, the population size is tightly controlled so that only N of these can be admitted into the next generation. The frequency of allele A in the gamete pool will be i/N, and because there are no fitness differences, the next generation is obtained by randomly choosing N alleles. The connection to the binomial distribution is clear: we perform N trials, each with p = i/N chance of success. Because the gamete pool is so large, we assume it is not depleted by this sampling, so the probability i/N is still the same for each trial. The distribution of the number of A alleles in the next generation is the binomial distribution with parameters (N, i/N) as expected. The decay of heterozygosity. Before we take up the backward, ancestral process for the Wright-Fisher model, we will look at the classical forward derivation. The heterozygosity of a population is defined to be the probability that two randomly sampled gene copies are different. For a randomly mating diploid population, this is equivalent to the chance that an individual is heterozygous at a locus. Let the current generation be generation zero, and let p0 be the frequency of A now. The heterozygosity of the population now is equal to H0 = 2p0(1–p0), the binomial chance that one allele A (and one a) is chosen in two random draws. Let the random variable Pt represent the frequencies of A in each future generation t. Then, as we have seen in earlier lectures, in the next generation the heterozygosity will have changed to be H1 = 2P1(1–P1). However, H1 will vary depending on the random realization of the process of genetic drift. On average, heterozygosity (variation) will be lost through drift: E[H 1 ] = E[2P 1 (1! P 1 )] = 2(E[P 1 ]! E[P 1 ] 2 !Var[P 1 ]) = 2p 0 (1! p 0 )(1! 1 2N ) = H 0 (1! 1 2N ) In the haploid case, we replace 2N by N. After t generations, we have: E[H t ] = H 0 1! 1 2N " #$ % &' t The approximation is valid for large N. Thus, as we’ve seen, in the Wright-Fisher model, heterozygosity decays at rate 1/N per generation, 1/2N if diploid. The decrease of P(T 2 = t +1) = P(T 2 > t) ! P(T 2 > t +1),t = 0,1,2,... = 1! 1 2N " #$ % &' t ! 1! 1 2N " #$ % &' t 1! 1 2N " #$ % &' = 1! 1 2N " #$ % &' t (1!1+ 1 2N ) = 1 2N 1! 1 2N " #$ % &' t sequences, etc. are drawn from a single species. (This is important for some of the statistical calculations testing for selection, below.) In general, Ti= the time until the coalescence of i lineages (genes, alleles, sequences,…). That is, after coalescence, the two genes are identical. We are interested in the distribution of the ‘waiting times’ until each coalescence, as well as the variance of these times, and, further, the expected waiting time and the total waiting time until all lineages have collapsed into a single common ancestor. It turns out that all this can be described as a stochastic process with rather simple properties. Note that each coalescent event is independent of all others – the waiting times are independent. 3.1 Basic results. Measured in discrete time, in a Wright-Fisher population of size 2N the distribution of waiting times until the collapse (coalescence, identity) of two sequences is geometric with the probability of success p (= coalescence) = 1/(2N) in any one generation, and so the probability of failure (= not coalescing) is 1-p, or [1-1/(2N)]. (Note the close relation between this and the heterozygosity computation.) It is easy to see that the waiting times form a geometric distribution by considering the probability that up until time t a coalescent event has not occurred, as the product of t ‘not coalescing’ events, just as with the heterozygosity iteration. If we let P(T2>t) denote the probability that two lineages not t P(T 2 > t) = 1! 1 2N " #$ % &' t ,t = 0,1,2,... t have coalesced, for times =0, 1, 2, …, then this is simply: so the probability that two lineages collapse at exactly the th time step is: And this is clearly a geometric distribution. 3.1.1 A very, very intuitive picture. We can gain a very intuitive picture of the same process by the following argument. We start by considering the coalescence time in a sample of two genes. Genes X and Y live in the present generation, and their common ancestor A lived t generations ago. Consequently, as we look backward from the present into the past, the two lines of descent remain distinct for t generations, at which time they coalesce into a single line of descent. In a given generation, the lines coalesce if the two genes in that generation are copies of a single parental gene in the generation before. Otherwise, the two lines remain distinct. E[T 2 ] = 2N = 2N 2 2 ! "# $ %& var(T 2 ) = 2N !1 2N 1 4N 2 = 2N(N !1) ! 4N 2 What can we say about the length of time, t, that they remain distinct? The problem is a lot like the following. Suppose that we are talking about the life-span of a piece of kitchen glassware. Eventually, someone will drop it and it will break. Suppose that the probability of breakage is h per day and its expected lifespan is T days. To see how h and T are related, consider the two things that can happen on day one: The glass either survives the first day or it breaks. It breaks with probability h, and in this case its lifespan is 1 day. It survives with probability 1–h. Further, for surviving glasses, the mean lifespan is 1+ T. Why? Because a glass doesn't age; its hazard of breakage is always h regardless of how old it is. Consequently, the expected life remaining to a glass does not depend on how old it is. Our one-day-old glass can expect to live T additional days, so its expected lifespan is 1 + T. Putting these facts together gives an expression for T in terms of itself: T = h + (1–h)(1 + T) So, T=1/h. (You can also derive the result using calculus.) Returning to gene lineages, if we knew the ‘hazard,’ h, that the lines of descent will coalesce (or collide) during a generation, then this would tell us immediately the mean number of generations until the two lineages coalesce. But we do know this: If there are G distinct genes in the population, then h = 1/G. More generally, the probability that two genes are identical when drawn from a (diploid) population is 1/(2N). 3.1.2 Results derive from the geometric distribution of ‘waiting times’ until lineages coalesce A geometric probability distribution may be described by Prob{x=i}=qi-1p, where p is the probability of success on any one trial, and q is the probability of failure. From basic statistical theory, we know that the mean of a geometric distribution function is just the 2inverse of the probability of success, p=1/2N, and its variance is q/p2= (1-p)/p . So: (i) The expected value for the time to coalescence for a sample of 2 genes (sequences,…) is just the following, where N measured in units of generations: Further, we also know the variance of the geometric is in this case: Note that the variance is quite large. In general, for n lineages: E[T k ] = 4N k(k !1) = 2N k 2 " #$ % &' (ii) The expected time to coalescence from k to k-1 lineages is: E[T 3 ] = 4N 3(3!1) = 2N 3 2 " #$ % &' = 1 3 2N So for example, if we have 3 sequences, the time to the first coalescence will be, on average: This makes sense, since for the first coalescence, we have a (3 choose 2) or 3 possible ways of collapsing 3 sequences together (1st and 2nd; 1st and 3rd; 2nd and 3rd) – there are more cars in the intersection, so a higher chance that they will ‘collide’, and so a lower waiting time until they do coalesce (specifically, 1/3 of the average time when there are only 2 sequences). And so on: for four lineages (sequences), we initially have 4-choose-2 options to collapse, which gives an expected time to first collapse of 2N/6 = 1/6 (2N), etc. (iii) The total length of all the branches in the genealogy tree, E[Ttot] , which is an important value that we’ll use to figure out the expected nucleotide diversity, may be computed as follows: E[T tot ] = iE[T i i=2 n ! ] = i 2N i 2 " #$ % &' i=2 n ! = 4N i i=1 n(1 ! (iv) The time to coalescence of all n lineages (the so-called “time to most recent common ancestor,” MRCA), and so the total expected depth of the coalescent, can be found as follows. Note that this expected time is ‘about’ 4N, a bit less with a small factor dependent on the sample size n. Therefore, sampling an n+1st sequence adds only 2/n to what may already be a sizeable number. This has implications for the measurement of DNA sequence polymorphism, which we describe below. Further, the equation for MRCA means that in generational units of 2N, the time to MRCA is always very close to its asymptotic value of 2, even for moderate n. Thus, for all but the smallest samples, there will likely be a large number of coalescent events in the very recent history of the sample. E[T n ] = 2N 2 i(i !1)i=2 n!1 " = 2N i2 1 i !1i=2 n!1 " ! 1 i = 2N i2(1! 1 2 + 1 2 ! 1 3 + 1 3 ! ...! 1 n !1 + 1 n !1 ! 1 n ) = 2N i2 1! 1 n # $% & '( Sij = number of mutations separating individuals i and j An important point: our model of mutation here is traditionally called the infinite sites model. Note that in doing this computation about neutral mutations and their ultimate ‘effect’ in showing up as segregating sites, via sprinkling on the coalescent branches, we have made implicit use of an assumption: each mutation is at a different site in the sequence, so that each mutation produces a distinct, segregating ‘spot’ on the DNA sequence. Roughly, this is what permits us to equate the number of segregating sites to the simple multiplication of the neutral mutation rate times the expected tree depth. You might want to think through what would happen if we allowed multiple ‘hits’ at the same nucleotide position. If we assume that the mutation rate is, say, 10-6 – 10-8 per base pair per replication, and that sequences are of ‘average’ length (like what?) then this assumption does not seem too bad, so the infinite sites model seems OK for sequences. 3.3 Using the coalescent to test hypotheses about nucleotide diversity: Tajima’s D Now we can actually construct a test of the neutral hypothesis, based on two estimators of theta. Another way we have of estimating θ is to just calculate the number of mutations separating individuals two at a time, and average over all pairs. This may be thought of as a sample average to estimate a population average, and is a common measure of nucleotide diversity. Denote by Under the infinite sites assumption, we can calculate Sij from a sample by calculating the number of segregating sites between sequences i and j. If we average Sij over all pairs (i,j) in a sample of size n this is called the average number of pairwise differences. We denote this by: Dn = 2 n(n !1) Sij i" j # Note that we can think of individuals (i,j) as a sample of size 2, so: E[Sij ] = E[S2 ] = ! E[Dn ] = 2 n(n !1) E[Sij i" j # ] = $ and so, Thus, Dn is another, unbiased estimator of θ, called !̂ T !̂ T . Tajima (1981) was the first to investigate its properties. He noticed that since E[D]= = θ and E[Sn]= !̂ W =anθ , (an as above, i.e., a n = 1 ii=1 n!1 " ) then the expected value of the difference !̂ T – !̂ W should be zero under the standard neutral model. Significant deviations from zero should cause the null model to be rejected (i.e., there is possibly positive selection). Specifically, Tajima (1989) proposed the test statistic: D = !̂ T " !̂ W V̂ar[!̂ T " !̂ W ] The denominator of Tajima’s D is an attempt to normalize for the effect of sample size on the critical values. We have to estimate this denominator (hence the ‘hat’ on Var) from the data by using the formula: V̂ar[!̂ T " !̂] = e 1 S + e 2 S(S "1) where e 1 = 1 a n n +1 3(n "1) " 1 a n # $% & '( , e 2 = 1 a n 2 + b n 2(n 2 + n + 3) 9n(n "1) " n + 2 na n + b n a n 2 # $% & '( where b n = 1 i 2 i=1 n!1 " This looks formidably complicated, but it’s really not (though tricky to derive): the coefficients come from the computation of the variance difference between the two estimators just as we derived the variance of Sn above. To actually use this test, Tajima suggested that the distribution of D might be approximated by a certain form (not quite a normal distribution, but a beta distribution), and provided tables of critical values for the rejection of the standard neutral model. The upper (lower) critical value is the value above (below) which the observed value of the statistic cannot be explained by the null model. As with any statistical test, it is necessary to specify a significance level alpha, which represents the acceptability of rejecting the null model just by chance when it is true. Roughly, values of Tajima’s D are significant at the 5% level (alpha = 0.05) if they are either greater than two or less than negative two. However, D is not exactly beta-distributed and critical values are often determined using computer simulation. (This is any area of on-going research.) There are several other related tests that you will probably encounter that are based on the same idea (Fu and Li’s D* and F tests, e.g.). As far as how the D value responds to deviations from the neutral model, which is the most important thing, this can be understood in the following way. First, the sign of the test is determined only by the sign of the numerator, since the denominator is always positive. The D value becomes negative when there is an excess of either low-frequency (rare) or high-frequency polymorphisms and a deficiency of middle-frequency polymorphisms. This might be caused by positive selection, or, alternatively, expanding population size (note that the Tajima model assumes constant population size for the null hypothesis). Large positive values of D can result from population contraction, or the balancing selection of two alternative polymorphisms. The sensitivity to demographic parameters cannot be overstressed. (Below we turn to a test for selection that does not make any such demographic assumptions, the McDonald-Kreitman test; however, it is correspondingly less powerful.) 4. Testing selection vs. neutrality: Ka/Ks; McDonald-Kreitman (MK) test Recall from the redundancy of the genetic code that certain nucleotide changes have no effect on the corresponding amino acid coded for – these are called synonymous nucleotide substitutions. Otherwise, a substitution is nonsynonymous (For example, both CAA and CAG code for glutamine, but CGA codes for arginine, so the first one-letter change alters the amino acid coded for, while the second does not.) The MK test compares polymorphic and fixed differences found at synonymous and nonsynonymous sites. Because synonymous and nonsynonymous sites are interleaved, one can assume they have the same mutation rate, and so (by taking ratios), we can factor out this usually unknown rate. So, we can test whether the ratio of polymorphism (within species differences) to divergence (between species differences) is the same for both synonymous and nonsynonymous sites. Call KA the nonsynonymous fixed differences (the “A” reminding us that the change alters the coded-for amino acid), and KS the synonymous changes. Similarly,within a species, using S for a segregating site as before, we have SA and SS. If the neutral theory holds, then KA/KS = SA/SS. Here’s how to use it. Consider the evolution of a protein coding gene in two closely related species. Suppose a sample was taken from each of the species. When the sequences from these two samples or populations are aligned together, polymorphic (variable) nucleotide sites can be identified. Each polymorphic site can be classified by two criteria. One is whether the polymorphic site is a difference between samples or a difference between seqences within a sample. Another criteria is whether the change is synonymous. A change is synonymous if it leads to a synonymous codon and otherwise non-synonymous. The result is conveniently presented by the following four values: Within species Between species Synonymous a b Non-Synonymous c d where a, for example, is the number of polymorphic sites that are both within sample variation and synonymous change. When mutations are selectively neutral, one can expect that the ratio of synonymous and nonsynonymous changes remains constant over time. Therefore, whether a mutation is synonymous should not depend on if it is a within sample polymorphism (occurred recently) or a between sample polymorphism (occurred long time ago). In statistical terms, the two classifications of polymorphic sites are independent under the null hypothesis that mutations are selectively neutral. A simple test of the null hypothesis is a Chi-square test, which is X2 =n(ad-bc)2 /[(a+b)(a+c)(b+d)(c+d)] where n=a+b+c+d is the total number of polymorphic sites. When n is not small, X2 follows approximately a Chi-square distribution with one degree of freedom. So if the value of X2 is larger that 3.841, the null hypothesis can be rejected at 5% significance level.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved