Fox Y-DNA Surname Project
Technical Report 1
Average Marker Mutation Rate
Posing the Problem
In Y-DNA studies, the situation can arise where a common ancestor is suspected back a number of generations and one wants to know whether the test results support this suspicion. In the case of two Fox Surname Project participants it is known that, some 8 or 9 generations back, a direct ancestor in one line and an uncle in the other line traveled on the same ship together to Philadelphia, PA, from Plymouth, England, attended weddings and witnessed wills. Their genetic distance was 0 mutations in 12 markers, 2 one step mutations in 25 markers (DYS447 and DYS458) and 2 more one step mutations in 37 markers (DYS570 and DYS576.) The question then becomes, given the actual test results, what are the chances of 2 mutations in 25 markers or 4 mutations in 37 markers over some 20 transmission events (10 generations in each line?) Ann Turner has developed a calculator to answer questions like this, which is based on the Poisson Distribution for Rare Events. It can be obtained at http://members.aol.com/dnafiler/MutationCalculator.exe
Using this calculator, one is faced with several questions:
ï‚· How should the average mutation rate over 25 or 37 markers when individual marker mutation rates are available but vary widely?
ï‚· All of the markers where mutations have occurred in the Fox study have a high mutation rate. Can this be taken into account?
To answer these questions, the writer has gone back to basic probability theory using mutation probabilities which vary from marker to marker, but are assumed constant for a given marker, to estimate the probability of 0, 1 or 2 mutations occurring among the markers tested by familytreedna.com (FTDNA ). Because mutation rates are given for all of the first 25 markers, the analysis is limited to these 25 markers. These results are then compared to the Poisson Distribution to see whether an average value can be found which checks the more detailed analysis.
The answer to the first question is quite simple. The arithmetic average mutation rate for the 25 markers is 0.002828. Use of this average in the Poisson Equation gives a very close check against the probability of 0, 1 or 2 mutations over 10, 20 and 40 transmission events as calculated by probability theory using the individual marker rates. The only catch is that it is necessary to assume that the second mutation on the same marker does not return it to its original state, resulting in zero mutations. It is shown that this is a minor correction.
John Chandler has pointed out that, at the low mutation rates found for most Y-DNA markers, mutation rates become additive and, therefore, that this result could have been expected from the Distributive Law of Mathematics. In fact, low probabilities are a requirement for the Poisson Distribution to apply. These calculations show that the Poisson Distribution using the arithmetic average mutation rate gives very reasonable results for typical conditions found in genealogical studies.
The answer to the second question is that the Poisson Distribution does not depend on knowing individual mutation rates but only on the sum of all the rates â€“ which, of course, defines the average. The fact that all the mutations in the Fox comparison occurred at rapidly moving markers is, however, confirmation that nothing weird is going on.
The 25 marker average mutation rate was then applied to the Fox participants, with the following results:
20 Events: 0 mutations = 24%, 1 = 34%, 2 = 24%, 3+ = 19%
40 Events: 0 mutations = 6%, 1 = 16%, 2 = 23%, 3+ = 55%
Two or more mutations in 10 generations (20 events) seems reasonable, in 20 generations (40 events) it is the expected result.
Extension of the detailed analysis to 37 markers is not possible due to a lack of specific marker mutation rates. It is probably conservative, however, to assume that the same 25 marker average probability can be used. Results of this approach were as follows:
20 Events: 0 mutations = 12%, 1 = 26%, 2 = 27%, 3 = 19%, 4 = 10%, 5+ = 6%
40 Events: 0 mutations = 2%, 1 = 6%, 2 = 13%, 3 = 19%, 4 = 19%, 5+ = 41%
The extension to 37 markers did not help clarify the Fox study but 4 or more mutations in 10 generations is still quite possible and in 20 generations it is more than likely to occur. Both of the markers which show a change are identified by FTDNA as fast markers and DYS-570 is identified by Kayser, et al, (Am J Hum Gen, 74, 1184-1197) as the most rapid of 48 loci tested.
The mutation rates for individual markers are taken from the Barton website and are based on Doug MacDonaldâ€™s interpretation of data from the Sorenson labs. These reported mutation rates vary from 0.0005 to 0.0075. In supplying these numbers, MacDonald states â€œThis is a very tricky procedure which is very sensitive to a certain normalization constant. It also depends on them (Sorenson) using the formulas 20a and 20b of Walsh's paper (Genetics 158: 897-912.) Because of the great sensitivity to parameter errors, the numbers that are farthest from the average are subject to the worst calculational errors, especially the SMALL ones.â€
Basically, the Poisson Distribution can be assumed for individual markers and the mutation rate for that marker estimated from the observed variation in father-son pairs or the variation from the norm over hundreds of test cases in family surname groups or in Haplogroups. Kayser, et al (Am J Hum Gen, 66, 1580-1588 and 74, 1184-1197) have studied a variety of markers and concluded that the length of the marker has an effect on meiosis so that different mutation rates may apply to subjects from different Haplogroups. In addition, the analysis is not clear-cut for a second mutation at a given marker. Finally, the effect of environmental conditions on mutation rates is a large unknown factor yet to be defined.
Probability Theory and the Poisson Distribution
The mathematics used is summarized in somewhat abbreviated fashion in Table 1. In these equations, N is the total number of transmission events, n the number of mutations and p the mutation probability per transmission event. Equations 1, 3 and 5 are the basic probability expressions for 0, 1 and 2 mutations using variable marker mutation probabilities. Equations 2, 4 and 6 show how these can be simplified for constant marker probabilities. Equation 7 is the general binomial expression for n mutations with constant probability, which looks complicated but simplifies down to Equations 2, 4 and 6. Finally, Equation 7 can be reduced to Equation 9, the Poisson Distribution, by substituting v as defined in Equation 8 and assuming that N is much larger than n.
In the Turner calculator, the Poisson Distribution is used and N is taken as the product of the number of transmission events and the number of markers. The number of transmission events is taken as the number of generations back to a common ancestor times 2, when comparing results for two individuals. It is not necessary for p to be constant and an average p can be used as long as each marker is assumed to behave independently. In general, N must be a large number but, for the very low mutation probability encountered in Y-DNA studies, it turns out that Equation 9 can be applied to rather modest values of N and still be accurate.
Equations 2, 4 and 6 can be applied to estimate the probability of 0, 1 or 2 mutations occurring in each of FTDNAâ€™s first 25 markers over a given number of transmission events. While mutation probabilities for different markers vary widely, it is reasonable to assume that they are constant for any given marker over a number of transmission events. Barton and MacDonaldâ€™s probabilities for individual markers can be found at http://worldfamilies.net/marker.htm
The case where there are 20 transmission events is shown in Table 2. [Note: Markers shown in red are those markers identified as fast markers by FTDNA. Markers shown in pink are additional fast markers (above p = .003) not identified by FTDNA. Markers shown in blue were identified as fast markers by FTDNA but are now identified as slow markers. The probabilities in each case should add up to 100%, differences are due to rounding errors. Note that the 2nd mutation is assumed always to increase the genetic distance, a necessary condition for the Poisson Model.]
It is apparent that differences in mutation rate are magnified considerably at 20 transmission events. For the two markers in question in this study, the probabilities of 1 or more mutations are 8.6% for DYS447 and 12.4% for DYS458. Similar calculations, not shown, were done for 10 and 40 transmission events. Equation 9, the Poisson Distribution, gave essentially the same results in all cases, differing only in the last decimal place.
Mutation Probability with Variable Marker Mutation Rates
Equations 1, 3 and 5 indicate how the variable marker probabilities shown in Table 2 can be applied to the estimation of the probability of 0, 1, or 2 mutations in all 25 markers. For zero mutations, it is simply a question of multiplying all the probabilities of zero mutations for each marker. For 1 or 2 mutations, the probabilities of all the various permutations leading to 1 or 2 mutations must be summed. For 1 mutation there are just 25 possible permutations. For 2 mutations there are 625 permutations but, since half of these are duplicates, the result must be divided by two. Both these cases can readily be set up for spreadsheet type calculations. For 3 or more mutations, however, the calculation becomes prohibitively tedious.
Tables 3, 4 and 5 show the results of these spreadsheet calculations in column 1 and compare them to the Poisson Distribution for Rare Events using the arithmetic average 25-marker probability of 0.002828 in column 2. The ratio of the two results is shown in column 3, where it is seen that the agreement is excellent. Table 3 is for 10, Table 4 is for 20 and Table 5 is for 40 transmission events.
The situation where two mutations occur at the same marker requires further consideration. The calculations in column 1 assume all of these cases increase the genetic difference, which is not really true. Column 4 shows the corrected results if half of the second mutations at a given marker revert back to the original condition, giving additional zero mutation results and fewer double mutation results. It is apparent that the Poisson Equation does not take this situation into account but that the correction is a minor one.
The Poisson Model for the Distribution of Rare Events is applicable to genetic mutations since mutation rates are quite low. It does not require that mutation rates be the same for all markers and an arithmetic average of the individual rates gives a very close approximation to the actual distribution. One limitation is that does not account for a later mutation that cancels a previous one. The average mutation probability of 0.002828 obtained here compares favorably with an average of 0.0028 determined by Kayser, et al (Am J Hum Gen, 66, 1580-1588) for many of these same markers by father-son pair analysis.
For the specific Fox Surname project pair described at the beginning, the analysis shows that a common ancestor 10 generations back is well within the realm of probability. The unusual null value obtained by FTDNA for DYS439 in both subjects is highly significant, as is the fact that all mutations are in rapidly moving markers. The true test, however, would be to test more members of the family in both lines.
The writer would like to acknowledge the assistance of Doug MacDonald and John Chandler in reviewing the original version of this paper. Comments by John Chandler were useful in making this revision.