Saturday, February 19, 2011

Genetic Genealogy and the Single Segment

Last year, my wife Janet and I sent our DNA off to 23andMe for analysis. Among the tools that they provide is a "Relative Finder," which lists other people on the site who share regions of DNA that appear to be identical by descent. In my case, there are 476 people listed, each sharing between 0.07% and 0.46% of my genome, almost always as a single segment (there are 18 people with whom I share two segments). These people are generally anonymous, but you have an opportunity to make contact and invite them to "share genomes," which means only that you can see which regions are shared.

There are a lot of people on 23andMe who are quite interested in this tool, and who use it for genetic genealogy. Many of these same people also use Family Tree DNA and ancestry.com. As a result of my interactions with these 23andMe relatives, and following the discussions on the 23andMe community forums, I have been thinking about, and researching, what it means to share one segment of DNA by descent with someone. In the process, I have realized some things that are not fully appreciated by most of the genealogy buffs on 23andMe.

I am presenting these insights here, and will consider them one at a time.
  • Distant relatives often share no genetic material at all.
  • It is possible to share a segment with very distant relatives.
  • Sometimes, more distant relationships are more likely.
  • Most of your relatives may be descended from a small fraction of your ancestors.
Distant relatives (fourth cousins and beyond) often share no genetic material.
The chances of not sharing any DNA at all becomes appreciable with fourth cousins and rises to approximately half with fifth cousins. This is based on my own simplified calculations and those of Donnelly (1983), who opines that "proof of descent from William Shakespeare does little to increase the probability that the claimant has genes in common with him." There are limits to what can be accomplished by genetic genealogy that are imposed by the real chance that you simply do not share any DNA at all with distant relatives. The more distant the relationship, the more likely it is that no DNA is shared.

On the other hand, you have to inherit your DNA from somebody, so there are some blocks of identity by descent that have been transmitted many generations.

It is possible to share a segment with very distant relatives.
"The probability that fourth cousins share at least one IBD [identical by descent] segment is 77%, and the expected length of this segment is 10 cM." Now consider the next step. There is a 50% chance that that one shared segment will not be transmitted at all, but a 90% chance that if it is transmitted it will be just as big as it was (the same 10 cM.). What this means for genealogy on 23andMe is that for two people sharing one segment identical by descent there is no way to reliably estimate how far back the common ancestor was. Furthermore, no improvement in software can possibly change that, because the limitation is imposed by the genetics itself.

No matter how far back you go, every nucleotide of one's genome is derived from some ancestor, and even going back 20 generations, the chance that the bit which has been inherited is part of a block 5 cM. or greater is still appreciable. In fact, even for 19th cousins, there is a real chance (13%) that any segment of DNA they have inherited in common will be 5 cM. or greater. This number is based on the term (1 - P(rec))n, where P(rec) is the probability that the segment will be broken up by recombination (1-size/100, where size is in cM.). For 19th cousins sharing a single ancestor, n is 40.

Of course, as mentioned above, there is very little chance that two 19th cousins will share any IBD segments at all, but this is offset if one has many 19th cousins, which is often the case.

Sometimes, more distant relationships are more likely.
23andMe reports a "predicted relationship" (e.g. "4th cousin") and a "relationship range" (e.g. "3rd to 7th cousin"). However, these ranges are likely to be wildly inaccurate, because the likely distance to a common ancestor, given only the information that two people share a single IBD segment, can vary enormously, based largely on how many relatives one has.

Here is my estimate of these values. You can skip this paragraph is you're not interested in the details.
The probability that a segment, if transmitted, will not be broken up by recombination is 1 minus the probability of recombination, which is 5% for a 5 cM. segment, 10% for a 10 cM. segment and so on. (If you are moving up a pedigree, this is the probability the segment was transmitted rather than created by recombination, but the value is the same.)
The probability that a segment is will be transmitted at all is one-half per generation.
Thus, for an nth cousin sharing a single ancestor, the probability is ((1-P(rec))/2)^(2n+2).
For an nth cousin sharing two ancestors (the usual case), the probability is
2(((1-P(rec))/2)^(2n+2)). For example, the probability of two 4th cousins sharing a specific 5 cM. segment is 2(((0.95))/2)^(10)) = 0.00117. If one has more than 855 4th cousins, then the expected number of 4th cousins sharing this segment will be greater than 1. Because every 4th cousin has the same chance of inheriting the segment, the expected number of 4th cousins who do share the segment will be directly proportional to the number of 4th cousins one has. In the case of 5th cousins, the probability of sharing a specific segment is 2(((0.95))/2)^(12)) = 0.00026, which would require 3,790 cousins for the expected number sharing the segment to exceed 1.0. In general, the number of cousins of a specific degree who should be expected to share a segment is given by

2(((1-P(rec))/2)^(2n+2)) x N

world population growth
where N is the number of relatives of that degree. For a 5 cM. segment, if the number of cousins of degree n+1 that you have is 4.43 times the number of cousins of degree n that you have, then you expect more cousins of degree n+1 than cousins of degree n to share the segment. For a 10 cM. segment, this ratio is 4.94.

Thus, if you have many more distant cousins, as would be expected if your ancestors had large families, then someone who shares a single IBD segment is more likely to be a distant cousin, because you have so many more distant cousins. The point where the increase in the number of cousins outweighs the loss of shared segments is five children per family. This is not extremely uncommon.

As an alternative to the math, consider the case of my (hypothetical) great-great-great-grandfather Joe. Let’s say that I have inherited a 5 cM. segment of DNA from him. (It’s likely that I have inherited at least one segment from him.) Our concern is whether a distant relative that shares this segment is more likely to be a fourth cousin also descended from Joe or a fifth cousin descended from Joe’s father Jacob. The chance that the 5 cM. segment was inherited by Joe, from Jacob, is slightly less than half (because of the possibility of recombination in that generation). Jacob had 12 children, so I can expect to have 12 times as many fifth cousins descended from Jacob as fourth cousins descended from Joe. That fact ends up being more significant than the chance of recombination, so I will share the segment in question with more fifth cousins than fourth cousins. This same logic applies to fifth vs. sixth cousins and so on.

Thus, my 23andMe relatives sharing one IBD segment might be fourth cousins, as predicted, or they might be distant cousins connected by prolific ancestors. There is no way to know.

The world population has increased perhaps 20-fold in the last millennium, but that works out to significantly less growth than the sustained doubling required to predict distant ancestry for people who share one IBD segment. Nevertheless, there are well-documented cases of rapid demographic expansion.
Most of your relatives may be descended from a small fraction of your ancestors.
Given that family size varies a great deal, it is no doubt common to have some ancestors who have left many more descendants than others. We all have 64 great-great-great-grandparents, typically in 32 couples. If one family among the 32 had five children and their descendants did as well, while others in the family reproduced at replacement rates (two children per family), then your more prolific ancestors (the parents of just one of your 31 great-great-grandparents) would account for over 3/4 of your fourth cousins.

In summary, it is impossible to know the relationship one has to relatives who are discovered by virtue of their sharing a single autosomal segment of DNA. The "predicted relationship" is uncertain, and even the range is hard to be sure of. The extensive information provided by 23andMe is a very useful tool for genealogy, but it cannot tell you about relatives with whom you do not share any genetic material by descent. On the other hand, relatives with whom you do share genetic material by descent can be quite distant.

20 comments:

  1. Steve,
    I followed the link you gave on your 23andme discussion thread and enjoyed reading your post in its entirety.

    I am certainly not a geneticist, but I have found that my 23andme results do bear out on many of the points you made in your post.

    Luckily, my grandmother is still living and she submitted a sample, which really helps with narrowing down which side of my family the hits are coming from. She comes from a long line of really big families (I think they were averaging 10-15 kids per generation each, for a while) and I think she told me once that she has over 40 1st cousins on one side of her family. That being said, out of the 15 people I am sharing genomes with to date (I just received my results in early Feb.), 12 also match with my grandmother as well.

    Just an aside - I was able to prove a paper trail to a common ancestor with a match who showed up as a possible 5th cousin, with a range from 3rd-8th cousin. It turns out we are 6th cousins, which appears to be about the edge of 23andme's testing range.

    I liked reading your post and I think it helps put the genetic genealogy tool through 23andme in perspective.

    Thanks!
    -Paula Meixner Kallsen

    ReplyDelete
  2. Thanks Steve. That is really a very helpful analysis. With large families the 23andme links will be for more distant family members, on the whole. So, if you have some sense of how many children your families produced, you can make a rough judgement of the likelihood of near and far distant relatives in 23andme's analysis.

    You briefly say that over the millennia there have been population surges in some areas. Given 23andme's spatial proximity maps of ancestry, it might be nice to see if here are differential replenishment rates in those populations to add a little more detail about the likelihood of recent and distant hits in those populations.

    ReplyDelete
  3. Joe,
    Thanks.
    I have one comment with respect to this:
    "So, if you have some sense of how many children your families produced, you can make a rough judgement of the likelihood of near and far distant relatives in 23andme's analysis."
    The families whose sizes matter are the other descendants of your various ancestors (those who share an ancestor but aren't in your line), and those are the people who you probably don't know about. One distant great-great-great uncle who moved to a surprising and unexpected country and had many many descendants can account for almost all of your relatives. This happens, and people are perplexed.

    ReplyDelete
  4. Steve,
    I'm wondering if you can shed light on my quandary, in light of your very helpful presentation here. I'm brand new to 23andMe: I currently have 660+ matches. My largest segment match is 44cM 9000+SNPs - it starts at 165 on ch. 2, goes to 215.2. What's interesting is that I match with 71 other people within those coordinates. These other matching segments range from 5cM to 7.1cM, averaging 5.5cM.
    Is there anything to be made of this, do you think? The fact that 71 people match me at similar lengths across a very nearly identical stretch of chromosome 2 - and the lengths are such they exceed 23andMe's threshold for calling it a match: What does that mean? Anything at all? Are we all related? Is this just statistical noise? Or something in between? Thanks!

    Dwight Holmes
    Riverdale Park MD

    ReplyDelete
  5. Dwight,
    I assume that you were able to see that 71 people match in this interval using Ancestry Finder. If not, I'd be interested in what tool you used. It seems unlikely to me that 71 people have agreed to share genomes. What does Ancestry Finder tell you about their ethnicity? Are they all Ashkenazi? Colonials (4 grandparents born in the US)? This could be a clue that they do indeed share this segment by descent from a single (very prolific) ancestor. Another clue would be if the boundaries are frequently the same (Do many of the 71 matches all line up on one end or the other or both?). Also, how many segments do you share with the person who has the 44 cM. segment?
    Steve

    P.S. We're neighbors (I live in Univ. Pk.)

    P.S.S. Those on 23andMe who want to know how he got so much information might want to follow this procedure:

    1) Go to Ancestry Finder.
    2) Click on "Show Advanced Controls."
    3) Change "Number of Grandparents from the same country" to 1+
    4) Change "minimum segment size" to 5 cM.
    5) Download the data as a .csv file.
    "Download Ancestry Finder Matches (CSV)"

    ReplyDelete
  6. Thank you for this very illuminating post to put things in perspective. A factor not taken into account by your calculations is a certain level of inbreeding. I am not talking about close inbreeding (such as cousin-cousin marriages) but that coming from population structure, e.g. you might inadvertently marry your 3rd cousin, and this might be relatively common in the past, when people lived in villages and there was much less movement than today. Second, your calculations are valid to see the chances that two distant relatives share a specific segment, but, won't you be able to share many more segments with closest relatives - such as third or fourth cousins - than with distant cousins?

    ReplyDelete
  7. Blackbird,
    Yes! You are right that nothing in this post depends on inbreeding. In particular, the possibility of sharing a single segment with more distant cousins than close ones arises even without inbreeding as a factor. Limited effective population size raises other issues that I have not considered.

    On the second point, you are of course right that third-cousins and closer are likely to share more than one segment by descent. However, "The chances of not sharing any DNA at all becomes appreciable with fourth cousins and rises to approximately half with fifth cousins" holds. It is only after the expected number of shared segments becomes less than one that the paradox of sharing segments with more distant cousins (e.g. 7th cousins) than close ones (e.g. 5th cousins) becomes likely.

    ReplyDelete
  8. Thanks Steve. Good post.

    If I go down to 5cM segments on the Ancestry Finder and have 4 grandparents from the same country clicked, this is the list of countries that pops up:
    Germany, Russia, Czech Republic, Poland, Norway, Macedonia, Ireland, UK, Malta, Latvia, Italy, Iraq, Greece, Finland, Croatia, Belgium.

    On my Mom's side I'm Czech and Croatian and on my Father's side I'm English, German, French, Irish and since they've been here since the 1600s, probably some other things, but I doubt sincerely I am related to all these people who are 100% certain nationalities, especially Maltese, Latvian, Finnish, Norwegan, Iraqi.

    If I go to 10cM and 4 grandparents, I get one hit from the Czech republic, which makes sense, since my great grandparents were from there.

    I also find it interesting that out of all the people listed on the Relative Finder, none share my maternal haplogroup. I suppose the odds of finding a relative who shares your direct maternal line is low in the extreme.

    Stephen

    ReplyDelete
  9. I should probably mention again the source of my numbers for the probability of no detectable relationship. Donnelly 1981 (Theoretical Population Biology) states that "there is an 82% probability of a detectable relationship between a person and his 8th generation ancestor, but only a 16% probability of detectable relationship with a 12th generation ancestor." This article is non online (I went to McKeldin library and copied it), is dense and is primarily devoted to the method rather than the results. A more recent and accessible treatment is Luke Jostin's blog post, Nov. 11, 2009, "How many ancestors share our DNA?" http://bit.ly/fcfVyx
    Remember that fourth cousins are 9th degree relatives.

    ReplyDelete
  10. Thank you for that explanation. I don't exactly know what other people who use 23andMe think about RF itself and the relationships stated on RF for predicted cousins.

    I don't take it that seriously. Yes, I accept that I share segments of dna but the exact relationship doesn't matter to me. At 23andMe a lot of folks are there for either medical reasons (medical indications based on SNPs), finding out direct familial information (adoptees, people without family) and genealogy buffs. Most know nothing about genetics or have been educated in any science beyond high school. That is one of the reasons the Y chromosome and mitochondrial haplogroups are considered important by most of them. To me, haplogroups are thousands of years old and tell me little about my ancestry in the last thousand years which is more important to me.

    You said you shared one segment with your RF cousins. I find that odd as I share segments with my RF cousins in the proportion of one segment per 0.10% sharing. I don't contact my RF cousins mainly as I come from an small European island population in South Europe, and I cannot see the point of contacting people of various origins who live in the USA: WASPs, African Americans, Hispanic Americans and so on. I also have a family tree that extends beyond the founding of America, so I doubt I will get any interesting genealogical information from my RF cousins.

    ReplyDelete
  11. Thanks, Steve - in reference to your response to my query on Fri Feb 25, 02:55:00 PM 2011:

    1] Yes, from Ancestry Finder (we can dream about everyone sharing genomes!)

    2] Ashkenazi count: ZERO.

    3] Colonials/4 gp’s born in US: 19
    “Not provided” on all 4: 13; others are scattered

    4] boundaries:
    This segment goes from 165.0 to 215.2
    Sorting by end point, I see that 47 of them end at 197.7. These start anywhere from early as 190.8 (6.2cM) and as late as 192.1 (5.1cM).
    Sorting by starting point, the largest group I have are the 24 segments that start at 192.0. These extend at least to 197.6 (5.1cM), most of them to 197.7 (5.2cM) and some as much as 198.8 (5.7cM). There are then 8 more that start at 192.1 and extend to 197.7-198.7. There are 9 beginning at 191.9 and going to 197.7 (5.2cM). 5 start at 191.5 and go to 197.7 (5.6cM).

    5] I have a 44.4cM shared segment with Mr K. And that’s the only segment (according to AF and RF; I’m trying to get him to upload his data to Gedmatch, he hasn’t said no, but hasn’t done it yet).

    P.S. So we are neighbors. Do you ever come to the Riverdale Park Farmers Market (starting up again next Thursday 4/14)? My wife is the ‘soap lady’ there – stop by and introduce yourself (the market is 3 – 7, I am usually there with her from 5 o’clock on).

    Dwight

    ReplyDelete
  12. This is a very helpful post. Based on our recent article on Europe-wide view genetic genealogy (here at PLOS biology) we've posted a somewhat similar discussion of the single blocks.

    ReplyDelete
  13. Sorry should have signed the above comment.

    Graham Coop.

    ReplyDelete
  14. I see it has been while since you posted this but I am finding how to look at the 23 an me genome is not easy. I have only had my result back for a couple of weeks with 993 matches. I have had about 60 respondents and am tracking the results in a spreadsheet and have a couple of genetic questions. I have one individual who matches me on three genomes (total 37 mb) all on the right. Does that indicate that she matches on my mother's side? I have another gal who matches me on the same genome on both the left and right side for a total of 33mg and she knows nothing of her genealogy. These two do not match but I understand that they may have just not inherited the same piece of DNA. Is any of this relevant?

    ReplyDelete
  15. I heard there was inbreeding with my grandmother...her uncle may have been the father of some of her many children. She died very young, so no one really knew her in my family.

    How will that inbreeding most likely show up with DNA results of those children?

    I have two unknown relative matches who have 4 matching segments, .79% matches, on 23andme. They haven't written back to me, yet, but I'm wondering if they are possibly relatives from the children of my grandmother.

    Would the 4 matching segments possibly show inbreeding on their side?

    ReplyDelete
  16. Hi Steve,

    I recently got on 23andme and I have about 995 dna relatives. I am adopted and to not have much family history. Do you know if the number of segments you have in common with someone affects how close the relationship may be? Sarah

    ReplyDelete
  17. Hello Steve, I truly enjoyed reading your fascinating post. Genetics and how it can be associated with genealogy is one of the parts of genetics that I find most interesting. Thank you for writing such an informative post. I'm glad that someone is able to clear up the simple misconceptions that have been created. Genes are wonderful parts of us that still have much to be discovered about them but they have been made out to be these miracle 'maps' to our entire ancestry paths, which is completely, as you rightly stated, false. Thank you again for you information. I have learnt a lot!

    Tayla Scott
    BSc student

    ReplyDelete
  18. This comment has been removed by a blog administrator.

    ReplyDelete
  19. To answer Sarah:
    You are likely to share more segments with closer relatives, but it's not a simple relationship. Longer segments are also indicative of a close relationship (because the typical segment shared by descent will get shorter in each generation). Also, people with ancestors in a group with small effective population size due to isolation (for example, Puerto Ricans and Ashkenazi), will recognized closer relatives as those with larger segments, not those with more segments.

    ReplyDelete
  20. I was recently notified of an mtDNA HVR1 match in Haplogroup K1a1b1. I have hundreds of Y-DNA matches and to date have not been able to find any genealogical connections.

    The fact that this was through the mitochondrial side left me little hope of ever finding a connection to the other gentleman but in a few short days I was able to locate him, share our family trees and make the family connection.

    He and I are 10th cousins. We share Robert Rose born 1594 in Elmswell, Suffolk, England and his wife Margery Evered as 9th great-grandparents.

    Given the fact that we both had to have had myDNA tests run at the same genetics testing lab, we both had to have done extensive research into our family history and that we connect over 400 years ago I was curious as to the odds of actually discovering this relationship.

    ReplyDelete