Oooh, A Mosaic Of Ancestry!

Principal Product Scientist Mike Macpherson, former Research Scientist Chuong “Tom” Do, and Computational Biologist Eric Durand have led a team that spent many months developing an innovative and accurate tool to determine your ancestry going as far back as 500 years.

One of the stunning aspects of Ancestry Composition is that it’s based on the newest advances in machine learning and thus will get better over time. “Ancestry Composition is truly innovative. Not only does it use public-genetic databases for reference, it also uses the data set from 23andMe, so as more people join 23andMe, the more powerful and more accurate Ancestry Composition will become,” Mike said.

The feature can very accurately detail the mosaic of your ancestral background, distinguishing British and Irish ancestry, for instance, or telling you the breakdown of your Scandinavian or Italian ancestral origins. It’s also a powerful tool for finding Ashkenazi Jewish ancestry.


Right now, Ancestry Composition is particularly interesting for people of mixed ancestry: individuals who have Native American, Latino, African American or mixed European heritage. An update is planned in the near future to add more detail for people with African and Asian ancestry. This will give a finer level of detail and help customers zero in on the regions of their ancestral origins, but the feature can be enlightening for people of any background offering a view of an individual’s genetic ancestry, breaking down the mix of ancestry by percentage and putting it all into an intuitive visualisation.

There are several other bells and whistles for those who want to dive in and find a few fun surprises. One of those is the Split View, which gives great detail for customers who have at least one parent also in the 23andMe community. If at least one parent has been tested and is linked through the Family Tree feature, Ancestry Composition’s Split View will tell you what mix of your ancestry comes from your mother and what mix of your ancestry comes from your father. Another add-on to the feature is a Chromosome View, which “paints” the ancestry on each of your 23 chromosomes.

If you'd like to see more detail on how this has done you can look both at a white paper put together by Mike or a recent poster presented at the annual meeting of the American Society of Human Genetics, which outlines the technique. The new feature replaces 23andMe's "ancestry painting" and "global similarity," two tools that were equally pioneering when they were first introduce, but with Ancestry Composition 23andMe breaks new ground and sets a new standard for determining genetic ancestry.


Language Schmanguage, say American Physicists


A group of physicists recently collaborated on a statistical survey of words. You may be wondering why physicists are interested in language. In this case, it is not language per se, but how words imitate the statistical patterns of the stock market and animal populations. This group of researchers, led by Alexander Petersen of the IMT Lucca Institute for Advanced Studies, culled data from Google’s digitized books to analyze how word use varies over time.

In particular, the scientists looked at “word competition.” Why would words compete? Well, this isn’t about competition between words. Obviously, for language as a whole to function, nouns need verbs, which need prepositions and adverbs. In this sense, competition refers to aggression between different variations of a word: is “color” used more than “colour”? It may be hard to imagine this, but before spell-check there were often misspelled words in newspapers and published books. As the researchers point out: “With the advent of spell-checkers in the digital era, the fitness of a ‘correctly’ spelled word is now larger than the fitness of related ‘incorrectly’ spelled words.”

How does spell-check (and grammar check) work in the first place?

Completely new words are often the product of an innovation, such as the internet, but languages also evolve because of new settings. Who knows if Americanisms like “skedaddle”, “rambunctious” , and “discombobulate” would have survived spell-check if they had arisen later in time.

The physicists also looked at synonym death. Have you ever heard of a radiogram? Probably not. The words radiogram and roentgenogram mean an x-ray. This may come as a shock, but before the 20th century, the word “roentgenogram” was used most frequently. Today, x-ray is the dominant word, while radiogram and roentgenogram are nearly extinct. Shorter more efficient words can eventually kill their longer, clunkier brethren.

Due to synonym death and the widespread use of spell-check, words are dying. Using complex algorithms, the scientists discovered that in the past 40 years more words have died than during any other period in their data (from 1800 – 2008). At the same time, fewer words are being successfully introduced into the language. As the scientists conclude: “In the past 10-20 years, the total number of distinct words has significantly decreased, which we find is due largely to the extinction of both misspelled words and nonsensical print errors, and simultaneously, the decreased birth rate of new misspelled variations.” Statistically speaking, the language is shrinking.

Academic paper: Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death


I do have to wonder if this article would have been written in quite the same way had the remit been given to some at the OED, or indeed had the research been conducted outside of the United States, and in light of the (again, US) research that people who text more often are more reluctant to take on new words… who is responsible for the shrinkage of the English language?

Google NGrams

What are the most-used words in the English language?

One way to trace most-frequently-used words is with the search magic of Google Ngram viewer. With the Ngram viewer, you type a word, and it tells you how often that word is used over a specific time period, based on the Google Books database. For example, the word “the” is used about 5% of the time, which means that in every text of 100 words, 5 of those words are “the.” Similarly, the word “of” is used about 3% of the time, and the word “and” is used about 2.5% of the time.

The folks over at Oxford Dictionaries compiled a comprehensive analysis of English-language usage called the Oxford English Corpus. In this sense, a corpus is the entire body of words and phrases that constitute a language. See their complete analysis here.

Obviously, most of the most commonly used words are short words that help build sentences. As in the previous sentence, the words of, are, that, and the join the parts of the sentence that make the idea. Linguists call these “function words.” 84 of the top 100 words are function words.

Things get interesting when you look at the list of top 10 nouns:


Some of the words, like way, have many different meanings, which may be why they are more frequently used. For example, you could say, “She’s lost her way” or “that’s the way to the grocery store.” It is the same word in both instances, but they are very different meanings. Another reason certain words occur frequently has to do with their use in common phrases. So a word like time is used often, and it also appears in many common phrases, like last time, in time, next time, etc.