“Fate tried to conceal him by naming him Smith.”

— Oliver Wendell Holmes, Jr., Supreme Court justice

What can you discover about a person from just their first name? With the right data set, it turns out, quite a bit. While they might seem randomly distributed, a person’s name says a lot about them: where they are from, their generation, and other demographic details. This has many practical uses but for now let’s start by taking a look at the raw data.

Diving into the name database

The Social Security Administration maintains a database of baby names for each year going back to 1880 along with their rankings. Here are the most popular baby names for 2016:

Rank Male Female
1 Noah Emma
2 Liam Olivia
3 William Ava
4 Mason Sophia
5 James Isabella
6 Benjamin Mia
7 Jacob Charlotte
8 Michael Abigail
9 Elijah Emily
10 Ethan Harper

Looking at this chart you may get an intuitive sense that some of these names are older than others. You may know Williams of all different ages, but Mason “feels” like a recent trend. Similarly for girls, Emma feels like it has a longer history than Harper. Looking at the prevalence over time shows this to be true. Here are those names, along with a few others:

Clearly, the evolution of a particular name over time is highly variable. By graphing names over time this way we can quickly gather insights at a glance. Some names, like William, were once staggeringly common (at one point nearly 10% of the male population!) but have lost popularity over time. Emma is, at the time of this writing, the most common girls’ name, despite having gone nearly extinct in the 1970s. An interesting anomaly from this analysis is the tendency for some names to come about seemingly spontaneously. Often these are “invented” names from popular media.

Pop culture and historical events

The name “Samantha” was created in 1964 for Elizabeth Montgomery’s character in the TV series Bewitched. It was so chosen because it sounded particularly “witchy” to contemporary audiences. At the time, “Samantha” was in the same league as her supporting characters “Endora”, “Tabitha”, and “Esmerelda” (although those never caught on!).

The first Samantha (1964)

“Madison” was chosen by Daryl Hannah’s character in the film Splash, after the eponymous street in New York City. “Madison’s not a name!”, Tom Hanks’ character exclaims in the movie. Today it is one of the most popular girls’ names and Tom Hanks’ comment may be confusing to modern viewers.



Just as media or current events can spontaneously create names, they can also lead to their extinction. While “Katrina” had been already been in decline for the past few decades, the 2005 hurricane led to the name being stigmatized. Even more dramatically, “Isis” enjoyed moderate success (about 500 girls per year) throughout much of the 2000s. Due to its association with the less-than-felicitous terrorist group, by 2016 only 53 girls were born with that name. I’m somewhat surprised that there are even 53 baby Isises out there, proving once again that people are delightfully complex.


While most names are exclusively male or female, some are androgynous (a phenonemon exploited to great effect with Saturday Night Live’s “It’s Pat!” series of skits). Occasionally a name will even flip genders as in Leslie or Rory, the latter corresponding to the success of the television series “Gilmore Girls”:


A practical application: Classifying names by gender

Looking at these plots, you may get the sense that they simply confirm what you already knew about a name. It may be obvious to you that Mary is a woman’s name or that Clarence was more popular in the 1800s. Where the data really shines though is in resolving ambiguous examples. Suppose you got an email from a person named Jaime. Is Jaime a man or a woman? Let’s compare it to the seemingly equivalent name Jamie:


While these two names are spelled and sound similar, these graphs show that Jaime is probably male while Jamie is probably female. We can use this same intuition to estimate the male/female probabilty for any name. However, what if we had a huge dataset, with hundreds of thousands of names? Manually classifying them all would be very tedius!

Instead, imagine using a data-driven approach to create an automated gender classifier. To do this we’ll leverage a tool from statistics known as Maximum Likelihood Estimation. If you’re interested in the technical details, Wikipedia has a great introduction the subject. Otherwise, skip ahead to see the results.

The chart below shows the popularity of a few names, separated by male and female. While a few names are almost exclusively unisex, most names are at least partially given to both genders.


By applying the maximum likelihood estimator, we can get a single number that predicts if a given name is male or female:


While we used this technique to estimate sex from a given name, one can imagine applying it to other demographic features. Suppose you’re an advertiser trying to tailor your marketing campaign to a list of client leads. You could use the list of first names to estimate age, ethnicity, and potentially even geographic location. Another interesting application might be to take federal salary data and use the gender estimator to monitor for wage gaps in different federal departments. If you’re interested in exploring on your own, you can download the 2021 name database here.


Want to learn more? Join our meetup group! Bethesda Data Science Meetup