Thursday, April 10, 2014

U.S. baby names, 1880-2012: Diversity

U.S. baby names, 1880-2012: Diversity - Prooffreader

The United States Social Security Administration released a fascinating data set a few years ago: all of the names of newborns registered since 1880. There have already been some great analyses and visualizations of this data (there's a list below). Still, I think there are plenty more items of interest to be found in mining this data, or if that fails, surveying, dowsing and spelunking it.

Friends' first reaction to news I was looking at this data set was inevitably, "So you must see more and more unique names as the years progress?" (Usually followed by, "Do you see a decline in 'Adolf' after WWII?" Short answer: yes.) Name diversification is anecdotally evident --  nobody named their babies Rumer, or North (or even Ashley) in 1900 -- but it's nice to have evidence. I'm far from the first to explore this phenomenon, but I think I've come up with some interesting displays:

This data seem to support the view that there are more and more names out there: In the 1880s, over 8% of the population whose birth records ended up reported in the Social Security database (an important distinction, as we'll see!) were named John or Mary; the most popular names nowadays are closer to 1% of the total, and their share has decreased rapidly since the late 1960s. Other analysts have shown that girls' names have more of a bandwagon effect than boys', and these graphs seem to bear that out, with higher peaks when names like Linda or Jessica become very popular for a few years, then fade into relative obscurity.

The Social Security database isn't perfect, nor do its curators claim it to be; in particular, statistics before World War II are suspect because of missing data and they way the data was collected. This is most evident when we try to visualize diversity by looking at how many new names per birth were introduced into the records:

The huge peak in the 1910s isn't an explosion of weird names, it's an artifact of the database. Social security numbers were introduced in 1935, and then adults signed up and gave their birth years; what you are seeing is evidence that there are a lot of names missing before then, which only stands to reason. How many people born in Appalachia or the Old West would the Social Security administration find records of decades after their births? Once the system kicks into gear, we see an increase in overall diversity since the late 1960s, as expected.

This is a fascinating data set, and in upcoming posts I'll mine it some more. For now, though, I'll spelunk one name: the hook of the 1989 film Heathers is that there were a lot of teenage girls with that name at that time, and the data shows that is the case; it also shows how dated the film is, because by now the name has faded into the obscurity from whence it came.