Wednesday, September 17, 2014

Coming Soon: Automation!

WARNING: nit-picky name nerd rant!

Every year for the past decade or so, when the SSA comes out with the top names, I go through the lists and put all the alternate spellings of different names together. Back before 2010, this didn't take that long, because the SSA only released the top 1000 names (names with instances of 200 or more for boys and 250 or more for girls). This was pretty accurate, but I wanted more! Then the SSA started releasing names with 5 or more instances for every single year going back to 1880!!!  This is awesome! However, it means that every year, there are 14,000-20,000 names for each gender. Sure, when you account for all the different spellings, there are only about 7,000 names, but still-- it's a task that takes me a couple of months. 

Enter my programmer friend who is a whiz at SQL databases. He has managed to put all the names from every year into a database! Wouldn't it be awesome if every year, I could feed a database some data when it is released, and have it spit out a top 7000 list with the names already sorted into like names with different spellings? YES! however, this is proving to be a daunting task. Putting names together isn't as easy as one would think. For example, one would pretty much argue that Katy, Katie, Kati and Katee are all the same name. Ditto for Eric, Erik and Erick. 

When you get into modern names, however, this gets confusing. Spelling names as creatively as possible has been the trend for at least a decade, and not every parents spells names so their pronunciations are clear. For example, Aleyah-- is it Aaliyah or Alaya? Alysa-- Alyssa, Aleesa, or Eliza? We would all pronounced Mia like /MEE a/ and Maya like /MY a/ (except for the couple of people I met who pronounce it /MAY a/. What is Miah? Is it like Maya (following in the footsteps of Maria/Mariah, Dina/Dinah?) or is it Mia with an extra h? In the 1990s, it probably would have been more likely to be Maya, since that name was more popular than Mia, and had more alternate spellings. However, now Mia is more common, and usually I would default ambiguous names to the more popular option. 

Unfortunately, the data can't really do that. I had originally decided to assign alternate spellings to a certain name permanently (i.e., Catherine would always be an alias of Katherine, even if that spelling was more popular). This is proving to be a really daunting task! 

Take another example: Adan and Aidan. Adan /ah DAHN/ is the Spanish form of Adam. In earlier days, it may also have been a typo or transcription error for Adam (these types of errors were common, especially with most documents being hand-written) Adan has been in the top 7000 since the early 1900s.

Aidan is an unrelated Irish name. Aidan, with two As is the original and traditional Irish spelling, and it first appeared in 1936, although we didn't see it again until 1957. Aiden, by far the most popular spelling now, didn't make an appearance until 1970. What do Aidan and Aiden have to do with Adan? Today, many parents are using Adan as an alternative spelling of Aidan  (there were 44 different spellings in 2013 if you include Adan: Aiden, Ayden, Aidan, Aden, Adan, Aydin, Aydan, Aidyn, Aaden, Aedan, Aayden, Adin, Adyn, Aaiden, Aydenn, Aiyden, Aeden, Eiden, Aydon, Adon, Aedyn, Aydyn, Aadyn, Aidin, Aidon, Eidan, Aydden, Aidden, Ayeden, Ayiden, Eyden, Aydn, Aadan, Aidenn, Aaidyn, Eydan, Adynn, Aedin, Aadon, Aidynn, Aedon, Aadin, Aiiden, Aiydan)

So, do we count Adan as a separate name, or as an alias of Aidan? And do we put Aidan as the base name, or Aiden, being that it is much more popular in the long run? If we take Adan out of the Aidan/Aiden count, Jackson becomes the #1 name for boys in 2013! See? This *does* effect statistics! 

There isn't an easy way to make spellings 100% consistent over time without going through every single year by hand and then checking it against previous years. It takes me an average of 3 months to go through the girls' and boys' name tables for each year and put them in order by spelling. At this rate, it would take me  33.25 years to do all the years going back to 1880! See why automation is appealing? But oh my, is it opening all sorts of new cans of worms!


Anonymous said...

Very interesting. I can't wait to hear more about this project.

Kelly said...

I feel your pain. I've done the same thing as you- load the data into a SQL database. I tried to automate it using this concept called fuzzy matching. It helps but isn't near perfect, so going through the names one by one now. I've written a little program to iterate through the names in the database and as each name pops up on the screen I either assign it to a new group or assign it to an existing one (ex: Caitlin would go with Kaitlyn, Katelyn, etc). SO MANY NAMES, though!

Ben Curthoys said...

I'm looking for a source of this data - name variants grouped together - to try and make a system more usable. If someone phones someone using my software, and says "Hi, my name's Kathryn Smith", and the operator types "Catherine" in, they aren't going to find them in the database, and then there's going to be an irritating "Sorry, how do you spell that?" exchange.

So, when I save a name in the database, I also transform it into a canonical form, and save that, and then do the same transformation on the search teams, so a search for one will match the other. Someone typing "Kate" into the search box would find both Kathryn and Catherine and so on... but there are SO MANY NAMES though.