Skip to main content

Data access and analysis of baby names statistics

Project description

babe

Note that the first time you import name, you need to have access to the Internet, and it will take a few seconds (depending on bandwidth) to download the required data.

But this data is automatically saved in a local file so things are faster the next time around.

To install:

pip install babe

Then in a python console or notebook...

from babe import UsNames

d = UsNames()

Intro to the data

The fundamental data provides a popularity score (number of babies recorded) associated to a (state, gender, name, year) tuple (that has data -- for names of babies born in the US between 1910 and 2019).

d.data
state gender year name popularity name_g
0 AK F 1910 Mary 14 Mary_F
1 AK F 1910 Annie 12 Annie_F
2 AK F 1910 Anna 10 Anna_F
3 AK F 1910 Margaret 8 Margaret_F
4 AK F 1910 Helen 7 Helen_F
... ... ... ... ... ... ...
28277 WY M 2019 Theo 5 Theo_M
28278 WY M 2019 Tristan 5 Tristan_M
28279 WY M 2019 Vincent 5 Vincent_M
28280 WY M 2019 Warren 5 Warren_M
28281 WY M 2019 Waylon 5 Waylon_M

6122890 rows × 6 columns

print(f"{len(d.names)} unique names")
31862 unique names

But some names can be used for both genders, so most of the internals will use name_g, the name concatenated with the gender (_F or _M):

print(f"{len(d.name_gs)} unique names_g (gendered names)")
34952 unique names_g (gendered names)

You can use resolve_name_g to get the name_g corresponding to a name as long as the name isn't used for more than one gender.

d.resolve_name_g('Cora')
'Cora_F'
try:
    d.resolve_name_g('Vanessa')
except AssertionError as e:
    print(e)
The Vanessa can be used for both genders. Specify Vanessa_F or Vanessa_M

by_state data

In some cases, it's more convenient to have a view indexed by (state, name_g, year). The by_state attribute provides that.

d.by_state
state  name_g      year
AK     Mary_F      1910    14
       Annie_F     1910    12
       Anna_F      1910    10
       Margaret_F  1910     8
       Helen_F     1910     7
                           ..
WY     Theo_M      2019     5
       Tristan_M   2019     5
       Vincent_M   2019     5
       Warren_M    2019     5
       Waylon_M    2019     5
Name: popularity, Length: 6122890, dtype: int64

This allows one to do things like getting the data for a given state only:

d.by_state['CA']
name_g      year
Mary_F      1910    295
Helen_F     1910    239
Dorothy_F   1910    220
Margaret_F  1910    163
Frances_F   1910    134
                   ... 
Zayvion_M   2019      5
Zeek_M      2019      5
Zhaire_M    2019      5
Zian_M      2019      5
Ziyad_M     2019      5
Name: popularity, Length: 387781, dtype: int64

... within a state, getting the 'by year popularity' for a given name:

d.by_state['CA']['Cora_F']  # or d.by_state['CA', 'Cora_F']
year
1911      8
1912      9
1913     15
1914     15
1915     17
       ... 
2015    269
2016    244
2017    284
2018    282
2019    256
Name: popularity, Length: 109, dtype: int64

... if you wanted to get the data for a given name (really name_g), for all states, you can do it using "slicing".

For example, if you're wondering how many little boys were called "Vanessa", and more specifically, when and where?...

d.by_state[:, 'Vanessa_M'] 
state  year
AZ     1988     8
CA     1980     7
       1981     5
       1982    20
       1983    19
       1984    14
       1985    12
       1986    13
       1987    13
       1988    26
       1989    17
       1990    16
       1991    18
       1992    17
       1993    17
       1994    10
       1995     9
       1996    10
       1997    11
       1998     7
DC     1989    11
NY     1982     5
       1983     9
       1986     6
       1988     6
       1989     6
TX     1981     5
       1982     7
       1983    12
       1984     9
       1985    10
       1986     8
       1987     9
       1988     8
       1989     5
       1990     6
       1991     5
       1992     5
       1994     5
Name: popularity, dtype: int64

national data

A national aggregation is available through the national attribute

d.national
name_g      year
Aaban_M     2013     6
            2014     6
Aadam_M     2019     6
Aadan_M     2008    12
            2009     6
                    ..
Zyriah_F    2013     7
            2014     6
            2016     5
Zyron_M     2015     5
Zyshonne_M  1998     5
Name: popularity, Length: 633239, dtype: int64

The interface is as with the by_state attribute, but without the state specification.

d.national.loc['Vanessa_F']
year
1935       5
1947      24
1948      32
1949      16
1950      41
        ... 
2015    1687
2016    1633
2017    1486
2018    1345
2019    1188
Name: popularity, Length: 74, dtype: int64

Plotting stuff

d.plot_popularity('Cora');

png

d.plot_popularity('Cora', 'GA');

png

d.plot_popularity(['Cora', 'Vanessa_F']);

png

d.plot_popularity('Cora', ['CA', 'GA']);

png

d.plot_popularity(['Cora', 'Vanessa_F'], ['CA', 'GA']);

png

Misc

gender-ambiguous names

We'll call the "femininity" of a name be the proportion of times it was used (all states, all time) to name a girl, and the "masculinity" of a name be defined accordingly.

d.femininity_of_name.iloc[12000:12010]
Lemmie      0.161290
Kashmere    0.161290
Clary       0.162162
Sung        0.162393
Kyrie       0.163527
Cedar       0.163686
Masyn       0.163895
Naveen      0.165605
Chai        0.166667
Atlee       0.167382
dtype: float64
d.femininity_of_name.plot(figsize=(17, 5), ylabel='femininity');

png

d.masculinity_of_name.iloc[19000:19010]
Berkley     0.108889
Dasani      0.110092
Sharone     0.111111
Ifeoluwa    0.111111
Rama        0.111111
Scout       0.111486
Brownie     0.111732
Lashon      0.113158
Indigo      0.113364
Argie       0.113636
dtype: float64
d.masculinity_of_name.plot(figsize=(17, 5), ylabel='masculinity');

png

The (gender-)"ambiguity" of a name can thus be defined as twice the minimum of it's femininity and masculinity.

By defining the ambiguity thusly, we have a score that is between 0 and 1. It is maximal (1) when an equal proportion of boys and girls were named with the name. It is minimal (0) when only one gender was named with it.

Note that this score is raw (or "un-smoothed"). It's computed with the raw counts, so the extreme scores will usually be for names with very low counts.

d.ambiguity_of_name
Munachiso    1.0
Addis        1.0
Deshone      1.0
Gal          1.0
Rajdeep      1.0
            ... 
Sharelle     0.0
Analy        0.0
Sharayah     0.0
Sharaya      0.0
Aaban        0.0
Length: 31862, dtype: float64
t = d.ambiguity_of_name
print(f"There are {len(t[t > 0])} (gender-)ambiguous names")
There are 3090 (gender-)ambiguous names
t = d.ambiguity_of_name
t[t > 0].plot(figsize=(17, 5), ylabel='gender-ambiguity');

png

t = list(d.ambiguous_names)
print(f"{len(t)} (gender-)ambiguous names:")
print(*t[:9], '...', sep=', ')
3090 (gender-)ambiguous names:
Nolie, Tyrese, Linn, Savannah, Bryn, Rei, Abby, Shilo, Tracy, ...


          

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babe-0.0.7.tar.gz (8.8 kB view hashes)

Uploaded Source

Built Distribution

babe-0.0.7-py3-none-any.whl (6.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page