Infer Gender from Indian Names
Project description
The ability to programmatically reliably infer social attributes of a person from their name can be useful for a broad set of tasks, from estimating bias in coverage of women in the media to estimating bias in lending against certain social groups. But unlike the American Census Bureau, which produces a list of last names and first names, which can (and are) used to infer the gender, race, ethnicity, etc. from names, the Indian government produces no such commensurate datasets. And hence inferring the relationship between gender, ethnicity, language group, etc. and names has generally been done with small datasets constructed in an ad-hoc manner.
We fill this yawning gap. Using data from the Indian Electoral Rolls (parsed data here), we estimate the proportion female, male, and third sex (see here) for a particular first name, year, and state.
Data
How is the underlying data produced?
We split name into first name and last name and then aggregated per state first_name, prop_female, n_female, n_male
This is used to provide the base prediction.
Given the association between prop_female and first_name may change over time, we exploited the age. Given the data were collected in 2017, we calculate the year each person was born and then do a group by year to create first_name, prop_female, n_female, n_male, year
We group across the 12 states to provide the aggregated view.
Issues with underlying data
Concerns:
Voting registration lists may not be accurate, systematically underrepresenting the poor, minorities, etc.
Voting registrations lists at best reflect the adult citizens. But to the extent that prejudice against women, etc., prevents some kinds of people to reach adulthood, the data bakes those biased in.
Indian names are complicated. We do not have good parsers for them yet. We have gone for the default arrangement. Please go through the notebook to look at the judgments we make. We plan to improve the underlying data over time.
Gender Classifier
We start by providing a base model for first_name that gives the Bayes optimal solution providing the proportion of women with that name who are women. We also provide a series of base models where the state of residence is known. In the future, we plan to use LSTM to learn the relationship between sequences of characters in the first name and gender.
Installation
We strongly recommend installing naampy inside a Python virtual environment (see venv documentation)
pip install naampy
Usage
usage: in_rolls_fn_gender [-h] -f FIRST_NAME [-s STATE] [-y YEAR] [-o OUTPUT] input Appends Electoral roll columns for prop_female, n_female, n_male n_third_gender by first name positional arguments: input Input file optional arguments: -h, --help show this help message and exit -f FIRST_NAME, --first-name FIRST_NAME Name or index location of column contains the first name -s STATE, --state STATE State name of Indian electoral rolls data (default=all) -y YEAR, --year YEAR Birth year in Indian electoral rolls data (default=all) -o OUTPUT, --output OUTPUT Output file with Indian electoral rolls data columns
Using naampy
>>> import pandas as pd >>> from naampy import in_rolls_fn_gender >>> names = [{'name': 'yoga'}, ... {'name': 'yasmin'}, ... {'name': 'siri'}, ... {'name': 'vivek'}] >>> df = pd.DataFrame(names) >>> in_rolls_fn_gender(df, 'name') name n_male n_female n_third_gender prop_female 0 yoga 202 150 0 0.426136 1 yasmin 24 2635 0 0.990974 2 siri 115 556 0 0.828614 3 vivek 2252 13 0 0.005740 >>> help(in_rolls_fn_gender) Help on method in_rolls_fn_gender in module naampy.in_rolls_fn: in_rolls_fn_gender(df, namecol, state=None, year=None) method of builtins.type instance Appends additional columns from Female ratio data to the input DataFrame based on the first name. Removes extra space. Checks if the name is the Indian electoral rolls data. If it is, outputs data from that row. Args: df (:obj:`DataFrame`): Pandas DataFrame containing the first name column. namecol (str or int): Column's name or location of the name in DataFrame. state (str): The state name of Indian electoral rolls data to be used. (default is None for all states) year (int): The year of Indian electoral rolls to be used. (default is None for all years) Returns: DataFrame: Pandas DataFrame with additional columns:- 'prop_female', 'n_female', 'n_male', 'n_third_gender' by first name
License
The package is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for naampy-0.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6ff410071cfaf678bdb0791c8d597de706f09b1943e7c5035e60b2f232cefcc |
|
MD5 | f81c182abe57be6afb46e2d05383c7ba |
|
BLAKE2b-256 | e5d716b2f770a6061d3987d9a10f7278008b30be449569c3d83b03584b444776 |