Infer Caste from Indian Names
Project description
Using data on more than 140M Indians across 19 states from the Socio-Economic Caste Census (parsed data here), we estimate the proportion scheduled caste, scheduled tribe, and other for a particular last name, year, and state.
Why?
We provide this package so that people can assess, highlight, and fight unfairness.
How is the underlying data produced?
The script downloads the clean version of the SECC posted here.
remove names with non-alphabetical characters
remove records with missing last names
remove < 2 char last names
remove rows with birth_date < 1900
last name shared by at least 1000
Group by last name, state, and year and produce the underlying data
Base Classifier
We start by providing a base model for last_name that gives the Bayes optimal solution providing the proportion of SC, ST, and Other with that last name. We also provide a series of base models where the state of residence is known.
Installation
We strongly recommend installing outkast inside a Python virtual environment (see venv documentation)
pip install outkast
Usage
usage: secc_caste [-h] -l LAST_NAME [-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}] [-y YEAR] [-o OUTPUT] input Appends SECC 2011 data columns for sc, st, and other by last name positional arguments: input Input file optional arguments: -h, --help show this help message and exit -l LAST_NAME, --last-name LAST_NAME Name or index location of column contains the last name -s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}, --state {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal} State name of SECC data (default=all) -y YEAR, --year YEAR Birth year in SECC data (default=all) -o OUTPUT, --output OUTPUT Output file with SECC data columns
Using outkast
>>> import pandas as pd >>> from outkast import secc_caste >>> >>> names = [{'name': 'patel'}, ... {'name': 'zala'}, ... {'name': 'lal'}, ... {'name': 'agarwal'}] >>> >>> df = pd.DataFrame(names) >>> >>> secc_caste(df, 'name') name n_sc n_st n_other prop_sc prop_st prop_other 0 patel 5681 112302 631393 0.007581 0.149861 0.842558 1 zala 667 14 34550 0.018932 0.000397 0.980670 2 lal 703595 241846 1314224 0.311371 0.107027 0.581601 3 agarwal 39 12 4375 0.008812 0.002711 0.988477 >>> >>> help(secc_caste) Help on method secc_caste in module outkast.secc_caste_ln: secc_caste(df, namecol, state=None, year=None) method of builtins.type instance Appends additional columns from SECC data to the input DataFrame based on the last name. Removes extra space. Checks if the name is the SECC data. If it is, outputs data from that row. Args: df (:obj:`DataFrame`): Pandas DataFrame containing the last name column. namecol (str or int): Column's name or location of the name in DataFrame. state (str): The state name of SECC data to be used. (default is None for all states) year (int): The year of SECC data to be used. (default is None for all years) Returns: DataFrame: Pandas DataFrame with additional columns:- 'n_sc', 'n_st', 'n_other', 'prop_sc', 'prop_st', 'prop_other' by last name
License
The package is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file outkast-0.2.1.tar.gz
.
File metadata
- Download URL: outkast-0.2.1.tar.gz
- Upload date:
- Size: 8.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb31ae8ee6e159d4420888b6d342c952f71d5babad8056bdf7411517b095c451 |
|
MD5 | 01f72addcd03862582002f241d36f127 |
|
BLAKE2b-256 | 9f70ad7347d090ed5d35a294c3dba1140e7c7a10cfb1ad2f36777d9680b5b567 |
File details
Details for the file outkast-0.2.1-py2.py3-none-any.whl
.
File metadata
- Download URL: outkast-0.2.1-py2.py3-none-any.whl
- Upload date:
- Size: 8.6 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99ff480ab84f56b048f7934f669e760c511b0f2d20359940510a7b075639faab |
|
MD5 | 92077e6588dfc193e668570abd5d8b36 |
|
BLAKE2b-256 | 05b2ae45d32c29eed34b29c16b4be6ebb4b3000ec82bcafc7ca74b82e17c6db7 |