Skip to main content

Infer Caste from Indian Names

Project description

https://github.com/appeler/outkast/actions/workflows/ci.yml/badge.svg https://img.shields.io/pypi/v/outkast.svg https://pepy.tech/badge/outkast https://img.shields.io/badge/docs-github.io-blue

Using data on more than 140M Indians across 19 states from the Socio-Economic Caste Census (parsed data here), we estimate the proportion scheduled caste, scheduled tribe, and other for a particular last name, year, and state.

Why?

We provide this package so that people can assess, highlight, and fight unfairness.

How is the underlying data produced?

  1. The script downloads the clean version of the SECC posted here.

  2. Produce base data frame and infer last names

  • remove names with non-alphabetical characters

  • remove records with missing last names

  • remove < 2 char last names

  • remove rows with birth_date < 1900

  • last name shared by at least 1000 hh

  1. Group by last name, state, and year and produce the underlying data

Base Classifier

We start by providing a base model for last_name that gives the Bayes optimal solution providing the proportion of SC, ST, and Other with that last name. We also provide a series of base models where the state of residence is known.

Installation

We strongly recommend installing outkast inside a Python virtual environment (see venv documentation)

pip install outkast

Usage

usage: secc_caste [-h] -l LAST_NAME
                [-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}]
                [-y YEAR] [-o OUTPUT]
                input

Appends SECC 2011 data columns for sc, st, and other by last name

positional arguments:
input                 Input file

optional arguments:
-h, --help            show this help message and exit
-l LAST_NAME, --last-name LAST_NAME
                        Name or index location of column contains the last
                        name
-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}, --state {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}
                        State name of SECC data (default=all)
-y YEAR, --year YEAR  Birth year in SECC data (default=all)
-o OUTPUT, --output OUTPUT
                        Output file with SECC data columns

Using outkast

>>> import pandas as pd
>>> from outkast import secc_caste
>>>
>>> names = [{'name': 'patel'},
...             {'name': 'zala'},
...             {'name': 'lal'},
...             {'name': 'agarwal'}]
>>>
>>> df = pd.DataFrame(names)
>>>
>>> secc_caste(df, 'name')
    name    n_sc    n_st  n_other   prop_sc   prop_st  prop_other
0    patel    5681  112302   631393  0.007581  0.149861    0.842558
1     zala     667      14    34550  0.018932  0.000397    0.980670
2      lal  703595  241846  1314224  0.311371  0.107027    0.581601
3  agarwal      39      12     4375  0.008812  0.002711    0.988477


>>>
>>> help(secc_caste)
Help on method secc_caste in module outkast.secc_caste_ln:

secc_caste(df, namecol, state=None, year=None) method of builtins.type instance
    Appends additional columns from SECC data to the input DataFrame
    based on the last name.

    Removes extra space. Checks if the name is the SECC data.
    If it is, outputs data from that row.

    Args:
        df (:obj:`DataFrame`): Pandas DataFrame containing the last name
            column.
        namecol (str or int): Column's name or location of the name in
            DataFrame.
        state (str): The state name of SECC data to be used.
            (default is None for all states)
        year (int): The year of SECC data to be used.
            (default is None for all years)

    Returns:
        DataFrame: Pandas DataFrame with additional columns:-
            'n_sc', 'n_st', 'n_other',
            'prop_sc', 'prop_st', 'prop_other' by last name

Authors

Suriyan Laohaprapanon and Gaurav Sood

License

The package is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

outkast-1.0.0.tar.gz (8.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

outkast-1.0.0-py3-none-any.whl (8.6 MB view details)

Uploaded Python 3

File details

Details for the file outkast-1.0.0.tar.gz.

File metadata

  • Download URL: outkast-1.0.0.tar.gz
  • Upload date:
  • Size: 8.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for outkast-1.0.0.tar.gz
Algorithm Hash digest
SHA256 09c3c2490ac0f1ddb186015ba07a04999dbf510a36b8902c5c104befb1aa8038
MD5 16376e8908b2bf3b26a111c90b05d8a6
BLAKE2b-256 020929a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for outkast-1.0.0.tar.gz:

Publisher: python-publish.yml on appeler/outkast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file outkast-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: outkast-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for outkast-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f757e00eb252c71e64ae83ab82908955624d78c39e030de6e96a86864a2eac42
MD5 59322040eedc33197b9dd6a8b0c9a4a6
BLAKE2b-256 8bd8a91f2e8ff6c8cc4246248702dc047f2cded31f69b6a61e83253f5baab374

See more details on using hashes here.

Provenance

The following attestation bundles were made for outkast-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on appeler/outkast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page