Instate: predict the state of residence from last name
Project description
instate: predict spoken language and the state of residence from last name
Using the Indian electoral rolls data (2017), we provide a Python package that takes the last name of a person and gives its distribution across states. This package can also predict the spoken language of the person based on the last name.
Potential Use Cases
India has 22 official languages. To serve such a diverse language base is a challenge for businesses and surveyors. To the extent that businesses have access to the last name (and no other information) and in the absence of other data that allows us to model a person's spoken language, the distribution of last names across states is the best we have.
Dataset
Refer lastname_langs_india.csv.tar.gz for the dataset, that will be used to predict/lookup the spoken language based on the last name.
Refer lastname_langs_india_top3.csv.tar.gz for the dataset, that will be used to predict the top-3 spoken languages based on the last name. A LSTM model has been trained on this dataset to predict the top-3 spoken languages.
Refer notebooks for the notebooks that were used to prepare above datasets and train the models.
Web UI
Streamlit App.: https://appeler-instate-streamlitstreamlit-app-e39m4c.streamlit.app/
Installation
We strongly recommend installing [instate]{.title-ref} inside a Python virtual environment (see venv documentation)
pip install instate
Examples
from instate import last_state
last_dat <- pd.read_csv("last_dat.csv")
last_state_dat <- last_state(last_dat, "dhingra")
print(last_state_dat)
API
instate provides 4 main functions for predicting state and language from Indian lastnames.
Electoral Rolls Lookup
- get_state_distribution - Get P(state|lastname) from 2017 electoral rolls data
import instate
# With list of names
names = ["sharma", "patel", "singh"]
result = instate.get_state_distribution(names)
print(result[["name", "Delhi", "Gujarat", "Punjab"]].head())
# With DataFrame
import pandas as pd
df = pd.DataFrame({"lastname": ["sharma", "patel"]})
result = instate.get_state_distribution(df, "lastname")
print(result.shape) # (2, 33) - 2 names + 31 state columns
- get_state_languages - Map states to their official languages
# Map states to languages
states = ["Delhi", "Punjab", "Gujarat"]
result = instate.get_state_languages(states)
print(result[["state", "official_languages"]])
# state official_languages
# 0 Delhi Hindi, English
# 1 Punjab Punjabi
# 2 Gujarat Gujarati
Neural Network Predictions
- predict_state - Predict likely states using trained GRU model
# Predict top 3 most likely states
names = ["sharma", "patel", "singh"]
result = instate.predict_state(names, top_k=3)
print(result["predicted_states"].iloc[0])
# ['Delhi', 'Uttar Pradesh', 'Bihar']
- predict_language - Predict likely languages using LSTM or k-nearest neighbor
# LSTM neural network prediction (top 3)
result = instate.predict_language(names, model="lstm", top_k=3)
print(result["predicted_languages"].iloc[0])
# ['hindi', 'punjabi', 'urdu']
# K-nearest neighbor lookup (single best)
result = instate.predict_language(names, model="knn")
print(result["predicted_languages"].iloc[0])
# 'hindi'
Complete Example
import pandas as pd
import instate
# Sample data
df = pd.DataFrame({
"person_id": [1, 2, 3],
"lastname": ["sharma", "patel", "singh"]
})
# Get state distributions from electoral rolls
state_dist = instate.get_state_distribution(df, "lastname")
print("Electoral rolls data shape:", state_dist.shape)
# Predict states with neural network
predicted_states = instate.predict_state(df, "lastname", top_k=3)
print("Top 3 predicted states:", predicted_states["predicted_states"].iloc[0])
# Predict languages
predicted_langs = instate.predict_language(df, "lastname", model="lstm", top_k=3)
print("Top 3 predicted languages:", predicted_langs["predicted_languages"].iloc[0])
# Map states to languages
states_df = pd.DataFrame({"state": ["Delhi", "Gujarat", "Punjab"]})
lang_map = instate.get_state_languages(states_df)
print("State language mapping:")
print(lang_map[["state", "official_languages"]])
Data
The underlying data for the package can be accessed at: https://doi.org/10.7910/DVN/ZXMVTJ
Evaluation
The model has a top-3 accuracy of 85.3% on unseen names. The KNN model does quite well. See the details here The name-to-language lookup has an accuracy of 67.9%. The name-to-language model prediction has an accuracy of 72.2%.
Authors
Atul Dhingra, Gaurav Sood and Rajashekar Chintalapati
Contributor Code of Conduct
The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.
License
The package is released under the MIT License.
🔗 Adjacent Repositories
- appeler/ethnicolr2 — Ethnicolr implementation with new models in pytorch
- appeler/naampy — Infer Sociodemographic Characteristics from Names Using Indian Electoral Rolls
- appeler/ethnicolr — Predict Race and Ethnicity Based on the Sequence of Characters in a Name
- appeler/parsernaam — AI name parsing. Predict first or last name using a DL model.
- appeler/ethnicolor — Race and Ethnicity based on name using data from census, voter reg. files, etc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file instate-1.0.0.tar.gz.
File metadata
- Download URL: instate-1.0.0.tar.gz
- Upload date:
- Size: 26.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c43350ae2f573baeca7a0238157da550aed4d82b8554a1a32a7145fd73acabfd
|
|
| MD5 |
8a5c0f30c7f5cfba7ab165b8ae340f96
|
|
| BLAKE2b-256 |
e68687dbc130e6af5e153ed1fbc5349cfa72528231cf6c05875cd702dccf7a0e
|
Provenance
The following attestation bundles were made for instate-1.0.0.tar.gz:
Publisher:
python-publish.yml on appeler/instate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
instate-1.0.0.tar.gz -
Subject digest:
c43350ae2f573baeca7a0238157da550aed4d82b8554a1a32a7145fd73acabfd - Sigstore transparency entry: 739254130
- Sigstore integration time:
-
Permalink:
appeler/instate@99c2ab6d5ce605a34feb57e5c35af528e70ad48e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/appeler
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@99c2ab6d5ce605a34feb57e5c35af528e70ad48e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file instate-1.0.0-py3-none-any.whl.
File metadata
- Download URL: instate-1.0.0-py3-none-any.whl
- Upload date:
- Size: 26.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06b42818015506330c2de937f6a4cc24f8d36d1e0c36dd90a3ff101981806e85
|
|
| MD5 |
82caa070fea2d4bc4e89e8dce3738b0f
|
|
| BLAKE2b-256 |
7c5c48a9e102265a22f4b0bb7cea478750a35a77d59d6cbf8cf7cc01df724454
|
Provenance
The following attestation bundles were made for instate-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on appeler/instate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
instate-1.0.0-py3-none-any.whl -
Subject digest:
06b42818015506330c2de937f6a4cc24f8d36d1e0c36dd90a3ff101981806e85 - Sigstore transparency entry: 739254143
- Sigstore integration time:
-
Permalink:
appeler/instate@99c2ab6d5ce605a34feb57e5c35af528e70ad48e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/appeler
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@99c2ab6d5ce605a34feb57e5c35af528e70ad48e -
Trigger Event:
workflow_dispatch
-
Statement type: