ML-assisted name parser for Indian and international names
Project description
Parsernaam: ML-Assisted Name Parser
Most common name parsers use crude pattern matching and the sequence of strings, e.g., the last word is the last name, to parse names. This approach is limited and fragile, especially for Indian names. We take a machine-learning approach to the problem. Using the large voter registration data in India and the US, we build machine-learning-based name parsers that predict whether the string is a first or last name.
For Indian electoral rolls, we assume the last name is the word in the name that is shared by multiple family members. (We table the expansion to include compound last names---extremely rare in India---till the next iteration.)
Gradio App.
Installation
pip install parsernaam
Usage
Python API
import pandas as pd
from parsernaam.parse import ParseNames
# Create DataFrame with names to parse
df = pd.DataFrame({'name': ['Jan', 'Nicholas Turner', 'Petersen', 'Nichols Richard', 'Piet',
'John Smith', 'Janssen', 'Kim Yeon']})
# Parse names using ML models
results = ParseNames.parse(df)
print(results.to_markdown())
Output:
| | name | parsed_name |
|---:|:----------------|:------------------------------------------------------------------------------|
| 0 | Jan | {'name': 'Jan', 'type': 'first', 'prob': 0.677} |
| 1 | Nicholas Turner | {'name': 'Nicholas Turner', 'type': 'first_last', 'prob': 0.999} |
| 2 | Petersen | {'name': 'Petersen', 'type': 'last', 'prob': 0.534} |
| 3 | Nichols Richard | {'name': 'Nichols Richard', 'type': 'last_first', 'prob': 0.999} |
| 4 | Piet | {'name': 'Piet', 'type': 'first', 'prob': 0.538} |
| 5 | John Smith | {'name': 'John Smith', 'type': 'first_last', 'prob': 0.997} |
| 6 | Janssen | {'name': 'Janssen', 'type': 'first', 'prob': 0.593} |
| 7 | Kim Yeon | {'name': 'Kim Yeon', 'type': 'last_first', 'prob': 0.999} |
Command Line Interface
parse_names input.csv -o output.csv -n name_column
Features
- Machine Learning Based: Uses LSTM neural networks trained on voter registration data
- Multi-language Support: Handles Indian, Western, and other international name patterns
- High Accuracy: Confidence scores provided for each prediction
- Performance Optimized: Model caching and batch processing support
- Robust Error Handling: Handles edge cases like empty names, special characters, etc.
Data
The model is trained on names from the Florida Voter Registration Data from early 2022. The data are available on the Harvard Dataverse
Authors
Rajashekar Chintalapati and Gaurav Sood
Contributing
Contributions are welcome. Please open an issue if you find a bug or have a feature request.
🔗 Adjacent Repositories
- appeler/naamkaran — generative model for names
- appeler/ethnicolr2 — Ethnicolr implementation with new models in pytorch
- appeler/namesexdata — Data on international first names and sex of people with that name
- appeler/pranaam — pranaam: predict religion based on name
- appeler/graphic_names — Infer the gender of a person with a particular first name using Google image search and Clarifai
License
The package is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsernaam-0.2.0.tar.gz.
File metadata
- Download URL: parsernaam-0.2.0.tar.gz
- Upload date:
- Size: 8.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77d57f97cbb63714fd19a269781c1bd72475e4db5e818e7088c0c73591dea041
|
|
| MD5 |
0d48a80f3ba96178550525063fd7f941
|
|
| BLAKE2b-256 |
b4829d72a2b2f23bd0b6b55fe2ada1a15b0652392e03de14bb5f311341ae87b8
|
Provenance
The following attestation bundles were made for parsernaam-0.2.0.tar.gz:
Publisher:
python-publish.yml on appeler/parsernaam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsernaam-0.2.0.tar.gz -
Subject digest:
77d57f97cbb63714fd19a269781c1bd72475e4db5e818e7088c0c73591dea041 - Sigstore transparency entry: 729635269
- Sigstore integration time:
-
Permalink:
appeler/parsernaam@7be61058f3deff28e39146c8552b4e5030a87448 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/appeler
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7be61058f3deff28e39146c8552b4e5030a87448 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file parsernaam-0.2.0-py3-none-any.whl.
File metadata
- Download URL: parsernaam-0.2.0-py3-none-any.whl
- Upload date:
- Size: 8.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b5dd454e654d230a991d20e9450deeac980c3a04b7a46258520a5537b9357be
|
|
| MD5 |
a40559f0c014f4681fdaf8b4a1589ea0
|
|
| BLAKE2b-256 |
6eee3a4e0d2f8e3c7c611840718a2cf24eb68157a048b286afb94816a30b026b
|
Provenance
The following attestation bundles were made for parsernaam-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on appeler/parsernaam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsernaam-0.2.0-py3-none-any.whl -
Subject digest:
8b5dd454e654d230a991d20e9450deeac980c3a04b7a46258520a5537b9357be - Sigstore transparency entry: 729635270
- Sigstore integration time:
-
Permalink:
appeler/parsernaam@7be61058f3deff28e39146c8552b4e5030a87448 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/appeler
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7be61058f3deff28e39146c8552b4e5030a87448 -
Trigger Event:
workflow_dispatch
-
Statement type: