Skip to main content

ML-assisted name parser for Indian and international names

Project description

Parsernaam: ML-Assisted Name Parser

image image image

Most common name parsers use crude pattern matching and the sequence of strings, e.g., the last word is the last name, to parse names. This approach is limited and fragile, especially for Indian names. We take a machine-learning approach to the problem. Using the large voter registration data in India and the US, we build machine-learning-based name parsers that predict whether the string is a first or last name.

For Indian electoral rolls, we assume the last name is the word in the name that is shared by multiple family members. (We table the expansion to include compound last names---extremely rare in India---till the next iteration.)

Gradio App.

parsernaam on HF

Installation

pip install parsernaam

Usage

Python API

import pandas as pd
from parsernaam.parse import ParseNames

# Create DataFrame with names to parse
df = pd.DataFrame({'name': ['Jan', 'Nicholas Turner', 'Petersen', 'Nichols Richard', 'Piet',
                           'John Smith', 'Janssen', 'Kim Yeon']})

# Parse names using ML models
results = ParseNames.parse(df)
print(results.to_markdown())

Output:

|    | name            | parsed_name                                                                   |
|---:|:----------------|:------------------------------------------------------------------------------|
|  0 | Jan             | {'name': 'Jan', 'type': 'first', 'prob': 0.677}                            |
|  1 | Nicholas Turner | {'name': 'Nicholas Turner', 'type': 'first_last', 'prob': 0.999}           |
|  2 | Petersen        | {'name': 'Petersen', 'type': 'last', 'prob': 0.534}                        |
|  3 | Nichols Richard | {'name': 'Nichols Richard', 'type': 'last_first', 'prob': 0.999}           |
|  4 | Piet            | {'name': 'Piet', 'type': 'first', 'prob': 0.538}                           |
|  5 | John Smith      | {'name': 'John Smith', 'type': 'first_last', 'prob': 0.997}                |
|  6 | Janssen         | {'name': 'Janssen', 'type': 'first', 'prob': 0.593}                        |
|  7 | Kim Yeon        | {'name': 'Kim Yeon', 'type': 'last_first', 'prob': 0.999}                  |

Command Line Interface

parse_names input.csv -o output.csv -n name_column

Features

  • Machine Learning Based: Uses LSTM neural networks trained on voter registration data
  • Multi-language Support: Handles Indian, Western, and other international name patterns
  • High Accuracy: Confidence scores provided for each prediction
  • Performance Optimized: Model caching and batch processing support
  • Robust Error Handling: Handles edge cases like empty names, special characters, etc.

Data

The model is trained on names from the Florida Voter Registration Data from early 2022. The data are available on the Harvard Dataverse

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributing

Contributions are welcome. Please open an issue if you find a bug or have a feature request.

🔗 Adjacent Repositories

License

The package is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsernaam-0.2.0.tar.gz (8.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsernaam-0.2.0-py3-none-any.whl (8.1 MB view details)

Uploaded Python 3

File details

Details for the file parsernaam-0.2.0.tar.gz.

File metadata

  • Download URL: parsernaam-0.2.0.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parsernaam-0.2.0.tar.gz
Algorithm Hash digest
SHA256 77d57f97cbb63714fd19a269781c1bd72475e4db5e818e7088c0c73591dea041
MD5 0d48a80f3ba96178550525063fd7f941
BLAKE2b-256 b4829d72a2b2f23bd0b6b55fe2ada1a15b0652392e03de14bb5f311341ae87b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsernaam-0.2.0.tar.gz:

Publisher: python-publish.yml on appeler/parsernaam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parsernaam-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: parsernaam-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parsernaam-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8b5dd454e654d230a991d20e9450deeac980c3a04b7a46258520a5537b9357be
MD5 a40559f0c014f4681fdaf8b4a1589ea0
BLAKE2b-256 6eee3a4e0d2f8e3c7c611840718a2cf24eb68157a048b286afb94816a30b026b

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsernaam-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on appeler/parsernaam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page