Skip to main content

Name parser

Project description

Parsernaam: Predict First and Last Name

https://github.com/appeler/parsernaam/actions/workflows/python-package.yml/badge.svg https://img.shields.io/pypi/v/parsernaam.svg https://static.pepy.tech/badge/parsernaam

Most common name parsers use crude pattern matching and the sequence of strings, e.g., the last word is the last name, to parse names. This approach is limited and fragile, especially for Indian names. We take a machine-learning approach to the problem. Using the large voter registration data in India and US, we build machine-learning-based name parsers that predict whether the string is a first or last name.

For Indian electoral rolls, we assume the last name is the word in the name that is shared by multiple family members. (We table the expansion to include compound last names—extremely rare in India—till the next iteration.)

Gradio App.

parsernaam on HF

Installation

pip install parsernaam

General API

The general API is as follows:

# Import the library
from parsernaam.parsernaam import ParseNames

positional arguments:
  df                 dataframe with Names to parse (with column name 'name')

# example
df = pd.DataFrame({'name': ['Jan', 'Nicholas Turner', 'Petersen', 'Nichols Richard', 'Piet',
                                     'John Smith', 'Janssen', 'Kim Yeon']})
df = ParseNames.parse(df)
print(df.to_markdown())
|    | name            | parsed_name                                                                   |
|---:|:----------------|:------------------------------------------------------------------------------|
|  0 | Jan             | {'name': 'Jan', 'type': 'first', 'prob': 0.6769440174102783}                  |
|  1 | Nicholas Turner | {'name': 'Nicholas Turner', 'type': 'first_last', 'prob': 0.9990382194519043} |
|  2 | Petersen        | {'name': 'Petersen', 'type': 'last', 'prob': 0.5342262387275696}              |
|  3 | Nichols Richard | {'name': 'Nichols Richard', 'type': 'last_first', 'prob': 0.9998832941055298} |
|  4 | Piet            | {'name': 'Piet', 'type': 'first', 'prob': 0.5381495952606201}                 |
|  5 | John Smith      | {'name': 'John Smith', 'type': 'first_last', 'prob': 0.9975730776786804}      |
|  6 | Janssen         | {'name': 'Janssen', 'type': 'first', 'prob': 0.5929554104804993}              |
|  7 | Kim Yeon        | {'name': 'Kim Yeon', 'type': 'last_first', 'prob': 0.9987115859985352}        |

Data

The model is trained on names from the Florida Voter Registration Data from early 2022. The data are available on the Harvard Dataverse

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributing

Contributions are welcome. Please open an issue if you find a bug or have a feature request.

License

The package is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsernaam-0.0.4.tar.gz (8.1 MB view details)

Uploaded Source

Built Distribution

parsernaam-0.0.4-py2.py3-none-any.whl (8.1 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file parsernaam-0.0.4.tar.gz.

File metadata

  • Download URL: parsernaam-0.0.4.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for parsernaam-0.0.4.tar.gz
Algorithm Hash digest
SHA256 6e6904fdad9efbe96a78e99825bee37d3970d9730083cbb847fcef27dad14ae2
MD5 cdffa8788c3ad171dc9018c4e8ba9132
BLAKE2b-256 2ceb599dd959b7b709b8e3307f70dcec70f954c195ca8f951a8d14058ec96438

See more details on using hashes here.

File details

Details for the file parsernaam-0.0.4-py2.py3-none-any.whl.

File metadata

  • Download URL: parsernaam-0.0.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for parsernaam-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d72c23879858bc906d09979a4a36cad4abcafa2bc9858921731d1afdb7736ccd
MD5 594ca673166a6ba5203c91d2ffd7f472
BLAKE2b-256 69b04ef135b494af3ac44ad21acbb119a977116b1275aa4bf37c121ef52672ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page