Name parser
Project description
Parsernaam: Predict First and Last Name
Most common name parsers use crude pattern matching and the sequence of strings, e.g., the last word is the last name, to parse names. This approach is limited and fragile, especially for Indian names. We take a machine-learning approach to the problem. Using the large voter registration data in India and US, we build machine-learning-based name parsers that predict whether the string is a first or last name.
For Indian electoral rolls, we assume the last name is the word in the name that is shared by multiple family members. (We table the expansion to include compound last names—extremely rare in India—till the next iteration.)
Gradio App.
Installation
pip install parsernaam
General API
The general API is as follows:
# Import the library from parsernaam.parsernaam import ParseNames positional arguments: df dataframe with Names to parse (with column name 'name') # example df = pd.DataFrame({'name': ['Jan', 'Nicholas Turner', 'Petersen', 'Nichols Richard', 'Piet', 'John Smith', 'Janssen', 'Kim Yeon']}) df = ParseNames.parse(df) print(df.to_markdown())
| | name | parsed_name | |---:|:----------------|:------------------------------------------------------------------------------| | 0 | Jan | {'name': 'Jan', 'type': 'first', 'prob': 0.6769440174102783} | | 1 | Nicholas Turner | {'name': 'Nicholas Turner', 'type': 'first_last', 'prob': 0.9990382194519043} | | 2 | Petersen | {'name': 'Petersen', 'type': 'last', 'prob': 0.5342262387275696} | | 3 | Nichols Richard | {'name': 'Nichols Richard', 'type': 'last_first', 'prob': 0.9998832941055298} | | 4 | Piet | {'name': 'Piet', 'type': 'first', 'prob': 0.5381495952606201} | | 5 | John Smith | {'name': 'John Smith', 'type': 'first_last', 'prob': 0.9975730776786804} | | 6 | Janssen | {'name': 'Janssen', 'type': 'first', 'prob': 0.5929554104804993} | | 7 | Kim Yeon | {'name': 'Kim Yeon', 'type': 'last_first', 'prob': 0.9987115859985352} |
Data
The model is trained on names from the Florida Voter Registration Data from early 2022. The data are available on the Harvard Dataverse
Contributing
Contributions are welcome. Please open an issue if you find a bug or have a feature request.
License
The package is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file parsernaam-0.0.4.tar.gz
.
File metadata
- Download URL: parsernaam-0.0.4.tar.gz
- Upload date:
- Size: 8.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e6904fdad9efbe96a78e99825bee37d3970d9730083cbb847fcef27dad14ae2 |
|
MD5 | cdffa8788c3ad171dc9018c4e8ba9132 |
|
BLAKE2b-256 | 2ceb599dd959b7b709b8e3307f70dcec70f954c195ca8f951a8d14058ec96438 |
File details
Details for the file parsernaam-0.0.4-py2.py3-none-any.whl
.
File metadata
- Download URL: parsernaam-0.0.4-py2.py3-none-any.whl
- Upload date:
- Size: 8.1 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d72c23879858bc906d09979a4a36cad4abcafa2bc9858921731d1afdb7736ccd |
|
MD5 | 594ca673166a6ba5203c91d2ffd7f472 |
|
BLAKE2b-256 | 69b04ef135b494af3ac44ad21acbb119a977116b1275aa4bf37c121ef52672ad |