Demographic prediction from name
Project description
NameTrace
NameTrace is a python package to identify "real" human names and predict gender and geographical origin of the name. The idea is to allow users to e.g. take users' names on social media platforms such as X and identify real names and predict gender and possible origin of the users. This package was build to help researchers.
See here for a comprehensive blogpost about the package.
Features
- Human Name Detection: Distinguish between human names and other text (usernames, company names, etc.)
- Gender Prediction: Predict gender from names using deep learning models
- Geographic Origin: Predict geographic subregion from names
- High Performance: Uses BiLSTM neural networks with rule-based fallbacks
- Easy to Use: Simple API with batch processing support
Installation
pip install nametrace
[!NOTE] nametrace requires pytorch. On some platforms, the latest versions of torch might not be supported, and you may get an error message during the installation of nametrace. Actually nametrace does not require the latest version of torch. You can solve this issue by simply installing a version of torch that is compatible with your system. For example, on a Mac OS 12.7 system with an Intel chip, you can only run torch<=2.2.2. So you just install pytorch first and then follow up with nametrace:
pip install "torch==2.2.2"pip install namtrace
Quick Start
from nametrace import NameTracer
# Initialize the predictor
nt = NameTracer()
# Predict for a single name
result = nt.predict("John Smith")
print(result)
# {
# 'is_human': True,
# 'gender': 'male',
# 'subregion': 'Northern Europe',
# 'confidence': {
# 'human': 1.0,
# 'gender': 0.9563450217247009,
# 'subregion': 0.40897873044013977
# }
# }
# Allows batch prediction
names = ["Maria Garcia", "user123", "Ahmed Hassan"]
results = nt.predict(names,batch_size=12)
for name, result in zip(names, results):
print(f"{name}: {result['is_human']}")
# Maria Garcia: True
# user123: False
# Ahmed Hassan: True
# Allows top k prediction
result = nt.predict("John Smith",topk=3)
# {
# 'is_human': True,
# 'gender': [
# ('male', 0.9563450217247009),
# ('female', 0.04365495219826698)],
# 'subregion': [
# ('Northern Europe', 0.40897873044013977),
# ('North America', 0.32769879698753357),
# ('Australia and New Zealand', 0.16957755386829376)
# ],
# 'confidence': {
# 'human': 1.0,
# 'gender': 0.9563450217247009,
# 'subregion': 0.40897873044013977
# }
# }
API Reference
NameTracer
The main class for name prediction.
__init__(device=None)
Initialize the tracer.
Parameters:
device(str, optional): Device for model inference ('cpu', 'cuda', or None for auto-detection)
predict(names, batch_size=None, topk=1)
Predict if a name(s) is(are) human and get demographics.
Parameters:
names(str or list): Input name string or list of name stringstopk(int, optional): Providetopkpredcitions, optional, defaults to 1batch_size(int, optional): batch size for batch inference, defaults to None (i.e. single batch)
Returns:
If names is a single name:
dict: Prediction results with keys:is_human(bool): Whether the name is humangender(str): Predicted gender ('male'/'female') or Nonesubregion(str): Predicted geographic subregion or Noneconfidence(dict): Confidence scores for each prediction
If names is a list of names list of above dict.
Training Details
NameTrace uses a two-stage approach:
- Human Detection: Rule-based lookup against known name databases, with
BiLSTMfallback for unknown names - Demographics Prediction: Character-level
BiLSTMmodel for joint gender and geographic origin prediction
Performance on test data
- Human Detection: Acc: 74.46; F1: 76.49
- Gender Prediction: Acc: 95.57; Macro F1: 93.01
- Geographic Origin: Acc: 66.55; Macro F1: 44.83
Requirements
- Python 3.9+
- PyTorch 2.0+
- nameparser
Data and credits
This package was built to allow gender and geographic origin precition in a simple modern api. As such it heavily benefitted from previous work of other authors and packages:
- name2nat (Kyubyong Park, 2020). I built on the data for nationalities collected by Kyubyong Park and convert these to geographic regions. I also extend his dataset by collecting the gender of the names in the dataset from Wikipedia.
- gender_guesser (ByRatings, 2016). I take a list of first names from this package, which is originally taken from a
cpackage by Jörg Michael (2008). - Gender By Name (Rupinder Singh Rana, 2024), which also contains a list of first namnes that I use for training of the human name detection.
Citation
If you use this package, please remember to cite it:
@misc{
bose-nametrace-2025,
url={https://github.com/parobo/nametrace},
journal={GitHub},
author={Bose, Paul},
year={2025},
month={Jun}
}
License
GNUv3 License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nametrace-0.1.1.tar.gz.
File metadata
- Download URL: nametrace-0.1.1.tar.gz
- Upload date:
- Size: 4.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c563bdc6d05a054bad98ecce9b25c5eb71b028ba61a534e6b315beadc9adb50b
|
|
| MD5 |
8c5028af9dc0505ccce22f4c5e8094d6
|
|
| BLAKE2b-256 |
5fc217c45daa3b4b421506b52b2664336d56365f01dc852755e7777b256f344c
|
File details
Details for the file nametrace-0.1.1-py3-none-any.whl.
File metadata
- Download URL: nametrace-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
670f9d47ba3f02f2a6022ed770f0600c8442b10f469f61424e2b206d92e44ccf
|
|
| MD5 |
0ab2eafa13e804ce4f78a5cbb5499621
|
|
| BLAKE2b-256 |
dd58e28f43d3551da7c2324ebc8ef790814922ac37b046f7045bdf92ba5c24e9
|