Skip to main content

Demographic prediction from name

Project description

NameTrace

NameTrace is a python package to identify "real" human names and predict gender and geographical origin of the name. The idea is to allow users to e.g. take users' names on social media platforms such as X and identify real names and predict gender and possible origin of the users. This package was build to help researchers.

Features

  • Human Name Detection: Distinguish between human names and other text (usernames, company names, etc.)
  • Gender Prediction: Predict gender from names using deep learning models
  • Geographic Origin: Predict geographic subregion from names
  • High Performance: Uses BiLSTM neural networks with rule-based fallbacks
  • Easy to Use: Simple API with batch processing support

Installation

pip install nametrace

Quick Start

from nametrace import NameTracer

# Initialize the predictor
nt = NameTracer()

# Predict for a single name
result = nt.predict("John Smith")
print(result)
# {
#   'is_human': True,
#   'gender': 'male',
#   'subregion': 'Northern Europe',
#   'confidence': {
#     'human': 1.0,
#     'gender': 0.9563450217247009,
#     'subregion': 0.40897873044013977
#     }
# }

# Allows batch prediction
names = ["Maria Garcia", "user123", "Ahmed Hassan"]
results = nt.predict(names,batch_size=12)
for name, result in zip(names, results):
    print(f"{name}: {result['is_human']}")

# Maria Garcia: True
# user123: False
# Ahmed Hassan: True


# Allows top k prediction 
result = nt.predict("John Smith",topk=3)
# {
#   'is_human': True,
#   'gender': [
#     ('male', 0.9563450217247009),
#     ('female', 0.04365495219826698)],
#    'subregion': [
#     ('Northern Europe', 0.40897873044013977),
#     ('North America', 0.32769879698753357),
#     ('Australia and New Zealand', 0.16957755386829376)
#     ], 
#   'confidence': {
#     'human': 1.0,
#     'gender': 0.9563450217247009,
#     'subregion': 0.40897873044013977
#     }
# }

API Reference

NameTracer

The main class for name prediction.

__init__(device=None)

Initialize the tracer.

Parameters:

  • device (str, optional): Device for model inference ('cpu', 'cuda', or None for auto-detection)

predict(names, batch_size=None, topk=1)

Predict if a name(s) is(are) human and get demographics.

Parameters:

  • names (str or list): Input name string or list of name strings
  • topk (int, optional): Provide topk predcitions, optional, defaults to 1
  • batch_size (int, optional): batch size for batch inference, defaults to None (i.e. single batch)

Returns: If names is a single name:

  • dict: Prediction results with keys:
    • is_human (bool): Whether the name is human
    • gender (str): Predicted gender ('male'/'female') or None
    • subregion (str): Predicted geographic subregion or None
    • confidence (dict): Confidence scores for each prediction

If names is a list of names list of above dict.

Training Details

NameTrace uses a two-stage approach:

  1. Human Detection: Rule-based lookup against known name databases, with BiLSTM fallback for unknown names
  2. Demographics Prediction: Character-level BiLSTM model for joint gender and geographic origin prediction

Performance on test data

  • Human Detection: Acc: 74.46; F1: 76.49
  • Gender Prediction: Acc: 95.57; Macro F1: 93.01
  • Geographic Origin: Acc: 66.55; Macro F1: 44.83

Requirements

  • Python 3.9+
  • PyTorch 2.0+
  • nameparser

Data and credits

This package was built to allow gender and geographic origin precition in a simple modern api. As such it heavily benefitted from previous work of other authors and packages:

  • name2nat (Kyubyong Park, 2020). I built on the data for nationalities collected by Kyubyong Park and convert these to geographic regions. I also extend his dataset by collecting the gender of the names in the dataset from Wikipedia.
  • gender_guesser (ByRatings, 2016). I take a list of first names from this package, which is originally taken from a c package by Jörg Michael (2008).
  • Gender By Name (Rupinder Singh Rana, 2024), which also contains a list of first namnes that I use for training of the human name detection.

License

GNUv3 License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nametrace-0.1.0.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nametrace-0.1.0-py3-none-any.whl (4.9 MB view details)

Uploaded Python 3

File details

Details for the file nametrace-0.1.0.tar.gz.

File metadata

  • Download URL: nametrace-0.1.0.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for nametrace-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a701f65c222869b3ed08ab04d486826d59c388a99ab0c25a73afdb8478415bd
MD5 8c0cf401d6ad7e549f22d3535f2c5ba2
BLAKE2b-256 e5ad7556ced6a651bd9ca65d686d5a064e5d04c898e363fc388e8f99b2e20a11

See more details on using hashes here.

File details

Details for the file nametrace-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nametrace-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for nametrace-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc2c67de956124c8ada304448c3c17ccdf01c40528a86e3627811d5176e9835e
MD5 4e96395007298e5fe7ec31a79acfc06a
BLAKE2b-256 50bc5d4b726877f59918827ec6db2e2adc2eafb0ff7055429fb7cd9388bd6266

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page