Skip to main content

Predict nationality, ethnicity, gender, region, language and religion from names - 238 countries, 6 major religions, 5.9M+ names, complete religious coverage

Project description

EthniData - Ethnicity and Nationality Prediction

Python License: MIT PyPI version

Predict nationality, ethnicity, and demographics from names using a comprehensive global database built from multiple authoritative sources.

๐ŸŒŸ Features

  • 190+ Countries - Comprehensive coverage from Wikipedia/Wikidata
  • 106 Countries - Enhanced with names-dataset
  • 120 Years of Olympic athlete names
  • Multiple Sources - Phone directories, census data, public records
  • Fast Predictions - SQLite-based for instant lookups
  • Normalized Data - Unicode-aware, case-insensitive matching
  • Ethnicity Support - Where available in source data
  • Simple API - Easy to use Python interface

๐Ÿ“Š Data Sources

  1. Wikipedia/Wikidata - 190+ countries, biographical data with ethnicity
  2. names-dataset - 106 countries, curated name lists
  3. Olympics Dataset - 120 years of athlete names (271,116 records)
  4. Phone Directories - Public domain name lists from multiple countries
  5. Census Data - US Census and other government open data

๐Ÿš€ Installation

pip install ethnidata

๐Ÿ“– Usage

Basic Usage

from ethnidata import EthniData

# Initialize
ed = EthniData()

# Predict nationality from first name
result = ed.predict_nationality("Ahmet", name_type="first")
print(result)
# {
#   'name': 'ahmet',
#   'country': 'TUR',
#   'country_name': 'Turkey',
#   'confidence': 0.89,
#   'top_countries': [
#     {'country': 'TUR', 'country_name': 'Turkey', 'probability': 0.89},
#     {'country': 'DEU', 'country_name': 'Germany', 'probability': 0.07},
#     ...
#   ]
# }

# Predict from last name
result = ed.predict_nationality("Tanaka", name_type="last")
print(result['country'])  # 'JPN'

# Predict from full name (combines both)
result = ed.predict_full_name("Wei", "Chen")
print(result['country'])  # 'CHN'

# Predict ethnicity (when available)
result = ed.predict_ethnicity("Muhammad", name_type="first")
print(result)
# {
#   'name': 'muhammad',
#   'ethnicity': 'Arab',
#   'country': 'SAU',
#   'country_name': 'Saudi Arabia'
# }

Advanced Usage

# Get top 10 predictions
result = ed.predict_nationality("Maria", name_type="first", top_n=10)

for country in result['top_countries']:
    print(f"{country['country_name']}: {country['probability']:.2%}")
# Spain: 35.4%
# Italy: 28.2%
# Portugal: 15.1%
# ...

# Database statistics
stats = ed.get_stats()
print(stats)
# {
#   'total_first_names': 123456,
#   'total_last_names': 234567,
#   'countries_first': 195,
#   'countries_last': 198
# }

๐Ÿ—๏ธ Project Structure

ethnidata/
โ”œโ”€โ”€ ethnidata/                # Main package
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ predictor.py          # Core prediction logic
โ”‚   โ””โ”€โ”€ ethnidata.db          # SQLite database
โ”œโ”€โ”€ scripts/                  # Data collection scripts
โ”‚   โ”œโ”€โ”€ 1_fetch_names_dataset.py
โ”‚   โ”œโ”€โ”€ 2_fetch_wikipedia.py
โ”‚   โ”œโ”€โ”€ 3_fetch_olympics.py
โ”‚   โ”œโ”€โ”€ 4_fetch_phone_directories.py
โ”‚   โ”œโ”€โ”€ 5_merge_all_data.py
โ”‚   โ””โ”€โ”€ 6_create_database.py
โ”œโ”€โ”€ tests/                    # Unit tests
โ”œโ”€โ”€ examples/                 # Example scripts
โ”œโ”€โ”€ docs/                     # Documentation
โ”œโ”€โ”€ setup.py
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md

๐Ÿ”ฌ Accuracy & Methodology

How it works

  1. Name Normalization: Names are lowercased and Unicode-normalized (e.g., "Josรฉ" โ†’ "jose")
  2. Database Lookup: Queries SQLite database for matching names
  3. Frequency-Based Scoring: Countries are ranked by how often the name appears
  4. Probability Calculation: Frequencies are converted to probabilities
  5. Full Name Combination: First name (40%) + last name (60%) weights

Limitations

  • Bias: Database reflects historical Olympic participation, Wikipedia coverage
  • Missing Names: Rare or new names may not be in database
  • Ethnicity: Only available where source data included it
  • Migration: Doesn't account for diaspora or modern migration patterns
  • Multiple Origins: Common names (e.g., "Ali", "Maria") exist in many cultures

๐Ÿ› ๏ธ Development

Build Database from Scratch

git clone https://github.com/teyfikoz/ethnidata.git
cd ethnidata

# Install dependencies
pip install -r requirements.txt

# Fetch all data (takes 10-30 minutes)
cd scripts
python 1_fetch_names_dataset.py
python 2_fetch_wikipedia.py
python 3_fetch_olympics.py
python 4_fetch_phone_directories.py
python 5_merge_all_data.py
python 6_create_database.py

Run Tests

pip install -e ".[dev]"
pytest tests/ -v

๐Ÿ“œ License

MIT License - see LICENSE file for details

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

๐Ÿ“š Citations

If you use this database in research, please cite:

@software{ethnidata_2024,
  title = {EthniData: Ethnicity and Nationality Prediction from Names},
  author = {Oz, Teyfik},
  year = {2024},
  url = {https://github.com/teyfikoz/ethnidata}
}

Data Source Citations

  • Olympics Data: Randi Griffin (2018). 120 years of Olympic history. Kaggle
  • names-dataset: Philippe Remy (2021). name-dataset
  • Wikidata: Wikimedia Foundation. Wikidata

๐Ÿ”— Related Projects

๐Ÿ“ง Contact


Built with โค๏ธ using open data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethnidata-3.1.4.tar.gz (16.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ethnidata-3.1.4-py3-none-any.whl (16.6 MB view details)

Uploaded Python 3

File details

Details for the file ethnidata-3.1.4.tar.gz.

File metadata

  • Download URL: ethnidata-3.1.4.tar.gz
  • Upload date:
  • Size: 16.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ethnidata-3.1.4.tar.gz
Algorithm Hash digest
SHA256 5fe721a92eb3d845656fdf987270e8493e6700102b1cae6f27a1a36eac388669
MD5 ae7527c9b9b7cc0ff8b9458c250efee0
BLAKE2b-256 5709b05fa032091124a55deef0899a374f01bc3f4ce68e306bd9b98ad5911175

See more details on using hashes here.

Provenance

The following attestation bundles were made for ethnidata-3.1.4.tar.gz:

Publisher: publish.yml on teyfikoz/ethnidata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ethnidata-3.1.4-py3-none-any.whl.

File metadata

  • Download URL: ethnidata-3.1.4-py3-none-any.whl
  • Upload date:
  • Size: 16.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ethnidata-3.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7f5000a39196e60ca34cc441de1c54357a3462842592961d945d8ad4b6c8f8c4
MD5 558d9dc710be4a361c66cd55f05f7749
BLAKE2b-256 05fb41996ff90a2b8adf52d5998b8a8bf20befaefb26460729a65bdcaead670e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ethnidata-3.1.4-py3-none-any.whl:

Publisher: publish.yml on teyfikoz/ethnidata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page