Predict nationality, ethnicity, gender, region and language from names using 310K+ global name database
Project description
EthniData - Ethnicity and Nationality Prediction
Predict nationality, ethnicity, and demographics from names using a comprehensive global database built from multiple authoritative sources.
๐ Features
- 190+ Countries - Comprehensive coverage from Wikipedia/Wikidata
- 106 Countries - Enhanced with names-dataset
- 120 Years of Olympic athlete names
- Multiple Sources - Phone directories, census data, public records
- Fast Predictions - SQLite-based for instant lookups
- Normalized Data - Unicode-aware, case-insensitive matching
- Ethnicity Support - Where available in source data
- Simple API - Easy to use Python interface
๐ Data Sources
- Wikipedia/Wikidata - 190+ countries, biographical data with ethnicity
- names-dataset - 106 countries, curated name lists
- Olympics Dataset - 120 years of athlete names (271,116 records)
- Phone Directories - Public domain name lists from multiple countries
- Census Data - US Census and other government open data
๐ Installation
pip install ethnidata
๐ Usage
Basic Usage
from ethnidata import EthniData
# Initialize
ed = EthniData()
# Predict nationality from first name
result = ed.predict_nationality("Ahmet", name_type="first")
print(result)
# {
# 'name': 'ahmet',
# 'country': 'TUR',
# 'country_name': 'Turkey',
# 'confidence': 0.89,
# 'top_countries': [
# {'country': 'TUR', 'country_name': 'Turkey', 'probability': 0.89},
# {'country': 'DEU', 'country_name': 'Germany', 'probability': 0.07},
# ...
# ]
# }
# Predict from last name
result = ed.predict_nationality("Tanaka", name_type="last")
print(result['country']) # 'JPN'
# Predict from full name (combines both)
result = ed.predict_full_name("Wei", "Chen")
print(result['country']) # 'CHN'
# Predict ethnicity (when available)
result = ed.predict_ethnicity("Muhammad", name_type="first")
print(result)
# {
# 'name': 'muhammad',
# 'ethnicity': 'Arab',
# 'country': 'SAU',
# 'country_name': 'Saudi Arabia'
# }
Advanced Usage
# Get top 10 predictions
result = ed.predict_nationality("Maria", name_type="first", top_n=10)
for country in result['top_countries']:
print(f"{country['country_name']}: {country['probability']:.2%}")
# Spain: 35.4%
# Italy: 28.2%
# Portugal: 15.1%
# ...
# Database statistics
stats = ed.get_stats()
print(stats)
# {
# 'total_first_names': 123456,
# 'total_last_names': 234567,
# 'countries_first': 195,
# 'countries_last': 198
# }
๐๏ธ Project Structure
ethnidata/
โโโ ethnidata/ # Main package
โ โโโ __init__.py
โ โโโ predictor.py # Core prediction logic
โ โโโ ethnidata.db # SQLite database
โโโ scripts/ # Data collection scripts
โ โโโ 1_fetch_names_dataset.py
โ โโโ 2_fetch_wikipedia.py
โ โโโ 3_fetch_olympics.py
โ โโโ 4_fetch_phone_directories.py
โ โโโ 5_merge_all_data.py
โ โโโ 6_create_database.py
โโโ tests/ # Unit tests
โโโ examples/ # Example scripts
โโโ docs/ # Documentation
โโโ setup.py
โโโ pyproject.toml
โโโ README.md
๐ฌ Accuracy & Methodology
How it works
- Name Normalization: Names are lowercased and Unicode-normalized (e.g., "Josรฉ" โ "jose")
- Database Lookup: Queries SQLite database for matching names
- Frequency-Based Scoring: Countries are ranked by how often the name appears
- Probability Calculation: Frequencies are converted to probabilities
- Full Name Combination: First name (40%) + last name (60%) weights
Limitations
- Bias: Database reflects historical Olympic participation, Wikipedia coverage
- Missing Names: Rare or new names may not be in database
- Ethnicity: Only available where source data included it
- Migration: Doesn't account for diaspora or modern migration patterns
- Multiple Origins: Common names (e.g., "Ali", "Maria") exist in many cultures
๐ ๏ธ Development
Build Database from Scratch
git clone https://github.com/teyfikoz/ethnidata.git
cd ethnidata
# Install dependencies
pip install -r requirements.txt
# Fetch all data (takes 10-30 minutes)
cd scripts
python 1_fetch_names_dataset.py
python 2_fetch_wikipedia.py
python 3_fetch_olympics.py
python 4_fetch_phone_directories.py
python 5_merge_all_data.py
python 6_create_database.py
Run Tests
pip install -e ".[dev]"
pytest tests/ -v
๐ License
MIT License - see LICENSE file for details
๐ค Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
๐ Citations
If you use this database in research, please cite:
@software{ethnidata_2024,
title = {EthniData: Ethnicity and Nationality Prediction from Names},
author = {Oz, Tefik Yavuz},
year = {2024},
url = {https://github.com/teyfikoz/ethnidata}
}
Data Source Citations
- Olympics Data: Randi Griffin (2018). 120 years of Olympic history. Kaggle
- names-dataset: Philippe Remy (2021). name-dataset
- Wikidata: Wikimedia Foundation. Wikidata
๐ Related Projects
- ethnicolr - Ethnicity prediction using LSTM
- name-dataset - Name database (106 countries)
- gender-guesser - Gender prediction
๐ง Contact
- GitHub Issues: Report bugs or request features
- GitHub: @teyfikoz
Built with โค๏ธ using open data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ethnidata-1.2.0-py3-none-any.whl.
File metadata
- Download URL: ethnidata-1.2.0-py3-none-any.whl
- Upload date:
- Size: 6.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6664f3b566252b74f683dbf179947e2ae93247816a010a6b92bf3b0bcd9c355c
|
|
| MD5 |
2f47d1a897e801a74bcdbb3bf45641e4
|
|
| BLAKE2b-256 |
e4066038d7179122f8c1cf76d831644ab72adaa7d14912bc497cc53d351ca52f
|