Predict nationality, ethnicity, gender, region and language from names using 310K+ global name database

These details have not been verified by PyPI

Project links

Project description

EthniData - Ethnicity and Nationality Prediction

Predict nationality, ethnicity, and demographics from names using a comprehensive global database built from multiple authoritative sources.

🌟 Features

190+ Countries - Comprehensive coverage from Wikipedia/Wikidata
106 Countries - Enhanced with names-dataset
120 Years of Olympic athlete names
Multiple Sources - Phone directories, census data, public records
Fast Predictions - SQLite-based for instant lookups
Normalized Data - Unicode-aware, case-insensitive matching
Ethnicity Support - Where available in source data
Simple API - Easy to use Python interface

📊 Data Sources

Wikipedia/Wikidata - 190+ countries, biographical data with ethnicity
names-dataset - 106 countries, curated name lists
Olympics Dataset - 120 years of athlete names (271,116 records)
Phone Directories - Public domain name lists from multiple countries
Census Data - US Census and other government open data

🚀 Installation

pip install ethnidata

📖 Usage

Basic Usage

from ethnidata import EthniData

# Initialize
ed = EthniData()

# Predict nationality from first name
result = ed.predict_nationality("Ahmet", name_type="first")
print(result)
# {
#   'name': 'ahmet',
#   'country': 'TUR',
#   'country_name': 'Turkey',
#   'confidence': 0.89,
#   'top_countries': [
#     {'country': 'TUR', 'country_name': 'Turkey', 'probability': 0.89},
#     {'country': 'DEU', 'country_name': 'Germany', 'probability': 0.07},
#     ...
#   ]
# }

# Predict from last name
result = ed.predict_nationality("Tanaka", name_type="last")
print(result['country'])  # 'JPN'

# Predict from full name (combines both)
result = ed.predict_full_name("Wei", "Chen")
print(result['country'])  # 'CHN'

# Predict ethnicity (when available)
result = ed.predict_ethnicity("Muhammad", name_type="first")
print(result)
# {
#   'name': 'muhammad',
#   'ethnicity': 'Arab',
#   'country': 'SAU',
#   'country_name': 'Saudi Arabia'
# }

Advanced Usage

# Get top 10 predictions
result = ed.predict_nationality("Maria", name_type="first", top_n=10)

for country in result['top_countries']:
    print(f"{country['country_name']}: {country['probability']:.2%}")
# Spain: 35.4%
# Italy: 28.2%
# Portugal: 15.1%
# ...

# Database statistics
stats = ed.get_stats()
print(stats)
# {
#   'total_first_names': 123456,
#   'total_last_names': 234567,
#   'countries_first': 195,
#   'countries_last': 198
# }

🏗️ Project Structure

ethnidata/
├── ethnidata/                # Main package
│   ├── __init__.py
│   ├── predictor.py          # Core prediction logic
│   └── ethnidata.db          # SQLite database
├── scripts/                  # Data collection scripts
│   ├── 1_fetch_names_dataset.py
│   ├── 2_fetch_wikipedia.py
│   ├── 3_fetch_olympics.py
│   ├── 4_fetch_phone_directories.py
│   ├── 5_merge_all_data.py
│   └── 6_create_database.py
├── tests/                    # Unit tests
├── examples/                 # Example scripts
├── docs/                     # Documentation
├── setup.py
├── pyproject.toml
└── README.md

🔬 Accuracy & Methodology

How it works

Name Normalization: Names are lowercased and Unicode-normalized (e.g., "José" → "jose")
Database Lookup: Queries SQLite database for matching names
Frequency-Based Scoring: Countries are ranked by how often the name appears
Probability Calculation: Frequencies are converted to probabilities
Full Name Combination: First name (40%) + last name (60%) weights

Limitations

Bias: Database reflects historical Olympic participation, Wikipedia coverage
Missing Names: Rare or new names may not be in database
Ethnicity: Only available where source data included it
Migration: Doesn't account for diaspora or modern migration patterns
Multiple Origins: Common names (e.g., "Ali", "Maria") exist in many cultures

🛠️ Development

Build Database from Scratch

git clone https://github.com/teyfikoz/ethnidata.git
cd ethnidata

# Install dependencies
pip install -r requirements.txt

# Fetch all data (takes 10-30 minutes)
cd scripts
python 1_fetch_names_dataset.py
python 2_fetch_wikipedia.py
python 3_fetch_olympics.py
python 4_fetch_phone_directories.py
python 5_merge_all_data.py
python 6_create_database.py

Run Tests

pip install -e ".[dev]"
pytest tests/ -v

📜 License

MIT License - see LICENSE file for details

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

📚 Citations

If you use this database in research, please cite:

@software{ethnidata_2024,
  title = {EthniData: Ethnicity and Nationality Prediction from Names},
  author = {Oz, Tefik Yavuz},
  year = {2024},
  url = {https://github.com/teyfikoz/ethnidata}
}

Data Source Citations

Olympics Data: Randi Griffin (2018). 120 years of Olympic history. Kaggle
names-dataset: Philippe Remy (2021). name-dataset
Wikidata: Wikimedia Foundation. Wikidata

🔗 Related Projects

ethnicolr - Ethnicity prediction using LSTM
name-dataset - Name database (106 countries)
gender-guesser - Gender prediction

📧 Contact

GitHub Issues: Report bugs or request features
GitHub: @teyfikoz

Built with ❤️ using open data

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.4.0

Mar 1, 2026

4.3.1

Jan 22, 2026

4.3.0

Jan 22, 2026

4.2.0

Jan 22, 2026

4.1.4

Jan 22, 2026

4.1.3

Jan 22, 2026

4.1.2

Jan 22, 2026

4.1.1

Dec 23, 2025

4.1.0

Dec 23, 2025

4.0.3

Dec 22, 2025

4.0.2

Dec 22, 2025

4.0.1

Dec 22, 2025

4.0.0

Dec 20, 2025

3.1.5

Dec 19, 2025

3.1.4

Dec 14, 2025

3.1.0

Dec 9, 2025

3.0.3

Dec 2, 2025

3.0.1

Nov 10, 2025

3.0.0

Nov 9, 2025

2.0.0

Nov 9, 2025

1.3.0

Nov 9, 2025

This version

1.2.0

Nov 9, 2025

1.1.0

Nov 9, 2025

1.0.0

Nov 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ethnidata-1.2.0-py3-none-any.whl (6.1 MB view details)

Uploaded Nov 9, 2025 Python 3

File details

Details for the file ethnidata-1.2.0-py3-none-any.whl.

File metadata

Download URL: ethnidata-1.2.0-py3-none-any.whl
Upload date: Nov 9, 2025
Size: 6.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ethnidata-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6664f3b566252b74f683dbf179947e2ae93247816a010a6b92bf3b0bcd9c355c`
MD5	`2f47d1a897e801a74bcdbb3bf45641e4`
BLAKE2b-256	`e4066038d7179122f8c1cf76d831644ab72adaa7d14912bc497cc53d351ca52f`

See more details on using hashes here.

ethnidata 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EthniData - Ethnicity and Nationality Prediction

🌟 Features

📊 Data Sources

🚀 Installation

📖 Usage

Basic Usage

Advanced Usage

🏗️ Project Structure

🔬 Accuracy & Methodology

How it works

Limitations

🛠️ Development

Build Database from Scratch

Run Tests

📜 License

🤝 Contributing

📚 Citations

Data Source Citations

🔗 Related Projects

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes