The python library to handle names

Project description

First and Last Names Database

To download the raw CSV data for your analysis, browse here.

This Python library provides information about names:

Popularity (rank)
Country (105 countries are supported)
Gender
Fuzzy search (search with an erroneous name, ISABLE -> ISABEL)
Auto-complete search (realtime, for example all names starting with ISA*.)

It can give you an answer to some of those questions:

Who is Zoe? Likely a Female, United Kindgom.
Knows Philippe? Likely a Male, France. And with the spelling Philipp? Male, Germany.
How about Nikki? Likely a Female, United States.

Composition

730K first names and 983K last names, extracted from the Facebook massive dump (533M users).

Installation

Available on PyPI:

pip install names-dataset

Usage

NOTE: The library requires 3.2GB of RAM to load the full dataset in memory. Make sure you have enough RAM to avoid MemoryError.

Once it's installed, run those commands to familiarize yourself with the library:

from names_dataset import NameDataset, NameWrapper

# The library takes time to initialize because the database is massive. A tip is to include its initialization in your app's startup process.
nd = NameDataset()

print(NameWrapper(nd.search('Philippe')).describe)
# Male, France

print(NameWrapper(nd.search('Zoe')).describe)
# Female, United Kingdom

print(nd.search('Walter'))
# {'first_name': {'country': {'Argentina': 0.062, 'Austria': 0.037, 'Bolivia, Plurinational State of': 0.042, 'Colombia': 0.096, 'Germany': 0.044, 'Italy': 0.295, 'Peru': 0.185, 'United States': 0.159, 'Uruguay': 0.036, 'South Africa': 0.043}, 'gender': {'Female': 0.007, 'Male': 0.993}, 'rank': {'Argentina': 37, 'Austria': 34, 'Bolivia, Plurinational State of': 67, 'Colombia': 250, 'Germany': 214, 'Italy': 193, 'Peru': 27, 'United States': 317, 'Uruguay': 44, 'South Africa': 388}}, 'last_name': {'country': {'Austria': 0.036, 'Brazil': 0.039, 'Switzerland': 0.032, 'Germany': 0.299, 'France': 0.121, 'United Kingdom': 0.048, 'Italy': 0.09, 'Nigeria': 0.078, 'United States': 0.172, 'South Africa': 0.085}, 'gender': {}, 'rank': {'Austria': 106, 'Brazil': 805, 'Switzerland': 140, 'Germany': 39, 'France': 625, 'United Kingdom': 1823, 'Italy': 3564, 'Nigeria': 926, 'United States': 1210, 'South Africa': 1169}}}

print(nd.search('White'))
# {'first_name': {'country': {'United Arab Emirates': 0.044, 'Egypt': 0.294, 'France': 0.061, 'Hong Kong': 0.05, 'Iraq': 0.094, 'Italy': 0.117, 'Malaysia': 0.133, 'Saudi Arabia': 0.089, 'Taiwan, Province of China': 0.044, 'United States': 0.072}, 'gender': {'Female': 0.519, 'Male': 0.481}, 'rank': {'Taiwan, Province of China': 6940, 'United Arab Emirates': None, 'Egypt': None, 'France': None, 'Hong Kong': None, 'Iraq': None, 'Italy': None, 'Malaysia': None, 'Saudi Arabia': None, 'United States': None}}, 'last_name': {'country': {'Canada': 0.035, 'France': 0.016, 'United Kingdom': 0.296, 'Ireland': 0.028, 'Iraq': 0.016, 'Italy': 0.02, 'Jamaica': 0.017, 'Nigeria': 0.031, 'United States': 0.5, 'South Africa': 0.04}, 'gender': {}, 'rank': {'Canada': 46, 'France': 1041, 'United Kingdom': 18, 'Ireland': 66, 'Iraq': 1307, 'Italy': 2778, 'Jamaica': 35, 'Nigeria': 425, 'United States': 47, 'South Africa': 416}}}

print(nd.search('محمد'))
# {'first_name': {'country': {'Algeria': 0.018, 'Egypt': 0.441, 'Iraq': 0.12, 'Jordan': 0.027, 'Libya': 0.035, 'Saudi Arabia': 0.154, 'Sudan': 0.07, 'Syrian Arab Republic': 0.062, 'Turkey': 0.022, 'Yemen': 0.051}, 'gender': {'Female': 0.035, 'Male': 0.965}, 'rank': {'Algeria': 4, 'Egypt': 1, 'Iraq': 2, 'Jordan': 1, 'Libya': 1, 'Saudi Arabia': 1, 'Sudan': 1, 'Syrian Arab Republic': 1, 'Turkey': 18, 'Yemen': 1}}, 'last_name': {'country': {'Egypt': 0.453, 'Iraq': 0.096, 'Jordan': 0.015, 'Libya': 0.043, 'Palestine, State of': 0.016, 'Saudi Arabia': 0.118, 'Sudan': 0.146, 'Syrian Arab Republic': 0.058, 'Turkey': 0.017, 'Yemen': 0.037}, 'gender': {}, 'rank': {'Egypt': 2, 'Iraq': 3, 'Jordan': 1, 'Libya': 1, 'Palestine, State of': 1, 'Saudi Arabia': 3, 'Sudan': 1, 'Syrian Arab Republic': 2, 'Turkey': 44, 'Yemen': 1}}}

print(nd.get_top_names(n=10, gender='Male', country_alpha2='US'))
# {'US': {'M': ['Jose', 'David', 'Michael', 'John', 'Juan', 'Carlos', 'Luis', 'Chris', 'Alex', 'Daniel']}}

print(nd.get_top_names(n=5, country_alpha2='ES'))
# {'ES': {'M': ['Jose', 'Antonio', 'Juan', 'Manuel', 'David'], 'F': ['Maria', 'Ana', 'Carmen', 'Laura', 'Isabel']}}

print(nd.get_country_codes(alpha_2=True))
# ['AE', 'AF', 'AL', 'AO', 'AR', 'AT', 'AZ', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BN', 'BO', 'BR', 'BW', 'CA', 'CH', 'CL', 'CM', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DZ', 'EC', 'EE', 'EG', 'ES', 'ET', 'FI', 'FJ', 'FR', 'GB', 'GE', 'GH', 'GR', 'GT', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JM', 'JO', 'JP', 'KH', 'KR', 'KW', 'KZ', 'LB', 'LT', 'LU', 'LY', 'MA', 'MD', 'MO', 'MT', 'MU', 'MV', 'MX', 'MY', 'NA', 'NG', 'NL', 'NO', 'OM', 'PA', 'PE', 'PH', 'PL', 'PR', 'PS', 'PT', 'QA', 'RS', 'RU', 'SA', 'SD', 'SE', 'SG', 'SI', 'SV', 'SY', 'TM', 'TN', 'TR', 'TW', 'US', 'UY', 'YE', 'ZA']

print(nd.auto_complete('isa', n=3)) # very fast, can be used in a loop in realtime.
# [{'name': 'Isabel', 'rank': 144}, {'name': 'Isaac', 'rank': 266}, {'name': 'Isa', 'rank': 450}]

print(nd.fuzzy_search('isablel', n=3)) # slow to compute.
# [{'name': 'Isabel', 'rank': 144}, {'name': 'Isabela', 'rank': 1228}, {'name': 'Isabele', 'rank': 2386}]

nd.first_names
# Dictionary of all the first names with their attributes.

nd.last_names
# Dictionary of all the last names with their attributes.

API

The search call provides information about:

country: The probability of the name belonging to a country. Only the top 10 countries matching the name are returned.
gender: The probability of the person to be a Male or Female.
rank: The rank of the name in his country. 1 means the most popular name.
NOTE: first_name/last_name: the gender does not apply to last_name.

The get_top_names call gives the most popular names:

n: The number of names to return matching some criteria. Default is 100.
gender: Filters on Male or Female. Default is None (both are returned).
use_first_names: Filters on the first names or last names. Default is True.
country_alpha2: Filters on the country (e.g. GB is the United Kingdom). Default is None (all countries are returned).

The get_country_codes returns the supported country codes (or full pycountry objects).

alpha_2: Only returns the country codes: 2-char code. Default is False.

Full dataset

The dataset is available here name_dataset.zip (3.3GB).

The data contains 491,655,925 records from 106 countries.
The uncompressed version takes around 10GB on the disk.
Each country is in a separate CSV file.
A CSV file contains rows of this format: first_name,last_name,gender,country_code.
Each record is a real person.

Ports

For Ruby see names_dataset.

License

This version was generated from the massive Facebook Leak (533M accounts).
Lists of names are not copyrightable, generally speaking, but if you want to be completely sure you should talk to a lawyer.

Countries

Afghanistan, Albania, Algeria, Angola, Argentina, Austria, Azerbaijan, Bahrain, Bangladesh, Belgium, Bolivia, Plurinational State of, Botswana, Brazil, Brunei Darussalam, Bulgaria, Burkina Faso, Burundi, Cambodia, Cameroon, Canada, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czechia, Denmark, Djibouti, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Fiji, Finland, France, Georgia, Germany, Ghana, Greece, Guatemala, Haiti, Honduras, Hong Kong, Hungary, Iceland, India, Indonesia, Iran, Islamic Republic of, Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Korea, Republic of, Kuwait, Lebanon, Libya, Lithuania, Luxembourg, Macao, Malaysia, Maldives, Malta, Mauritius, Mexico, Moldova, Republic of, Morocco, Namibia, Netherlands, Nigeria, Norway, Oman, Palestine, State of, Panama, Peru, Philippines, Poland, Portugal, Puerto Rico, Qatar, Russian Federation, Saudi Arabia, Serbia, Singapore, Slovenia, South Africa, Spain, Sudan, Sweden, Switzerland, Syrian Arab Republic, Taiwan, Province of China, Tunisia, Turkey, Turkmenistan, United Arab Emirates, United Kingdom, United States, Uruguay, Yemen.

🇲🇹🇪🇬🇧🇴🇳🇦🇹🇳🇷🇸🇯🇲🇦🇷🇯🇵🇰🇿🇸🇦🇺🇸🇦🇪🇭🇺🇭🇰🇶🇦🇸🇬🇩🇪🇾🇪🇲🇾🇭🇹🇵🇷🇨🇳🇦🇴🇹🇼🇸🇩🇧🇭🇧🇪🇪🇹🇪🇪🇨🇴🇬🇷🇧🇷🇷🇺🇱🇾🇸🇻🇰🇼🇰🇷🇦🇱🇸🇾🇧🇫🇨🇿🇨🇦🇴🇲🇩🇰🇨🇱🇧🇩🇧🇼🇫🇯🇮🇶🇮🇪🇿🇦🇨🇷🇯🇴🇰🇭🇵🇪🇺🇾🇮🇷🇲🇩🇫🇷🇲🇴🇳🇱🇬🇭🇨🇾🇩🇿🇮🇹🇬🇧🇧🇮🇮🇳🇫🇮🇦🇫🇵🇭🇦🇿🇬🇪🇨🇲🇮🇱🇪🇸🇱🇹🇩🇯🇬🇹🇱🇺🇵🇸🇹🇷🇵🇱🇮🇸🇳🇬🇵🇦🇭🇷🇸🇮🇭🇳🇦🇹🇲🇺🇸🇪🇲🇦🇨🇭🇧🇳🇲🇻🇳🇴🇪🇨🇮🇩🇧🇬🇵🇹🇲🇽🇱🇧🇹🇲

NOTE: It is unfortunately not possible to support more countries because the missing ones were not included in the original dataset.

Citation

@misc{NameDataset2021,
  author = {Philippe Remy},
  title = {Name Dataset},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/philipperemy/name-dataset}},
}

Project details

Release history Release notifications | RSS feed

This version

3.3.1

Apr 8, 2025

3.1.0

May 7, 2022

3.0.3

Apr 23, 2022

3.0.2

Jan 10, 2022

3.0.1

Jan 10, 2022

3.0.0

Jan 10, 2022

2.1.0

Oct 2, 2021

1.9.1

Apr 26, 2020

1.9.0

Jun 9, 2019

1.8.0

Jun 9, 2019

1.7.0

Jun 9, 2019

1.6.0

Jun 9, 2019

1.5.0

Apr 15, 2019

1.4.0

Apr 12, 2019

1.3.0

Feb 27, 2019

1.2.0

Oct 11, 2018

1.1.0

Oct 11, 2018

1.0.0

Oct 11, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

names_dataset-3.3.1.tar.gz (55.7 MB view details)

Uploaded Apr 8, 2025 Source

Built Distribution

names_dataset-3.3.1-py3-none-any.whl (55.7 MB view details)

Uploaded Apr 8, 2025 Python 3

File details

Details for the file names_dataset-3.3.1.tar.gz.

File metadata

Download URL: names_dataset-3.3.1.tar.gz
Upload date: Apr 8, 2025
Size: 55.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for names_dataset-3.3.1.tar.gz
Algorithm	Hash digest
SHA256	`55f17d0fff976d69ac59d5181e3dd4c3601b18523ce876cbbd83f87f0ef7fd11`
MD5	`29b097ddcc212a2723212cec7f0e00e8`
BLAKE2b-256	`cad15f9d26f4090035e482f8971a9c3f6b04ed3fc92340c6bb0c349e408aa62b`

See more details on using hashes here.

File details

Details for the file names_dataset-3.3.1-py3-none-any.whl.

File metadata

Download URL: names_dataset-3.3.1-py3-none-any.whl
Upload date: Apr 8, 2025
Size: 55.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for names_dataset-3.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9fb1b60a82de8bea004e7755b6133df0b4a029f93b9ee18bfa37f356f2b8c75e`
MD5	`fa28e6d4f9a26ed4512f3e8438ba0251`
BLAKE2b-256	`804d11c1373e5e21e03963d7b5d63181a073ee4ea0d213efc5cef36cd74aa69c`

See more details on using hashes here.

names-dataset 3.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

First and Last Names Database

Composition

Installation

Usage

API

Full dataset

Ports

License

Countries

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes