Deterministic author-name parser: split a raw name string into title/first/middle/last/suffix/nickname, across Latin, CJK, Korean, Cyrillic, and Arabic scripts. From OpenAlex.
Project description
whatsername
Figure out what someone's name actually is.
whatsername is a deterministic Python parser that splits a raw author-name
string into structured components — title, first, middle, last, suffix,
nickname — across Latin, Chinese, Japanese, Korean, Cyrillic, and Arabic /
Persian scripts. It's the name parser extracted from the
OpenAlex author entity resolution pipeline.
It is deterministic and fast (no network, no model weights), and every parse comes with a confidence score so you can route the hard cases to a human or an LLM.
from whatsername import parse_name
parse_name("John Maynard Smith")
# {'first': 'john', 'middle': 'maynard', 'last': 'smith', 'confidence': 'medium', ...}
# (3 tokens, no surname prefix -> the boundary is a guess, hence 'medium')
parse_name("Smith, John M.") # comma form -> surname is explicit
# {'first': 'john', 'middle': 'm.', 'last': 'smith', 'confidence': 'high', ...}
parse_name("John M. Harris Jr MD") # post-nominal credentials removed
# {'first': 'john', 'middle': 'm.', 'last': 'harris', 'suffix': 'jr', ...}
parse_name("张伟") # Chinese: surname-first
# {'first': 'wei', 'last': 'zhang', 'confidence': 'medium', ...}
Install
pip install whatsername
For better romanization of Japanese, Korean, and Cyrillic names, install the optional extra (the parser still works without it, at lower confidence):
pip install "whatsername[cjk]"
What you get back
parse_name(s) returns a dict. All string values are lowercase ASCII (diacritics
removed, apostrophes dropped, hyphens and initials' periods preserved), or None.
| key | example | notes |
|---|---|---|
title |
dr. |
recognized academic/professional titles |
first |
john |
given name |
middle |
maynard |
middle name(s) / patronymic |
last |
smith |
family name (compound prefixes like van der kept) |
suffix |
jr., phd |
generational + post-nominal credentials |
nickname |
jack |
text found in (...) / [...] |
confidence |
high |
high | medium | low — threshold on this |
Confidence is the whole point of routing:
high— comma-delimited (Last, First), a clear compound surname prefix, or a simple 1–2 token Latin name.medium— 3+ token Latin with no prefix (surname boundary is a guess), or a CJK/Korean name resolved via the surname tables.low— Arabic/Persian (short vowels aren't written), ambiguous CJK, or Cyrillic transliteration. Good candidates to send to an LLM.
The OpenAlex-internal form
parse_human_name(s) returns the exact 6-field form OpenAlex uses for author
matching: empty strings instead of None, surname particles stripped from
last (so "de Oliveira" and "Oliveira" match), and a
nameparser HumanName fallback for
low-confidence Latin names.
Accuracy
Benchmarked against the public human-name-parser-gold-standard (15,309 OpenAlex author names):
| metric | accuracy |
|---|---|
| Full match (all 6 fields exact) | 88.8% |
last (family name) |
90.6% |
last (surname particles stripped, i.e. matching-relevant) |
91.4% |
first |
94.3% |
middle |
94.7% |
title / suffix / nickname |
≥99.5% |
Run it yourself:
pip install pytest
pytest tests/test_benchmark.py -s
The largest error sources are inherent and hard: compound-surname boundaries,
name-order disambiguation in romanized CJK names, and unusual scripts. That's
what the confidence field is for.
About the benchmark. The gold standard is LLM-annotated — each name was parsed by Claude Opus 4.6, not labeled by hand. It's a strong reference set but not infallible, especially on the same hard cases the parser struggles with (compound surnames, CJK name order, rare scripts). Treat the accuracy numbers as indicative, not gospel.
Why is this GPL-licensed?
whatsername depends on Unidecode for
transliteration (Chinese pinyin, Cyrillic, Arabic fallback), and Unidecode is
licensed under the GPL. A project that depends on a GPL library must itself
be GPL-compatible, so whatsername is released under the GPL-3.0-or-later.
If you need a permissively-licensed name parser, you'll want a different library
(or one that doesn't transliterate non-Latin scripts).
Credits
Built by OpenAlex / OurResearch as part of the author entity resolution (AER) work. The deterministic parser and its surname gazetteers come from the OpenAlex data pipeline; the benchmark was generated with Claude Opus 4.6. See the gold standard repo for the labeling protocol and reproduction harness.
License
GPL-3.0-or-later. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whatsername-0.1.0.tar.gz.
File metadata
- Download URL: whatsername-0.1.0.tar.gz
- Upload date:
- Size: 782.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b79b5cf4ff5a9342efc4b78a2aae4dbd16e159fe61fef75f5f543429f2037935
|
|
| MD5 |
cddb4c493cd5bce8f781b0bb1957ef84
|
|
| BLAKE2b-256 |
fcf7f7501f3f8f8552284c98f44f891f4f2f7e5d82385eb2796b48096b94700d
|
File details
Details for the file whatsername-0.1.0-py3-none-any.whl.
File metadata
- Download URL: whatsername-0.1.0-py3-none-any.whl
- Upload date:
- Size: 48.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eea9b849a9233aae7249170996734243b6c194105820625cc28583bbe3384deb
|
|
| MD5 |
05b3ff52fb074a2c9423b273f7d1692e
|
|
| BLAKE2b-256 |
46e26055f86c4c4880f2cc1e03296584caca6fe218c18cd8887f9a189358ea8d
|