Skip to main content

Deterministic author-name parser: split a raw name string into title/first/middle/last/suffix/nickname, across Latin, CJK, Korean, Cyrillic, and Arabic scripts. From OpenAlex.

Project description

whatsername

Figure out what someone's name actually is.

whatsername is a deterministic Python parser that splits a raw author-name string into structured components — title, first, middle, last, suffix, nickname — across Latin, Chinese, Japanese, Korean, Cyrillic, and Arabic / Persian scripts. It's the name parser extracted from the OpenAlex author entity resolution pipeline.

It is deterministic and fast (no network, no model weights), and every parse comes with a confidence score so you can route the hard cases to a human or an LLM.

from whatsername import parse_name

parse_name("John Maynard Smith")
# {'first': 'john', 'middle': 'maynard', 'last': 'smith', 'confidence': 'medium', ...}
# (3 tokens, no surname prefix -> the boundary is a guess, hence 'medium')

parse_name("Smith, John M.")             # comma form -> surname is explicit
# {'first': 'john', 'middle': 'm.', 'last': 'smith', 'confidence': 'high', ...}

parse_name("John M. Harris Jr MD")       # post-nominal credentials removed
# {'first': 'john', 'middle': 'm.', 'last': 'harris', 'suffix': 'jr', ...}

parse_name("张伟")                        # Chinese: surname-first
# {'first': 'wei', 'last': 'zhang', 'confidence': 'medium', ...}

Install

pip install whatsername

For better romanization of Japanese, Korean, and Cyrillic names, install the optional extra (the parser still works without it, at lower confidence):

pip install "whatsername[cjk]"

What you get back

parse_name(s) returns a dict. All string values are lowercase ASCII (diacritics removed, apostrophes dropped, hyphens and initials' periods preserved), or None.

key example notes
title dr. recognized academic/professional titles
first john given name
middle maynard middle name(s) / patronymic
last smith family name (compound prefixes like van der kept)
suffix jr., phd generational + post-nominal credentials
nickname jack text found in (...) / [...]
confidence high high | medium | lowthreshold on this

Confidence is the whole point of routing:

  • high — comma-delimited (Last, First), a clear compound surname prefix, or a simple 1–2 token Latin name.
  • medium — 3+ token Latin with no prefix (surname boundary is a guess), or a CJK/Korean name resolved via the surname tables.
  • low — Arabic/Persian (short vowels aren't written), ambiguous CJK, or Cyrillic transliteration. Good candidates to send to an LLM.

The OpenAlex-internal form

parse_human_name(s) returns the exact 6-field form OpenAlex uses for author matching: empty strings instead of None, surname particles stripped from last (so "de Oliveira" and "Oliveira" match), and a nameparser HumanName fallback for low-confidence Latin names.

Accuracy

Benchmarked against the public human-name-parser-gold-standard (15,309 OpenAlex author names):

metric accuracy
Full match (all 6 fields exact) 88.8%
last (family name) 90.6%
last (surname particles stripped, i.e. matching-relevant) 91.4%
first 94.3%
middle 94.7%
title / suffix / nickname ≥99.5%

Run it yourself:

pip install pytest
pytest tests/test_benchmark.py -s

The largest error sources are inherent and hard: compound-surname boundaries, name-order disambiguation in romanized CJK names, and unusual scripts. That's what the confidence field is for.

About the benchmark. The gold standard is LLM-annotated — each name was parsed by Claude Opus 4.6, not labeled by hand. It's a strong reference set but not infallible, especially on the same hard cases the parser struggles with (compound surnames, CJK name order, rare scripts). Treat the accuracy numbers as indicative, not gospel.

Why is this GPL-licensed?

whatsername depends on Unidecode for transliteration (Chinese pinyin, Cyrillic, Arabic fallback), and Unidecode is licensed under the GPL. A project that depends on a GPL library must itself be GPL-compatible, so whatsername is released under the GPL-3.0-or-later. If you need a permissively-licensed name parser, you'll want a different library (or one that doesn't transliterate non-Latin scripts).

Credits

Built by OpenAlex / OurResearch as part of the author entity resolution (AER) work. The deterministic parser and its surname gazetteers come from the OpenAlex data pipeline; the benchmark was generated with Claude Opus 4.6. See the gold standard repo for the labeling protocol and reproduction harness.

License

GPL-3.0-or-later. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whatsername-0.1.0.tar.gz (782.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whatsername-0.1.0-py3-none-any.whl (48.7 kB view details)

Uploaded Python 3

File details

Details for the file whatsername-0.1.0.tar.gz.

File metadata

  • Download URL: whatsername-0.1.0.tar.gz
  • Upload date:
  • Size: 782.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for whatsername-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b79b5cf4ff5a9342efc4b78a2aae4dbd16e159fe61fef75f5f543429f2037935
MD5 cddb4c493cd5bce8f781b0bb1957ef84
BLAKE2b-256 fcf7f7501f3f8f8552284c98f44f891f4f2f7e5d82385eb2796b48096b94700d

See more details on using hashes here.

File details

Details for the file whatsername-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: whatsername-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for whatsername-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eea9b849a9233aae7249170996734243b6c194105820625cc28583bbe3384deb
MD5 05b3ff52fb074a2c9423b273f7d1692e
BLAKE2b-256 46e26055f86c4c4880f2cc1e03296584caca6fe218c18cd8887f9a189358ea8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page