Skip to main content

Deterministic classifier for personal names

Project description

onomaspy

Deterministic classifier for personal names

  usage: onomaspy [-h] [-v] [-F FILE] [-o tagged|csv|padded] [-t TRANSLITERATE]
                  [-u UNKNOWN_AS_FAMILY] [-b BREAK_FULL_NAMES]
                  givens_file families_file

  Deterministic classifier for personal names

  positional arguments:
    givens_file           givens filename
    families_file         families filename

  optional arguments:
    -h, --help            show this help message and exit
    -v, --version         show program's version number and exit
    -F FILE, --file FILE  input filename
    -o tagged|csv|padded, --output-format tagged|csv|padded
                          output format. `tagged` by default
    -t TRANSLITERATE, --transliterate TRANSLITERATE
                          transliterate names
    -u UNKNOWN_AS_FAMILY, --unknown-as-family UNKNOWN_AS_FAMILY
                          Treat unknown names as family names
    -b BREAK_FULL_NAMES, --break-full-names BREAK_FULL_NAMES
                          Force split of ambiguous full names

TL;DR

onomastic is an algorithm for classifying personal names deterministically, using given and family names lists. onomastic tries to minimize misclassifications, and does not make inferences about ambiguous personal names unless forced to do so.

Sample usage:

$ echo "Bulgarelli Manfroni Franco Leonardo" | onomastic --givens test/data/givens.txt  --families test/data/families.txt --bonus given  -utb
GivenAndFamily:Franco Leonardo,Bulgarelli Manfroni

It offers both a python and haskell implementation. It can be used as a library or command-line command.

The problem

Human names - aka personal names - are usually composed of many individual names - first name, middle name, last name, for example - which can be grouped into two main sections: givens names and family names.

There is no single way of properly writing a personal name - the proper order f individual names may even vary from culture to culture. For example, two common formats are display-order and sort-order:

Thanks to those common and easily recognizable formats, it is simple to analyse a string that represents a name and infer which parts correspond to the given names and family names. For example:

There are a lot of good packages that effectively perform this task using parsers or regular expressions:

However, people do not always follow those conventions when they manually enter personal names. It is common to deal with lists like the following:

In such situations, format-based algorithms will not solve our problem. You need something that actually undestands about individual names. Although you could use a machine learning algorithm - see this article - improper classification of personal names can be a sensible thing. Also, getting a big list of real names can turn into troubles.

Because of this using a deterministic algorithm that only requires datasets of given and family names - and not just personal names - is a better approach.

The solution

onomastic classifies personal names using given and families list, which can be obtained from different sources depending your country or location. onomastic is designed to classify names only when they are not ambiguous, but this restriction can be relaxed using different flags.

Usage

Available options

-g, --givens

onomastic needs a list of given names, one name per line. The givens file content should look like the following:

Alvesio
Alvia
Alvida
Alvin
Alvina
Alvis
Alvita
Alys
Alysa
Alyson
Alyssa
Alzena
Amabel
Amabelia
Amabelio
Amable
Amada
Amadeo

This is a requied option.

-f,--families

onomastic needs a list of family names - aka surnames -, one name per line. The families file content should look like the following:

Acasio
Accino
Accorinti
Acebo
Aceituno
Acero
Aceval
Acevedo
Aceñero
Acha
Achacata
Achaval
Achemon
Achilli
Achucarro

This is a requied option.

-F,--file

-o,--output-format

Output format {tagged|csv|padded}. Default is 'tagged'

-B,--bonus BONUS

Try to maximize length of a name group. Options are {no|given|family}. Default is 'no'

-t,--transliterate

Transliterate names before classifying them

-u,--unknown-as-family

Treat unknown names as family names

-b,--break-full-names

Force split of ambiguous full names

Examples

$ echo "Franco Bulgarelli" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
GivenAndFamily:Franco,Bulgarelli

$ echo "Franco Bulgarelli Manfroni" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
GivenAndFamily:Franco,Bulgarelli Manfroni

# some ambiguous splits will be prevented by default
$ echo "Bulgarelli Manfroni Franco Leonardo" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
FullName:Bulgarelli Manfroni Franco Leonardo

# however you can force them using the -b flag
$ echo "Bulgarelli Manfroni Franco Leonardo" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -utb
GivenAndFamily:Leonardo,Bulgarelli Manfroni Franco

$ echo "Feldfeber Kivelski Ivana" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
GivenAndFamily:Ivana,Feldfeber Kivelski

$ echo "Julian Berbel Alt" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
GivenAndFamily:Julian,Berbel Alt

$ echo "BERBEL ALT julian" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
GivenAndFamily:Julian,Berbel Alt

$ echo "Finzi Nadia Giselle" | onomastic --givens test/data/givens.txt  --families test/data/families.txt  -ut
GivenAndFamily:Nadia Giselle,Finzi

$ echo "Bulgarelli Manfroni Franco Leonardo" | onomastic --givens test/data/givens.txt  --families test/data/families.txt --bonus family  -utb
GivenAndFamily:Leonardo,Bulgarelli Manfroni Franco

$ echo "Bulgarelli Manfroni Franco Leonardo" | onomastic --givens test/data/givens.txt  --families test/data/families.txt --bonus given  -utb
GivenAndFamily:Franco Leonardo,Bulgarelli Manfroni

Caveats and future work

onomastic is far from perfect. It does currently not deal with:

  • titles, initials and nicknames
  • gender
  • compound names like Juan Cruz or María de los Angeles

Development

Installation

TL;DR setup:

$ ./devinit

Basic setup:

# Create and active a virtual env
$ python3 -m venv .venv
$ source .venv/bin/activate
# install the project
$ pip install -e .

Install testing dependencies:

$ pip install -e .[testing]

Install tox for full test, build and publish lifecycle:

$ pip install tox

Run tests

# basic, quick run
$ pytest
# run as part of standard tox lifecycle
$ tox

Publish project

# update package version in setup.cfg
# them run these command:
$ git tag <version>
$ git push origin HEAD --tags
# clean project
$ tox -e clean
# build the package
$ tox -e build
# publish to test.pypi
$ tox -e publish
# publish to real pypi
$ tox -e --publish -- --repository pypi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onomaspy-0.0.2.tar.gz (104.0 kB view details)

Uploaded Source

Built Distribution

onomaspy-0.0.2-py2.py3-none-any.whl (21.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file onomaspy-0.0.2.tar.gz.

File metadata

  • Download URL: onomaspy-0.0.2.tar.gz
  • Upload date:
  • Size: 104.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.10

File hashes

Hashes for onomaspy-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3c9a6de0c170b2997c0f19ca693d0079eef2a9e08330a5e0a0fe6557db49bbc7
MD5 037c3408abcf8179e9a761184e788757
BLAKE2b-256 e577bed332772f764dac4c95d898a3eb90a057d92a056c13fd5ee62267501e4e

See more details on using hashes here.

File details

Details for the file onomaspy-0.0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: onomaspy-0.0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.10

File hashes

Hashes for onomaspy-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f4612e67d6bfa5513c84ac48776342ed16e094af6ec497a6fbd0b891ee9e4cee
MD5 4fef5ff76342ce68a3c5a7ab6c49dfcd
BLAKE2b-256 ee04440d5b6d17736a4bef96d42423305a335274c4aa4b0f935b11867cd2a5e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page