Skip to main content

Russian names parser, gender identification and processing tools

Project description

# Russian Names

russiannames is a Python 3 library dedicated to parse Russian names, surnames and midnames, identify person gender by fullname and how name is written. It uses MongoDB as backend to speed-up name parsing.

## Documentation

Documentation is built automatically and can be found on https://russiannames.readthedocs.org/en/latest/

## Installation

To install Python library use pip install russiannames via pip or python setup.py install

To use database you need MongoDB instance. Unpack db_data_bson.zip file from https://github.com/datacoon/russiannames/blob/master/data/bson/db_dump_bson.zip

and use mongorestore command to restore names database with 3 collections: names, surnames and midnames

## Features

Database of names used for identification

  • 375449 surnames - collection: surnames

  • 32134 first names - collection: names

  • 48274 midnames - collection: midnames

Detailed database statistics by gender and collection

collection| total | males|females|universal or unidentified |
— | — | — | — | — |
names | 32134 | 19297 | 8278 | 1196 |
midnames | 48274 | 30114 | 16143 | 0 |
surnames | 375274 | 124662 | 111534 | 38827 |

Supports 12 formats of Russian full names writing style

Format | Example | Description |
—— | ————– | ———— |
f | Ольга | only first name |
s | Петров | only surname |
Fs | О. Сидорова | first letter of first name and full surname |
sF | Николаев С. | full surname and first letter of surname |
sf | Абрамов Семен | full surname and full first name |
fs | Соня Камиуллина | full first name and full surname |
fm | Иван Петрович | full first name and full middlename |
SFM | М.Д.М. | first letters of surname, first name, middlename |
FMs | А.Н. Егорова | first letters of first and middle name and full furname |
sFM | Николаенко С.П. | full surname and first letters of first and middle names |
sfM | Петракова Зинаида М. | full surname, first name and first letter of middle name |
sfm | Казаков Ринат Артурович | full name as surname, first name and middle name |
fms | Светлана Архиповна Волкова | full name as first name, middle name and surname |

Supports names with following ethnics identification

9 ethnic types in names, surnames and middle names supported

key | name (en) | name (rus)
—- | ——— | ———-
arab | Arabic | Арабское
arm | Armenian | Армянское
geor | Georgian | Грузинское
germ | German | Немецкие
greek | Greek | Греческие
jew | Jew | Еврейские
polsk | Polish | Польские
slav | Slavic (Russian) | Славянские
tur | Turkic | Тюркские (тюркоязычные)

## Limitations

  • very rare names, surnames or middlenames could be not parsed

  • ethnic identification is still on early stage

## Speed optimization

  • preconfigured and preindexed MongoDb collections used

## Usage and Examples

### Parse name and identify gender

Parses names and returns: format, surname, first name, middle name, parsed (True/False) and gender

>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.parse('Нигматуллин Ринат Ахметович')
{'format': 'sfm', 'sn': 'Нигматуллин', 'fn': 'Ринат', 'mn': 'Ахметович', 'gender': 'm', 'text': 'Нигматуллин Ринат Ахметович', 'parsed': True}
>>> parser.parse('Петрова C.Я.')
{'format': 'sFM', 'sn': 'Петрова', 'fn_s': 'C', 'mn_s': 'Я', 'gender': 'f', 'text': 'Петрова C.Я.', 'parsed': True}

Gender field could have one of following values:

  • m: Male

  • f: Female

  • u: Unknown / unidentified

  • -: Impossible to identify

### Ethnic identification (experimental) Parses surname, first name and middle name and tries to identify person ethic affilation of the person

>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.classify('Нигматуллин', 'Ринат', 'Ахметович')
{'ethnics': ['tur'], 'gender': 'm'}
>>> parser.classify('Алексеева', 'Ольга', 'Ивановна')
{'ethnics': ['slav'], 'gender': 'f'}

## Supported languages * Russian

## Requirements * pymongo * click

## Acknowledgements

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

russiannames-1.0.tar.gz (12.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page