Russian names parser, gender identification and processing tools
Project description
# Russian Names
russiannames is a Python 3 library dedicated to parse Russian names, surnames and midnames, identify person gender by fullname and how name is written. It uses MongoDB as backend to speed-up name parsing.
## Documentation
Documentation is built automatically and can be found on https://russiannames.readthedocs.org/en/latest/
## Installation
To install Python library use pip install russiannames via pip or python setup.py install
To use database you need MongoDB instance. Unpack db_data_bson.zip file from https://github.com/datacoon/russiannames/blob/master/data/bson/db_dump_bson.zip
and use mongorestore command to restore names database with 3 collections: names, surnames and midnames
## Features
Database of names used for identification
375449 surnames - collection: surnames
32134 first names - collection: names
48274 midnames - collection: midnames
Detailed database statistics by gender and collection
Supports 12 formats of Russian full names writing style
Supports names with following ethnics identification
9 ethnic types in names, surnames and middle names supported
## Limitations
very rare names, surnames or middlenames could be not parsed
ethnic identification is still on early stage
## Speed optimization
preconfigured and preindexed MongoDb collections used
## Usage and Examples
### Parse name and identify gender
Parses names and returns: format, surname, first name, middle name, parsed (True/False) and gender
>>> from russiannames.parser import NamesParser >>> parser = NamesParser() >>> parser.parse('Нигматуллин Ринат Ахметович') {'format': 'sfm', 'sn': 'Нигматуллин', 'fn': 'Ринат', 'mn': 'Ахметович', 'gender': 'm', 'text': 'Нигматуллин Ринат Ахметович', 'parsed': True} >>> parser.parse('Петрова C.Я.') {'format': 'sFM', 'sn': 'Петрова', 'fn_s': 'C', 'mn_s': 'Я', 'gender': 'f', 'text': 'Петрова C.Я.', 'parsed': True}
Gender field could have one of following values:
m: Male
f: Female
u: Unknown / unidentified
-: Impossible to identify
### Ethnic identification (experimental) Parses surname, first name and middle name and tries to identify person ethic affilation of the person
>>> from russiannames.parser import NamesParser >>> parser = NamesParser() >>> parser.classify('Нигматуллин', 'Ринат', 'Ахметович') {'ethnics': ['tur'], 'gender': 'm'} >>> parser.classify('Алексеева', 'Ольга', 'Ивановна') {'ethnics': ['slav'], 'gender': 'f'}
## Supported languages * Russian
## Requirements * pymongo * click
## Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file russiannames-1.0.tar.gz
.
File metadata
- Download URL: russiannames-1.0.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.9.1 pkginfo/1.4.1 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.11.2 CPython/3.4.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c38c91bac064535d318e593a55b9b8eb1c54f184c13d2e1516b88033db0bbae |
|
MD5 | 2fc3f936c5a6d1a66f1b37d2409c1006 |
|
BLAKE2b-256 | 01e9d89e9b3f6e7dbbbb0fa565c197ac024ea86fc741805b2e0c9572e677779c |