Moscow toponym extractor for Russian texts

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Natural Language
- Russian
Programming Language
- Python :: 3
Topic
- Text Processing

Project description

Moscow Toponym Extractor

This module contains an extractor of Moscow toponyms from Russian texts using such Python libraries as SpaCy, Natasha, and PyMorphy2.

Returned attributes for extracted Moscow toponym:

toponym - toponym in an inflected form (e.g., Кремле)
lemmatized_toponym - toponym in the base form (e.g., Кремль)
start_char - start character index (e.g., 79)
stop_char - end character index (e.g., 85)

Installation

Install the package using pip:

pip install moscow-toponyms

Download ru_core_news_sm

pip install https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.1.0/ru_core_news_sm-3.1.0.tar.gz

Quick start

>>> from moscow_toponyms import QuickExtract
>>> text = "Однажды весною, в час небывало жаркого заката, в Москве, на Патриарших прудах, появились два гражданина."
>>> toponyms = QuickExtract(text)
>>> toponyms.extract()
[{'toponym': 'Патриарших прудах',
  'lemmatized_toponym': 'Патриаршие пруды',
  'start_char': 60,
  'stop_char': 77}]

Usage

>>> from moscow_toponyms import ExtractMosToponyms
>>> text = "Однажды весною, в час небывало жаркого заката, в Москве, на Патриарших прудах, появились два гражданина."
>>> extract_toponyms = ExtractMosToponyms(text)

Using SpaCy extract toponyms and their position in a text, lemmatize extracted toponyms using PyMorphy2:

>>> spacy_extracted = extract_toponyms.spacy_extract()
>>> print(spacy_extracted)
({51: 'смоленский площадь'}, {0: 'саша панкратов'})
>>> spacy_dict = spacy_extracted[0]
>>> spacy_names = spacy_extracted[1]

Using Natasha extract toponyms and their position in a text:

>>> natasha_extractor = extract_toponyms.natasha_extract()
>>> print(natasha_extractor)
({51: ['Смоленской площади', 'Смоленская площадь', 69]}, {0: 'Саша Панкратов'})
>>> natasha_dict = natasha_extractor[0]
>>> natasha_names = natasha_extractor[1]

Add the extracted names to the existing black list for cleaner output:

>>> black_list = extract_toponyms.merging_blacklists(spacy_names, natasha_names)

Filter all extracted toponyms and return only Moscow toponyms in inflected and base forms, their start and end character indices

>>> final_results = extract_toponyms.inner_merging_filtering(black_list, spacy_dict, natasha_dict)
>>> print(final_results)
[{'toponym': 'Смоленской площади', 'lemmatized_toponym': 'Смоленская площадь', 'start_char': 51, 'stop_char': 69}]

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Natural Language
- Russian
Programming Language
- Python :: 3
Topic
- Text Processing

Release history Release notifications | RSS feed

This version

0.1.0

Feb 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moscow_toponyms-0.1.0.tar.gz (269.8 kB view details)

Uploaded Feb 28, 2023 Source

File details

Details for the file moscow_toponyms-0.1.0.tar.gz.

File metadata

Download URL: moscow_toponyms-0.1.0.tar.gz
Upload date: Feb 28, 2023
Size: 269.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for moscow_toponyms-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d016b2614329afa90b8ed0ddf9d417fd3546ec7c83a848c29a02d7c40082266f`
MD5	`da696ac608a9066aa642a226d4865799`
BLAKE2b-256	`f23019170152f0f07b6fdf7a8c96e85b1c5e9654a142ad54ea0b8edb4cc51b67`

See more details on using hashes here.

moscow-toponyms 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Moscow Toponym Extractor

Installation

Quick start

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes