Skip to main content

Moscow toponym extractor for Russian texts

Project description

Moscow Toponym Extractor

This module contains an extractor of Moscow toponyms from Russian texts using such Python libraries as SpaCy, Natasha, and PyMorphy2.

Returned attributes for extracted Moscow toponym:

  • toponym - toponym in an inflected form (e.g., Кремле)
  • lemmatized_toponym - toponym in the base form (e.g., Кремль)
  • start_char - start character index (e.g., 79)
  • stop_char - end character index (e.g., 85)

Installation

  1. Install the package using pip:
pip install moscow-toponyms
  1. Download ru_core_news_sm
pip install https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.1.0/ru_core_news_sm-3.1.0.tar.gz

Quick start

>>> from moscow_toponyms import QuickExtract
>>> text = "Однажды весною, в час небывало жаркого заката, в Москве, на Патриарших прудах, появились два гражданина."
>>> toponyms = QuickExtract(text)
>>> toponyms.extract()
[{'toponym': 'Патриарших прудах',
  'lemmatized_toponym': 'Патриаршие пруды',
  'start_char': 60,
  'stop_char': 77}]

Usage

>>> from moscow_toponyms import ExtractMosToponyms
>>> text = "Однажды весною, в час небывало жаркого заката, в Москве, на Патриарших прудах, появились два гражданина."
>>> extract_toponyms = ExtractMosToponyms(text)

Using SpaCy extract toponyms and their position in a text, lemmatize extracted toponyms using PyMorphy2:

>>> spacy_extracted = extract_toponyms.spacy_extract()
>>> print(spacy_extracted)
({51: 'смоленский площадь'}, {0: 'саша панкратов'})
>>> spacy_dict = spacy_extracted[0]
>>> spacy_names = spacy_extracted[1]

Using Natasha extract toponyms and their position in a text:

>>> natasha_extractor = extract_toponyms.natasha_extract()
>>> print(natasha_extractor)
({51: ['Смоленской площади', 'Смоленская площадь', 69]}, {0: 'Саша Панкратов'})
>>> natasha_dict = natasha_extractor[0]
>>> natasha_names = natasha_extractor[1]

Add the extracted names to the existing black list for cleaner output:

>>> black_list = extract_toponyms.merging_blacklists(spacy_names, natasha_names)

Filter all extracted toponyms and return only Moscow toponyms in inflected and base forms, their start and end character indices

>>> final_results = extract_toponyms.inner_merging_filtering(black_list, spacy_dict, natasha_dict)
>>> print(final_results)
[{'toponym': 'Смоленской площади', 'lemmatized_toponym': 'Смоленская площадь', 'start_char': 51, 'stop_char': 69}]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moscow_toponyms-0.1.0.tar.gz (269.8 kB view details)

Uploaded Source

File details

Details for the file moscow_toponyms-0.1.0.tar.gz.

File metadata

  • Download URL: moscow_toponyms-0.1.0.tar.gz
  • Upload date:
  • Size: 269.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for moscow_toponyms-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d016b2614329afa90b8ed0ddf9d417fd3546ec7c83a848c29a02d7c40082266f
MD5 da696ac608a9066aa642a226d4865799
BLAKE2b-256 f23019170152f0f07b6fdf7a8c96e85b1c5e9654a142ad54ea0b8edb4cc51b67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page