Skip to main content

Analyzing Russian text to identify and extract borrowed words from other languages. The package uses the pymorphy2 library to normalize words, and it includes a dictionary of borrowed words for comparison.

Reason this release was yanked:

incorrect handling of 'ё' and dashes

Project description

russborrow

This is a test version of the russborrow module.

Overview

russborrow is a Python module for analyzing Russian text to identify and extract borrowed words from other languages. The module uses the Pymorphy2 package to normalize words and a dictionary of borrowed words.

Usage

The russborrow module provides various options for analyzing text and extracting borrowed words.

Make sure to import it in your script or Python environment:

import russborrow

Note: If a variable is initialized (e.g., borrowed in the examples), it will store the returned object. If no variable is used, the module will work with provided input and output files. If neither a variable is initialized nor an output file is provided, the code won’t crash, but no output or return will be given.

Option 1: Analyze Text String

Example 1:

borrowed = russborrow.extract("""
Значимость этих проблем настолько очевидна, что синтетическое тестирование
требует анализа экспериментов, поражающих по своей масштабности и грандиозности.
""")

Example 2:

text_to_analyze = """
Значимость этих проблем настолько очевидна, что синтетическое тестирование
требует анализа экспериментов, поражающих по своей масштабности и грандиозности.
"""
borrowed = russborrow.extract(text_to_analyze)

Option 2: Analyze Text File

Example 1, Standard Usage with ~:

file_path = '~/Desktop/text.txt'
borrowed = russborrow.extract(file_path)

Example 2, Windows Path:

file_path = r"C:\Users\Username\Documents\file.txt"
borrowed = russborrow.extract(file_path)

Example 3 (File in the same directory as the code):

borrowed = russborrow.extract('text.txt')

Option 3: Provide Output File Path

Specify an output file path to create a file containing the analysis results. The provided file path must end with either .txt or .csv. If the specified file already exists, it will be overwritten.

Example 3:

output_path = '~/Desktop/newoutput.txt'
borrowed = russborrow.extract(string, output_path)

Returned Object Attributes

The russborrow.extract function returns an object of the Borrowed class with the following attributes:

  • borrowed.len: Total number of words in the text.

  • borrowed.bor: Number of borrowed words in the provided text.

  • borrowed.percent: Percentage of borrowed words in the provided text.

  • borrowed.dict: A dictionary containing normalized versions of borrowed words as keys with the following values:

    • value[‘Repeats’]: Count of the word (normalized version) in the text.

    • value[‘Value’]: Description of the word.

    • value[‘Origin’]: Language of origin of the borrowed word.

    • value[‘Instances’]: List of all borrowed words before normalization found in the original text that have the normalized version as the key.

Note: Object attributes have no setters.

Used Resources

Pymorphy2

Pymorphy2 is a Python package for morphological analysis and inflection. It is used in the russborrow module to normalize words for comparrison with dictionary.

Borrowed Words Dictionary

The dictionary used for identifying borrowed words is sourced from Wiktionary. It is stored in the borrowed_dictionary.csv file within the russborrow module. The dictionary format includes the following columns:

  • Key: Borrowed word

  • Value: Description of the word

  • Origin: Language of origin

Example entry in the dictionary:

гламур, — glamer, от gramarye «магия, заклинание», Из гэльского (шотландского)

Exclusion Note: The word “они”” — 鬼 «демон» (демоны-людоеды, умеющие обращаться в людей) has been intentionally excluded from the dictionary for the following reasons

  • Pymorphy and its resources recognize the word “они” solely as a pronoun (местоимение) without a noun form (существительное).

  • Retaining “они” in the dictionary leads the program to classify the highly common pronoun “они” as a borrowed word.

  • According to Wiktionary, the usage of “они” as a borrowed word is infrequent. If your text focuses on Japanese folklore or demons, it is advisable to manually verify the output for accuracy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

russborrow-0.0.8a1.tar.gz (153.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

russborrow-0.0.8a1-py3-none-any.whl (155.3 kB view details)

Uploaded Python 3

File details

Details for the file russborrow-0.0.8a1.tar.gz.

File metadata

  • Download URL: russborrow-0.0.8a1.tar.gz
  • Upload date:
  • Size: 153.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for russborrow-0.0.8a1.tar.gz
Algorithm Hash digest
SHA256 5dc5bdb57587ca577e79f075ffec663a35b41b767d00d5b64335c5046a94d967
MD5 941e4c9ebc63723e564aa9039cf91980
BLAKE2b-256 06610dcb8f7ae8bf4ec90338442d023cf16c96dfbafc7f9e6f0ecb9b610f1a74

See more details on using hashes here.

File details

Details for the file russborrow-0.0.8a1-py3-none-any.whl.

File metadata

  • Download URL: russborrow-0.0.8a1-py3-none-any.whl
  • Upload date:
  • Size: 155.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for russborrow-0.0.8a1-py3-none-any.whl
Algorithm Hash digest
SHA256 e5afb77825876c3ee4cf8537b928ba1d72a941ba9412694ab2120e784a50c0ea
MD5 a5918d4009e1ea42c39668d7c0122d0c
BLAKE2b-256 560c66a2acad33744887b3fc2e260053639e7ff49757f3e290affe8519343629

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page