Skip to main content

Analyzing Russian text to identify and extract borrowed words from other languages. The package uses the pymorphy2 library to normalize words, and it includes a dictionary of borrowed words for comparison.

Project description

russborrow

This is a test version of the russborrow module.

Overview

russborrow is a Python module for analyzing Russian text to identify and extract borrowed words from other languages. The module uses the Pymorphy2 package to normalize words and a dictionary of borrowed words.

Usage

The russborrow module provides various options for analyzing text and extracting borrowed words.

Make sure to import it in your script or Python environment:

import russborrow

Note: If a variable is initialized (e.g., borrowed in the examples), it will store the returned object. If no variable is used, the module will work with provided input and output files. If neither a variable is initialized nor an output file is provided, the code won’t crash, but no output or return will be given.

Option 1: Analyze Text String

Example 1:

borrowed = russborrow.extract("""
Значимость этих проблем настолько очевидна, что синтетическое тестирование
требует анализа экспериментов, поражающих по своей масштабности и грандиозности.
""")

Example 2:

text_to_analyze = """
Значимость этих проблем настолько очевидна, что синтетическое тестирование
требует анализа экспериментов, поражающих по своей масштабности и грандиозности.
"""
borrowed = russborrow.extract(text_to_analyze)

Option 2: Analyze Text File

Example 1, Standard Usage with ~:

file_path = '~/Desktop/text.txt'
borrowed = russborrow.extract(file_path)

Example 2, Windows Path:

file_path = r"C:\Users\Username\Documents\file.txt"
borrowed = russborrow.extract(file_path)

Example 3 (File in the same directory as the code):

borrowed = russborrow.extract('text.txt')

Option 3: Provide Output File Path

Specify an output file path to create a file containing the analysis results. The provided file path must end with either .txt or .csv. If the specified file already exists, it will be overwritten.

Example 3:

output_path = '~/Desktop/newoutput.txt'
borrowed = russborrow.extract(string, output_path)

Returned Object Attributes

The russborrow.extract function returns an object of the Borrowed class with the following attributes:

  • borrowed.len: Total number of words in the text.

  • borrowed.bor: Number of borrowed words in the provided text.

  • borrowed.percent: Percentage of borrowed words in the provided text.

  • borrowed.dict: A dictionary containing normalized versions of borrowed words as keys with the following values:

    • value[‘Repeats’]: Count of the word (normalized version) in the text.

    • value[‘Value’]: Description of the word.

    • value[‘Origin’]: Language of origin of the borrowed word.

    • value[‘Instances’]: List of all borrowed words before normalization found in the original text that have the normalized version as the key.

Note: Object attributes have no setters.

Used Resources

Pymorphy2

Pymorphy2 is a Python package for morphological analysis and inflection. It is used in the russborrow module to normalize words for comparrison with dictionary.

Borrowed Words Dictionary

The dictionary used for identifying borrowed words is sourced from Wiktionary. It is stored in the borrowed_dictionary.csv file within the russborrow module. The dictionary format includes the following columns:

  • Key: Borrowed word

  • Value: Description of the word

  • Origin: Language of origin

Example entry in the dictionary:

гламур, — glamer, от gramarye «магия, заклинание», Из гэльского (шотландского)

Exclusion Note: The word “они”” — 鬼 «демон» (демоны-людоеды, умеющие обращаться в людей) has been intentionally excluded from the dictionary for the following reasons

  • Pymorphy and its resources recognize the word “они” solely as a pronoun (местоимение) without a noun form (существительное).

  • Retaining “они” in the dictionary leads the program to classify the highly common pronoun “они” as a borrowed word.

  • According to Wiktionary, the usage of “они” as a borrowed word is infrequent. If your text focuses on Japanese folklore or demons, it is advisable to manually verify the output for accuracy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

russborrow-0.0.9a1.tar.gz (153.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

russborrow-0.0.9a1-py3-none-any.whl (155.9 kB view details)

Uploaded Python 3

File details

Details for the file russborrow-0.0.9a1.tar.gz.

File metadata

  • Download URL: russborrow-0.0.9a1.tar.gz
  • Upload date:
  • Size: 153.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for russborrow-0.0.9a1.tar.gz
Algorithm Hash digest
SHA256 dbff845526cfc6cfe2cad389f414992beda1c7cf376cacf0b0566bef782d6a2d
MD5 6086eeb70a5b9937b200221c86031a4a
BLAKE2b-256 16be77916b239cbbec8abc802c4b53ab00386b4596b2875f5f11e5a2c858c333

See more details on using hashes here.

File details

Details for the file russborrow-0.0.9a1-py3-none-any.whl.

File metadata

  • Download URL: russborrow-0.0.9a1-py3-none-any.whl
  • Upload date:
  • Size: 155.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for russborrow-0.0.9a1-py3-none-any.whl
Algorithm Hash digest
SHA256 c58a631d65508374c33b6ebc7238f41b7133dbb258ccc2fd9a99d3e0f041e556
MD5 c443b738f514105523fcd6c58f2d8d72
BLAKE2b-256 a93d24ad86c2495b8fb15c93dd454de21f7f0ec5433fb3c713d57423abf2e4a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page