Skip to main content

A collection of utilities for working with Greek text

Project description

Python package

Greek Language Utilties from the WMU Hecurlaneum Project.

This package provides a set of utilities for working with Greek text. It is designed to be used in conjunction with the WMU Herculaneum Project, but can be used independently.

Installation

poetry add wmu_greek_utils

Usage

Normalization Options

The Normalizer class provides several options for normalizing Greek text. These options can be combined to achieve the desired normalization effect. Below are the available options:

  • LOWERCASE: Converts all characters to lowercase.
  • UPPERCASE: Converts all characters to uppercase.
  • REMOVE_SPACES: Removes all spaces from the text.
  • REMOVE_NEWLINES: Removes all newline characters from the text.
  • REMOVE_PUNCTUATION: Removes all punctuation marks from the text.
  • REMOVE_ACCENTS: Removes all accent marks from the text.
  • REMOVE_BREATHING: Removes all breathing marks from the text.
  • IOTA_ADSCRIPT: Converts iota subscript to iota adscript.
  • NORMALIZE_SIGMA: Normalizes all sigma characters to a single form.
  • NORMALIZE_THETA: Normalizes all theta characters to a single form.
  • NORMALIZE_PHI: Normalizes all phi characters to a single form.
  • NORMALIZE_APOSTROPHE: Normalizes all apostrophe characters to a single form.

Example Usage

from normalize import Normalizer, NormalizationOptions

# Standard normalization is LOWERCASE | NORMALIZE_THETA | NORMALIZE_PHI | NORMALIZE_APOSTROPHE
normalize = Normalizer()
# notice odd thetas
text = "Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν ϑεόν, καὶ ϑεὸς ἦν ὁ Λόγος."
normalized_text = normalize(text)
print(normalized_text)  # Output: "ἐν ἀρχῇ ἦν ὁ λόγος, καὶ ὁ λόγος ἦν πρὸς τὸν θεόν, καὶ θεὸς ἦν ὁ λόγος."

# Create a normalizer with multiple options

from normalize import UPPERCASE, REMOVE_SPACES, REMOVE_NEWLINES, REMOVE_PUNCTUATION, REMOVE_ACCENTS, REMOVE_BREATHING, IOTA_ADSCRIPT, NORMALIZE_SIGMA, NORMALIZE_THETA, NORMALIZE_PHI, NORMALIZE_APOSTROPHE

radical_normalizer = Normalizer(config=UPPERCASE
        | REMOVE_SPACES
        | REMOVE_NEWLINES
        | REMOVE_PUNCTUATION
        | REMOVE_ACCENTS
        | REMOVE_BREATHING
        | IOTA_ADSCRIPT
        | NORMALIZE_SIGMA
        | NORMALIZE_THETA
        | NORMALIZE_PHI
        | NORMALIZE_APOSTROPHE
)

# The above is equivalent to Normalizer(config=NORMALIZATION_OPTIONS.ALL)

normalized_text = radical_normalizer(text)
print(normalized_text)  # Output: "ΕΝΑΡΧΗΙΗΝΟΛΟΓΟϹΚΑΙΟΛΟΓΟϹΗΝΠΡΟϹΤΟΝΘΕΟΝΚΑΙΘΕΟϹΗΝΟΛΟΓΟϹ"

AGDT morphological parsing

parse_mophology

The parse_morphology function can be used to parse the morphology field of a morphological code.

Examples:

  1. Parsing a verb morphology code:
>>> parse_morphology("v3sasm---", include_names=False)
['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle', None, None, None]
  1. Parsing a noun morphology code:
>>> parse_morphology("n-s---mn-", include_names=False)
['noun', None, 'singular', None, None, None, 'masculine', 'nominative', None]
  1. Including the position names in the output:
   >>> list(parse_morphology("n-s---mn-"))
    [('part_of_speech', 'noun'), ('person', None), ('number', 'singular'), ('tense', None), ('mood', None), ('voice', None), ('gender', 'masculine'), ('case', 'nominative'), ('degree', None)]

morphology_string

Given a list of forms, produce the morphology string to the best of our ability.

Examples:

  1. Basic usage with a list of forms:
>>> morphology_string(['noun', 'masculine', 'singular', 'nominative'])
'n-s---mn-'
  1. Usage with a randomized list of forms (in other words, the order of the forms does not matter):
>>> list = ['noun', 'masculine', 'singular', 'nominative']
>>> random.shuffle(list)
>>> morphology_string(list)
'n-s---mn-'
  1. Usage with abbreviated forms:
>>> morphology_string(['masc', 'sing', 'nom', 'n'])
'n-s---mn-'
  1. Usage with a more complex list of forms:
>>> morphology_string(['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle', None, None, None])
'v3sasm---'
  1. Usage with a partial list of forms:
>>> morphology_string(['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle'])
'v3sasm---'

position_to_name

""" Given a 0-based position, return the name of the position.

'part_of_speech' >>> position_to_name(8)
'degree'

name_to_position

Given a name, return the 0-based position. Can use some short or alternate names for the name.

    >>> name_to_position('part_of_speech')
    0
    >>> name_to_position('pos')
    0
    >>> name_to_position('degree')
    8

recreate_sentence

Given a list of words and a list of morphologies, recreate the sentence, along with the positions in the sentence.

words = [
        ("The", "det"),
        ("cat", "noun"),
        (",", "punctuation"),
        ("the", "det"),
        ("dog", "noun"),
        (",", "punctuation"),
        ("and", "conj"),
        ("the", "det"),
        ("frog", "noun"),
        ("sat", "verb"),
        ("on", "prep"),
        ("the", "det"),
        ("mat", "noun"),
        (".", "punctuation"),
    ]
sentence, poss = agdt.recreate_sentence(words)
assert sentence == "The cat, the dog, and the frog sat on the mat."
assert poss == [
        (0, 2),
        (4, 6),
        (7, 7),
        (9, 11),
        (13, 15),
        (16, 16),
        (18, 20),
        (22, 24),
        (26, 29),
        (31, 33),
        (35, 36),
        (38, 40),
        (42, 44),
        (45, 45),
    ]

Acknowledgements

This package was developed by the WMU Herculaneum Project.

I am grateful for James Tauber's greek_normalisation package, which was used as a reference for the normalization options in this package; some of that package is used.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wmu_greek_utils-0.3.1.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

wmu_greek_utils-0.3.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file wmu_greek_utils-0.3.1.tar.gz.

File metadata

  • Download URL: wmu_greek_utils-0.3.1.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.5.0

File hashes

Hashes for wmu_greek_utils-0.3.1.tar.gz
Algorithm Hash digest
SHA256 f63640c7c6dc417e3451c1686a6a23dffc15527348451feca7abc31cb4b4c649
MD5 31d463efd683a89e9b4bec0f0ef62cb3
BLAKE2b-256 4c98e7b3050854c94a6ee804eb8ca3757ec0df28c27c2dcb679e8a283fe8faec

See more details on using hashes here.

File details

Details for the file wmu_greek_utils-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: wmu_greek_utils-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.5.0

File hashes

Hashes for wmu_greek_utils-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 94acff3e960d9eb7b3f5715d5e2659be87126c3ba4b9f0b6e08085b575326fb5
MD5 6324eb4d7f9525a4f2bd69352eda6eb5
BLAKE2b-256 01d3eb19ba1a91ec2d5e00df52bae57872f2f9f684b7ffc6650904514d53a32a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page