A collection of utilities for working with Greek text
Project description
Greek Language Utilties from the WMU Hecurlaneum Project.
This package provides a set of utilities for working with Greek text. It is designed to be used in conjunction with the WMU Herculaneum Project, but can be used independently.
Installation
poetry add wmu_greek_utils
Usage
Normalization Options
The Normalizer
class provides several options for normalizing Greek text. These options can be combined to achieve the desired normalization effect. Below are the available options:
LOWERCASE
: Converts all characters to lowercase.UPPERCASE
: Converts all characters to uppercase.REMOVE_SPACES
: Removes all spaces from the text.REMOVE_NEWLINES
: Removes all newline characters from the text.REMOVE_PUNCTUATION
: Removes all punctuation marks from the text.REMOVE_ACCENTS
: Removes all accent marks from the text.REMOVE_BREATHING
: Removes all breathing marks from the text.IOTA_ADSCRIPT
: Converts iota subscript to iota adscript.NORMALIZE_SIGMA
: Normalizes all sigma characters to a single form.NORMALIZE_THETA
: Normalizes all theta characters to a single form.NORMALIZE_PHI
: Normalizes all phi characters to a single form.NORMALIZE_APOSTROPHE
: Normalizes all apostrophe characters to a single form.
Example Usage
from normalize import Normalizer, NormalizationOptions
# Standard normalization is LOWERCASE | NORMALIZE_THETA | NORMALIZE_PHI | NORMALIZE_APOSTROPHE
normalize = Normalizer()
# notice odd thetas
text = "Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν ϑεόν, καὶ ϑεὸς ἦν ὁ Λόγος."
normalized_text = normalize(text)
print(normalized_text) # Output: "ἐν ἀρχῇ ἦν ὁ λόγος, καὶ ὁ λόγος ἦν πρὸς τὸν θεόν, καὶ θεὸς ἦν ὁ λόγος."
# Create a normalizer with multiple options
from normalize import UPPERCASE, REMOVE_SPACES, REMOVE_NEWLINES, REMOVE_PUNCTUATION, REMOVE_ACCENTS, REMOVE_BREATHING, IOTA_ADSCRIPT, NORMALIZE_SIGMA, NORMALIZE_THETA, NORMALIZE_PHI, NORMALIZE_APOSTROPHE
radical_normalizer = Normalizer(config=UPPERCASE
| REMOVE_SPACES
| REMOVE_NEWLINES
| REMOVE_PUNCTUATION
| REMOVE_ACCENTS
| REMOVE_BREATHING
| IOTA_ADSCRIPT
| NORMALIZE_SIGMA
| NORMALIZE_THETA
| NORMALIZE_PHI
| NORMALIZE_APOSTROPHE
)
# The above is equivalent to Normalizer(config=NORMALIZATION_OPTIONS.ALL)
normalized_text = radical_normalizer(text)
print(normalized_text) # Output: "ΕΝΑΡΧΗΙΗΝΟΛΟΓΟϹΚΑΙΟΛΟΓΟϹΗΝΠΡΟϹΤΟΝΘΕΟΝΚΑΙΘΕΟϹΗΝΟΛΟΓΟϹ"
AGDT morphological parsing
parse_mophology
The parse_morphology
function can be used to parse the morphology field of a morphological code.
Examples:
- Parsing a verb morphology code:
>>> parse_morphology("v3sasm---", include_names=False)
['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle', None, None, None]
- Parsing a noun morphology code:
>>> parse_morphology("n-s---mn-", include_names=False)
['noun', None, 'singular', None, None, None, 'masculine', 'nominative', None]
- Including the position names in the output:
>>> list(parse_morphology("n-s---mn-"))
[('part_of_speech', 'noun'), ('person', None), ('number', 'singular'), ('tense', None), ('mood', None), ('voice', None), ('gender', 'masculine'), ('case', 'nominative'), ('degree', None)]
morphology_string
Given a list of forms, produce the morphology string to the best of our ability.
Examples:
- Basic usage with a list of forms:
>>> morphology_string(['noun', 'masculine', 'singular', 'nominative'])
'n-s---mn-'
- Usage with a randomized list of forms (in other words, the order of the forms does not matter):
>>> list = ['noun', 'masculine', 'singular', 'nominative']
>>> random.shuffle(list)
>>> morphology_string(list)
'n-s---mn-'
- Usage with abbreviated forms:
>>> morphology_string(['masc', 'sing', 'nom', 'n'])
'n-s---mn-'
- Usage with a more complex list of forms:
>>> morphology_string(['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle', None, None, None])
'v3sasm---'
- Usage with a partial list of forms:
>>> morphology_string(['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle'])
'v3sasm---'
position_to_name
""" Given a 0-based position, return the name of the position.
'part_of_speech' >>> position_to_name(8)
'degree'
name_to_position
Given a name, return the 0-based position. Can use some short or alternate names for the name.
>>> name_to_position('part_of_speech')
0
>>> name_to_position('pos')
0
>>> name_to_position('degree')
8
recreate_sentence
Given a list of words and a list of morphologies, recreate the sentence, along with the positions in the sentence.
words = [
("The", "det"),
("cat", "noun"),
(",", "punctuation"),
("the", "det"),
("dog", "noun"),
(",", "punctuation"),
("and", "conj"),
("the", "det"),
("frog", "noun"),
("sat", "verb"),
("on", "prep"),
("the", "det"),
("mat", "noun"),
(".", "punctuation"),
]
sentence, poss = agdt.recreate_sentence(words)
assert sentence == "The cat, the dog, and the frog sat on the mat."
assert poss == [
(0, 2),
(4, 6),
(7, 7),
(9, 11),
(13, 15),
(16, 16),
(18, 20),
(22, 24),
(26, 29),
(31, 33),
(35, 36),
(38, 40),
(42, 44),
(45, 45),
]
Acknowledgements
This package was developed by the WMU Herculaneum Project.
I am grateful for James Tauber's greek_normalisation package, which was used as a reference for the normalization options in this package; some of that package is used.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wmu_greek_utils-0.3.1.tar.gz
.
File metadata
- Download URL: wmu_greek_utils-0.3.1.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f63640c7c6dc417e3451c1686a6a23dffc15527348451feca7abc31cb4b4c649 |
|
MD5 | 31d463efd683a89e9b4bec0f0ef62cb3 |
|
BLAKE2b-256 | 4c98e7b3050854c94a6ee804eb8ca3757ec0df28c27c2dcb679e8a283fe8faec |
File details
Details for the file wmu_greek_utils-0.3.1-py3-none-any.whl
.
File metadata
- Download URL: wmu_greek_utils-0.3.1-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94acff3e960d9eb7b3f5715d5e2659be87126c3ba4b9f0b6e08085b575326fb5 |
|
MD5 | 6324eb4d7f9525a4f2bd69352eda6eb5 |
|
BLAKE2b-256 | 01d3eb19ba1a91ec2d5e00df52bae57872f2f9f684b7ffc6650904514d53a32a |