A morphologically aware version of T5 model.
Project description
morpht5
This package contains the source code for the MorphT5 models used in the Low Resource Interlinear Translation paper.
Installation
pip install morpht5
Components
Models
The package includes three model variants, incorporating morphological features through dedicated embedding layers:
MorphT5SumModel: Model using positional summation for combining text and morphological tag embeddingsMorphT5AutoModel: Base model with autoencoder-style morphological tag embeddingsMorphT5ConcatModel: Model using concatenation for combining text and morphological tag embeddings
Tokenizer
The package comes with a tokenizer that allows for encoding text and morphological tags into tokens:
>>> from morpht5 import MorphT5Tokenizer
>>> text = ["Λέγει", "αὐτῷ", "ὁ", "Ἰησοῦς", "Ἔγειρε", "ἆρον", "τὸν", "κράβαττόν", "σου", "καὶ", "περιπάτει"]
>>> tags = [
... "V-PIA-3S",
... "PPro-DM3S",
... "Art-NMS",
... "N-NMS",
... "V-PMA-2S",
... "V-AMA-2S",
... "Art-AMS",
... "N-AMS",
... "PPro-G2S",
... "Conj",
... "V-PMA-2S",
... ]
>>> tokenizer = MorphT5Tokenizer.from_pretrained("mrapacz/interlinear-en-philta-emb-auto-diacritics-bh")
>>> inputs = tokenizer(text=text, morph_tags=tags, return_tensors="pt")
>>> inputs.keys()
dict_keys(['input_ids', 'attention_mask', 'input_morphs'])
Tagsets
The package comes with Enum classes for two morphological tagsets - one compiled from BibleHub and one compiled from Oblubienica:
from morpht5 import BibleHubTag, OblubienicaTag
Both tagsets cover comprehensive morphological features:
- Parts of Speech (Verb, Noun, Adjective, etc.)
- Person (1st, 2nd, 3rd)
- Tense (Present, Imperfect, Future, Aorist, Perfect, Pluperfect)
- Mood (Indicative, Imperative, Subjunctive, Optative, Infinitive, Participle)
- Voice (Active, Middle, Passive)
- Case (Nominative, Vocative, Accusative, Genitive, Dative)
- Number (Singular, Plural)
- Gender (Masculine, Feminine, Neuter)
- Degree (Positive, Comparative, Superlative)
The tagsets differ in their annotation style:
- BibleHub: Compact format (e.g.,
V-PIA-3Sfor "Verb - Present Indicative Active - 3rd Person Singular") - Oblubienica: Verbose format (e.g.,
vi Pres Act 3 Sgfor the same morphological information)
>>> from morpht5 import BibleHubTag, OblubienicaTag
>>> len(BibleHubTag)
684
>>> len(OblubienicaTag)
1070
Formatting
There's also a utility function for formatting interlinear translations:
>>> from morpht5.utils.formatting import format_interlinear
>>> text = ['Λέγει', 'αὐτῷ', 'ὁ', 'Ἰησοῦς', 'Ἔγειρε', 'ἆρον', 'τὸν', 'κράβαττόν', 'σου', 'καὶ', 'περιπάτει']
>>> tags = ['V-PIA-3S', 'PPro-DM3S', 'Art-NMS', 'N-NMS', 'V-PMA-2S', 'V-AMA-2S', 'Art-AMS', 'N-AMS', 'PPro-G2S', 'Conj', 'V-PMA-2S']
>>> trans_pl = ['Mówi', 'mu', '-', 'Jezus', 'wstawaj', 'weź', '-', 'matę', 'swoją', 'i', 'chodź']
>>> trans_en = ['says', 'to him', '-', 'jesus', 'arise', 'take up', '-', 'mat', 'of you', 'and', 'walk']
>>> print(format_interlinear(text, tags, trans_en, trans_pl))
Λέγει | αὐτῷ | ὁ | Ἰησοῦς | Ἔγειρε | ἆρον | τὸν | κράβαττόν | σου | καὶ | περιπάτει
V-PIA-3S | PPro-DM3S | Art-NMS | N-NMS | V-PMA-2S | V-AMA-2S | Art-AMS | N-AMS | PPro-G2S | Conj | V-PMA-2S
says | to him | - | jesus | arise | take up | the | mat | of you | and | walk
Mówi | mu | - | Jezus | wstawaj | weź | - | matę | swoją | i | chodź
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file morpht5-0.2.2.tar.gz.
File metadata
- Download URL: morpht5-0.2.2.tar.gz
- Upload date:
- Size: 70.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c40202f08927cae7137a6ede7ded369c0a29c69ad40fa7b268fca0484d68e1dc
|
|
| MD5 |
c92ddfd8c8fc6cff912ac1fded13847e
|
|
| BLAKE2b-256 |
8c26bb8ae2e12658363431eeb5105502c25031e5da4a8d259ddbeb47e039770f
|
File details
Details for the file morpht5-0.2.2-py3-none-any.whl.
File metadata
- Download URL: morpht5-0.2.2-py3-none-any.whl
- Upload date:
- Size: 72.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1f882826ef94896011e311805941c526a0d34af7a0de864b989f4777af89168
|
|
| MD5 |
7829e09a22aa039b0bff9493cb35c8ac
|
|
| BLAKE2b-256 |
d48cc08a95554fb0a7abd8a17522d22dc10bb4e2622163c0b9624ad113defc9c
|