Skip to main content

A morphologically aware version of T5 model.

Project description

morpht5

This package contains the source code for the MorphT5 models used in the Low Resource Interlinear Translation paper.

Installation

pip install morpht5

Components

Models

The package includes three model variants, incorporating morphological features through dedicated embedding layers:

  • MorphT5SumModel: Model using positional summation for combining text and morphological tag embeddings
  • MorphT5AutoModel: Base model with autoencoder-style morphological tag embeddings
  • MorphT5ConcatModel: Model using concatenation for combining text and morphological tag embeddings

Tokenizer

The package comes with a tokenizer that allows for encoding text and morphological tags into tokens:

>>> from morpht5 import MorphT5Tokenizer
>>> text = ["Λέγει", "αὐτῷ", "ὁ", "Ἰησοῦς", "Ἔγειρε", "ἆρον", "τὸν", "κράβαττόν", "σου", "καὶ", "περιπάτει"]
>>> tags = [
...     "V-PIA-3S",
...     "PPro-DM3S",
...     "Art-NMS",
...     "N-NMS",
...     "V-PMA-2S",
...     "V-AMA-2S",
...     "Art-AMS",
...     "N-AMS",
...     "PPro-G2S",
...     "Conj",
...     "V-PMA-2S",
... ]
>>> tokenizer = MorphT5Tokenizer.from_pretrained("mrapacz/interlinear-en-philta-emb-auto-diacritics-bh")
>>> inputs = tokenizer(text=text, morph_tags=tags, return_tensors="pt")
>>> inputs.keys()
dict_keys(['input_ids', 'attention_mask', 'input_morphs'])

Tagsets

The package comes with Enum classes for two morphological tagsets - one compiled from BibleHub and one compiled from Oblubienica:

from morpht5 import BibleHubTag, OblubienicaTag

Both tagsets cover comprehensive morphological features:

  • Parts of Speech (Verb, Noun, Adjective, etc.)
  • Person (1st, 2nd, 3rd)
  • Tense (Present, Imperfect, Future, Aorist, Perfect, Pluperfect)
  • Mood (Indicative, Imperative, Subjunctive, Optative, Infinitive, Participle)
  • Voice (Active, Middle, Passive)
  • Case (Nominative, Vocative, Accusative, Genitive, Dative)
  • Number (Singular, Plural)
  • Gender (Masculine, Feminine, Neuter)
  • Degree (Positive, Comparative, Superlative)

The tagsets differ in their annotation style:

  • BibleHub: Compact format (e.g., V-PIA-3S for "Verb - Present Indicative Active - 3rd Person Singular")
  • Oblubienica: Verbose format (e.g., vi Pres Act 3 Sg for the same morphological information)
>>> from morpht5 import BibleHubTag, OblubienicaTag
>>> len(BibleHubTag)
684
>>> len(OblubienicaTag)
1070

Formatting

There's also a utility function for formatting interlinear translations:

>>> from morpht5.utils.formatting import format_interlinear
>>> text = ['Λέγει', 'αὐτῷ', 'ὁ', 'Ἰησοῦς', 'Ἔγειρε', 'ἆρον', 'τὸν', 'κράβαττόν', 'σου', 'καὶ', 'περιπάτει']
>>> tags = ['V-PIA-3S', 'PPro-DM3S', 'Art-NMS', 'N-NMS', 'V-PMA-2S', 'V-AMA-2S', 'Art-AMS', 'N-AMS', 'PPro-G2S', 'Conj', 'V-PMA-2S']
>>> trans_pl = ['Mówi', 'mu', '-', 'Jezus', 'wstawaj', 'weź', '-', 'matę', 'swoją', 'i', 'chodź']
>>> trans_en = ['says', 'to him', '-', 'jesus', 'arise', 'take up', '-', 'mat', 'of you', 'and', 'walk']
>>> print(format_interlinear(text, tags, trans_en, trans_pl))
 Λέγει   |    αὐτῷ   |        | Ἰησοῦς |  Ἔγειρε  |   ἆρον   |   τὸν   | κράβαττόν |   σου    | καὶ  | περιπάτει
V-PIA-3S | PPro-DM3S | Art-NMS | N-NMS  | V-PMA-2S | V-AMA-2S | Art-AMS |   N-AMS   | PPro-G2S | Conj |  V-PMA-2S
  says   |   to him  |    -    | jesus  |  arise   | take up  |   the   |    mat    |  of you  | and  |    walk
  Mówi   |     mu    |    -    | Jezus  | wstawaj  |   weź    |    -    |    matę   |  swoją   |  i   |   chodź

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morpht5-0.2.2.tar.gz (70.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morpht5-0.2.2-py3-none-any.whl (72.2 kB view details)

Uploaded Python 3

File details

Details for the file morpht5-0.2.2.tar.gz.

File metadata

  • Download URL: morpht5-0.2.2.tar.gz
  • Upload date:
  • Size: 70.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/23.2.0

File hashes

Hashes for morpht5-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c40202f08927cae7137a6ede7ded369c0a29c69ad40fa7b268fca0484d68e1dc
MD5 c92ddfd8c8fc6cff912ac1fded13847e
BLAKE2b-256 8c26bb8ae2e12658363431eeb5105502c25031e5da4a8d259ddbeb47e039770f

See more details on using hashes here.

File details

Details for the file morpht5-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: morpht5-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 72.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/23.2.0

File hashes

Hashes for morpht5-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c1f882826ef94896011e311805941c526a0d34af7a0de864b989f4777af89168
MD5 7829e09a22aa039b0bff9493cb35c8ac
BLAKE2b-256 d48cc08a95554fb0a7abd8a17522d22dc10bb4e2622163c0b9624ad113defc9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page