Skip to main content

Morpheme Segmentation in Multi- and Monolingual Wordlists

Project description

Morpheme Segmentation in Multi- and Monolingual Wordlists

This package provides implementations for several algorithms by which words in a wordlist can be segmented into morphemes.

If you use this software package, please cite it accordingly:

Rubehn, A. and J.-M. List (2025). MorSeg: A Python package for morpheme segmentation in multi- and monolingual wordlists [Software Library, Version 0.1]. Chair for Multilingual Computational Linguistics, University of Passau.

Installation

This package can be conveniently installed using pip:

pip install morseg

Basic Usage

Loading data

Assuming your data is presented in a TSV file following the LingPy specifications (see /tests/test_data/german.tsv for an example), you can simply load your data with:

from morseg.utils.wrappers import WordlistWrapper

wl = WordlistWrapper.from_file(YOUR_FILE)

This creates a wordlist wrapper object; a representation of a wordlist with three annotation levels: The predicted segmentations (by a model), the Gold standard segmentations, and the unsegmented form. The training of all models requires the data to be stored in this class!

Training a model

The Tokenizer class offers a unified interface for all models that are implemented in this library. For example, if you want to train a LSV (Letter Successor Variety) model, you can simply do so like that:

from morseg.algorithms.tokenizer import LSVTokenizer

model = LSVTokenizer()
model.train(wl)

The current release covers implementations of the following models:

Furthermore, some popular models for subword tokenization are implemented:

Obtain segmentations

You can obtain the predicted segmentations from your training data by calling:

for segmented_word in model.get_segmentations():
    # do whatever

You can also try segmenting unseen words (depending on the model, this might work more or less well):

word = ["w", "o", "r", "d"]
segmented_word = model(word)

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morseg-0.1.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morseg-0.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file morseg-0.1.tar.gz.

File metadata

  • Download URL: morseg-0.1.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.0

File hashes

Hashes for morseg-0.1.tar.gz
Algorithm Hash digest
SHA256 d2dd394190085c0e384a83d0771c519d31c231e966bfe7ea225e624d74435e1e
MD5 8a92f860196d356979b1311744b8e37f
BLAKE2b-256 cf518f9567651b27bfe0c6caaf7c341a2c980cbfff1b4defb39b4e41e8e39b36

See more details on using hashes here.

File details

Details for the file morseg-0.1-py3-none-any.whl.

File metadata

  • Download URL: morseg-0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.0

File hashes

Hashes for morseg-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ca60f0dcddc56cb156918ab14fde7a81b91177a77507f3f7cfafd7fc937b234c
MD5 af50ede4a8153f34a0e5d9e94ebb588c
BLAKE2b-256 db041ace253a2b29cbf01b233bbbe6a6b0eb457bffb860b360823a3dcba509a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page