Morpheme Segmentation in Multi- and Monolingual Wordlists
Project description
Morpheme Segmentation in Multi- and Monolingual Wordlists
This package provides implementations for several algorithms by which words in a wordlist can be segmented into morphemes.
If you use this software package, please cite it accordingly:
Rubehn, A. and J.-M. List (2025). MorSeg: A Python package for morpheme segmentation in multi- and monolingual wordlists [Software Library, Version 0.1]. Chair for Multilingual Computational Linguistics, University of Passau.
Installation
This package can be conveniently installed using pip:
pip install morseg
Basic Usage
Loading data
Assuming your data is presented in a TSV file following the LingPy specifications (see /tests/test_data/german.tsv for an example), you can simply load your data with:
from morseg.utils.wrappers import WordlistWrapper
wl = WordlistWrapper.from_file(YOUR_FILE)
This creates a wordlist wrapper object; a representation of a wordlist with three annotation levels: The predicted segmentations (by a model), the Gold standard segmentations, and the unsegmented form. The training of all models requires the data to be stored in this class!
Training a model
The Tokenizer class offers a unified interface for all models that are implemented in this library. For example, if you want to train a LSV (Letter Successor Variety) model, you can simply do so like that:
from morseg.algorithms.tokenizer import LSVTokenizer
model = LSVTokenizer()
model.train(wl)
The current release covers implementations of the following models:
LSVTokenizer: Letter Successor Variety (Harris, 1955) with the following adaptations:- Letter Successor Entropy (Hafer and Weiss, 1974)
- Letter Max-Drop Variety (Hammarström, 2009)
- Normalized Letter Successor Variety (Çöltekin, 2010)
LPVTokenizer: Letter Predecessor Variety (analogically to LSV, but processing the words backwards)LSPVTokenizer: A combination of Letter Successor Variety and Letter Predecessor VarietyMorfessor: The Morfessor Baseline Model (Creutz and Lagus, 2002)SquareEntropyTokenizer(Méndez-Cruz et al., 2016)
Furthermore, some popular models for subword tokenization are implemented:
PairEncoding: Byte-Pair Encoding (Sennrich et al., 2016)WordPiece(Schuster and Nakajima, 2012)UnigramSentencePiece(Kudo, 2018)
Obtain segmentations
You can obtain the predicted segmentations from your training data by calling:
for segmented_word in model.get_segmentations():
# do whatever
You can also try segmenting unseen words (depending on the model, this might work more or less well):
word = ["w", "o", "r", "d"]
segmented_word = model(word)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file morseg-0.1.tar.gz.
File metadata
- Download URL: morseg-0.1.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2dd394190085c0e384a83d0771c519d31c231e966bfe7ea225e624d74435e1e
|
|
| MD5 |
8a92f860196d356979b1311744b8e37f
|
|
| BLAKE2b-256 |
cf518f9567651b27bfe0c6caaf7c341a2c980cbfff1b4defb39b4e41e8e39b36
|
File details
Details for the file morseg-0.1-py3-none-any.whl.
File metadata
- Download URL: morseg-0.1-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca60f0dcddc56cb156918ab14fde7a81b91177a77507f3f7cfafd7fc937b234c
|
|
| MD5 |
af50ede4a8153f34a0e5d9e94ebb588c
|
|
| BLAKE2b-256 |
db041ace253a2b29cbf01b233bbbe6a6b0eb457bffb860b360823a3dcba509a0
|