SigMorphon dataset utilities with typed TSV loading, downloads, and MorphDataset generation.
Project description
sigmorphon-vp
sigmorphon-vp is a typed utility package for downloading, converting, merging, and pre-encoding SigMorphon-style morphological reinflection datasets.
PyPI package name:
pip install sigmorphon-vp
Import name:
import sigmorphon
The package is designed to work well with chartoken-vp, but it stays publishable as its own package.
What it provides
sigmorphon-vp covers the data layer around morphological reinflection:
- download helpers for SigMorphon 2021 Task 0 style data
- conversion from upstream raw files into a consistent internal TSV format
- merge helpers for multi-language training corpora
- typed TSV loading
- an in-memory
MorphDatasetthat pre-encodes examples into tensors
Internal TSV format
The package converts data into a normalized 4-column TSV format:
lemma<TAB>features<TAB>surface<TAB>lang
Example:
walk V;PST walked eng
Comment lines beginning with # are allowed and ignored by the loaders.
Installation
Requirements:
- Python
>=3.14 - PyTorch
>=2.0 chartoken-vp>=2.1.0
Install from PyPI:
pip install sigmorphon-vp
Downloading datasets
The downloader exposes:
DATASETSdownload_languagedownload_allget_available_languagesmerge_tsv
Quick example:
from pathlib import Path
from sigmorphon import download_all, merge_tsv
out_dir = Path("data")
download_all(["rus", "bul", "spa"], out_dir)
train_files = sorted(out_dir.glob("*_train.tsv"))
merge_tsv(train_files, out_dir / "merged_train.tsv")
What download does
For each requested language, the package:
- downloads upstream raw files
- stores them under
raw/ - converts them to the internal TSV layout
- deduplicates rows with an MD5 hash
- writes
*_train.tsvand*_dev.tsv
If converted files already exist, the downloader skips them.
Discovering languages
get_available_languages() tries to read the list of languages from the SigMorphon GitHub repository. If that request fails, it falls back to the built-in DATASETS mapping.
Loading data
Simple TSV loading:
from sigmorphon import load_tsv
rows = load_tsv("data/rus_train.tsv")
Glob pattern loading:
from sigmorphon import load_tsv_pattern
rows = load_tsv_pattern("data/*_train.tsv")
Returned row type:
MorphRow = tuple[str, list[str], str, str]
That is:
- lemma
- feature list
- surface form
- language code
MorphDataset
MorphDataset is the package's main runtime component. It converts rows into ready-to-train tensors using:
CharVocabFeatureVocab- a
lang_to_idmapping
Constructor inputs:
rowschar_vocabfeature_vocablang_to_idmax_lenmax_featurespin_memory
Produced tensors:
- source character ids
- target character ids
- feature ids
- feature masks
- language ids
Batch serving
MorphDataset.epoch_batches(...) yields batches in the following typed layout:
MorphBatch = tuple[
torch.Tensor, # source
torch.Tensor, # target
torch.Tensor, # language ids
torch.Tensor, # feature ids
torch.Tensor, # feature mask
]
Behavior:
- shuffles sample order each epoch
- supports CUDA-aware non-blocking transfers
- uses a secondary CUDA stream for prefetch when available
- keeps all pre-encoded samples in RAM for fast iteration
Example:
import torch
from chartoken import CharVocab, FeatureVocab
from sigmorphon import MorphDataset, load_tsv
rows = load_tsv("data/rus_train.tsv")
char_vocab = CharVocab.from_texts([lemma for lemma, _, _, _ in rows] + [surface for _, _, surface, _ in rows])
feature_vocab = FeatureVocab.from_tags([tags for _, tags, _, _ in rows])
lang_to_id = {"rus": 0}
dataset = MorphDataset(rows, char_vocab, feature_vocab, lang_to_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
for source, target, lang_ids, feature_ids, feature_mask in dataset.epoch_batches(32, device):
print(source.shape, target.shape)
break
Utility methods
MorphDataset.memory_bytes() estimates how much RAM the encoded tensors occupy.
That is useful when you want to:
- compare eager dataset sizes across languages
- plan batch sizes
- decide whether to split or merge corpora
Intended scope
This package intentionally focuses on morphological dataset handling. It does not contain:
- neural network layers
- training loops
- checkpointing
- model definitions
That separation keeps the package reusable across multiple applications, not just morphoformer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sigmorphon_vp-2.1.4.tar.gz.
File metadata
- Download URL: sigmorphon_vp-2.1.4.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2bf31c73e1c91493649ac6d6fba3b8aedb94c8e7a41f8bfc46831114a482c70
|
|
| MD5 |
85ebd075b87f597343e3447393f4e5c5
|
|
| BLAKE2b-256 |
2d6be799c40a2cee756bee643b70c356264c1fc1e4b60038b830c910bdde2e80
|
File details
Details for the file sigmorphon_vp-2.1.4-py3-none-any.whl.
File metadata
- Download URL: sigmorphon_vp-2.1.4-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05fd04e350d962a6a0843db49206ee757cc0c122c1d4786c88cc5db5cddac6db
|
|
| MD5 |
a8102ee02cc7d4936ae83305dc571874
|
|
| BLAKE2b-256 |
a9521aa038b5e5b52a2185fe622c39ad1930963e19ef673106862fc1fa22206b
|