SigMorphon dataset utilities with typed TSV loading, downloads, and MorphDataset generation.

These details have not been verified by PyPI

Project description

sigmorphon-vp

sigmorphon-vp is a typed utility package for downloading, converting, merging, and pre-encoding SigMorphon-style morphological reinflection datasets.

PyPI package name:

pip install sigmorphon-vp

Import name:

import sigmorphon

The package is designed to work well with chartoken-vp, but it stays publishable as its own package.

What it provides

sigmorphon-vp covers the data layer around morphological reinflection:

download helpers for SigMorphon 2021 Task 0 style data
conversion from upstream raw files into a consistent internal TSV format
merge helpers for multi-language training corpora
typed TSV loading
an in-memory MorphDataset that pre-encodes examples into tensors

Internal TSV format

The package converts data into a normalized 4-column TSV format:

lemma<TAB>features<TAB>surface<TAB>lang

Example:

walk	V;PST	walked	eng

Comment lines beginning with # are allowed and ignored by the loaders.

Installation

Requirements:

Python >=3.14
PyTorch >=2.0
chartoken-vp>=2.1.0

Install from PyPI:

pip install sigmorphon-vp

Downloading datasets

The downloader exposes:

DATASETS
download_language
download_all
get_available_languages
merge_tsv

Quick example:

from pathlib import Path

from sigmorphon import download_all, merge_tsv

out_dir = Path("data")
download_all(["rus", "bul", "spa"], out_dir)

train_files = sorted(out_dir.glob("*_train.tsv"))
merge_tsv(train_files, out_dir / "merged_train.tsv")

What download does

For each requested language, the package:

downloads upstream raw files
stores them under raw/
converts them to the internal TSV layout
deduplicates rows with an MD5 hash
writes *_train.tsv and *_dev.tsv

If converted files already exist, the downloader skips them.

Discovering languages

get_available_languages() tries to read the list of languages from the SigMorphon GitHub repository. If that request fails, it falls back to the built-in DATASETS mapping.

Loading data

Simple TSV loading:

from sigmorphon import load_tsv

rows = load_tsv("data/rus_train.tsv")

Glob pattern loading:

from sigmorphon import load_tsv_pattern

rows = load_tsv_pattern("data/*_train.tsv")

Returned row type:

MorphRow = tuple[str, list[str], str, str]

That is:

lemma
feature list
surface form
language code

`MorphDataset`

MorphDataset is the package's main runtime component. It converts rows into ready-to-train tensors using:

CharVocab
FeatureVocab
a lang_to_id mapping

Constructor inputs:

rows
char_vocab
feature_vocab
lang_to_id
max_len
max_features
pin_memory

Produced tensors:

source character ids
target character ids
feature ids
feature masks
language ids

Batch serving

MorphDataset.epoch_batches(...) yields batches in the following typed layout:

MorphBatch = tuple[
    torch.Tensor,  # source
    torch.Tensor,  # target
    torch.Tensor,  # language ids
    torch.Tensor,  # feature ids
    torch.Tensor,  # feature mask
]

Behavior:

shuffles sample order each epoch
supports CUDA-aware non-blocking transfers
uses a secondary CUDA stream for prefetch when available
keeps all pre-encoded samples in RAM for fast iteration

Example:

import torch

from chartoken import CharVocab, FeatureVocab
from sigmorphon import MorphDataset, load_tsv

rows = load_tsv("data/rus_train.tsv")
char_vocab = CharVocab.from_texts([lemma for lemma, _, _, _ in rows] + [surface for _, _, surface, _ in rows])
feature_vocab = FeatureVocab.from_tags([tags for _, tags, _, _ in rows])
lang_to_id = {"rus": 0}

dataset = MorphDataset(rows, char_vocab, feature_vocab, lang_to_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for source, target, lang_ids, feature_ids, feature_mask in dataset.epoch_batches(32, device):
    print(source.shape, target.shape)
    break

Utility methods

MorphDataset.memory_bytes() estimates how much RAM the encoded tensors occupy.

That is useful when you want to:

compare eager dataset sizes across languages
plan batch sizes
decide whether to split or merge corpora

Intended scope

This package intentionally focuses on morphological dataset handling. It does not contain:

neural network layers
training loops
checkpointing
model definitions

That separation keeps the package reusable across multiple applications, not just morphoformer.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

3.0.0

Apr 20, 2026

2.2.0

Apr 2, 2026

2.1.5

Mar 30, 2026

2.1.4

Mar 30, 2026

2.1.3

Mar 30, 2026

2.1.2

Mar 30, 2026

2.1.1

Mar 30, 2026

2.1.0

Mar 29, 2026

2.0.1

Mar 29, 2026

2.0.0

Mar 29, 2026

1.1.0

Mar 28, 2026

1.0.1

Mar 28, 2026

1.0.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigmorphon_vp-3.0.0.tar.gz (9.1 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sigmorphon_vp-3.0.0-py3-none-any.whl (7.8 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file sigmorphon_vp-3.0.0.tar.gz.

File metadata

Download URL: sigmorphon_vp-3.0.0.tar.gz
Upload date: Apr 20, 2026
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ca59fb4bad53115b82184b0bdb23272fcb31ecbe1115cca58d940cdd8b40cf2b`
MD5	`b66982fe4694f3717b222e29a921b2eb`
BLAKE2b-256	`30980a1f867e44d6f0a7834be3ad92bf57befeb6b8b7aef4387c4f38028c6441`

See more details on using hashes here.

File details

Details for the file sigmorphon_vp-3.0.0-py3-none-any.whl.

File metadata

Download URL: sigmorphon_vp-3.0.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2541808030ebecfd745ed985b2088b133e48243249c5f31164aeb74614268847`
MD5	`1c4a7305b9c9bd69e1a9da10ddbd9ec3`
BLAKE2b-256	`318f39c692e5e2a18d78c5b06cdb84e77fb81e026826cf6de8e71467119eacf7`

See more details on using hashes here.

sigmorphon-vp 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

sigmorphon-vp

What it provides

Internal TSV format

Installation

Downloading datasets

What download does

Discovering languages

Loading data

`MorphDataset`

Batch serving

Utility methods

Intended scope

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes