Skip to main content

SigMorphon dataset utilities with typed TSV loading, downloads, and MorphDataset generation.

Project description

sigmorphon-vp

sigmorphon-vp is a typed utility package for downloading, converting, merging, and pre-encoding SigMorphon-style morphological reinflection datasets.

PyPI package name:

pip install sigmorphon-vp

Import name:

import sigmorphon

The package is designed to work well with chartoken-vp, but it stays publishable as its own package.

What it provides

sigmorphon-vp covers the data layer around morphological reinflection:

  • download helpers for SigMorphon 2021 Task 0 style data
  • conversion from upstream raw files into a consistent internal TSV format
  • merge helpers for multi-language training corpora
  • typed TSV loading
  • an in-memory MorphDataset that pre-encodes examples into tensors

Internal TSV format

The package converts data into a normalized 4-column TSV format:

lemma<TAB>features<TAB>surface<TAB>lang

Example:

walk	V;PST	walked	eng

Comment lines beginning with # are allowed and ignored by the loaders.

Installation

Requirements:

  • Python >=3.14
  • PyTorch >=2.0
  • chartoken-vp>=2.1.0

Install from PyPI:

pip install sigmorphon-vp

Downloading datasets

The downloader exposes:

  • DATASETS
  • download_language
  • download_all
  • get_available_languages
  • merge_tsv

Quick example:

from pathlib import Path

from sigmorphon import download_all, merge_tsv

out_dir = Path("data")
download_all(["rus", "bul", "spa"], out_dir)

train_files = sorted(out_dir.glob("*_train.tsv"))
merge_tsv(train_files, out_dir / "merged_train.tsv")

What download does

For each requested language, the package:

  1. downloads upstream raw files
  2. stores them under raw/
  3. converts them to the internal TSV layout
  4. deduplicates rows with an MD5 hash
  5. writes *_train.tsv and *_dev.tsv

If converted files already exist, the downloader skips them.

Discovering languages

get_available_languages() tries to read the list of languages from the SigMorphon GitHub repository. If that request fails, it falls back to the built-in DATASETS mapping.

Loading data

Simple TSV loading:

from sigmorphon import load_tsv

rows = load_tsv("data/rus_train.tsv")

Glob pattern loading:

from sigmorphon import load_tsv_pattern

rows = load_tsv_pattern("data/*_train.tsv")

Returned row type:

MorphRow = tuple[str, list[str], str, str]

That is:

  • lemma
  • feature list
  • surface form
  • language code

MorphDataset

MorphDataset is the package's main runtime component. It converts rows into ready-to-train tensors using:

  • CharVocab
  • FeatureVocab
  • a lang_to_id mapping

Constructor inputs:

  • rows
  • char_vocab
  • feature_vocab
  • lang_to_id
  • max_len
  • max_features
  • pin_memory

Produced tensors:

  • source character ids
  • target character ids
  • feature ids
  • feature masks
  • language ids

Batch serving

MorphDataset.epoch_batches(...) yields batches in the following typed layout:

MorphBatch = tuple[
    torch.Tensor,  # source
    torch.Tensor,  # target
    torch.Tensor,  # language ids
    torch.Tensor,  # feature ids
    torch.Tensor,  # feature mask
]

Behavior:

  • shuffles sample order each epoch
  • supports CUDA-aware non-blocking transfers
  • uses a secondary CUDA stream for prefetch when available
  • keeps all pre-encoded samples in RAM for fast iteration

Example:

import torch

from chartoken import CharVocab, FeatureVocab
from sigmorphon import MorphDataset, load_tsv

rows = load_tsv("data/rus_train.tsv")
char_vocab = CharVocab.from_texts([lemma for lemma, _, _, _ in rows] + [surface for _, _, surface, _ in rows])
feature_vocab = FeatureVocab.from_tags([tags for _, tags, _, _ in rows])
lang_to_id = {"rus": 0}

dataset = MorphDataset(rows, char_vocab, feature_vocab, lang_to_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for source, target, lang_ids, feature_ids, feature_mask in dataset.epoch_batches(32, device):
    print(source.shape, target.shape)
    break

Utility methods

MorphDataset.memory_bytes() estimates how much RAM the encoded tensors occupy.

That is useful when you want to:

  • compare eager dataset sizes across languages
  • plan batch sizes
  • decide whether to split or merge corpora

Intended scope

This package intentionally focuses on morphological dataset handling. It does not contain:

  • neural network layers
  • training loops
  • checkpointing
  • model definitions

That separation keeps the package reusable across multiple applications, not just morphoformer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigmorphon_vp-3.0.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sigmorphon_vp-3.0.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file sigmorphon_vp-3.0.0.tar.gz.

File metadata

  • Download URL: sigmorphon_vp-3.0.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-3.0.0.tar.gz
Algorithm Hash digest
SHA256 ca59fb4bad53115b82184b0bdb23272fcb31ecbe1115cca58d940cdd8b40cf2b
MD5 b66982fe4694f3717b222e29a921b2eb
BLAKE2b-256 30980a1f867e44d6f0a7834be3ad92bf57befeb6b8b7aef4387c4f38028c6441

See more details on using hashes here.

File details

Details for the file sigmorphon_vp-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: sigmorphon_vp-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2541808030ebecfd745ed985b2088b133e48243249c5f31164aeb74614268847
MD5 1c4a7305b9c9bd69e1a9da10ddbd9ec3
BLAKE2b-256 318f39c692e5e2a18d78c5b06cdb84e77fb81e026826cf6de8e71467119eacf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page