Skip to main content

SigMorphon dataset utilities with typed TSV loading, downloads, and MorphDataset generation.

Project description

sigmorphon-vp

sigmorphon-vp is a typed utility package for downloading, converting, merging, and pre-encoding SigMorphon-style morphological reinflection datasets.

PyPI package name:

pip install sigmorphon-vp

Import name:

import sigmorphon

The package is designed to work well with chartoken-vp, but it stays publishable as its own package.

What it provides

sigmorphon-vp covers the data layer around morphological reinflection:

  • download helpers for SigMorphon 2021 Task 0 style data
  • conversion from upstream raw files into a consistent internal TSV format
  • merge helpers for multi-language training corpora
  • typed TSV loading
  • an in-memory MorphDataset that pre-encodes examples into tensors

Internal TSV format

The package converts data into a normalized 4-column TSV format:

lemma<TAB>features<TAB>surface<TAB>lang

Example:

walk	V;PST	walked	eng

Comment lines beginning with # are allowed and ignored by the loaders.

Installation

Requirements:

  • Python >=3.14
  • PyTorch >=2.0
  • chartoken-vp>=2.1.0

Install from PyPI:

pip install sigmorphon-vp

Downloading datasets

The downloader exposes:

  • DATASETS
  • download_language
  • download_all
  • get_available_languages
  • merge_tsv

Quick example:

from pathlib import Path

from sigmorphon import download_all, merge_tsv

out_dir = Path("data")
download_all(["rus", "bul", "spa"], out_dir)

train_files = sorted(out_dir.glob("*_train.tsv"))
merge_tsv(train_files, out_dir / "merged_train.tsv")

What download does

For each requested language, the package:

  1. downloads upstream raw files
  2. stores them under raw/
  3. converts them to the internal TSV layout
  4. deduplicates rows with an MD5 hash
  5. writes *_train.tsv and *_dev.tsv

If converted files already exist, the downloader skips them.

Discovering languages

get_available_languages() tries to read the list of languages from the SigMorphon GitHub repository. If that request fails, it falls back to the built-in DATASETS mapping.

Loading data

Simple TSV loading:

from sigmorphon import load_tsv

rows = load_tsv("data/rus_train.tsv")

Glob pattern loading:

from sigmorphon import load_tsv_pattern

rows = load_tsv_pattern("data/*_train.tsv")

Returned row type:

MorphRow = tuple[str, list[str], str, str]

That is:

  • lemma
  • feature list
  • surface form
  • language code

MorphDataset

MorphDataset is the package's main runtime component. It converts rows into ready-to-train tensors using:

  • CharVocab
  • FeatureVocab
  • a lang_to_id mapping

Constructor inputs:

  • rows
  • char_vocab
  • feature_vocab
  • lang_to_id
  • max_len
  • max_features
  • pin_memory

Produced tensors:

  • source character ids
  • target character ids
  • feature ids
  • feature masks
  • language ids

Batch serving

MorphDataset.epoch_batches(...) yields batches in the following typed layout:

MorphBatch = tuple[
    torch.Tensor,  # source
    torch.Tensor,  # target
    torch.Tensor,  # language ids
    torch.Tensor,  # feature ids
    torch.Tensor,  # feature mask
]

Behavior:

  • shuffles sample order each epoch
  • supports CUDA-aware non-blocking transfers
  • uses a secondary CUDA stream for prefetch when available
  • keeps all pre-encoded samples in RAM for fast iteration

Example:

import torch

from chartoken import CharVocab, FeatureVocab
from sigmorphon import MorphDataset, load_tsv

rows = load_tsv("data/rus_train.tsv")
char_vocab = CharVocab.from_texts([lemma for lemma, _, _, _ in rows] + [surface for _, _, surface, _ in rows])
feature_vocab = FeatureVocab.from_tags([tags for _, tags, _, _ in rows])
lang_to_id = {"rus": 0}

dataset = MorphDataset(rows, char_vocab, feature_vocab, lang_to_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for source, target, lang_ids, feature_ids, feature_mask in dataset.epoch_batches(32, device):
    print(source.shape, target.shape)
    break

Utility methods

MorphDataset.memory_bytes() estimates how much RAM the encoded tensors occupy.

That is useful when you want to:

  • compare eager dataset sizes across languages
  • plan batch sizes
  • decide whether to split or merge corpora

Intended scope

This package intentionally focuses on morphological dataset handling. It does not contain:

  • neural network layers
  • training loops
  • checkpointing
  • model definitions

That separation keeps the package reusable across multiple applications, not just morphoformer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigmorphon_vp-2.1.4.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sigmorphon_vp-2.1.4-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file sigmorphon_vp-2.1.4.tar.gz.

File metadata

  • Download URL: sigmorphon_vp-2.1.4.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-2.1.4.tar.gz
Algorithm Hash digest
SHA256 b2bf31c73e1c91493649ac6d6fba3b8aedb94c8e7a41f8bfc46831114a482c70
MD5 85ebd075b87f597343e3447393f4e5c5
BLAKE2b-256 2d6be799c40a2cee756bee643b70c356264c1fc1e4b60038b830c910bdde2e80

See more details on using hashes here.

File details

Details for the file sigmorphon_vp-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: sigmorphon_vp-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 05fd04e350d962a6a0843db49206ee757cc0c122c1d4786c88cc5db5cddac6db
MD5 a8102ee02cc7d4936ae83305dc571874
BLAKE2b-256 a9521aa038b5e5b52a2185fe622c39ad1930963e19ef673106862fc1fa22206b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page