Skip to main content

SigMorphon 2021 dataset downloader, TSV parser, deduplication, and GPU-ready MorphDataset with CUDA stream prefetch

Project description

sigmorphon-vp

SigMorphon 2021 dataset downloader, TSV parser, and GPU-ready dataset with CUDA stream prefetch.

Part of the MorphFormer project by Voluntas Progressus.

Installation

pip install sigmorphon-vp

Requires Python >= 3.14, PyTorch >= 2.0, and chartoken-vp >= 1.1.0.

Features

  • Download SigMorphon 2021 Task 0 data for 11+ languages directly from GitHub
  • TSV parsing with MD5-based deduplication and automatic column reordering
  • MorphDataset — pre-encoded character/feature tensors in RAM with pinned memory and CUDA stream prefetch
  • Merge multiple TSV files into a single training set
  • Language listingget_available_languages() returns all supported ISO 639-3 codes

Quick Start

from sigmorphon import download_all, load_tsv, MorphDataset, get_available_languages
from chartoken import CharVocab, FeatureVocab
from pathlib import Path

# See available languages
print(get_available_languages())

# Download Russian and German data
download_all(["rus", "deu"], Path("data/collections"))

# Load and encode
rows = load_tsv("data/collections/rus_train.tsv")
char_vocab = CharVocab.from_texts([r[0] for r in rows] + [r[2] for r in rows])
feat_vocab = FeatureVocab.from_tags([r[1] for r in rows])
lang_to_id = {"rus": 0}

dataset = MorphDataset(rows, char_vocab, feat_vocab, lang_to_id)

API

Function / Class Description
download_all(langs, path) Download train/dev TSV files for given languages
merge_tsv(pattern, output) Merge multiple TSV files into one
load_tsv(path) Parse a single TSV file into rows
load_tsv_pattern(pattern) Glob-load multiple TSV files
get_available_languages() List all supported language codes
MorphDataset PyTorch Dataset with CUDA prefetch
DATASETS Dict of available SigMorphon 2021 datasets

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigmorphon_vp-1.1.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sigmorphon_vp-1.1.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file sigmorphon_vp-1.1.0.tar.gz.

File metadata

  • Download URL: sigmorphon_vp-1.1.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-1.1.0.tar.gz
Algorithm Hash digest
SHA256 b601d8d153679f1f69c667f6937ed89b37fc54b7059ab7b603685ac750ef34b7
MD5 135ef761376de4a75ed200488c465846
BLAKE2b-256 834808a44cdc9210cb4504a3c1e377dcb2071b0c2d363d0f39f5dbaaab49e6b6

See more details on using hashes here.

File details

Details for the file sigmorphon_vp-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: sigmorphon_vp-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for sigmorphon_vp-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d35e69248fc86cbb26abc7b518d3598bc24b983617fa9922ca25ba63c79308c
MD5 501ac4341fc2860037410b4cbbb0ac31
BLAKE2b-256 04a584fbd251c25f9ebecc62e140b721b44ba2daa1566cc11245bd576e317063

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page