SigMorphon 2021 dataset downloader, TSV parser, deduplication, and GPU-ready MorphDataset with CUDA stream prefetch
Project description
sigmorphon-vp
SigMorphon 2021 dataset downloader, TSV parser, and GPU-ready dataset with CUDA stream prefetch.
Part of the MorphFormer project by Voluntas Progressus.
Installation
pip install sigmorphon-vp
Requires Python >= 3.14, PyTorch >= 2.0, and chartoken-vp >= 1.1.0.
Features
- Download SigMorphon 2021 Task 0 data for 11+ languages directly from GitHub
- TSV parsing with MD5-based deduplication and automatic column reordering
- MorphDataset — pre-encoded character/feature tensors in RAM with pinned memory and CUDA stream prefetch
- Merge multiple TSV files into a single training set
- Language listing —
get_available_languages()returns all supported ISO 639-3 codes
Quick Start
from sigmorphon import download_all, load_tsv, MorphDataset, get_available_languages
from chartoken import CharVocab, FeatureVocab
from pathlib import Path
# See available languages
print(get_available_languages())
# Download Russian and German data
download_all(["rus", "deu"], Path("data/collections"))
# Load and encode
rows = load_tsv("data/collections/rus_train.tsv")
char_vocab = CharVocab.from_texts([r[0] for r in rows] + [r[2] for r in rows])
feat_vocab = FeatureVocab.from_tags([r[1] for r in rows])
lang_to_id = {"rus": 0}
dataset = MorphDataset(rows, char_vocab, feat_vocab, lang_to_id)
API
| Function / Class | Description |
|---|---|
download_all(langs, path) |
Download train/dev TSV files for given languages |
merge_tsv(pattern, output) |
Merge multiple TSV files into one |
load_tsv(path) |
Parse a single TSV file into rows |
load_tsv_pattern(pattern) |
Glob-load multiple TSV files |
get_available_languages() |
List all supported language codes |
MorphDataset |
PyTorch Dataset with CUDA prefetch |
DATASETS |
Dict of available SigMorphon 2021 datasets |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sigmorphon_vp-1.1.0.tar.gz.
File metadata
- Download URL: sigmorphon_vp-1.1.0.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b601d8d153679f1f69c667f6937ed89b37fc54b7059ab7b603685ac750ef34b7
|
|
| MD5 |
135ef761376de4a75ed200488c465846
|
|
| BLAKE2b-256 |
834808a44cdc9210cb4504a3c1e377dcb2071b0c2d363d0f39f5dbaaab49e6b6
|
File details
Details for the file sigmorphon_vp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: sigmorphon_vp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d35e69248fc86cbb26abc7b518d3598bc24b983617fa9922ca25ba63c79308c
|
|
| MD5 |
501ac4341fc2860037410b4cbbb0ac31
|
|
| BLAKE2b-256 |
04a584fbd251c25f9ebecc62e140b721b44ba2daa1566cc11245bd576e317063
|