Skip to main content

MorphFormer: multilingual morphological reinflection via character-level Transformer

Project description

MorphFormer v2

PyPI version Python 3.14+ License: MIT OS Independent

Character-level Transformer для мультиязычной морфологической реинфлексии. Поддержка 35+ языков, language-conditioned адаптеры, GQA, RoPE, SwiGLU.

Установка

pip install morphoformer

Или из исходников:

cd Morph_v2
pip install -e .

Требования: Python >= 3.14, PyTorch >= 2.0

Быстрый старт

# Скачать данные (35+ языков SigMorphon 2021)
morphoformer download --lang rus,deu,fra --out-dir data --merge

# Обучить (авто-подбор batch/checkpointing под VRAM)
morphoformer train --preset medium --data "data/*_train.tsv" --device cuda

# Инференс
morphoformer infer \
  --checkpoint checkpoints/morphformer_epoch50.pt \
  --word "бежать" --morph "V;IND;PRS;3;SG" --lang rus

# Интерактивный REPL
morphoformer serve --checkpoint checkpoints/morphformer_epoch50.pt

Пресеты

Пресет d_model Encoder Decoder dim_ff ~Params VRAM
small 384 4 layers 3 layers 1024 ~7M < 4 GB
medium 512 8 layers 6 layers 1376 ~45M 4–8 GB
large 768 10 layers 8 layers 2048 ~120M >= 8 GB
morphoformer init-config --preset medium --out config.toml

Архитектура

  • Encoder-Decoder Transformer с pre-norm (RMSNorm)
  • GQA (Grouped Query Attention) — сжатый KV-кэш
  • RoPE — Rotary Position Embeddings
  • SwiGLU FFN
  • Conformer-style Conv — depthwise conv1d между SelfAttn и FFN в encoder
  • Language-Conditioned Adapters — gated bottleneck per language
  • Structured Morph Encoder — embed + pool / cross-attention вместо посимвольной строки
  • Weight tying — output projection = char embedding
  • CUDA Stream Prefetch — async батчи через отдельный stream
  • Auto Memory Planning — авто gradient checkpointing + batch resize по VRAM

CLI

Команда Описание
train Обучение модели
infer Одиночный инференс
serve Интерактивный REPL
download Скачать данные SigMorphon/UniMorph
modules Список зарегистрированных модулей
init-config Создать TOML-шаблон
morphoformer --help
morphoformer train --help

Формат данных

TSV: лемма\tпризнаки\tсловоформа\tязык

бежать	V;IND;PRS;3;SG	бежит	rus
gehen	V;IND;PRS;3;SG	geht	deu
aller	V;IND;PRS;3;SG	va	fra

Конфигурация

[model]
d_model = 512
num_heads = 8
num_kv_heads = 2
dim_ff = 1376
dropout = 0.15
num_languages = 50

[model.encoder]
num_layers = 8
conv = "local"
adapter = "language_conditioned"
adapter_bottleneck = 128

[model.decoder]
num_layers = 6

[model.morph_encoder]
type = "pooled"          # "attention" для large

[training]
epochs = 50
batch_size = 64
lr = 5e-4
warmup_steps = 4000
device = "auto"          # cuda / rocm / xpu / mps / cpu

Примеры: config_examples/

Python API

import torch
from morphoformer import modules  # регистрирует все модули
from morphoformer.modules import MorphFormer
from morphoformer.data import CharVocab, FeatureVocab
from morphoformer.inference import greedy_decode

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
char_vocab = CharVocab.from_dict(ckpt["char_vocab"])
feature_vocab = FeatureVocab.from_dict(ckpt["feature_vocab"])
# ... build model, load state_dict ...

result = greedy_decode(
    model, char_vocab, feature_vocab,
    "бежать", ["V", "IND", "PRS", "3", "SG"], lang_id=0,
    device=torch.device("cuda"), max_len=96, max_out=128,
)

Модули (pluggable registry)

$ morphoformer modules
[attention]  gqa, mha, cross
[feedforward]  swiglu, gelu
[norm]  rmsnorm, layernorm
[conv]  local, none
[adapter]  language_conditioned, bottleneck, none
[morph_encoder]  pooled, attention
[position]  rope

Свой модуль:

from morphoformer.modules.registry import register

@register("feedforward", "my_ffn")
class MyFFN(nn.Module):
    ...

Устройства

morphoformer train --device auto    # CUDA → XPU → MPS → CPU
morphoformer train --device cuda    # NVIDIA
morphoformer train --device rocm    # AMD
morphoformer train --device xpu     # Intel Arc
morphoformer train --device mps     # Apple Silicon

Структура проекта

morphoformer/
├── cli.py                 # CLI entry point
├── config/
│   ├── schema.py          # Dataclass → TOML mapping
│   ├── loader.py          # TOML load/save
│   └── presets.py         # small / medium / large
├── data/
│   ├── vocab.py           # CharVocab (NFKC, SOS/EOS/PAD)
│   ├── feature_vocab.py   # FeatureVocab (UniMorph tags)
│   ├── dataset.py         # MorphDataset + CUDA stream prefetch
│   └── download.py        # SigMorphon downloader
├── modules/
│   ├── registry.py        # @register / get / list_modules
│   ├── transformer.py     # MorphFormer (main model)
│   ├── encoder.py         # Encoder + EncoderLayer
│   ├── decoder.py         # Decoder + DecoderLayer + KV cache
│   ├── attention.py       # GQA, MHA, CrossAttention
│   ├── feedforward.py     # SwiGLU, GeLU
│   ├── position.py        # RoPE
│   ├── conv.py            # Depthwise conv (Conformer)
│   ├── adapter.py         # Language-conditioned adapters
│   └── morph_encoder.py   # Pooled / Attention morph encoder
├── inference/
│   ├── decode.py          # Greedy decode + KV cache
│   └── cache.py           # KVCache dataclass
└── training/
    ├── trainer.py         # Training loop (AMP, grad accum, checkpoints)
    ├── scheduler.py       # Cosine LR + warmup
    └── memory.py          # VRAM profiling + auto batch planning

Публикация

pip install build twine
python -m build
python -m twine upload dist/*

Подробнее: USAGE.md

Лицензия

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morphoformer-2.2.0.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morphoformer-2.2.0-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file morphoformer-2.2.0.tar.gz.

File metadata

  • Download URL: morphoformer-2.2.0.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.2.0.tar.gz
Algorithm Hash digest
SHA256 9901941e4fdf64d4f353804362dc35a27d48ee9d8d4f580df46b369b1ba92ab9
MD5 827021df662fdc7439164852b274e2e3
BLAKE2b-256 aadb31b50f12dd4de853bfa790470034fe49ef03f193ff673cc37a86ee6d2df1

See more details on using hashes here.

File details

Details for the file morphoformer-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: morphoformer-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 47f9710408d6bc9f212f0e284c09be4c68634b60c367fb749af592e1e89661dd
MD5 3a1275fba34e0de10eceef091986876e
BLAKE2b-256 0ffeedfc17cfc31079c3a47aac49fef5aed25f9d8cff137a17612aa56f88db39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page