MorphFormer: multilingual morphological reinflection via character-level Transformer
Project description
MorphFormer v2
Character-level Transformer для мультиязычной морфологической реинфлексии. Поддержка 35+ языков, language-conditioned адаптеры, GQA, RoPE, SwiGLU.
Установка
pip install morphoformer
Или из исходников:
cd Morph_v2
pip install -e .
Требования: Python >= 3.14, PyTorch >= 2.0
Быстрый старт
# Скачать данные (35+ языков SigMorphon 2021)
morphoformer download --lang rus,deu,fra --out-dir data --merge
# Обучить (авто-подбор batch/checkpointing под VRAM)
morphoformer train --preset medium --data "data/*_train.tsv" --device cuda
# Инференс
morphoformer infer \
--checkpoint checkpoints/morphformer_epoch50.pt \
--word "бежать" --morph "V;IND;PRS;3;SG" --lang rus
# Интерактивный REPL
morphoformer serve --checkpoint checkpoints/morphformer_epoch50.pt
Пресеты
| Пресет | d_model | Encoder | Decoder | dim_ff | ~Params | VRAM |
|---|---|---|---|---|---|---|
small |
384 | 4 layers | 3 layers | 1024 | ~7M | < 4 GB |
medium |
512 | 8 layers | 6 layers | 1376 | ~45M | 4–8 GB |
large |
768 | 10 layers | 8 layers | 2048 | ~120M | >= 8 GB |
morphoformer init-config --preset medium --out config.toml
Архитектура
- Encoder-Decoder Transformer с pre-norm (RMSNorm)
- GQA (Grouped Query Attention) — сжатый KV-кэш
- RoPE — Rotary Position Embeddings
- SwiGLU FFN
- Conformer-style Conv — depthwise conv1d между SelfAttn и FFN в encoder
- Language-Conditioned Adapters — gated bottleneck per language
- Structured Morph Encoder — embed + pool / cross-attention вместо посимвольной строки
- Weight tying — output projection = char embedding
- CUDA Stream Prefetch — async батчи через отдельный stream
- Auto Memory Planning — авто gradient checkpointing + batch resize по VRAM
CLI
| Команда | Описание |
|---|---|
train |
Обучение модели |
infer |
Одиночный инференс |
serve |
Интерактивный REPL |
download |
Скачать данные SigMorphon/UniMorph |
modules |
Список зарегистрированных модулей |
init-config |
Создать TOML-шаблон |
morphoformer --help
morphoformer train --help
Формат данных
TSV: лемма\tпризнаки\tсловоформа\tязык
бежать V;IND;PRS;3;SG бежит rus
gehen V;IND;PRS;3;SG geht deu
aller V;IND;PRS;3;SG va fra
Конфигурация
[model]
d_model = 512
num_heads = 8
num_kv_heads = 2
dim_ff = 1376
dropout = 0.15
num_languages = 50
[model.encoder]
num_layers = 8
conv = "local"
adapter = "language_conditioned"
adapter_bottleneck = 128
[model.decoder]
num_layers = 6
[model.morph_encoder]
type = "pooled" # "attention" для large
[training]
epochs = 50
batch_size = 64
lr = 5e-4
warmup_steps = 4000
device = "auto" # cuda / rocm / xpu / mps / cpu
Примеры: config_examples/
Python API
import torch
from morphoformer import modules # регистрирует все модули
from morphoformer.modules import MorphFormer
from morphoformer.data import CharVocab, FeatureVocab
from morphoformer.inference import greedy_decode
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
char_vocab = CharVocab.from_dict(ckpt["char_vocab"])
feature_vocab = FeatureVocab.from_dict(ckpt["feature_vocab"])
# ... build model, load state_dict ...
result = greedy_decode(
model, char_vocab, feature_vocab,
"бежать", ["V", "IND", "PRS", "3", "SG"], lang_id=0,
device=torch.device("cuda"), max_len=96, max_out=128,
)
Модули (pluggable registry)
$ morphoformer modules
[attention] gqa, mha, cross
[feedforward] swiglu, gelu
[norm] rmsnorm, layernorm
[conv] local, none
[adapter] language_conditioned, bottleneck, none
[morph_encoder] pooled, attention
[position] rope
Свой модуль:
from morphoformer.modules.registry import register
@register("feedforward", "my_ffn")
class MyFFN(nn.Module):
...
Устройства
morphoformer train --device auto # CUDA → XPU → MPS → CPU
morphoformer train --device cuda # NVIDIA
morphoformer train --device rocm # AMD
morphoformer train --device xpu # Intel Arc
morphoformer train --device mps # Apple Silicon
Структура проекта
morphoformer/
├── cli.py # CLI entry point
├── config/
│ ├── schema.py # Dataclass → TOML mapping
│ ├── loader.py # TOML load/save
│ └── presets.py # small / medium / large
├── data/
│ ├── vocab.py # CharVocab (NFKC, SOS/EOS/PAD)
│ ├── feature_vocab.py # FeatureVocab (UniMorph tags)
│ ├── dataset.py # MorphDataset + CUDA stream prefetch
│ └── download.py # SigMorphon downloader
├── modules/
│ ├── registry.py # @register / get / list_modules
│ ├── transformer.py # MorphFormer (main model)
│ ├── encoder.py # Encoder + EncoderLayer
│ ├── decoder.py # Decoder + DecoderLayer + KV cache
│ ├── attention.py # GQA, MHA, CrossAttention
│ ├── feedforward.py # SwiGLU, GeLU
│ ├── position.py # RoPE
│ ├── conv.py # Depthwise conv (Conformer)
│ ├── adapter.py # Language-conditioned adapters
│ └── morph_encoder.py # Pooled / Attention morph encoder
├── inference/
│ ├── decode.py # Greedy decode + KV cache
│ └── cache.py # KVCache dataclass
└── training/
├── trainer.py # Training loop (AMP, grad accum, checkpoints)
├── scheduler.py # Cosine LR + warmup
└── memory.py # VRAM profiling + auto batch planning
Публикация
pip install build twine
python -m build
python -m twine upload dist/*
Подробнее: USAGE.md
Лицензия
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file morphoformer-2.2.0.tar.gz.
File metadata
- Download URL: morphoformer-2.2.0.tar.gz
- Upload date:
- Size: 32.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9901941e4fdf64d4f353804362dc35a27d48ee9d8d4f580df46b369b1ba92ab9
|
|
| MD5 |
827021df662fdc7439164852b274e2e3
|
|
| BLAKE2b-256 |
aadb31b50f12dd4de853bfa790470034fe49ef03f193ff673cc37a86ee6d2df1
|
File details
Details for the file morphoformer-2.2.0-py3-none-any.whl.
File metadata
- Download URL: morphoformer-2.2.0-py3-none-any.whl
- Upload date:
- Size: 41.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47f9710408d6bc9f212f0e284c09be4c68634b60c367fb749af592e1e89661dd
|
|
| MD5 |
3a1275fba34e0de10eceef091986876e
|
|
| BLAKE2b-256 |
0ffeedfc17cfc31079c3a47aac49fef5aed25f9d8cff137a17612aa56f88db39
|