Skip to main content

MorphFormer: multilingual morphological reinflection via character-level Transformer

Project description

MorphFormer v2

PyPI version Python 3.14+ License: MIT OS Independent

Character-level Transformer для мультиязычной морфологической реинфлексии. Поддержка 35+ языков, language-conditioned адаптеры, GQA, RoPE, SwiGLU.

Установка

pip install morphoformer

Или из исходников:

cd Morph_v2
pip install -e .

Требования: Python >= 3.14, PyTorch >= 2.0

Быстрый старт

# Скачать данные (35+ языков SigMorphon 2021)
morphoformer download --lang rus,deu,fra --out-dir data --merge

# Обучить (авто-подбор batch/checkpointing под VRAM)
morphoformer train --preset medium --data "data/*_train.tsv" --device cuda

# Инференс
morphoformer infer \
  --checkpoint checkpoints/morphformer_epoch50.pt \
  --word "бежать" --morph "V;IND;PRS;3;SG" --lang rus

# Интерактивный REPL
morphoformer serve --checkpoint checkpoints/morphformer_epoch50.pt

Пресеты

Пресет d_model Encoder Decoder dim_ff ~Params VRAM
small 384 4 layers 3 layers 1024 ~7M < 4 GB
medium 512 8 layers 6 layers 1376 ~45M 4–8 GB
large 768 10 layers 8 layers 2048 ~120M >= 8 GB
morphoformer init-config --preset medium --out config.toml

Архитектура

  • Encoder-Decoder Transformer с pre-norm (RMSNorm)
  • GQA (Grouped Query Attention) — сжатый KV-кэш
  • RoPE — Rotary Position Embeddings
  • SwiGLU FFN
  • Conformer-style Conv — depthwise conv1d между SelfAttn и FFN в encoder
  • Language-Conditioned Adapters — gated bottleneck per language
  • Structured Morph Encoder — embed + pool / cross-attention вместо посимвольной строки
  • Weight tying — output projection = char embedding
  • CUDA Stream Prefetch — async батчи через отдельный stream
  • Auto Memory Planning — авто gradient checkpointing + batch resize по VRAM

CLI

Команда Описание
train Обучение модели
infer Одиночный инференс
serve Интерактивный REPL
download Скачать данные SigMorphon/UniMorph
modules Список зарегистрированных модулей
init-config Создать TOML-шаблон
morphoformer --help
morphoformer train --help

Формат данных

TSV: лемма\tпризнаки\tсловоформа\tязык

бежать	V;IND;PRS;3;SG	бежит	rus
gehen	V;IND;PRS;3;SG	geht	deu
aller	V;IND;PRS;3;SG	va	fra

Конфигурация

[model]
d_model = 512
num_heads = 8
num_kv_heads = 2
dim_ff = 1376
dropout = 0.15
num_languages = 50

[model.encoder]
num_layers = 8
conv = "local"
adapter = "language_conditioned"
adapter_bottleneck = 128

[model.decoder]
num_layers = 6

[model.morph_encoder]
type = "pooled"          # "attention" для large

[training]
epochs = 50
batch_size = 64
lr = 5e-4
warmup_steps = 4000
device = "auto"          # cuda / rocm / xpu / mps / cpu

Примеры: config_examples/

Python API

import torch
from morphoformer import modules  # регистрирует все модули
from morphoformer.modules import MorphFormer
from morphoformer.data import CharVocab, FeatureVocab
from morphoformer.inference import greedy_decode

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
char_vocab = CharVocab.from_dict(ckpt["char_vocab"])
feature_vocab = FeatureVocab.from_dict(ckpt["feature_vocab"])
# ... build model, load state_dict ...

result = greedy_decode(
    model, char_vocab, feature_vocab,
    "бежать", ["V", "IND", "PRS", "3", "SG"], lang_id=0,
    device=torch.device("cuda"), max_len=96, max_out=128,
)

Модули (pluggable registry)

$ morphoformer modules
[attention]  gqa, mha, cross
[feedforward]  swiglu, gelu
[norm]  rmsnorm, layernorm
[conv]  local, none
[adapter]  language_conditioned, bottleneck, none
[morph_encoder]  pooled, attention
[position]  rope

Свой модуль:

from morphoformer.modules.registry import register

@register("feedforward", "my_ffn")
class MyFFN(nn.Module):
    ...

Устройства

morphoformer train --device auto    # CUDA → XPU → MPS → CPU
morphoformer train --device cuda    # NVIDIA
morphoformer train --device rocm    # AMD
morphoformer train --device xpu     # Intel Arc
morphoformer train --device mps     # Apple Silicon

Структура проекта

morphoformer/
├── cli.py                 # CLI entry point
├── config/
│   ├── schema.py          # Dataclass → TOML mapping
│   ├── loader.py          # TOML load/save
│   └── presets.py         # small / medium / large
├── data/
│   ├── vocab.py           # CharVocab (NFKC, SOS/EOS/PAD)
│   ├── feature_vocab.py   # FeatureVocab (UniMorph tags)
│   ├── dataset.py         # MorphDataset + CUDA stream prefetch
│   └── download.py        # SigMorphon downloader
├── modules/
│   ├── registry.py        # @register / get / list_modules
│   ├── transformer.py     # MorphFormer (main model)
│   ├── encoder.py         # Encoder + EncoderLayer
│   ├── decoder.py         # Decoder + DecoderLayer + KV cache
│   ├── attention.py       # GQA, MHA, CrossAttention
│   ├── feedforward.py     # SwiGLU, GeLU
│   ├── position.py        # RoPE
│   ├── conv.py            # Depthwise conv (Conformer)
│   ├── adapter.py         # Language-conditioned adapters
│   └── morph_encoder.py   # Pooled / Attention morph encoder
├── inference/
│   ├── decode.py          # Greedy decode + KV cache
│   └── cache.py           # KVCache dataclass
└── training/
    ├── trainer.py         # Training loop (AMP, grad accum, checkpoints)
    ├── scheduler.py       # Cosine LR + warmup
    └── memory.py          # VRAM profiling + auto batch planning

Публикация

pip install build twine
python -m build
python -m twine upload dist/*

Подробнее: USAGE.md

Лицензия

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morphoformer-2.3.0.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morphoformer-2.3.0-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file morphoformer-2.3.0.tar.gz.

File metadata

  • Download URL: morphoformer-2.3.0.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.3.0.tar.gz
Algorithm Hash digest
SHA256 753d520ffde33197ad2da7d61519be88f03d980a65a5e88c54f4a8dada55c75d
MD5 e0be8c984b30f52d6534ea0761ead841
BLAKE2b-256 a1075ec4902741c10952560607902f69863b658463d4dc40ef40ec590e9ca4f8

See more details on using hashes here.

File details

Details for the file morphoformer-2.3.0-py3-none-any.whl.

File metadata

  • Download URL: morphoformer-2.3.0-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 880144959e28f595cae6fd41e7b680b8b77ba36ce58264cff7c0e03aaad9fd45
MD5 d43bf395766455188679996515f3fcea
BLAKE2b-256 78d481b424eabf1343056d846cc0cc4c646bcf8f5a2096235856930741005c2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page