Skip to main content

MorphFormer: multilingual morphological reinflection via character-level Transformer

Project description

MorphFormer v2

PyPI version Python 3.14+ License: MIT OS Independent

Character-level Transformer для мультиязычной морфологической реинфлексии. Поддержка 35+ языков, language-conditioned адаптеры, GQA, RoPE, SwiGLU.

Установка

pip install morphoformer

Или из исходников:

cd Morph_v2
pip install -e .

Требования: Python >= 3.14, PyTorch >= 2.0

Быстрый старт

# Скачать данные (35+ языков SigMorphon 2021)
morphoformer download --lang rus,deu,fra --out-dir data --merge

# Обучить (авто-подбор batch/checkpointing под VRAM)
morphoformer train --preset medium --data "data/*_train.tsv" --device cuda

# Инференс
morphoformer infer \
  --checkpoint checkpoints/morphformer_epoch50.pt \
  --word "бежать" --morph "V;IND;PRS;3;SG" --lang rus

# Интерактивный REPL
morphoformer serve --checkpoint checkpoints/morphformer_epoch50.pt

Пресеты

Пресет d_model Encoder Decoder dim_ff ~Params VRAM
small 384 4 layers 3 layers 1024 ~7M < 4 GB
medium 512 8 layers 6 layers 1376 ~45M 4–8 GB
large 768 10 layers 8 layers 2048 ~120M >= 8 GB
morphoformer init-config --preset medium --out config.toml

Архитектура

  • Encoder-Decoder Transformer с pre-norm (RMSNorm)
  • GQA (Grouped Query Attention) — сжатый KV-кэш
  • RoPE — Rotary Position Embeddings
  • SwiGLU FFN
  • Conformer-style Conv — depthwise conv1d между SelfAttn и FFN в encoder
  • Language-Conditioned Adapters — gated bottleneck per language
  • Structured Morph Encoder — embed + pool / cross-attention вместо посимвольной строки
  • Weight tying — output projection = char embedding
  • CUDA Stream Prefetch — async батчи через отдельный stream
  • Auto Memory Planning — авто gradient checkpointing + batch resize по VRAM

CLI

Команда Описание
train Обучение модели
infer Одиночный инференс
serve Интерактивный REPL
download Скачать данные SigMorphon/UniMorph
modules Список зарегистрированных модулей
init-config Создать TOML-шаблон
morphoformer --help
morphoformer train --help

Формат данных

TSV: лемма\tпризнаки\tсловоформа\tязык

бежать	V;IND;PRS;3;SG	бежит	rus
gehen	V;IND;PRS;3;SG	geht	deu
aller	V;IND;PRS;3;SG	va	fra

Конфигурация

[model]
d_model = 512
num_heads = 8
num_kv_heads = 2
dim_ff = 1376
dropout = 0.15
num_languages = 50

[model.encoder]
num_layers = 8
conv = "local"
adapter = "language_conditioned"
adapter_bottleneck = 128

[model.decoder]
num_layers = 6

[model.morph_encoder]
type = "pooled"          # "attention" для large

[training]
epochs = 50
batch_size = 64
lr = 5e-4
warmup_steps = 4000
device = "auto"          # cuda / rocm / xpu / mps / cpu

Примеры: config_examples/

Python API

import torch
from morphoformer import modules  # регистрирует все модули
from morphoformer.modules import MorphFormer
from morphoformer.data import CharVocab, FeatureVocab
from morphoformer.inference import greedy_decode

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
char_vocab = CharVocab.from_dict(ckpt["char_vocab"])
feature_vocab = FeatureVocab.from_dict(ckpt["feature_vocab"])
# ... build model, load state_dict ...

result = greedy_decode(
    model, char_vocab, feature_vocab,
    "бежать", ["V", "IND", "PRS", "3", "SG"], lang_id=0,
    device=torch.device("cuda"), max_len=96, max_out=128,
)

Модули (pluggable registry)

$ morphoformer modules
[attention]  gqa, mha, cross
[feedforward]  swiglu, gelu
[norm]  rmsnorm, layernorm
[conv]  local, none
[adapter]  language_conditioned, bottleneck, none
[morph_encoder]  pooled, attention
[position]  rope

Свой модуль:

from morphoformer.modules.registry import register

@register("feedforward", "my_ffn")
class MyFFN(nn.Module):
    ...

Устройства

morphoformer train --device auto    # CUDA → XPU → MPS → CPU
morphoformer train --device cuda    # NVIDIA
morphoformer train --device rocm    # AMD
morphoformer train --device xpu     # Intel Arc
morphoformer train --device mps     # Apple Silicon

Структура проекта

morphoformer/
├── cli.py                 # CLI entry point
├── config/
│   ├── schema.py          # Dataclass → TOML mapping
│   ├── loader.py          # TOML load/save
│   └── presets.py         # small / medium / large
├── data/
│   ├── vocab.py           # CharVocab (NFKC, SOS/EOS/PAD)
│   ├── feature_vocab.py   # FeatureVocab (UniMorph tags)
│   ├── dataset.py         # MorphDataset + CUDA stream prefetch
│   └── download.py        # SigMorphon downloader
├── modules/
│   ├── registry.py        # @register / get / list_modules
│   ├── transformer.py     # MorphFormer (main model)
│   ├── encoder.py         # Encoder + EncoderLayer
│   ├── decoder.py         # Decoder + DecoderLayer + KV cache
│   ├── attention.py       # GQA, MHA, CrossAttention
│   ├── feedforward.py     # SwiGLU, GeLU
│   ├── position.py        # RoPE
│   ├── conv.py            # Depthwise conv (Conformer)
│   ├── adapter.py         # Language-conditioned adapters
│   └── morph_encoder.py   # Pooled / Attention morph encoder
├── inference/
│   ├── decode.py          # Greedy decode + KV cache
│   └── cache.py           # KVCache dataclass
└── training/
    ├── trainer.py         # Training loop (AMP, grad accum, checkpoints)
    ├── scheduler.py       # Cosine LR + warmup
    └── memory.py          # VRAM profiling + auto batch planning

Публикация

pip install build twine
python -m build
python -m twine upload dist/*

Подробнее: USAGE.md

Лицензия

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morphoformer-2.2.1.tar.gz (32.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morphoformer-2.2.1-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file morphoformer-2.2.1.tar.gz.

File metadata

  • Download URL: morphoformer-2.2.1.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.2.1.tar.gz
Algorithm Hash digest
SHA256 7bc363d60f450449e8f9002c6296cbf5c20995c31ffe6a8a1b3a9318db8630cd
MD5 22a88461988da60f5395a055fa166aa6
BLAKE2b-256 ca3416383a3acc631fab8b31ede31cfc643d1be41e16ee08cb41689549cec61b

See more details on using hashes here.

File details

Details for the file morphoformer-2.2.1-py3-none-any.whl.

File metadata

  • Download URL: morphoformer-2.2.1-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 936071ac5f7c0411cea04570cb83ca7cc17e0804362af871aae4714a9828fc9c
MD5 ac6f8f22c2b564899f5886e0ea3b9f85
BLAKE2b-256 49d706382d92081d1fc0eacc0e8f0021c931fd174931c3d2d88e4b9c689c3ef1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page