Skip to main content

MorphFormer: multilingual morphological reinflection via character-level Transformer

Project description

MorphFormer v2

PyPI version Python 3.14+ License: MIT OS Independent

Character-level Transformer для мультиязычной морфологической реинфлексии. Поддержка 35+ языков, language-conditioned адаптеры, GQA, RoPE, SwiGLU.

Установка

pip install morphoformer

Или из исходников:

cd Morph_v2
pip install -e .

Требования: Python >= 3.14, PyTorch >= 2.0

Быстрый старт

# Скачать данные (35+ языков SigMorphon 2021)
morphoformer download --lang rus,deu,fra --out-dir data --merge

# Обучить (авто-подбор batch/checkpointing под VRAM)
morphoformer train --preset medium --data "data/*_train.tsv" --device cuda

# Инференс
morphoformer infer \
  --checkpoint checkpoints/morphformer_epoch50.pt \
  --word "бежать" --morph "V;IND;PRS;3;SG" --lang rus

# Интерактивный REPL
morphoformer serve --checkpoint checkpoints/morphformer_epoch50.pt

Пресеты

Пресет d_model Encoder Decoder dim_ff ~Params VRAM
small 384 4 layers 3 layers 1024 ~7M < 4 GB
medium 512 8 layers 6 layers 1376 ~45M 4–8 GB
large 768 10 layers 8 layers 2048 ~120M >= 8 GB
morphoformer init-config --preset medium --out config.toml

Архитектура

  • Encoder-Decoder Transformer с pre-norm (RMSNorm)
  • GQA (Grouped Query Attention) — сжатый KV-кэш
  • RoPE — Rotary Position Embeddings
  • SwiGLU FFN
  • Conformer-style Conv — depthwise conv1d между SelfAttn и FFN в encoder
  • Language-Conditioned Adapters — gated bottleneck per language
  • Structured Morph Encoder — embed + pool / cross-attention вместо посимвольной строки
  • Weight tying — output projection = char embedding
  • CUDA Stream Prefetch — async батчи через отдельный stream
  • Auto Memory Planning — авто gradient checkpointing + batch resize по VRAM

CLI

Команда Описание
train Обучение модели
infer Одиночный инференс
serve Интерактивный REPL
download Скачать данные SigMorphon/UniMorph
modules Список зарегистрированных модулей
init-config Создать TOML-шаблон
morphoformer --help
morphoformer train --help

Формат данных

TSV: лемма\tпризнаки\tсловоформа\tязык

бежать	V;IND;PRS;3;SG	бежит	rus
gehen	V;IND;PRS;3;SG	geht	deu
aller	V;IND;PRS;3;SG	va	fra

Конфигурация

[model]
d_model = 512
num_heads = 8
num_kv_heads = 2
dim_ff = 1376
dropout = 0.15
num_languages = 50

[model.encoder]
num_layers = 8
conv = "local"
adapter = "language_conditioned"
adapter_bottleneck = 128

[model.decoder]
num_layers = 6

[model.morph_encoder]
type = "pooled"          # "attention" для large

[training]
epochs = 50
batch_size = 64
lr = 5e-4
warmup_steps = 4000
device = "auto"          # cuda / rocm / xpu / mps / cpu

Примеры: config_examples/

Python API

import torch
from morphoformer import modules  # регистрирует все модули
from morphoformer.modules import MorphFormer
from morphoformer.data import CharVocab, FeatureVocab
from morphoformer.inference import greedy_decode

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
char_vocab = CharVocab.from_dict(ckpt["char_vocab"])
feature_vocab = FeatureVocab.from_dict(ckpt["feature_vocab"])
# ... build model, load state_dict ...

result = greedy_decode(
    model, char_vocab, feature_vocab,
    "бежать", ["V", "IND", "PRS", "3", "SG"], lang_id=0,
    device=torch.device("cuda"), max_len=96, max_out=128,
)

Модули (pluggable registry)

$ morphoformer modules
[attention]  gqa, mha, cross
[feedforward]  swiglu, gelu
[norm]  rmsnorm, layernorm
[conv]  local, none
[adapter]  language_conditioned, bottleneck, none
[morph_encoder]  pooled, attention
[position]  rope

Свой модуль:

from morphoformer.modules.registry import register

@register("feedforward", "my_ffn")
class MyFFN(nn.Module):
    ...

Устройства

morphoformer train --device auto    # CUDA → XPU → MPS → CPU
morphoformer train --device cuda    # NVIDIA
morphoformer train --device rocm    # AMD
morphoformer train --device xpu     # Intel Arc
morphoformer train --device mps     # Apple Silicon

Структура проекта

morphoformer/
├── cli.py                 # CLI entry point
├── config/
│   ├── schema.py          # Dataclass → TOML mapping
│   ├── loader.py          # TOML load/save
│   └── presets.py         # small / medium / large
├── data/
│   ├── vocab.py           # CharVocab (NFKC, SOS/EOS/PAD)
│   ├── feature_vocab.py   # FeatureVocab (UniMorph tags)
│   ├── dataset.py         # MorphDataset + CUDA stream prefetch
│   └── download.py        # SigMorphon downloader
├── modules/
│   ├── registry.py        # @register / get / list_modules
│   ├── transformer.py     # MorphFormer (main model)
│   ├── encoder.py         # Encoder + EncoderLayer
│   ├── decoder.py         # Decoder + DecoderLayer + KV cache
│   ├── attention.py       # GQA, MHA, CrossAttention
│   ├── feedforward.py     # SwiGLU, GeLU
│   ├── position.py        # RoPE
│   ├── conv.py            # Depthwise conv (Conformer)
│   ├── adapter.py         # Language-conditioned adapters
│   └── morph_encoder.py   # Pooled / Attention morph encoder
├── inference/
│   ├── decode.py          # Greedy decode + KV cache
│   └── cache.py           # KVCache dataclass
└── training/
    ├── trainer.py         # Training loop (AMP, grad accum, checkpoints)
    ├── scheduler.py       # Cosine LR + warmup
    └── memory.py          # VRAM profiling + auto batch planning

Публикация

pip install build twine
python -m build
python -m twine upload dist/*

Подробнее: USAGE.md

Лицензия

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morphoformer-2.1.4.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morphoformer-2.1.4-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file morphoformer-2.1.4.tar.gz.

File metadata

  • Download URL: morphoformer-2.1.4.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.1.4.tar.gz
Algorithm Hash digest
SHA256 1ad6db2673c6de50c7a2b6c51ec71d42bb8fe4388cdcc2d4695c9b8bb107f5bb
MD5 19d6af51d0c538bbf6f9890c73067b13
BLAKE2b-256 e106a49afd6fa6852e3dad54bd0c44c38e103b7fbb87b7902ac2f7a478c2e668

See more details on using hashes here.

File details

Details for the file morphoformer-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: morphoformer-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1c2104da93417676897026f75948bcc2c910dc74b71b10fc501e2dfe4afe6631
MD5 68889b337009a616b85982dec3133878
BLAKE2b-256 35bd364ad05c3497a6ab8e0d4a0cde3cbe3a49a6e4f2b7a73fb5b09e4c65009f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page