Skip to main content

Morphoformer with CELMoE-based multilingual morphology, typed training pipeline, and publishable CLI.

Project description

morphoformer

morphoformer is the application package of the Morph_v4 stack. It combines character-level vocabularies, dataset tooling, typed training utilities, reusable Transformer blocks, and the generic CELMoE hierarchy into a trainable multilingual morphology system.

PyPI package name:

pip install morphoformer

Import name:

import morphoformer

What this package is

Unlike the libraries under libs/, morphoformer is not just a toolkit piece. It is the runnable application layer:

  • configuration loading
  • CLI commands
  • model wiring
  • trainer
  • inference entry points

It depends on these independently publishable packages:

  • chartoken-vp
  • celmoe-vp
  • sigmorphon-vp
  • torchblocks-vp
  • trainkit-vp

Architecture summary

The current model builds a three-level expert hierarchy:

  • universal
  • family
  • language

The actual orchestration is handled by HierarchicalCELMoE. morphoformer supplies the morphology-specific expert blocks, embeddings, routing, and output heads.

Input side:

  • character embeddings
  • feature embeddings
  • language embeddings
  • feature-to-token broadcast fusion

Expert side:

  • MorphExpertStack built from torchblocks-vp
  • configurable attention, norm, feedforward, adapter, convolution, and position modules
  • routing by language family and language code

Output side:

  • logits
  • universal_logits
  • family_logits
  • language_logits

Those outputs are consumed by the multi-loss training setup in trainkit-vp.

Installation

Requirements:

  • Python >=3.14
  • PyTorch >=2.0

Install from PyPI:

pip install morphoformer

For local development from this repository, publish or install the dependent libraries first, because they are versioned as separate packages.

CLI

The package exposes the morphoformer console command.

Available subcommands:

  • download
  • inspect-config
  • train
  • infer

Download data

List languages:

morphoformer download --list-languages

Download specific languages and merge them:

morphoformer download --lang rus,krl,afb --out-dir data --merge

Download everything known by the downloader:

morphoformer download --lang all --out-dir data

Inspect config

morphoformer inspect-config --config dev/config.toml

Train

morphoformer train --config dev/config.toml

The trainer writes the best checkpoint into the configured output directory.

Infer

morphoformer infer `
  --config dev/config.toml `
  --checkpoint artifacts/v4_omni/best.pt `
  --lemma write `
  --tags "V;PST" `
  --lang eng

Configuration

The TOML config is loaded into typed dataclasses:

  • DataConfig
  • LanguageConfig
  • ModelConfig
  • OptimizerConfig
  • TrainConfig
  • DecodeConfig
  • MorphoformerConfig

Main config sections:

  • [data]
  • [model]
  • [optimizer]
  • [train]
  • [decode]
  • [languages.<code>]

Example:

[data]
train_path = "data/merged_train.tsv"
dev_path = "data/merged_dev.tsv"
max_len = 96
max_features = 12

[model]
d_model = 768
dim_ff = 2304
num_heads = 12
num_kv_heads = 4
dropout = 0.12
max_positions = 256
feature_dim = 128
attention = "gqa"
feedforward = "swiglu"
norm = "rmsnorm"
adapter = "language_conditioned"
universal_layers = 8
family_layers = 2
language_layers = 2

[train]
stage = "joint"
epochs = 10
batch_size = 64
warmup_steps = 500
total_steps = 12000
output_dir = "artifacts/v4_omni"

[languages.rus]
family = "slavic"

Training flow

The trainer does the following:

  1. load train and dev TSV data
  2. build character and feature vocabularies
  3. build the language-to-id map from config
  4. pre-encode datasets into MorphDataset
  5. instantiate Morphoformer
  6. freeze or unfreeze stages according to train.stage
  7. optimize with AdamW, warmup cosine schedule, and AMP when enabled
  8. evaluate on the dev set each epoch
  9. save the best checkpoint

The loss is a weighted combination of:

  • final output loss
  • universal expert loss
  • family expert loss
  • language expert loss

Checkpoint contents

Saved checkpoints include:

  • model_state
  • optimizer_state
  • char_vocab
  • feature_vocab
  • language_to_id
  • epoch

That is enough to restore the model together with the exact vocabularies used during training.

Inference path

predict_form(...):

  • encodes the lemma with CharVocab
  • encodes tags with FeatureVocab
  • maps the language string to language_id
  • runs greedy decoding through the model
  • decodes predicted ids back into a surface string

Relationship to celmoe-vp

This package is where the task-specific part begins.

celmoe-vp itself stays generic and knows nothing about morphology. morphoformer is responsible for:

  • choosing hierarchy levels
  • defining expert block structure
  • mapping languages to families
  • attaching morphology-specific heads
  • converting expert outputs into token logits

That split is important because the architecture package and the application package are published separately.

Publishing and versioning

In Morph_v4 the libraries are not bundled into one mega-package. Each package is published independently and morphoformer depends on versioned releases of the lower-level libs.

That means before publishing morphoformer, you should publish compatible versions of:

  • chartoken-vp
  • celmoe-vp
  • sigmorphon-vp
  • torchblocks-vp
  • trainkit-vp

The repository includes publish.ps1 to build, version, and publish the stack in dependency order.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

morphoformer-4.7.5.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

morphoformer-4.7.5-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file morphoformer-4.7.5.tar.gz.

File metadata

  • Download URL: morphoformer-4.7.5.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-4.7.5.tar.gz
Algorithm Hash digest
SHA256 43b32cd5757d80712b0b7a06fb0320df75a7c43acffe2911f485ad65d7009716
MD5 1aa4742f2c943a06bc09e928b9725a19
BLAKE2b-256 fafdf8b37399a35cc6bba50d6992abe36f2f162e4b9461d29dcd3a1f53298450

See more details on using hashes here.

File details

Details for the file morphoformer-4.7.5-py3-none-any.whl.

File metadata

  • Download URL: morphoformer-4.7.5-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for morphoformer-4.7.5-py3-none-any.whl
Algorithm Hash digest
SHA256 66728cfd9ab3ffab88f52b606b63a622d5265376d3e4a25bc1df63d7f74ee15f
MD5 31e7e9ea0518a1cd33407081461f4be6
BLAKE2b-256 c794cde3a9516ec8e263e2f0be9427c2f454c6c92947ec0ae16b2ddeada826ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page