Skip to main content

Un-fairseq: UnFormers (Universal Transformers) — config-driven enc-dec chassis covering NLLB/mBART/Marian/mT5/UL2/t5gemma/TranslateGemma/Qwen/Gemma, plus Matryoshka encoder, Garg 2019 supervised attention, PyTorch IBM Models 1/2/HMM/4, Brown+k-means clustering, and portable char/byte alignment.

Project description

unfairseq / UnFormers

UnFormers (Universal Transformers) is a single configurable encoder-decoder Transformer implementation that covers the architectural choices of modern NMT / seq2seq model families through presets. One codebase, one set of modules, one HF-compatible PreTrainedModel — and a preset picks the knobs (attention kind, positional encoding, norm, FFN, bias policy, …) to reconstruct NLLB / mBART / Marian / mT5 / UL2 / t5gemma / TranslateGemma / Qwen / Gemma.

On top of the core it ships:

  • Matryoshka encoder (MatFormer-style depth pruning): train once, serve at multiple depths, prune permanently after training.
  • Supervised attention word alignment (Garg 2019) on a configurable decoder-layer / head, applied at every Matryoshka granularity.
  • Neural IBM alignment (IBM Model 1 / 2 / HMM) in pure PyTorch, GPU- batched, subword-native — an eflomal replacement that aligns directly on your tokenizer's ids.
  • Portable alignment format: char-span (or UTF-8 byte-span) records plus word-level aggregation via ICU, so alignments are usable by any downstream tokenizer.
  • UL2 mixture-of-denoisers corpus preprocessing (R / X / S denoisers).
  • Expert-parallel MoE, KV cache for generation, gradient checkpointing, and warm_start (Net2Net + bert2bert) to seed UnFormer weights from any HF checkpoint.

Installation

UnFormers has one native dependency chain you need to handle before pip install: PyICU, which wraps ICU4C (the Unicode library Chrome/Firefox/ Java all use). ICU ships the word-break dictionaries for CJK / Thai / Khmer / Lao / Myanmar that make word-level alignment work for those languages.

1. Install ICU4C (system library)

macOS (Homebrew):

brew install icu4c
# Homebrew doesn't symlink icu4c by default; tell pkg-config where to find it:
echo 'export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
echo 'export PKG_CONFIG_PATH="/usr/local/opt/icu4c/lib/pkgconfig"' >> ~/.zshrc

Apple Silicon paths are under /opt/homebrew/opt/icu4c/... instead of /usr/local/opt/....

Debian / Ubuntu:

sudo apt install pkg-config libicu-dev

Fedora / RHEL:

sudo dnf install libicu-devel

Alpine:

apk add icu-dev pkgconfig

Windows: grab the ICU binaries from https://icu.unicode.org/download and ensure icu-config is on PATH, or use a pre-built PyICU wheel from the Python wheels index (2.16+ has Windows wheels).

2. Install PyICU (Python binding)

# after the system icu4c is in place:
pip install PyICU>=2.11

If PyICU's build fails with "u_init_74 not found" or similar, you have a version mismatch — icu-config --version must match the ICU the wheel was built against. Rebuild against your local ICU with:

PYICU_INCLUDES="$(icu-config --cppflags)" \
PYICU_LFLAGS="$(icu-config --ldflags)" \
pip install --no-binary=:all: PyICU

3. ICU data / dictionaries

ICU's word-break dictionaries for zh / ja / th / km / lo / my ship with the ICU4C install — you do not need to download anything separately. To verify the bundled dictionaries are available:

import icu
bi = icu.BreakIterator.createWordInstance(icu.Locale("zh"))
bi.setText("我爱北京天安门")
print([bi.current(), bi.next()])  # should return actual boundary offsets

If icu.ICU_VERSION prints and BreakIterator segments Chinese correctly, you have the dictionaries. They live inside icudt{VERSION}l.dat in the ICU data directory (icu-config --icudatadir). On a minimal ICU install ("lite") the dict files are stripped; install the full ICU package (default on every major distro).

If you ever need a newer or language-specific ICU data bundle, download icu4c-*-data-bin-l.zip from https://icu.unicode.org/download and drop the .dat file into icu-config --icudatadir.

4. UnFormers itself

pip install -e .                 # dev install from a checkout
# or from the repo root:
pip install .                    # regular install
pip install .[align]             # + eflomal (optional, we ship our own)
pip install .[dev]               # + pytest, ruff

Once installed, sanity-check ICU integration:

python -c "import icu; print('ICU', icu.ICU_VERSION, 'PyICU', icu.__version__)"
python -c "from unformers.align import get_segmenter; print(get_segmenter('zh')('机器翻译系统'))"

Quick start

Build a model from a preset with any HF tokenizer

from transformers import AutoTokenizer
from unformers import UnFormerForConditionalGeneration
from unformers.presets import from_preset

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
cfg = from_preset("ul2-mini-6-3", vocab_size=tok.vocab_size,
                  pad_token_id=tok.pad_token_id,
                  bos_token_id=tok.bos_token_id or tok.eos_token_id,
                  eos_token_id=tok.eos_token_id)
model = UnFormerForConditionalGeneration(cfg)

Train IBM-2 alignments and emit portable JSONL

python -m unformers.align.cli \
    --input parallel.tsv --src-col 0 --tgt-col 1 \
    --src-lang eng_Latn --tgt-lang zho_Hans \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --aligner-epochs 5 \
    --output aligned.jsonl

Each output line is tokenizer-agnostic:

{
  "src_text": "hello world",
  "tgt_text": "你好 世界",
  "src_lang": "eng_Latn",
  "tgt_lang": "zho_Hans",
  "char_alignments": [{"src": [0, 5], "tgt": [0, 2]}, {"src": [6, 11], "tgt": [3, 5]}],
  "word_alignments": [{"src": [0, 5], "tgt": [0, 2]}, {"src": [6, 11], "tgt": [3, 5]}],
  "byte_offsets": false,
  "segmenter_src": "icu:eng_Latn",
  "segmenter_tgt": "icu:zho_Hans"
}

Use --byte for UTF-8 byte offsets instead of char offsets.

Train UL2-mini with Matryoshka + supervised attention

python examples/train_pure_pytorch.py \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --n-pairs 5000 --max-steps 1000 --batch-size 16 \
    --d-model 256 --num-heads 8 --ffn-size 512

Warm-start from an HF checkpoint

from transformers import AutoModelForSeq2SeqLM
from unformers.interop import warm_start

source = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-base")
cfg = from_preset("mt5", vocab_size=source.config.vocab_size, size="mt5-base")
target = UnFormerForConditionalGeneration(cfg)
manifest = warm_start(target, source, strategy="auto")
print(manifest.summary())   # copied=..., padded=..., randomised=...

Preset capability matrix

Preset Positional Norm FFN Attention Bias Align default Notes
marian sinusoidal LayerNorm post ReLU MHA opt-in classic vanilla Transformer
nllb sinusoidal LayerNorm pre ReLU (MoE opt) MHA opt-in lang codes, MoE via moe_num_experts
mbart learned abs LayerNorm pre GELU MHA opt-in
mt5 T5 rel-bias RMSNorm pre GeGLU MHA opt-in untied lm_head
ul2 T5 rel-bias RMSNorm pre SwiGLU (MoE opt) MHA opt-in prefix-LM, [R]/[X]/[S] tags
t5gemma RoPE RMSNorm pre GeGLU GQA opt-in tied embed, √d scale
translategemma RoPE 1M base RMSNorm+preresid GeGLU GQA, QK-norm, logit-cap, sliding opt-in 5:1 local/global interleave
qwen3.5 RoPE 1M base RMSNorm pre SwiGLU GQA, QK-norm opt-in decoder-only family → enc-dec adapted
gemma4 RoPE multi-freq RMSNorm+preresid GeGLU GQA, QK-norm, logit-cap, sliding opt-in local 10k / global 1M RoPE bases
ul2-mini-6-3 RoPE RMSNorm pre SwiGLU MHA on Matryoshka [2,4,6], Garg 2019 demo

All presets accept **kwargs to override d_model / encoder_layers / decoder_layers / num_heads / intermediate_size etc. so you can shrink a 2B preset into a test-sized version:

cfg = from_preset("gemma4", vocab_size=32000, d_model=64,
                  encoder_layers=2, decoder_layers=2,
                  num_heads=4, num_kv_heads=2, head_dim=16, intermediate_size=128)

Alignment supervision is available on every preset

Every preset exposes the same set of alignment_* kwargs to from_preset. Garg 2019 supervised cross-attention is off by default for all presets except ul2-mini-6-3 (the demo preset), where it's on. Enable and tune on any preset:

cfg = from_preset(
    "nllb",
    vocab_size=tok.vocab_size, size="nllb-600m-distilled",
    alignment_enabled=True,                     # turn Garg loss on
    alignment_loss_weight=0.05,                 # λ in total = ce + λ * align
    alignment_decoder_layer=-1,                 # which decoder layer to supervise (-1 = top)
    alignment_num_heads=1,                      # first N cross-attn heads, averaged
    alignment_full_context=False,               # second decoder pass w/o causal mask
    alignment_apply_to_all_granularities=True,  # Matryoshka × alignment
)

To disable on ul2-mini-6-3: pass alignment_enabled=False. For full control pass alignment=AlignmentConfig(...) as a kwarg — the explicit config overrides any individual alignment_* kwargs.

What's in the box

Architecture (config-driven)

  • Attention: MHA / MQA / GQA, QK-norm, attention logit soft-cap, sliding window, per-layer local/global interleave.
  • Positional: sinusoidal, learned abs, T5 bucketed rel-bias, ALiBi, RoPE (single-freq + per-layer multi-freq with NTK / linear scaling).
  • Norm: LayerNorm (bias / no-bias), RMSNorm; pre- / post-norm; pre-residual norm (Gemma-style).
  • FFN: Dense (GELU/ReLU/SiLU), GLU (SwiGLU/GeGLU/ReGLU), MoE (single-GPU + expert-parallel).
  • Embedding: tied / untied; √d_model scale; final-logit soft-cap & scale.
  • Decoder: causal or prefix-LM; every-N cross-attention layers.

Training

  • UnFormerTrainer subclasses HF Seq2SeqTrainer; use it or fall back to examples/train_pure_pytorch.py when you don't want accelerate.
  • Losses: label-smoothed CE, Garg 2019 alignment NLL, Switch-style MoE aux.
  • Matryoshka depth sampling: joint / stochastic / sandwich.
  • Gradient checkpointing via model.gradient_checkpointing_enable() — skips the alignment-supervised layer so the Garg loss still backprops.

Alignment

  • unformers.align.NeuralIBMAligner — IBM Model 1 / 2 / HMM, factored lexical table, GPU-batched, pharaoh output, fwd/rev + grow-diag-final-and symmetrisation.
  • unformers.align.PortableAlignment — char (default) or byte spans + word aggregation; python -m unformers.align.cli end-to-end runner.

Data

  • TokenizerWrapper — any HF tokenizer, handles UL2 denoiser tags and lang codes.
  • UL2 mixture-of-denoisers (R / X / S) preprocessing.
  • Seq2SeqWithAlignmentCollator — pads src/tgt, shifts decoder input, turns pharaoh alignments into flat loss-index tensors with inverse-frequency weights.

Interop

  • warm_start(target, source, strategy="auto") — Net2Net (wider / deeper identity insertion) + bert2bert (cross-attn init from self-attn when source lacks cross-attn) + key-normalisation aliases for T5 / BART / NLLB / Marian / mBART / Llama / Qwen / Gemma naming. Returns a CopyManifest listing copied / padded / identity-inserted / randomised tensors.

Generation

  • model.generate(...) via HF GenerationMixin. KV cache verified against full-forward parity to 2e-5. Greedy and beam search both work.

Development

Run the tests

pip install -e '.[dev]'
pytest                           # fast tests
pytest -v -m slow -k 0.5B        # large-scale param-tier tests
pytest tests/test_portable_alignment.py -v  # alignment + ICU tests

Layout

unformers/
  config.py          # UnFormerConfig + all nested dataclasses
  modules/           # attention, positional, norm, ffn, moe, embedding
  blocks/            # encoder_layer, decoder_layer
  model/             # encoder, decoder, seq2seq (PreTrainedModel)
  presets/           # one file per family + _helpers.py
  align/             # NeuralIBMAligner, portable alignment, segmenters, CLI
  data/              # tokenizer wrapper, collator, UL2 denoisers
  train/             # trainer, losses, Matryoshka policy
  interop/           # warm_start
examples/
  smoke_test.py             # HF Trainer path
  train_pure_pytorch.py     # plain torch loop
tests/
  test_presets.py
  test_preset_sizes.py      # 0.5B / 1B / 2B / 3B tiers (slow)
  test_warm_start.py
  test_gradient_checkpointing.py
  test_moe.py
  test_portable_alignment.py

License

See LICENSE in the repo root.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unfairseq-0.0.1.tar.gz (79.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unfairseq-0.0.1-py3-none-any.whl (86.6 kB view details)

Uploaded Python 3

File details

Details for the file unfairseq-0.0.1.tar.gz.

File metadata

  • Download URL: unfairseq-0.0.1.tar.gz
  • Upload date:
  • Size: 79.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for unfairseq-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8578135ad207fa9e3c417c7e927a2f4fe45ba91d23a3fa500c916a2127c692cf
MD5 c6c20a7489522208824ca90b2c09fce8
BLAKE2b-256 983aac37a15405c8fa290bc313f4596a3d740ca61789a5e5ac9a151114bb6463

See more details on using hashes here.

File details

Details for the file unfairseq-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: unfairseq-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 86.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for unfairseq-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d101092438030f6b38abb4f7a57148fa6c1711c664aa0e1aeb155886dbfa697e
MD5 2807fdea083ae55adbcba9c74962b599
BLAKE2b-256 9b6e5c6fc08df923e3434032836feffeafc0d7a7815290d0ec43ccb6d24a09f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page