Un-fairseq: UnFormers (Universal Transformers) — config-driven enc-dec chassis covering NLLB/mBART/Marian/mT5/UL2/t5gemma/TranslateGemma/Qwen/Gemma, plus Matryoshka encoder, Garg 2019 supervised attention, PyTorch IBM Models 1/2/HMM/4, Brown+k-means clustering, and portable char/byte alignment.
Project description
unfairseq / UnFormers
UnFormers (Universal Transformers) is a single configurable encoder-decoder
Transformer implementation that covers the architectural choices of modern
NMT / seq2seq model families through presets. One codebase, one set of
modules, one HF-compatible PreTrainedModel — and a preset picks the knobs
(attention kind, positional encoding, norm, FFN, bias policy, …) to
reconstruct NLLB / mBART / Marian / mT5 / UL2 / t5gemma / TranslateGemma /
Qwen / Gemma.
On top of the core it ships:
- Matryoshka encoder (MatFormer-style depth pruning): train once, serve at multiple depths, prune permanently after training.
- Supervised attention word alignment (Garg 2019) on a configurable decoder-layer / head, applied at every Matryoshka granularity.
- Neural IBM alignment (IBM Model 1 / 2 / HMM) in pure PyTorch, GPU- batched, subword-native — an eflomal replacement that aligns directly on your tokenizer's ids.
- Portable alignment format: char-span (or UTF-8 byte-span) records plus word-level aggregation via ICU, so alignments are usable by any downstream tokenizer.
- UL2 mixture-of-denoisers corpus preprocessing (R / X / S denoisers).
- Expert-parallel MoE, KV cache for generation, gradient checkpointing, and warm_start (Net2Net + bert2bert) to seed UnFormer weights from any HF checkpoint.
Installation
UnFormers has one native dependency chain you need to handle before pip install: PyICU, which wraps ICU4C (the Unicode library Chrome/Firefox/
Java all use). ICU ships the word-break dictionaries for CJK / Thai / Khmer /
Lao / Myanmar that make word-level alignment work for those languages.
1. Install ICU4C (system library)
macOS (Homebrew):
brew install icu4c
# Homebrew doesn't symlink icu4c by default; tell pkg-config where to find it:
echo 'export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
echo 'export PKG_CONFIG_PATH="/usr/local/opt/icu4c/lib/pkgconfig"' >> ~/.zshrc
Apple Silicon paths are under /opt/homebrew/opt/icu4c/... instead of
/usr/local/opt/....
Debian / Ubuntu:
sudo apt install pkg-config libicu-dev
Fedora / RHEL:
sudo dnf install libicu-devel
Alpine:
apk add icu-dev pkgconfig
Windows: grab the ICU binaries from
https://icu.unicode.org/download and ensure icu-config is on PATH, or use
a pre-built PyICU wheel from the Python wheels index (2.16+ has Windows
wheels).
2. Install PyICU (Python binding)
# after the system icu4c is in place:
pip install PyICU>=2.11
If PyICU's build fails with "u_init_74 not found" or similar, you have a
version mismatch — icu-config --version must match the ICU the wheel was
built against. Rebuild against your local ICU with:
PYICU_INCLUDES="$(icu-config --cppflags)" \
PYICU_LFLAGS="$(icu-config --ldflags)" \
pip install --no-binary=:all: PyICU
3. ICU data / dictionaries
ICU's word-break dictionaries for zh / ja / th / km / lo / my ship with the ICU4C install — you do not need to download anything separately. To verify the bundled dictionaries are available:
import icu
bi = icu.BreakIterator.createWordInstance(icu.Locale("zh"))
bi.setText("我爱北京天安门")
print([bi.current(), bi.next()]) # should return actual boundary offsets
If icu.ICU_VERSION prints and BreakIterator segments Chinese correctly,
you have the dictionaries. They live inside icudt{VERSION}l.dat in the ICU
data directory (icu-config --icudatadir). On a minimal ICU install ("lite")
the dict files are stripped; install the full ICU package (default on every
major distro).
If you ever need a newer or language-specific ICU data bundle, download
icu4c-*-data-bin-l.zip from https://icu.unicode.org/download and drop the
.dat file into icu-config --icudatadir.
4. UnFormers itself
pip install -e . # dev install from a checkout
# or from the repo root:
pip install . # regular install
pip install .[align] # + eflomal (optional, we ship our own)
pip install .[dev] # + pytest, ruff
Once installed, sanity-check ICU integration:
python -c "import icu; print('ICU', icu.ICU_VERSION, 'PyICU', icu.__version__)"
python -c "from unformers.align import get_segmenter; print(get_segmenter('zh')('机器翻译系统'))"
Quick start
Build a model from a preset with any HF tokenizer
from transformers import AutoTokenizer
from unformers import UnFormerForConditionalGeneration
from unformers.presets import from_preset
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
cfg = from_preset("ul2-mini-6-3", vocab_size=tok.vocab_size,
pad_token_id=tok.pad_token_id,
bos_token_id=tok.bos_token_id or tok.eos_token_id,
eos_token_id=tok.eos_token_id)
model = UnFormerForConditionalGeneration(cfg)
Train IBM-2 alignments and emit portable JSONL
python -m unformers.align.cli \
--input parallel.tsv --src-col 0 --tgt-col 1 \
--src-lang eng_Latn --tgt-lang zho_Hans \
--tokenizer Qwen/Qwen2.5-0.5B \
--aligner-epochs 5 \
--output aligned.jsonl
Each output line is tokenizer-agnostic:
{
"src_text": "hello world",
"tgt_text": "你好 世界",
"src_lang": "eng_Latn",
"tgt_lang": "zho_Hans",
"char_alignments": [{"src": [0, 5], "tgt": [0, 2]}, {"src": [6, 11], "tgt": [3, 5]}],
"word_alignments": [{"src": [0, 5], "tgt": [0, 2]}, {"src": [6, 11], "tgt": [3, 5]}],
"byte_offsets": false,
"segmenter_src": "icu:eng_Latn",
"segmenter_tgt": "icu:zho_Hans"
}
Use --byte for UTF-8 byte offsets instead of char offsets.
Train UL2-mini with Matryoshka + supervised attention
python examples/train_pure_pytorch.py \
--tokenizer Qwen/Qwen2.5-0.5B \
--n-pairs 5000 --max-steps 1000 --batch-size 16 \
--d-model 256 --num-heads 8 --ffn-size 512
Warm-start from an HF checkpoint
from transformers import AutoModelForSeq2SeqLM
from unformers.interop import warm_start
source = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-base")
cfg = from_preset("mt5", vocab_size=source.config.vocab_size, size="mt5-base")
target = UnFormerForConditionalGeneration(cfg)
manifest = warm_start(target, source, strategy="auto")
print(manifest.summary()) # copied=..., padded=..., randomised=...
Preset capability matrix
| Preset | Positional | Norm | FFN | Attention | Bias | Align default | Notes |
|---|---|---|---|---|---|---|---|
marian |
sinusoidal | LayerNorm post | ReLU | MHA | ✓ | opt-in | classic vanilla Transformer |
nllb |
sinusoidal | LayerNorm pre | ReLU (MoE opt) | MHA | ✓ | opt-in | lang codes, MoE via moe_num_experts |
mbart |
learned abs | LayerNorm pre | GELU | MHA | ✓ | opt-in | |
mt5 |
T5 rel-bias | RMSNorm pre | GeGLU | MHA | ✗ | opt-in | untied lm_head |
ul2 |
T5 rel-bias | RMSNorm pre | SwiGLU (MoE opt) | MHA | ✗ | opt-in | prefix-LM, [R]/[X]/[S] tags |
t5gemma |
RoPE | RMSNorm pre | GeGLU | GQA | ✗ | opt-in | tied embed, √d scale |
translategemma |
RoPE 1M base | RMSNorm+preresid | GeGLU | GQA, QK-norm, logit-cap, sliding | ✗ | opt-in | 5:1 local/global interleave |
qwen3.5 |
RoPE 1M base | RMSNorm pre | SwiGLU | GQA, QK-norm | ✗ | opt-in | decoder-only family → enc-dec adapted |
gemma4 |
RoPE multi-freq | RMSNorm+preresid | GeGLU | GQA, QK-norm, logit-cap, sliding | ✗ | opt-in | local 10k / global 1M RoPE bases |
ul2-mini-6-3 |
RoPE | RMSNorm pre | SwiGLU | MHA | ✗ | on | Matryoshka [2,4,6], Garg 2019 demo |
All presets accept **kwargs to override d_model / encoder_layers /
decoder_layers / num_heads / intermediate_size etc. so you can shrink a
2B preset into a test-sized version:
cfg = from_preset("gemma4", vocab_size=32000, d_model=64,
encoder_layers=2, decoder_layers=2,
num_heads=4, num_kv_heads=2, head_dim=16, intermediate_size=128)
Alignment supervision is available on every preset
Every preset exposes the same set of alignment_* kwargs to from_preset.
Garg 2019 supervised cross-attention is off by default for all presets except
ul2-mini-6-3 (the demo preset), where it's on. Enable and tune on any
preset:
cfg = from_preset(
"nllb",
vocab_size=tok.vocab_size, size="nllb-600m-distilled",
alignment_enabled=True, # turn Garg loss on
alignment_loss_weight=0.05, # λ in total = ce + λ * align
alignment_decoder_layer=-1, # which decoder layer to supervise (-1 = top)
alignment_num_heads=1, # first N cross-attn heads, averaged
alignment_full_context=False, # second decoder pass w/o causal mask
alignment_apply_to_all_granularities=True, # Matryoshka × alignment
)
To disable on ul2-mini-6-3: pass alignment_enabled=False. For full
control pass alignment=AlignmentConfig(...) as a kwarg — the explicit
config overrides any individual alignment_* kwargs.
What's in the box
Architecture (config-driven)
- Attention: MHA / MQA / GQA, QK-norm, attention logit soft-cap, sliding window, per-layer local/global interleave.
- Positional: sinusoidal, learned abs, T5 bucketed rel-bias, ALiBi, RoPE (single-freq + per-layer multi-freq with NTK / linear scaling).
- Norm: LayerNorm (bias / no-bias), RMSNorm; pre- / post-norm; pre-residual norm (Gemma-style).
- FFN: Dense (GELU/ReLU/SiLU), GLU (SwiGLU/GeGLU/ReGLU), MoE (single-GPU + expert-parallel).
- Embedding: tied / untied; √d_model scale; final-logit soft-cap & scale.
- Decoder: causal or prefix-LM; every-N cross-attention layers.
Training
UnFormerTrainersubclasses HFSeq2SeqTrainer; use it or fall back toexamples/train_pure_pytorch.pywhen you don't wantaccelerate.- Losses: label-smoothed CE, Garg 2019 alignment NLL, Switch-style MoE aux.
- Matryoshka depth sampling:
joint/stochastic/sandwich. - Gradient checkpointing via
model.gradient_checkpointing_enable()— skips the alignment-supervised layer so the Garg loss still backprops.
Alignment
unformers.align.NeuralIBMAligner— IBM Model 1 / 2 / HMM, factored lexical table, GPU-batched, pharaoh output, fwd/rev + grow-diag-final-and symmetrisation.unformers.align.PortableAlignment— char (default) or byte spans + word aggregation;python -m unformers.align.cliend-to-end runner.
Data
TokenizerWrapper— any HF tokenizer, handles UL2 denoiser tags and lang codes.- UL2 mixture-of-denoisers (R / X / S) preprocessing.
Seq2SeqWithAlignmentCollator— pads src/tgt, shifts decoder input, turns pharaoh alignments into flat loss-index tensors with inverse-frequency weights.
Interop
warm_start(target, source, strategy="auto")— Net2Net (wider / deeper identity insertion) + bert2bert (cross-attn init from self-attn when source lacks cross-attn) + key-normalisation aliases for T5 / BART / NLLB / Marian / mBART / Llama / Qwen / Gemma naming. Returns aCopyManifestlisting copied / padded / identity-inserted / randomised tensors.
Generation
model.generate(...)via HFGenerationMixin. KV cache verified against full-forward parity to2e-5. Greedy and beam search both work.
Development
Run the tests
pip install -e '.[dev]'
pytest # fast tests
pytest -v -m slow -k 0.5B # large-scale param-tier tests
pytest tests/test_portable_alignment.py -v # alignment + ICU tests
Layout
unformers/
config.py # UnFormerConfig + all nested dataclasses
modules/ # attention, positional, norm, ffn, moe, embedding
blocks/ # encoder_layer, decoder_layer
model/ # encoder, decoder, seq2seq (PreTrainedModel)
presets/ # one file per family + _helpers.py
align/ # NeuralIBMAligner, portable alignment, segmenters, CLI
data/ # tokenizer wrapper, collator, UL2 denoisers
train/ # trainer, losses, Matryoshka policy
interop/ # warm_start
examples/
smoke_test.py # HF Trainer path
train_pure_pytorch.py # plain torch loop
tests/
test_presets.py
test_preset_sizes.py # 0.5B / 1B / 2B / 3B tiers (slow)
test_warm_start.py
test_gradient_checkpointing.py
test_moe.py
test_portable_alignment.py
License
See LICENSE in the repo root.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unfairseq-0.0.1.tar.gz.
File metadata
- Download URL: unfairseq-0.0.1.tar.gz
- Upload date:
- Size: 79.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8578135ad207fa9e3c417c7e927a2f4fe45ba91d23a3fa500c916a2127c692cf
|
|
| MD5 |
c6c20a7489522208824ca90b2c09fce8
|
|
| BLAKE2b-256 |
983aac37a15405c8fa290bc313f4596a3d740ca61789a5e5ac9a151114bb6463
|
File details
Details for the file unfairseq-0.0.1-py3-none-any.whl.
File metadata
- Download URL: unfairseq-0.0.1-py3-none-any.whl
- Upload date:
- Size: 86.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d101092438030f6b38abb4f7a57148fa6c1711c664aa0e1aeb155886dbfa697e
|
|
| MD5 |
2807fdea083ae55adbcba9c74962b599
|
|
| BLAKE2b-256 |
9b6e5c6fc08df923e3434032836feffeafc0d7a7815290d0ec43ccb6d24a09f8
|