HuggingFace-compatible wrapper for NovelAI's T5 with Flex Attention - supports T5, MT5, and UMT5

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

NAI-T5 Wrapper

HuggingFace-compatible wrapper for NovelAI's T5 implementation with Flex Attention support for T5, MT5, and UMT5 models.

This is a community fork that adds:

Drop-in replacement for HuggingFace's T5EncoderModel, MT5EncoderModel, and UMT5EncoderModel
Automatic runtime conversion from HuggingFace weights
MT5 and UMT5 model support

Install

# Install from PyPI
pip install nai-t5-wrapper

# Or install from GitHub source
pip install git+https://github.com/bghira/nai-t5-wrapper.git

Quick Start (HuggingFace-Compatible API)

The wrapper provides a drop-in replacement for HuggingFace T5 encoder models:

from nai_t5_wrapper import NAIT5EncoderModel
from transformers import AutoTokenizer

# Load any supported T5 variant - weights are converted at runtime
model = NAIT5EncoderModel.from_pretrained('google/t5-v1_1-xxl')
# model = NAIT5EncoderModel.from_pretrained('google/mt5-xxl')
# model = NAIT5EncoderModel.from_pretrained('google/umt5-xxl')

tokenizer = AutoTokenizer.from_pretrained('google/t5-v1_1-xxl')

# Use like HuggingFace T5EncoderModel
inputs = tokenizer("Hello world", return_tensors="pt")
output = model(inputs.input_ids.cuda(), attention_mask=inputs.attention_mask.cuda())
embeddings = output.last_hidden_state  # or output[0]

Supported Models

Model Type	Example HuggingFace IDs
T5 v1.1	`google/t5-v1_1-small`, `google/t5-v1_1-xl`, `google/t5-v1_1-xxl`
MT5	`google/mt5-small`, `google/mt5-xl`, `google/mt5-xxl`
UMT5	`google/umt5-small`, `google/umt5-xl`, `google/umt5-xxl`

Additional Options

# Custom sequence length (default: 512)
model = NAIT5EncoderModel.from_pretrained('google/t5-v1_1-xxl', max_seq_len=256)

# Custom dtype
model = NAIT5EncoderModel.from_pretrained('google/t5-v1_1-xxl', dtype=torch.float16)

# Compile for additional speedup
model = NAIT5EncoderModel.from_pretrained('google/t5-v1_1-xxl').compile()

Advanced Usage

For advanced usage with pre-converted weights and the underlying NAI-T5 API, see the encoder usage or decoder usage docs.

Other packages you may want:

# Sentencepiece tokenizer (alternative to HF tokenizers)
pip install sentencepiece
# Tensorizer for pre-converted weight loading
pip install tensorizer async_timeout

See weight loading docs for how to convert HF weights to tensorizer format.

What's included

Performance features

torch SDPA attention in encoder + decoder
Flex attention in encoder (optional)
- ignores padding keys
- ignores padding queries (uses safe_softmax to give these positions 0-probability)
fused projections
- QKV fusion in self-attention
- KV fusion in cross-attention
- in-projection fusion in GEGLU
RMSNorm scales can be fused into subsequent linear projections
KV cache support
- just one big re-used tensor (avoids repeatedly reallocating a tensor as the sequence grows)
UMT5 per-layer position embedding fusion (all layers computed concurrently)
FFN out-proj is allowed to run in half-precision without the use of autocast

PyTorch idioms

RMSNorm built-in
GELU built-in

Compatibility

masking
- 3-dim packing mask or 2-dim padding mask
support for v1.1 (GEGLU) and v1.0 (ReLU)
support for UMT5 (e.g. EleutherAI's pile-t5) per-layer position embeddings
supports SentencePiece tokenizer
supports disabling attention scale, for compatibility with Google checkpoints
- Google burned the attention scale into the weights, which had no detriment to training dynamics because Adafactor optimizer scales param lr w.r.t the RMS of the params (more detail here)

Training considerations

weight init (basic attempt)
supports conventional attention scale (head_dim**-.5)

Float16 considerations

See how we approached float16 support.

Float16 is appealing because the precision is better, and because it can enable better performance on devices such as 3090 and 4090. Ordinarily these consumer cards are speed-limited computing float16 matmuls with float32 accumulation, but they support float16 matmul with float16 accumulation at higher speeds (with 4090 being comparable to A100).

Support for float16 accumulation is being added to pytorch.

Split-k matmul can be used to recover the accuracy that float16 matmuls lose with a float16 accumulator.
See CublasOps for a cuBLAS implementation and gpu_poor for a triton implementation of split-k matmul.

Performance

nai-t5 supports two types of attention: SDPA and Flex.

T5 is challenging to run performantly, because its relative position bias requires applying an arbitrary bias in the attention calculation.
SDPA typically falls back to the slowest backend, cutlassF (memory-efficient attention), when arbitrary biases are required.
On H100s, SDPA can use the cuDNN backend as a faster alternative. It's unknown how reliable is cuDNN's correctness. We encountered correctness problems with cuDNN SDPA during training, but perhaps inference on fixed workloads could be different.
HF computes attention manually via math operations.

Flex attention is a far faster way to implement complex attention patterns. We measured a few approaches, but "just add an arbitrary bias" was the fastest under the parameters we tried.
Flex attention is only fast when compiled, because otherwise it falls back to a vmapped math attention.

Benchmark

We measured the encoder performance of T5 v1.1 XXL at encoding a batch-of-1, 512-length context.

For uncompiled inference, nai-t5 SDPA (cutlassF) is 60% faster than HF, or 68% with cuDNN.

For compiled inference, nai-t5 SDPA (cutlassF) and HF are comparable. cuDNN is about 2% faster. nai-t5 Flex is 10% faster at 0% sparsity (full 512-length prompt).
nai-t5 Flex is 16% faster at 93.75% sparsity (1-token prompt).

Implementation        Compiled    FLOP/s           ms/iter    iter/s
--------------------  ----------  -------------  ---------  --------
hf                    False       226.6 TFLOP/s       21.4      46.8
nai_sdpa (cutlassF)   False       363.5 TFLOP/s       13.3      75.0
nai_sdpa (cuDNN)      False       381.5 TFLOP/s       12.7      78.7
nai_flex (512 tokens) False        23.3 TFLOP/s      208.0       4.8
nai_flex   (1 token)  False        23.2 TFLOP/s      208.9       4.8
hf                     True       499.4 TFLOP/s        9.7     103.1
nai_sdpa (cutlassF)    True       501.0 TFLOP/s        9.7     103.4
nai_sdpa (cuDNN)       True       513.2 TFLOP/s        9.4     105.9
nai_flex (512 tokens)  True       552.8 TFLOP/s        8.8     114.1
nai_flex   (1 token)   True       579.4 TFLOP/s        8.4     119.6

Performance measured via benchmark_encoder.py, using environment:

transformers 4.49.0
torch 2.6.0
CUDA 12.8
Nvidia Driver Version: 535.216.01
triton 3.2.0+git35c6c7c6
apex layernorm **not** used by transformers (commented-out of modeling_t5.py to avoid import)
NVIDIA H100 80GB HBM3

Typical invocation:

python -m nai_t5_wrapper.scripts.benchmark_encoder --ckpt v1_1-xxl --batch-size 1 --nai-fuse-norm-scales --bench-hf --bench-nai-sdpa --enable-cudnn-sdpa --bench-nai-flex

There are initiatives in HF transformers to introduce Flex attention and SDPA. These are still in review at the time of transformers v4.49.0.

Precision

On T5v1.1 XXL, we compare HF vs nai-t5 half-precision implementations to see how close each gets to HF float32.

nai-t5 half-precision implementations get closer than HF to the float32 reference in every quantile.

Encoder precision

absmax diff quantiles:
[0.5000, 0.7500, 0.9000, 0.9500, 0.9900, 0.9990, 0.9999]

HF float32 vs HF pure-bf16:
[0.0006, 0.0011, 0.0018, 0.0022, 0.0033, 0.0047, 0.0080]
HF float32 vs NAI bf16:
[0.0003, 0.0005, 0.0008, 0.0010, 0.0014, 0.0022, 0.0037]

HF float32 vs HF fp16:
[4.7763e-05, 9.2119e-05, 1.4318e-04, 1.7839e-04, 2.4898e-04, 3.5743e-04, 6.5168e-04]
HF float32 vs NAI f16:
[3.7434e-05, 7.2266e-05, 1.1227e-04, 1.3884e-04, 1.9671e-04, 2.7551e-04, 4.2810e-04]

Precision compared via t5_encoder_hf_precision_parity.py, using environment:

transformers 4.49.0
torch 2.6.0
CUDA 12.8
Nvidia Driver Version: 535.216.01
triton 3.2.0+git35c6c7c6
apex layernorm **not** used by transformers (commented-out of modeling_t5.py to avoid import)
NVIDIA H100 80GB HBM3
nai-t5 using norm fusion but not using flex attention or torch compilation

nai-t5 has the advantage of a float32 residual.
HF has the advantage of running FFN out-projections in float32, which is costly.
Runtime performance should be compared too to understand the cost/benefit tradeoff of where extra precision was purchased.

The two implementations use entirely different approaches to float16. HF takes the risk that activation-clipping could impact outliers. nai-t5 takes a risk of float16 underflow in its residual stream. both are more accurate than their bfloat16 counterparts.

Philosophy

Main objective was to modernize T5 with Torch SDPA attention and write in a clearer code style.

type hints
document return types via NamedTuple
document tensor shapes via einops rearrange
pass KV cache as a forward argument to be mutated; no impact on return types
clearer separation of concerns between encoder/decoder/model
- avoid weight-tying and shared references
prefer to duplicate modules rather than add conditions to existing modules to make them multi-use
- makes it clearer that there are 3 types of attention, and they can be optimized differently
- makes it clearer that encoder does not use a KV cache
eliminate unused configurables
- for example we do not keep "tie emb to lm_head"
- keep only what's needed for final shipped models (e.g. v1.1 and v1.0), not ablations

Shelved ideas

We considered fusing the decoder's every cross-attention KV projection, but it's questionable whether this would provide any speedup (KV is work that can be done concurrently with Q anyway), and it would complicate FSDP (the very wide fused KV projection would need to be chunked to achieve good compute/communication overlap).

MaskedTensor could be used to exploit sparsity on padded fixed-length sequences. Fixed-length sequences help to enable torch.compile dynamic=False. This would be particularly beneficial when inferencing the decoder, as the sequence length keeps changing (but could be modelled as a fixed-length MaskedTensor).

References

T5:

@article{10.5555/3455716.3455856,
    author = {Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.},
    title = {Exploring the limits of transfer learning with a unified text-to-text transformer},
    year = {2020},
    issue_date = {January 2020},
    publisher = {JMLR.org},
    volume = {21},
    number = {1},
    issn = {1532-4435},
    abstract = {Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.},
    journal = {J. Mach. Learn. Res.},
    month = jan,
    articleno = {140},
    numpages = {67},
    keywords = {transfer learning, natural language processing, multi-task learning, attention based models, deep learning}
}

UMT5:

@inproceedings{chung2023unimax,
    title={UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining},
    author={Hyung Won Chung and Xavier Garcia and Adam Roberts and Yi Tay and Orhan Firat and Sharan Narang and Noah Constant},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=kXwdL1cWOAi}
}

pile-t5:

@misc{2024PileT5,
  author  = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
  title   = {Pile-T5},
  year    = {2024},
  url     = {https://blog.eleuther.ai/pile-t5/},
  note    = {Blog post},
}

Flex attention:

@misc{li2024flexattention,
      title={FlexAttention for Efficient High-Resolution Vision-Language Models}, 
      author={Junyan Li and Delin Chen and Tianle Cai and Peihao Chen and Yining Hong and Zhenfang Chen and Yikang Shen and Chuang Gan},
      year={2024},
      eprint={2407.20228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.20228}, 
}

Huggingface Transformers:

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}

Graphcore, Running Flan-T5‌-XL Inference In Float16‍ Fo‌r IPU - Ho‌w We D‌id It

Contribution

This is a community fork. Contributions, bug reports, and feature requests are welcome via GitHub Issues and Pull Requests.

Acknowledgements

This wrapper is built on top of NovelAI's T5 implementation. We thank NovelAI for open-sourcing their optimized T5 codebase.

License

Apache 2.0. Uses code from NovelAI/t5 and HF transformers, both Apache 2.0 licensed.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bghira

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.0

Feb 14, 2026

This version

1.0.0

Dec 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nai_t5_wrapper-1.0.0.tar.gz (66.8 kB view details)

Uploaded Dec 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nai_t5_wrapper-1.0.0-py3-none-any.whl (94.5 kB view details)

Uploaded Dec 21, 2025 Python 3

File details

Details for the file nai_t5_wrapper-1.0.0.tar.gz.

File metadata

Download URL: nai_t5_wrapper-1.0.0.tar.gz
Upload date: Dec 21, 2025
Size: 66.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nai_t5_wrapper-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a580ee74d139c6a7a5bb4ae4810416a0ec858ba59f15a41b8d2218e51a003b24`
MD5	`1c906141cf9418a056ca1bde9f7692d8`
BLAKE2b-256	`02497f385dd0b5f9e21a892b3c574d3c023db4d0a3a9784c5d6dced013e62a94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nai_t5_wrapper-1.0.0.tar.gz:

Publisher: publish.yml on bghira/nai-t5-wrapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nai_t5_wrapper-1.0.0.tar.gz
- Subject digest: a580ee74d139c6a7a5bb4ae4810416a0ec858ba59f15a41b8d2218e51a003b24
- Sigstore transparency entry: 774629652
- Sigstore integration time: Dec 21, 2025
Source repository:
- Permalink: bghira/nai-t5-wrapper@15f32e7f4184148ce7a7e712ff0803e3beb23c10
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/bghira
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@15f32e7f4184148ce7a7e712ff0803e3beb23c10
- Trigger Event: release

File details

Details for the file nai_t5_wrapper-1.0.0-py3-none-any.whl.

File metadata

Download URL: nai_t5_wrapper-1.0.0-py3-none-any.whl
Upload date: Dec 21, 2025
Size: 94.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nai_t5_wrapper-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea1e45c1ee9e39afc9db69bd74757c0ce190559b7f8a9e874fc86a0b589cef9f`
MD5	`2f7e99ad497dc1bf7fd7b688d3f9fc9c`
BLAKE2b-256	`a845fc5dfcafed3f16f35673dff03cc424911d264eac53abcc6fe02ce207945e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nai_t5_wrapper-1.0.0-py3-none-any.whl:

Publisher: publish.yml on bghira/nai-t5-wrapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nai_t5_wrapper-1.0.0-py3-none-any.whl
- Subject digest: ea1e45c1ee9e39afc9db69bd74757c0ce190559b7f8a9e874fc86a0b589cef9f
- Sigstore transparency entry: 774629653
- Sigstore integration time: Dec 21, 2025
Source repository:
- Permalink: bghira/nai-t5-wrapper@15f32e7f4184148ce7a7e712ff0803e3beb23c10
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/bghira
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@15f32e7f4184148ce7a7e712ff0803e3beb23c10
- Trigger Event: release

nai-t5-wrapper 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

NAI-T5 Wrapper

Install

Quick Start (HuggingFace-Compatible API)

Supported Models

Additional Options

Advanced Usage

What's included

Performance features

PyTorch idioms

Compatibility

Training considerations

Float16 considerations

Performance

Benchmark

Precision

Philosophy

Shelved ideas

References

Contribution

Acknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance