MLX transformers is a machine learning framework with similar Interface to Huggingface transformers.
Project description
MLX Transformers
mlx-transformers provides MLX implementations of several Hugging Face-style model architectures for Apple Silicon. The project keeps a familiar Transformers-style API while loading weights from Hugging Face checkpoints and running inference with MLX.
The repository is currently inference-focused. Some model families have broader parity than others, but the core usage pattern is the same across the package:
import mlx.core as mx
from transformers import AutoConfig, AutoTokenizer
from mlx_transformers.models import BertModel
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
model = BertModel(config)
model.from_pretrained(model_name)
inputs = tokenizer("Hello from MLX", return_tensors="np")
inputs = {k: mx.array(v) for k, v in inputs.items()}
outputs = model(**inputs)
Quantized loading is supported through the same loader:
model.from_pretrained(
model_name,
quantize=True,
group_size=64,
bits=4,
mode="affine",
)
Pre-quantized MLX checkpoints can also be loaded directly without re-quantizing:
model_name = "mlx-community/Phi-3-mini-4k-instruct-4bit"
config = AutoConfig.from_pretrained(model_name)
model = Phi3ForCausalLM(config)
model.from_pretrained(model_name)
Requirements
- Apple Silicon Mac
- Python 3.10+
- MLX-compatible environment
Some models are gated on Hugging Face. If needed, set HF_TOKEN in your environment before calling from_pretrained(...).
Installation
Install from PyPI:
pip install mlx-transformers
Install for local development:
pip install -r requirements.txt
pip install -e .
asitop is also useful if you want to monitor GPU and CPU usage on Apple Silicon:
pip install asitop
Available Models
Current exports from src/mlx_transformers/models/__init__.py:
- BERT
BertModelBertForMaskedLMBertForSequenceClassificationBertForTokenClassificationBertForQuestionAnswering
- RoBERTa
RobertaModelRobertaForSequenceClassificationRobertaForTokenClassificationRobertaForQuestionAnswering
- XLM-RoBERTa
XLMRobertaModelXLMRobertaForSequenceClassificationXLMRobertaForTokenClassificationXLMRobertaForQuestionAnswering
- Causal LMs
LlamaModel,LlamaForCausalLMPhiModel,PhiForCausalLMPhi3Model,Phi3ForCausalLMQwen3Model,Qwen3ForCausalLMQwen3VLModel,Qwen3VLForConditionalGenerationOpenELMModel,OpenELMForCausalLMPersimmonForCausalLMFuyuForCausalLM
- Translation
M2M100ForConditionalGeneration
Text Generation Benchmarking
The benchmark script measures prompt prefill time, decode time, and throughput for supported text-generation families and emits a markdown table suitable for the README.
Supported benchmark families currently include phi, phi3, llama,
qwen3, openelm, persimmon, and gemma3_text.
When --dataset ultrachat is set, the script samples prompts from
HuggingFaceH4/ultrachat_200k and benchmarks each token-length bucket
separately.
Example results:
| Label | Hugging Face model | Bucket | Samples | Prompt tokens | New tokens | Prefill (s) | Prefill tok/s | Decode (s) | Decode tok/s | Full (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| phi3 | microsoft/Phi-3-mini-4k-instruct | 1-128 | 30 | 97 | 107 | 0.151 | 674.42 | 5.985 | 17.63 | 6.135 |
| phi3 | microsoft/Phi-3-mini-4k-instruct | 129-512 | 30 | 397 | 96 | 0.654 | 840.55 | 8.636 | 13.11 | 9.290 |
Benchmark commands
Generic multi-model run:
python examples/text_generation/benchmark_generation.py \
--model phi3=microsoft/Phi-3-mini-4k-instruct \
--model qwen3=Qwen/Qwen3-0.6B \
--model openelm=apple/OpenELM-1_1B-Instruct \
--dataset ultrachat \
--bucket 1:128 \
--bucket 129:512 \
--bucket 513:1024 \
--bucket 1025:2048 \
--max-tokens 128 \
--runs 3 \
--warmup-runs 1 \
--output-file benchmark_results.md
Phi-3 4k run used for the table above:
python examples/text_generation/benchmark_generation.py \
--model phi3=microsoft/Phi-3-mini-4k-instruct \
--dataset ultrachat \
--bucket 1:128 \
--bucket 129:512 \
--output-file benchmark_results-phi3-4k.md \
--samples-per-bucket 10
Examples
Sentence Embeddings with BERT
python examples/bert/sentence_transformers.py
LLaMA Text Generation
The LLaMA example now formats the input with the tokenizer chat template and stops on EOS.
python examples/text_generation/llama_generation.py \
--model-name meta-llama/Llama-3.2-1B-Instruct \
--prompt "Write a short explanation of rotary embeddings." \
--max-tokens 128 \
--temp 0.0
Quantized LLaMA Text Generation
python examples/text_generation/quantized_llama_generation.py \
--model-name meta-llama/Llama-3.2-1B-Instruct \
--prompt "Explain why 4-bit quantization can reduce memory usage." \
--bits 4 \
--group-size 64 \
--mode affine \
--max-tokens 128 \
--temp 0.0
Phi-3 Text Generation
python examples/text_generation/phi3_generation.py \
--model-name microsoft/Phi-3-mini-4k-instruct \
--prompt "Explain attention masking." \
--max-tokens 128 \
--temp 0.0
OpenELM Text Generation
python examples/text_generation/openelm_generation.py \
--model-name apple/OpenELM-1_1B-Instruct \
--prompt "Summarize grouped-query attention." \
--max-tokens 128
Qwen3 Text Generation
python examples/text_generation/qwen3_generation.py \
--model-name Qwen/Qwen3-0.6B \
--prompt "Explain grouped-query attention in one paragraph." \
--max-tokens 128 \
--temp 0.0
Qwen3-VL Image + Text Generation
python examples/text_generation/qwen3_vl_generation.py \
--model-name Qwen/Qwen3-VL-2B-Instruct \
--image-url "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" \
--prompt "Describe the image and mention the likely setting." \
--max-tokens 128 \
--temp 0.0
Quantized Qwen3-VL Image + Text Generation
python examples/text_generation/qwen3_vl_generation.py \
--model-name Qwen/Qwen3-VL-2B-Instruct \
--image-path /Users/odunayoogundepo/Desktop/screenshot.png \
--prompt "What is happening in this image?" \
--max-tokens 1048 \
--temp 0.0 \
--quantize \
--mode nvfp4 \
--quantize-input
--quantize-input is only valid with --mode nvfp4 or --mode mxfp8.
Gemma3 Image + Text Generation
python examples/text_generation/gemma3_generation.py \
--model-name google/gemma-3-4b-it \
--image-path /Users/odunayoogundepo/Desktop/screenshot.png \
--prompt "What is happening in this image?" \
--max-tokens 128 \
--temp 0.0
Gemma3 Text Generation
python examples/text_generation/gemma3_text_generation.py \
--model-name google/gemma-3-4b-it \
--prompt "Explain grouped-query attention in one paragraph." \
--max-tokens 128 \
--temp 0.0
NLLB / M2M-100 Translation
python examples/translation/nllb_translation.py \
--model_name facebook/nllb-200-distilled-600M \
--source_language English \
--target_language Yoruba \
--text_to_translate "Let us translate text to Yoruba"
Chat Interface
A Streamlit chat UI is included under chat/.
cd chat
bash start.sh
Add or remove entries in chat/models.txt to control which models appear in the
sidebar. The chat app now resolves supported text model families from the model
config, including phi, phi3, llama, qwen3, openelm, persimmon, and
gemma3_text.
Tests
The repository currently includes focused tests for:
- BERT
- RoBERTa
- XLM-RoBERTa
- LLaMA
- Phi
- Phi-3
Run the full test suite:
python -m unittest
Run a single module:
python -m unittest tests.test_bert
python -m unittest tests.test_llama
Some tests download model weights from Hugging Face on first run.
Repository Layout
src/mlx_transformers/models/ model implementations and shared helpers
examples/ runnable examples
tests/ model parity and behavior tests
chat/ streamlit chat interface
Notes
- Model loading is handled through
from_pretrained(...)insrc/mlx_transformers/models/base.py. - Pretrained models are loaded in eval mode by default.
- Causal generation support is present for the decoder-style model families, but parity and feature coverage still vary by architecture.
Contributing
Contributions are welcome. The highest-value contributions are usually:
- new model implementations
- parity fixes against Hugging Face behavior
- generation and cache correctness fixes
- tests for unsupported or weakly covered paths
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_transformers-0.2.0.tar.gz.
File metadata
- Download URL: mlx_transformers-0.2.0.tar.gz
- Upload date:
- Size: 83.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c638a99685e31d131ea03933f1638cac759bc5954fc56ac3f3c3a648e9aabb0
|
|
| MD5 |
bbc396b9556b9cd21a31dd352dd6d342
|
|
| BLAKE2b-256 |
6d5ebb5de827ef57cc784f1e26ec2c2965f3ab4507d0aebd082d0127aff3ec09
|
Provenance
The following attestation bundles were made for mlx_transformers-0.2.0.tar.gz:
Publisher:
publish.yml on ToluClassics/mlx-transformers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlx_transformers-0.2.0.tar.gz -
Subject digest:
4c638a99685e31d131ea03933f1638cac759bc5954fc56ac3f3c3a648e9aabb0 - Sigstore transparency entry: 1154813207
- Sigstore integration time:
-
Permalink:
ToluClassics/mlx-transformers@993957b6a691749095494ddc0abd3cccf751e4d2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ToluClassics
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@993957b6a691749095494ddc0abd3cccf751e4d2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mlx_transformers-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mlx_transformers-0.2.0-py3-none-any.whl
- Upload date:
- Size: 92.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f63782a0ed6abebbfd165e6fcfc1e5353f531b5fbeb0a1e5577e047a5745e022
|
|
| MD5 |
6c1645a1674b11b4528a392bdee0b6a4
|
|
| BLAKE2b-256 |
c5f567c0b232a730dd7d319f6e3a791200a522a50e54f30957021998dcc5aa1c
|
Provenance
The following attestation bundles were made for mlx_transformers-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on ToluClassics/mlx-transformers
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlx_transformers-0.2.0-py3-none-any.whl -
Subject digest:
f63782a0ed6abebbfd165e6fcfc1e5353f531b5fbeb0a1e5577e047a5745e022 - Sigstore transparency entry: 1154813213
- Sigstore integration time:
-
Permalink:
ToluClassics/mlx-transformers@993957b6a691749095494ddc0abd3cccf751e4d2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ToluClassics
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@993957b6a691749095494ddc0abd3cccf751e4d2 -
Trigger Event:
release
-
Statement type: