Skip to main content

MLX transformers is a machine learning framework with similar Interface to Huggingface transformers.

Project description

MLX Transformers

PyPI

mlx-transformers provides MLX implementations of several Hugging Face-style model architectures for Apple Silicon. The project keeps a familiar Transformers-style API while loading weights from Hugging Face checkpoints and running inference with MLX.

The repository is currently inference-focused. Some model families have broader parity than others, but the core usage pattern is the same across the package:

import mlx.core as mx
from transformers import AutoConfig, AutoTokenizer

from mlx_transformers.models import BertModel

model_name = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)

model = BertModel(config)
model.from_pretrained(model_name)

inputs = tokenizer("Hello from MLX", return_tensors="np")
inputs = {k: mx.array(v) for k, v in inputs.items()}

outputs = model(**inputs)

Quantized loading is supported through the same loader:

model.from_pretrained(
    model_name,
    quantize=True,
    group_size=64,
    bits=4,
    mode="affine",
)

Pre-quantized MLX checkpoints can also be loaded directly without re-quantizing:

model_name = "mlx-community/Phi-3-mini-4k-instruct-4bit"

config = AutoConfig.from_pretrained(model_name)
model = Phi3ForCausalLM(config)
model.from_pretrained(model_name)

Requirements

  • Apple Silicon Mac
  • Python 3.10+
  • MLX-compatible environment

Some models are gated on Hugging Face. If needed, set HF_TOKEN in your environment before calling from_pretrained(...).

Installation

Install from PyPI:

pip install mlx-transformers

Install for local development:

pip install -r requirements.txt
pip install -e .

asitop is also useful if you want to monitor GPU and CPU usage on Apple Silicon:

pip install asitop

Available Models

Current exports from src/mlx_transformers/models/__init__.py:

  • BERT
    • BertModel
    • BertForMaskedLM
    • BertForSequenceClassification
    • BertForTokenClassification
    • BertForQuestionAnswering
  • RoBERTa
    • RobertaModel
    • RobertaForSequenceClassification
    • RobertaForTokenClassification
    • RobertaForQuestionAnswering
  • XLM-RoBERTa
    • XLMRobertaModel
    • XLMRobertaForSequenceClassification
    • XLMRobertaForTokenClassification
    • XLMRobertaForQuestionAnswering
  • Causal LMs
    • LlamaModel, LlamaForCausalLM
    • PhiModel, PhiForCausalLM
    • Phi3Model, Phi3ForCausalLM
    • Qwen3Model, Qwen3ForCausalLM
    • Qwen3VLModel, Qwen3VLForConditionalGeneration
    • OpenELMModel, OpenELMForCausalLM
    • PersimmonForCausalLM
    • FuyuForCausalLM
  • Translation
    • M2M100ForConditionalGeneration

Text Generation Benchmarking

The benchmark script measures prompt prefill time, decode time, and throughput for supported text-generation families and emits a markdown table suitable for the README.

Supported benchmark families currently include phi, phi3, llama, qwen3, openelm, persimmon, and gemma3_text.

When --dataset ultrachat is set, the script samples prompts from HuggingFaceH4/ultrachat_200k and benchmarks each token-length bucket separately.

Example results:

Label Hugging Face model Bucket Samples Prompt tokens New tokens Prefill (s) Prefill tok/s Decode (s) Decode tok/s Full (s)
phi3 microsoft/Phi-3-mini-4k-instruct 1-128 30 97 107 0.151 674.42 5.985 17.63 6.135
phi3 microsoft/Phi-3-mini-4k-instruct 129-512 30 397 96 0.654 840.55 8.636 13.11 9.290
Benchmark commands

Generic multi-model run:

python examples/text_generation/benchmark_generation.py \
  --model phi3=microsoft/Phi-3-mini-4k-instruct \
  --model qwen3=Qwen/Qwen3-0.6B \
  --model openelm=apple/OpenELM-1_1B-Instruct \
  --dataset ultrachat \
  --bucket 1:128 \
  --bucket 129:512 \
  --bucket 513:1024 \
  --bucket 1025:2048 \
  --max-tokens 128 \
  --runs 3 \
  --warmup-runs 1 \
  --output-file benchmark_results.md

Phi-3 4k run used for the table above:

python examples/text_generation/benchmark_generation.py \
  --model phi3=microsoft/Phi-3-mini-4k-instruct \
  --dataset ultrachat \
  --bucket 1:128 \
  --bucket 129:512 \
  --output-file benchmark_results-phi3-4k.md \
  --samples-per-bucket 10

Examples

Sentence Embeddings with BERT

python examples/bert/sentence_transformers.py

LLaMA Text Generation

The LLaMA example now formats the input with the tokenizer chat template and stops on EOS.

python examples/text_generation/llama_generation.py \
  --model-name meta-llama/Llama-3.2-1B-Instruct \
  --prompt "Write a short explanation of rotary embeddings." \
  --max-tokens 128 \
  --temp 0.0

Quantized LLaMA Text Generation

python examples/text_generation/quantized_llama_generation.py \
  --model-name meta-llama/Llama-3.2-1B-Instruct \
  --prompt "Explain why 4-bit quantization can reduce memory usage." \
  --bits 4 \
  --group-size 64 \
  --mode affine \
  --max-tokens 128 \
  --temp 0.0

Phi-3 Text Generation

python examples/text_generation/phi3_generation.py \
  --model-name microsoft/Phi-3-mini-4k-instruct \
  --prompt "Explain attention masking." \
  --max-tokens 128 \
  --temp 0.0

OpenELM Text Generation

python examples/text_generation/openelm_generation.py \
  --model-name apple/OpenELM-1_1B-Instruct \
  --prompt "Summarize grouped-query attention." \
  --max-tokens 128

Qwen3 Text Generation

python examples/text_generation/qwen3_generation.py \
  --model-name Qwen/Qwen3-0.6B \
  --prompt "Explain grouped-query attention in one paragraph." \
  --max-tokens 128 \
  --temp 0.0

Qwen3-VL Image + Text Generation

python examples/text_generation/qwen3_vl_generation.py \
  --model-name Qwen/Qwen3-VL-2B-Instruct \
  --image-url "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" \
  --prompt "Describe the image and mention the likely setting." \
  --max-tokens 128 \
  --temp 0.0

Quantized Qwen3-VL Image + Text Generation

python examples/text_generation/qwen3_vl_generation.py \
  --model-name Qwen/Qwen3-VL-2B-Instruct \
  --image-path /Users/odunayoogundepo/Desktop/screenshot.png \
  --prompt "What is happening in this image?" \
  --max-tokens 1048 \
  --temp 0.0 \
  --quantize \
  --mode nvfp4 \
  --quantize-input

--quantize-input is only valid with --mode nvfp4 or --mode mxfp8.

Gemma3 Image + Text Generation

python examples/text_generation/gemma3_generation.py \
  --model-name google/gemma-3-4b-it \
  --image-path /Users/odunayoogundepo/Desktop/screenshot.png \
  --prompt "What is happening in this image?" \
  --max-tokens 128 \
  --temp 0.0

Gemma3 Text Generation

python examples/text_generation/gemma3_text_generation.py \
  --model-name google/gemma-3-4b-it \
  --prompt "Explain grouped-query attention in one paragraph." \
  --max-tokens 128 \
  --temp 0.0

NLLB / M2M-100 Translation

python examples/translation/nllb_translation.py \
  --model_name facebook/nllb-200-distilled-600M \
  --source_language English \
  --target_language Yoruba \
  --text_to_translate "Let us translate text to Yoruba"

Chat Interface

A Streamlit chat UI is included under chat/.

cd chat
bash start.sh

Add or remove entries in chat/models.txt to control which models appear in the sidebar. The chat app now resolves supported text model families from the model config, including phi, phi3, llama, qwen3, openelm, persimmon, and gemma3_text.

Chat Image

Tests

The repository currently includes focused tests for:

  • BERT
  • RoBERTa
  • XLM-RoBERTa
  • LLaMA
  • Phi
  • Phi-3

Run the full test suite:

python -m unittest

Run a single module:

python -m unittest tests.test_bert
python -m unittest tests.test_llama

Some tests download model weights from Hugging Face on first run.

Repository Layout

src/mlx_transformers/models/   model implementations and shared helpers
examples/                      runnable examples
tests/                         model parity and behavior tests
chat/                          streamlit chat interface

Notes

  • Model loading is handled through from_pretrained(...) in src/mlx_transformers/models/base.py.
  • Pretrained models are loaded in eval mode by default.
  • Causal generation support is present for the decoder-style model families, but parity and feature coverage still vary by architecture.

Contributing

Contributions are welcome. The highest-value contributions are usually:

  • new model implementations
  • parity fixes against Hugging Face behavior
  • generation and cache correctness fixes
  • tests for unsupported or weakly covered paths

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_transformers-0.2.0.tar.gz (83.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_transformers-0.2.0-py3-none-any.whl (92.0 kB view details)

Uploaded Python 3

File details

Details for the file mlx_transformers-0.2.0.tar.gz.

File metadata

  • Download URL: mlx_transformers-0.2.0.tar.gz
  • Upload date:
  • Size: 83.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_transformers-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4c638a99685e31d131ea03933f1638cac759bc5954fc56ac3f3c3a648e9aabb0
MD5 bbc396b9556b9cd21a31dd352dd6d342
BLAKE2b-256 6d5ebb5de827ef57cc784f1e26ec2c2965f3ab4507d0aebd082d0127aff3ec09

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_transformers-0.2.0.tar.gz:

Publisher: publish.yml on ToluClassics/mlx-transformers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_transformers-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mlx_transformers-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f63782a0ed6abebbfd165e6fcfc1e5353f531b5fbeb0a1e5577e047a5745e022
MD5 6c1645a1674b11b4528a392bdee0b6a4
BLAKE2b-256 c5f567c0b232a730dd7d319f6e3a791200a522a50e54f30957021998dcc5aa1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_transformers-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ToluClassics/mlx-transformers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page