mlx-lm

LLMs with MLX and the Hugging Face Hub

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

awni mlx-dev

These details have not been verified by PyPI

Project description

MLX LM

MLX LM is a Python package for generating text and fine-tuning large language models on Apple silicon with MLX.

Some key features include:

Integration with the Hugging Face Hub to easily use thousands of LLMs with a single command.
Support for quantizing and uploading models to the Hugging Face Hub.
Low-rank and full model fine-tuning with support for quantized models.
Distributed inference and fine-tuning with mx.distributed

The easiest way to get started is to install the mlx-lm package:

With pip:

pip install mlx-lm

With conda:

conda install -c conda-forge mlx-lm

Quick Start

To generate text with an LLM use:

mlx_lm.generate --prompt "How tall is Mt Everest?"

To chat with an LLM use:

mlx_lm.chat

This will give you a chat REPL that you can use to interact with the LLM. The chat context is preserved during the lifetime of the REPL.

Commands in mlx-lm typically take command line options which let you specify the model, sampling parameters, and more. Use -h to see a list of available options for a command, e.g.:

mlx_lm.generate -h

The default model for generation and chat is mlx-community/Llama-3.2-3B-Instruct-4bit. You can specify any MLX-compatible model with the --model flag. Thousands are available in the MLX Community Hugging Face organization.

Python API

You can use mlx-lm as a module:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a story about Einstein"

messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

To see a description of all the arguments you can do:

>>> help(generate)

Check out the generation example to see how to use the API in more detail. Check out the batch generation example to see how to efficiently generate continuations for a batch of prompts.

The mlx-lm package also comes with functionality to quantize and optionally upload models to the Hugging Face Hub.

You can convert models using the Python API:

from mlx_lm import convert

repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"

convert(repo, quantize=True, upload_repo=upload_repo)

This will generate a 4-bit quantized Mistral 7B and upload it to the repo mlx-community/My-Mistral-7B-Instruct-v0.3-4bit. It will also save the converted model in the path mlx_model by default.

To see a description of all the arguments you can do:

>>> help(convert)

Streaming

For streaming generation, use the stream_generate function. This yields a generation response object.

For example,

from mlx_lm import load, stream_generate

repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)

prompt = "Write a story about Einstein"

messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
)

for response in stream_generate(model, tokenizer, prompt, max_tokens=512):
    print(response.text, end="", flush=True)
print()

Sampling

The generate and stream_generate functions accept sampler and logits_processors keyword arguments. A sampler is any callable which accepts a possibly batched logits array and returns an array of sampled tokens. The logits_processors must be a list of callables which take the token history and current logits as input and return the processed logits. The logits processors are applied in order.

Some standard sampling functions and logits processors are provided in mlx_lm.sample_utils.

Command Line

You can also use mlx-lm from the command line with:

mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"

This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.

For a full list of options run:

mlx_lm.generate --help

To quantize a model from the command line run:

mlx_lm.convert --model mistralai/Mistral-7B-Instruct-v0.3 -q

For more options run:

mlx_lm.convert --help

You can upload new models to Hugging Face by specifying --upload-repo to convert. For example, to upload a quantized Mistral-7B model to the MLX Hugging Face community you can do:

mlx_lm.convert \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral

Models can also be converted and quantized directly in the mlx-my-repo Hugging Face Space.

Long Prompts and Generations

mlx-lm has some tools to scale efficiently to long prompts and generations:

A rotating fixed-size key-value cache.
Prompt caching

To use the rotating key-value cache pass the argument --max-kv-size n where n can be any integer. Smaller values like 512 will use very little RAM but result in worse quality. Larger values like 4096 or higher will use more RAM but have better quality.

Caching prompts can substantially speedup reusing the same long context with different queries. To cache a prompt use mlx_lm.cache_prompt. For example:

cat prompt.txt | mlx_lm.cache_prompt \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --prompt - \
  --prompt-cache-file mistral_prompt.safetensors

Then use the cached prompt with mlx_lm.generate:

mlx_lm.generate \
    --prompt-cache-file mistral_prompt.safetensors \
    --prompt "\nSummarize the above text."

The cached prompt is treated as a prefix to the supplied prompt. Also notice when using a cached prompt, the model to use is read from the cache and need not be supplied explicitly.

Prompt caching can also be used in the Python API in order to avoid recomputing the prompt. This is useful in multi-turn dialogues or across requests that use the same context. See the example for more usage details.

Supported Models

mlx-lm supports thousands of LLMs available on the Hugging Face Hub. If the model you want to run is not supported, file an issue or better yet, submit a pull request. Many supported models are available in various quantization formats in the MLX Community Hugging Face organization.

For some models the tokenizer may require you to enable the trust_remote_code option. You can do this by passing --trust-remote-code in the command line. If you don't specify the flag explicitly, you will be prompted to trust remote code in the terminal when running the model.

Tokenizer options can also be set in the Python API. For example:

model, tokenizer = load(
    "qwen/Qwen-7B",
    tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)

Large Models

[!NOTE] This requires macOS 15.0 or higher to work.

Models which are large relative to the total RAM available on the machine can be slow. mlx-lm will attempt to make them faster by wiring the memory occupied by the model and cache. This requires macOS 15 or higher to work.

If you see the following warning message:

[WARNING] Generating with a model that requires ...

then the model will likely be slow on the given machine. If the model fits in RAM then it can often be sped up by increasing the system wired memory limit. To increase the limit, set the following sysctl:

sudo sysctl iogpu.wired_limit_mb=N

The value N should be larger than the size of the model in megabytes but smaller than the memory size of the machine.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

awni mlx-dev

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.30.7

Feb 12, 2026

0.30.6

Feb 4, 2026

0.30.5

Jan 25, 2026

0.30.4

Jan 19, 2026

0.30.2

Jan 6, 2026

0.30.0

Dec 18, 2025

0.29.1

Dec 16, 2025

0.28.4

Dec 3, 2025

0.28.3

Oct 17, 2025

0.28.2

Oct 2, 2025

0.28.1

Sep 27, 2025

0.28.0

Sep 17, 2025

0.27.1

Sep 4, 2025

0.27.0

Aug 29, 2025

0.26.4

Aug 25, 2025

0.26.3

Aug 6, 2025

0.26.2

Jul 30, 2025

0.26.1

Jul 26, 2025

0.26.0

Jul 8, 2025

0.25.3

Jul 1, 2025

0.25.2

Jun 9, 2025

0.25.1

Jun 7, 2025

0.25.0

Jun 2, 2025

0.24.1

May 14, 2025

0.24.0

Apr 28, 2025

0.23.2

Apr 22, 2025

0.23.1

Apr 20, 2025

0.23.0

Apr 18, 2025

0.22.5

Apr 11, 2025

0.22.4

Apr 6, 2025

0.22.3

Apr 3, 2025

0.22.2

Mar 21, 2025

0.22.1

Mar 18, 2025

0.22.0

Mar 13, 2025

0.21.5

Feb 27, 2025

0.21.4

Feb 8, 2025

0.21.3

Feb 7, 2025

0.21.2

Feb 5, 2025

0.21.1

Jan 16, 2025

0.21.0

Jan 10, 2025

0.20.6

Jan 3, 2025

0.20.5

Dec 23, 2024

0.20.4

Dec 13, 2024

0.20.3

Dec 11, 2024

0.20.2

Dec 8, 2024

0.20.1

Nov 25, 2024

0.19.3

Nov 4, 2024

0.19.2

Oct 23, 2024

0.19.1

Oct 14, 2024

0.19.0

Oct 2, 2024

0.18.2

Sep 19, 2024

0.18.1

Aug 30, 2024

0.17.1

Aug 24, 2024

0.17.0

Aug 17, 2024

0.16.1

Jul 23, 2024

0.16.0

Jul 22, 2024

0.15.3

Jul 17, 2024

0.15.2

Jul 8, 2024

0.15.1

Jul 7, 2024

0.15.0

Jun 27, 2024

0.14.3

Jun 3, 2024

0.14.2

Jun 2, 2024

0.14.1

May 31, 2024

0.14.0

May 24, 2024

0.13.1

May 17, 2024

0.13.0

May 10, 2024

0.12.1

Apr 30, 2024

0.12.0

Apr 26, 2024

0.11.0

Apr 23, 2024

0.10.0

Apr 19, 2024

0.9.0

Apr 11, 2024

0.8.0

Apr 8, 2024

0.7.0

Apr 5, 2024

0.6.0

Apr 2, 2024

0.5.0

Mar 25, 2024

0.4.0

Mar 21, 2024

0.3.0

Mar 13, 2024

0.2.0

Mar 13, 2024

0.1.0

Mar 8, 2024

0.0.14

Mar 4, 2024

0.0.13

Feb 21, 2024

0.0.12

Feb 20, 2024

0.0.11

Feb 18, 2024

0.0.10

Feb 13, 2024

0.0.9

Feb 8, 2024

0.0.8

Feb 6, 2024

0.0.7

Feb 4, 2024

0.0.6

Jan 26, 2024

0.0.5

Jan 24, 2024

0.0.3

Jan 15, 2024

0.0.2

Jan 12, 2024

0.0.1

Jan 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_lm-0.30.7.tar.gz (275.8 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_lm-0.30.7-py3-none-any.whl (386.6 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file mlx_lm-0.30.7.tar.gz.

File metadata

Download URL: mlx_lm-0.30.7.tar.gz
Upload date: Feb 12, 2026
Size: 275.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_lm-0.30.7.tar.gz
Algorithm	Hash digest
SHA256	`e5f31ac58d9f2381f28e1ba639ff903e64f7cff1bdc245c0bc97f72264be329c`
MD5	`ab6df89af6567201af79f2fb18fac149`
BLAKE2b-256	`660d56542e2ae13ec6f542d3977d7cff89a205d4f6c5122e0ce23f33265f61c9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_lm-0.30.7.tar.gz:

Publisher: release.yml on ml-explore/mlx-lm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mlx_lm-0.30.7.tar.gz
- Subject digest: e5f31ac58d9f2381f28e1ba639ff903e64f7cff1bdc245c0bc97f72264be329c
- Sigstore transparency entry: 945274447
- Sigstore integration time: Feb 12, 2026
Source repository:
- Permalink: ml-explore/mlx-lm@1974376d704a28a652fc8577cd93b1e2f767ecaa
- Branch / Tag: refs/tags/v0.30.7
- Owner: https://github.com/ml-explore
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1974376d704a28a652fc8577cd93b1e2f767ecaa
- Trigger Event: push

File details

Details for the file mlx_lm-0.30.7-py3-none-any.whl.

File metadata

Download URL: mlx_lm-0.30.7-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 386.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_lm-0.30.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17442a4bf01c4c2d3bca1e647712fe44f19890c3f1eadc8589d389e57b44b9bf`
MD5	`a5a46236a1d457319553889a33950a80`
BLAKE2b-256	`1e17a41c798a3d9cbdc47f39c6db5bba4c2cd199203ead26bf911cb03b644070`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_lm-0.30.7-py3-none-any.whl:

Publisher: release.yml on ml-explore/mlx-lm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mlx_lm-0.30.7-py3-none-any.whl
- Subject digest: 17442a4bf01c4c2d3bca1e647712fe44f19890c3f1eadc8589d389e57b44b9bf
- Sigstore transparency entry: 945274605
- Sigstore integration time: Feb 12, 2026
Source repository:
- Permalink: ml-explore/mlx-lm@1974376d704a28a652fc8577cd93b1e2f767ecaa
- Branch / Tag: refs/tags/v0.30.7
- Owner: https://github.com/ml-explore
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1974376d704a28a652fc8577cd93b1e2f767ecaa
- Trigger Event: push

mlx-lm 0.30.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

MLX LM

Quick Start

Python API

Streaming

Sampling

Command Line

Long Prompts and Generations

Supported Models

Large Models

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance