Skip to main content

LLMs on Apple silicon with MLX and the Hugging Face Hub

Project description

Generate Text with LLMs and MLX

The easiest way to get started is to install the mlx-lm package:

With pip:

pip install mlx-lm

With conda:

conda install -c conda-forge mlx-lm

The mlx-lm package also has:

Quick Start

To generate text with an LLM use:

mlx_lm.generate --prompt "Hi!"

To chat with an LLM use:

mlx_lm.chat

This will give you a chat REPL that you can use to interact with the LLM. The chat context is preserved during the lifetime of the REPL.

Commands in mlx-lm typically take command line options which let you specify the model, sampling parameters, and more. Use -h to see a list of available options for a command, e.g.:

mlx_lm.generate -h

Python API

You can use mlx-lm as a module:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a story about Einstein"

messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

To see a description of all the arguments you can do:

>>> help(generate)

Check out the generation example to see how to use the API in more detail.

The mlx-lm package also comes with functionality to quantize and optionally upload models to the Hugging Face Hub.

You can convert models using the Python API:

from mlx_lm import convert

repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"

convert(repo, quantize=True, upload_repo=upload_repo)

This will generate a 4-bit quantized Mistral 7B and upload it to the repo mlx-community/My-Mistral-7B-Instruct-v0.3-4bit. It will also save the converted model in the path mlx_model by default.

To see a description of all the arguments you can do:

>>> help(convert)

Streaming

For streaming generation, use the stream_generate function. This yields a generation response object.

For example,

from mlx_lm import load, stream_generate

repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)

prompt = "Write a story about Einstein"

messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

for response in stream_generate(model, tokenizer, prompt, max_tokens=512):
    print(response.text, end="", flush=True)
print()

Command Line

You can also use mlx-lm from the command line with:

mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"

This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.

For a full list of options run:

mlx_lm.generate --help

To quantize a model from the command line run:

mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q

For more options run:

mlx_lm.convert --help

You can upload new models to Hugging Face by specifying --upload-repo to convert. For example, to upload a quantized Mistral-7B model to the MLX Hugging Face community you can do:

mlx_lm.convert \
    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral

Models can also be converted and quantized directly in the mlx-my-repo Hugging Face Space.

Long Prompts and Generations

mlx-lm has some tools to scale efficiently to long prompts and generations:

  • A rotating fixed-size key-value cache.
  • Prompt caching

To use the rotating key-value cache pass the argument --max-kv-size n where n can be any integer. Smaller values like 512 will use very little RAM but result in worse quality. Larger values like 4096 or higher will use more RAM but have better quality.

Caching prompts can substantially speedup reusing the same long context with different queries. To cache a prompt use mlx_lm.cache_prompt. For example:

cat prompt.txt | mlx_lm.cache_prompt \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --prompt - \
  --prompt-cache-file mistral_prompt.safetensors

Then use the cached prompt with mlx_lm.generate:

mlx_lm.generate \
    --prompt-cache-file mistral_prompt.safetensors \
    --prompt "\nSummarize the above text."

The cached prompt is treated as a prefix to the supplied prompt. Also notice when using a cached prompt, the model to use is read from the cache and need not be supplied explicitly.

Prompt caching can also be used in the Python API in order to to avoid recomputing the prompt. This is useful in multi-turn dialogues or across requests that use the same context. See the example for more usage details.

Supported Models

mlx-lm supports thousands of Hugging Face format LLMs. If the model you want to run is not supported, file an issue or better yet, submit a pull request.

Here are a few examples of Hugging Face models that work with this example:

Most Mistral, Llama, Phi-2, and Mixtral style models should work out of the box.

For some models (such as Qwen and plamo) the tokenizer requires you to enable the trust_remote_code option. You can do this by passing --trust-remote-code in the command line. If you don't specify the flag explicitly, you will be prompted to trust remote code in the terminal when running the model.

For Qwen models you must also specify the eos_token. You can do this by passing --eos-token "<|endoftext|>" in the command line.

These options can also be set in the Python API. For example:

model, tokenizer = load(
    "qwen/Qwen-7B",
    tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)

Large Models

[!NOTE] This requires macOS 15.0 or higher to work.

Models which are large relative to the total RAM available on the machine can be slow. mlx-lm will attempt to make them faster by wiring the memory occupied by the model and cache. This requires macOS 15 or higher to work.

If you see the following warning message:

[WARNING] Generating with a model that requires ...

then the model will likely be slow on the given machine. If the model fits in RAM then it can often be sped up by increasing the system wired memory limit. To increase the limit, set the following sysctl:

sudo sysctl iogpu.wired_limit_mb=N

The value N should be larger than the size of the model in megabytes but smaller than the memory size of the machine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_lm-0.21.4.tar.gz (108.7 kB view details)

Uploaded Source

Built Distribution

mlx_lm-0.21.4-py3-none-any.whl (145.7 kB view details)

Uploaded Python 3

File details

Details for the file mlx_lm-0.21.4.tar.gz.

File metadata

  • Download URL: mlx_lm-0.21.4.tar.gz
  • Upload date:
  • Size: 108.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for mlx_lm-0.21.4.tar.gz
Algorithm Hash digest
SHA256 deecb0890fde405a8595cb22cd3336d7419ed141f84e0b55c5a7f01125a9bf0f
MD5 4cd567ee5a82774aca09e2906cd32b96
BLAKE2b-256 b3a7aa97ea584e09069a976cbafe2cc1d65b1825f2372b111cf6c94108cbb099

See more details on using hashes here.

File details

Details for the file mlx_lm-0.21.4-py3-none-any.whl.

File metadata

  • Download URL: mlx_lm-0.21.4-py3-none-any.whl
  • Upload date:
  • Size: 145.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for mlx_lm-0.21.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7353d74160577558d1c3c5b5853ab2bbe43e76a3777618b13d04515e366fb3f2
MD5 04187fa4eebebb17e9ea93231d580f9d
BLAKE2b-256 22e0be88ec77ede3a010bed4cb1d6312bc7f056ceb16c7d9eb5430c7899a0f29

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page