airllm

AirLLM runs 70B large language models on a single 4GB GPU without quantization, distillation or pruning. 405B Llama 3.1 on 8GB, DeepSeek-V3 671B on ~12GB.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

gavinli

These details have not been verified by PyPI

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

airllm_logo

Quickstart | Configurations | MacOS | Example notebooks | FAQ

AirLLM dramatically reduces inference memory usage, letting 70B large language models run on a single 4GB GPU card — without quantization, distillation, or pruning. You can even run 405B Llama 3.1 on 8GB, and DeepSeek-V3 (671B) on ~12GB.

AI Agents Recommendation:

Updates

[2026/06] v3.0: FP8 model support + the latest models. Run DeepSeek-V3 (671B) on ~12GB and Qwen3-235B on ~3GB, plus Qwen3, Llama 3.x/4, DeepSeek V2/V3, Phi-4, Gemma and more — all through a single AutoModel.

[2024/08/20] v2.11.0: Support Qwen2.5

[2024/08/18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work!

[2024/07/30] Support Llama3.1 405B (example notebook). Support 8bit/4bit quantization.

[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.

[2023/12/25] v2.8.2: Support MacOS running 70B large language models.

[2023/12/20] v2.7: Support AirLLMMixtral.

[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.

[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.

[2023/12/03] added support of ChatGLM, QWen, Baichuan, Mistral, InternLM!

[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.

[2023/12/01] airllm 2.0. Support compressions: 3x run time speed up!

[2023/11/20] airllm Initial version!

Quickstart

1. Install package

First, install the airllm pip package.

pip install airllm

2. Inference

Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.

(You can also specify the path to save the splitted layered model through layer_shards_saving_path when init AirLLMLlama2.

from airllm import AutoModel

MAX_LENGTH = 128
# just pass a hugging face repo id — works with almost any popular model:
model = AutoModel.from_pretrained("Qwen/Qwen3-32B")

# go bigger with the exact same one line:
#model = AutoModel.from_pretrained("Qwen/Qwen3-235B-A22B")     # 235B, runs in ~3GB
#model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V3")  # 671B, runs in ~12GB

# or use a model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-32B/snapshots/...")

input_text = [
        'What is the capital of United States?',
        #'I like',
    ]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=False)
           
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.

Model Compression - 3x Inference Speed Up!

We just added model compression based on block-wise quantization-based model compression. Which can further speed up the inference speed for up to 3x , with almost ignorable accuracy loss! (see more performance evaluation and why we use block-wise quantization in this paper)

speed_improvement

How to enable model compression speed up:

Step 1. make sure you have bitsandbytes installed by pip install -U bitsandbytes
Step 2. make sure airllm verion later than 2.0.0: pip install -U airllm
Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):

model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct",
                     compression='4bit' # specify '8bit' for 8-bit block-wise quantization 
                    )

What are the differences between model compression and quantization?

Quantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.

While in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights' part, which is easier to ensure the accuracy.

Configurations

When initialize the model, we support the following configurations:

compression: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
profiling_mode: supported options: True to output time consumptions or by default False
layer_shards_saving_path: optionally another path to save the splitted model
hf_token: huggingface token can be provided here if downloading gated models like: meta-llama/Llama-2-7b-hf
prefetching: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this.
delete_original: if you don't have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space.

MacOS

Just install airllm and run the code the same as on linux. See more in Quick Start.

make sure you installed mlx and torch
you probably need to install python native see more here
only Apple silicon is supported

Example [python notebook] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb)

Example Python Notebook

Example colabs here:

example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):

ChatGLM:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=True)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache= True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

QWen:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache=True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

Baichuan, InternLM, Mistral, etc:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base")
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache=True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

To request other model support: here

Supported Models

AirLLM works out of the box with virtually every popular open LLM — just pass its Hugging Face ID to AutoModel.from_pretrained(...). That covers all the major families:

Llama (2 / 3 / 3.1 / 3.3 / 4) · Qwen (1 / 2 / 2.5 / 3, including MoE and FP8) · DeepSeek (V2 / V3 / R1) · Mistral & Mixtral · Phi · Gemma · ChatGLM · Baichuan · InternLM · Yi — and most new models the day they're released.

Tiny GPU, huge models

The trick: AirLLM only ever keeps one layer on the GPU at a time, so the VRAM you need depends on the model's layer size — not its total size. That's how a 671B model fits on a hobbyist card:

Model	Size	GPU VRAM
Qwen3 / Mistral / Phi (≈8B)	8B	~1–2 GB
Qwen3-30B / Mixtral (MoE)	30–47B	~1–3 GB
Qwen3-235B (MoE)	235B	~3 GB
Llama 3.x 70B (full precision)	70B	~4 GB
Llama 3.1 405B	405B	~8 GB
DeepSeek-V3	671B	~12 GB

Same one line of code for all of them — no special setup.

Acknowledgement

A lot of the code are based on SimJeg's great work in the Kaggle exam competition. Big shoutout to SimJeg:

GitHub account @SimJeg, the code on Kaggle, the associated discussion.

FAQ

1. MetadataIncompleteBuffer

safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See this. You may need to extend your disk space, clear huggingface .cache and rerun.

2. ValueError: max() arg is an empty sequence

Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:

For QWen model:

from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)

For ChatGLM model:

from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)

3. 401 Client Error....Repo model ... is gated.

Some models are gated models, needs huggingface api token. You can provide hf_token:

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN')

4. ValueError: Asking to pad but the tokenizer does not have a padding token.

Some model's tokenizer doesn't have padding token, so you can set a padding token or simply turn the padding config off:

input_tokens = model.tokenizer(input_text,
   return_tensors="pt", 
   return_attention_mask=False, 
   truncation=True, 
   max_length=MAX_LENGTH, 
   padding=False  #<-----------   turn off padding 
)

Citing AirLLM

If you find AirLLM useful in your research and wish to cite it, please use the following BibTex entry:

@software{airllm2023,
  author = {Gavin Li},
  title = {AirLLM: scaling large language models on low-end commodity computers},
  url = {https://github.com/lyogavin/airllm/},
  version = {0.0},
  year = {2023},
}

Contribution

Welcomed contributions, ideas and discussions!

If you find it useful, please ⭐ or buy me a coffee! 🙏

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

gavinli

These details have not been verified by PyPI

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

3.0.1

Jun 30, 2026

3.0.0

Jun 30, 2026

2.11.0

Sep 21, 2024

2.10.2

Aug 23, 2024

2.10.1

Aug 18, 2024

2.9.1

Aug 3, 2024

2.9

Jul 31, 2024

2.8.6

Jul 29, 2024

2.8.4

Jul 29, 2024

2.8.3

Dec 27, 2023

2.8.2

Dec 25, 2023

2.8.1

Dec 25, 2023

2.8

Dec 25, 2023

2.7

Dec 21, 2023

2.6.2

Dec 20, 2023

2.6.1

Dec 20, 2023

2.6

Dec 20, 2023

2.5

Dec 19, 2023

2.4.5

Dec 17, 2023

2.4.4

Dec 17, 2023

2.4.3

Dec 17, 2023

2.4.2

Dec 8, 2023

2.4.1

Dec 5, 2023

2.4.0

Dec 4, 2023

2.3.1

Dec 4, 2023

2.3.0

Dec 4, 2023

2.2.0

Dec 3, 2023

2.1.1

Dec 3, 2023

2.1.0

Dec 3, 2023

2.0.0

Dec 2, 2023

0.9.5

Nov 21, 2023

0.9.4

Nov 18, 2023

0.9.3

Nov 17, 2023

0.9.2

Nov 17, 2023

0.9.1

Nov 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airllm-3.0.1.tar.gz (40.0 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

airllm-3.0.1-py3-none-any.whl (41.0 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file airllm-3.0.1.tar.gz.

File metadata

Download URL: airllm-3.0.1.tar.gz
Upload date: Jun 30, 2026
Size: 40.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for airllm-3.0.1.tar.gz
Algorithm	Hash digest
SHA256	`b8cf8068238ef129b21d0b1ea2133f994a9a47a10252dd29c337a2c92e01b453`
MD5	`4f572970b70aad0fa8870c9a278a69b4`
BLAKE2b-256	`99d96cd7253a9a7e2c3975ab039aeb105453be78621b4843a4b4f4cd21eaa54e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for airllm-3.0.1.tar.gz:

Publisher: release.yml on lyogavin/airllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: airllm-3.0.1.tar.gz
- Subject digest: b8cf8068238ef129b21d0b1ea2133f994a9a47a10252dd29c337a2c92e01b453
- Sigstore transparency entry: 2027861987
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: lyogavin/airllm@6436dce1547c613e462fa88327c0ad07f00059f0
- Branch / Tag: refs/tags/v3.0.1
- Owner: https://github.com/lyogavin
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6436dce1547c613e462fa88327c0ad07f00059f0
- Trigger Event: release

File details

Details for the file airllm-3.0.1-py3-none-any.whl.

File metadata

Download URL: airllm-3.0.1-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 41.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for airllm-3.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6b8190c9c06f6527ab625c5e683be9e3fcea0ada1d2aa215851f3cc910ceb1f`
MD5	`e45b0ca387345b751fa687fd13c9d825`
BLAKE2b-256	`6458d228a0f82b59957191359baa1d54836945c0467ab9698b83a2db262936ad`

See more details on using hashes here.

Provenance

The following attestation bundles were made for airllm-3.0.1-py3-none-any.whl:

Publisher: release.yml on lyogavin/airllm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: airllm-3.0.1-py3-none-any.whl
- Subject digest: e6b8190c9c06f6527ab625c5e683be9e3fcea0ada1d2aa215851f3cc910ceb1f
- Sigstore transparency entry: 2027862153
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: lyogavin/airllm@6436dce1547c613e462fa88327c0ad07f00059f0
- Branch / Tag: refs/tags/v3.0.1
- Owner: https://github.com/lyogavin
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6436dce1547c613e462fa88327c0ad07f00059f0
- Trigger Event: release

airllm 3.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AI Agents Recommendation:

Updates

Star History

Table of Contents

Quickstart

1. Install package

2. Inference

Model Compression - 3x Inference Speed Up!

How to enable model compression speed up:

What are the differences between model compression and quantization?

Configurations

MacOS

Example Python Notebook

example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):

To request other model support: here

Supported Models

Tiny GPU, huge models

Acknowledgement

FAQ

1. MetadataIncompleteBuffer

2. ValueError: max() arg is an empty sequence

3. 401 Client Error....Repo model ... is gated.

4. ValueError: Asking to pad but the tokenizer does not have a padding token.

Citing AirLLM

Sponsors

Run AI Agent Teams in the Cloud — Bloome

Contribution

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance