Skip to main content

CLI to quantize and release Hugging Face models in multiple formats

Project description

autopack

autopack makes your Hugging Face models easy to run, share, and ship. It quantizes once and exports to multiple runtimes, with sensible defaults and an automatic flow that produces a readable summary. It supports HF, ONNX, and GGUF (llama.cpp) formats and can publish to the Hugging Face Hub in one shot.

About · Requirements · Setup · Building Instructions · Running · Detailed Usage · Q&A


About The Project

What is autopack?

autopack is a CLI that helps you quantize and package Hugging Face models into multiple useful formats in a single pass, with an option to publish artifacts to the Hub.

You have a 120b llm and want to optimise it so that people (not corpotations with clusters of b200s) can use it on their 8gb 2060? all you need to do is run

autopack auto meta-llama/Llama-3-8B -o out/llama

Why use it?

  • Fast: generate multiple variants in one command.
  • Practical: built on Transformers, bitsandbytes, ONNX, and llama.cpp.
  • Portable: CPU- and GPU-friendly artifacts, good defaults.

Requirements

Core

  • Python 3.9+
  • PyTorch, Transformers, Hugging Face Hub
  • Optional: bitsandbytes (4/8-bit), optimum[onnxruntime] (ONNX), llama.cpp (GGUF tools)

Notes

  • GGUF export requires a built llama.cpp and llama-quantize in PATH.
  • Set HUGGINGFACE_HUB_TOKEN to publish, or pass --token.

Setup

Install

pip install -e .

Optional extras

# ONNX export support
pip install -e '.[onnx]'

# GGUF export helpers (converter deps)
pip install -e '.[gguf]'

# llama.cpp runtime bindings (llama-cpp-python)
pip install -e '.[llama]'

# Everything for llama.cpp functionality (GGUF export + runtime)
pip install -e '.[gguf,llama]'

Note: for GGUF and llama.cpp functionality you also need the llama.cpp tools (llama-quantize, llama-cli) available on your PATH. You can build the vendored copy and export PATH as shown in Vendored llama.cpp quick build.

Building Instructions

python -m build

Running the Application

Quickstart

autopack auto meta-llama/Llama-3-8B -o out/llama3 --output-format hf

Add ONNX and GGUF:

autopack auto meta-llama/Llama-3-8B -o out/llama3 --output-format hf onnx gguf

GGUF only (with default presets Q4_K_M, Q5_K_M, Q8_0):

autopack auto meta-llama/Llama-3-8B -o out/llama3-gguf --output-format gguf

Publish to Hub:

autopack publish out/llama3-4bit your-username/llama3-4bit --private \
  --commit-message "Add 4-bit quantized weights"

Detailed Usage

Commands Overview

auto

Run common HF quantization variants and optional ONNX/GGUF exports in one go, with a summary table and generated README in the output folder.

autopack auto <model_id_or_path> -o <out_dir> \
  --output-format hf [onnx] [gguf] \
  [--eval-dataset <dataset>[::<config>]] \
  [--revision <rev>] [--trust-remote-code]

Key points:

  • Default HF variants: bnb-4bit, bnb-8bit, int8-dynamic, bf16
  • Add ONNX and/or GGUF via --output-format
  • If --eval-dataset is provided, perplexity is computed for each HF variant

quantize

Produce specific formats with a chosen quantization strategy.

autopack quantize <model_id_or_path> -o <out_dir> \
  --output-format hf [onnx] [gguf] \
  [--quantization bnb-4bit|bnb-8bit|int8-dynamic|none] \
  [--dtype auto|float16|bfloat16|float32] \
  [--device-map auto|cpu] [--prune <0..0.95>] \
  [--revision <rev>] [--trust-remote-code]

publish

Upload an exported model folder to the Hugging Face Hub.

autopack publish <folder> <user_or_org/repo> \
  [--private] [--token $HUGGINGFACE_HUB_TOKEN] \
  [--branch <rev>] [--commit-message "..."] [--no-create]

Common Options

  • --trust-remote-code: enable loading custom modeling code from Hub repos
  • --revision: branch/tag/commit to load
  • --device-map: set to cpu to force CPU; defaults to auto
  • --dtype: compute dtype for non-INT8 layers (applies to HF exports)
  • --prune: global magnitude pruning ratio across Linear layers (0..0.95)

Output Formats

  • hf: Transformers checkpoint with tokenizer and config
  • onnx: ONNX export using optimum[onnxruntime] for CausalLM
  • gguf: llama.cpp GGUF via convert_hf_to_gguf.py and llama-quantize

GGUF Details

  • Converter resolution order:
    1. --gguf-converter if provided
    2. $LLAMA_CPP_CONVERT env var
    3. Vendored script: third_party/llama.cpp/convert_hf_to_gguf.py
    4. ~/llama.cpp/convert_hf_to_gguf.py or ~/src/llama.cpp/convert_hf_to_gguf.py
  • Quant presets: uppercase (e.g., Q4_K_M). If omitted, autopack generates Q4_K_M, Q5_K_M, Q8_0 by default.
  • Isolation: by default, conversion runs in an isolated .venv inside the output dir. Disable with --gguf-no-isolation.
  • Architecture checks: pass --gguf-force to bypass the basic architecture guard.
  • Ensure llama-quantize is in PATH (typically in third_party/llama.cpp/build/bin).

ONNX Details

  • Requires: pip install 'optimum[onnxruntime]'
  • Uses ORTModelForCausalLM; non-CausalLM models may not be supported in this version.

Perplexity Evaluation

  • --eval-dataset accepts dataset or dataset:config (e.g., wikitext-2-raw-v1)
  • Device selection is automatic (cuda if available, else cpu)
  • Only CausalLM architectures are supported for perplexity computation
  • Uses a bounded sample count and expects a text field in the dataset

More Examples

CPU-friendly int8 dynamic with pruning:

autopack quantize meta-llama/Llama-3-8B -o out/llama3-cpu \
  --output-format hf --quantization int8-dynamic --prune 0.2 --device-map cpu

BF16 only (no quantization):

autopack quantize meta-llama/Llama-3-8B -o out/llama3-bf16 \
  --output-format hf --quantization none --dtype bfloat16

Override GGUF presets:

autopack auto meta-llama/Llama-3-8B -o out/llama3-gguf \
  --output-format gguf --gguf-quant Q5_K_M Q8_0

Hello World (Transformers on CPU):

pip install -e .
autopack auto sshleifer/tiny-gpt2 -o out/tiny --output-format hf
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained('out/tiny/bf16')
m   = AutoModelForCausalLM.from_pretrained('out/tiny/bf16', device_map='cpu')
ids = tok('Hello world', return_tensors='pt').input_ids
out = m.generate(ids, max_new_tokens=8)
print(tok.decode(out[0]))
PY

Hello World (GGUF with llama.cpp):

autopack auto sshleifer/tiny-gpt2 -o out/tiny-gguf --output-format gguf
./third_party/llama.cpp/build/bin/llama-cli -m out/tiny-gguf/gguf/model-Q4_K_M.gguf -p "Hello world" -n 16

Vendored llama.cpp quick build

cd third_party/llama.cpp
cmake -S . -B build -DGGML_NATIVE=ON
cmake --build build -j

Troubleshooting

  • llama-quantize not found: build llama.cpp and ensure build/bin is in PATH.
  • BitsAndBytes on Windows: currently not installed by default; prefer CPU/int8-dynamic flows.
  • Custom code prompt: pass --trust-remote-code to avoid the interactive confirmation.

Environment Variables

  • HUGGINGFACE_HUB_TOKEN: token to publish to the Hub
  • LLAMA_CPP_CONVERT: path to convert_hf_to_gguf.py
  • PATH: should include the directory with llama-quantize

Q&A

FAQs

What does “auto” do?

Generates HF variants (4-bit, 8-bit, int8-dynamic, bf16) and prints a summary; GGUF/ONNX are opt-in.

What if I omit --gguf-quant?

autopack will create multiple useful presets by default (Q4_K_M, Q5_K_M, Q8_0).


License: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autopack_grn-0.1.1.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autopack_grn-0.1.1-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file autopack_grn-0.1.1.tar.gz.

File metadata

  • Download URL: autopack_grn-0.1.1.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for autopack_grn-0.1.1.tar.gz
Algorithm Hash digest
SHA256 55f079adce7314743dcae58ebd65149316a1cd7feaec82ffc7008b30e8b17c61
MD5 8b5b6b00231cec9016917cab0a9c9aa9
BLAKE2b-256 f5734d1e1aa180ca702c6c2b55cc2969475486d2056f242c30860b466f50f2ef

See more details on using hashes here.

File details

Details for the file autopack_grn-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: autopack_grn-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for autopack_grn-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 576d38d74232251f9aff8b74297318a2a6c7cc261e5bc7b9cef62e335b85b8c4
MD5 0d7e234fb54f7ff68ebb6a6f8b27ea51
BLAKE2b-256 3590ee0b92c951f961d8bc0f917950e40150b56640b0d1660709965f725bdc19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page