CLI to quantize and release Hugging Face models in multiple formats
Project description
autopack
autopack makes your Hugging Face models easy to run, share, and ship. It quantizes once and exports to multiple runtimes, with sensible defaults and an automatic flow that produces a readable summary. It supports HF, ONNX, and GGUF (llama.cpp) formats and can publish to the Hugging Face Hub in one shot.
About · Requirements · Setup · Building Instructions · Running · Detailed Usage · Q&A
About The Project
What is autopack?
autopack is a CLI that helps you quantize and package Hugging Face models into multiple useful formats in a single pass, with an option to publish artifacts to the Hub.
You have a 120B LLM and want to optimize it so that people (not corporations with clusters of B200s) can use it on their 8GB 2060? All you need to do is run:
autopack sentence-transformers/all-MiniLM-L6-v2
Why use it?
- Fast: generate multiple variants in one command.
- Practical: built on Transformers, bitsandbytes, ONNX, and llama.cpp.
- Portable: CPU- and GPU-friendly artifacts, good defaults.
Requirements
Core
- Python 3.9+
- PyTorch, Transformers, Hugging Face Hub
- Optional: bitsandbytes (4/8-bit), optimum[onnxruntime] (ONNX), llama.cpp (GGUF tools)
Notes
- GGUF export requires a built llama.cpp and
llama-quantizein PATH. - Set
HUGGINGFACE_HUB_TOKENto publish, or pass--token.
Setup
Install
pip install autopack-grn
Optional extras
# ONNX export support
pip install 'autopack-grn[onnx]'
# GGUF export helpers (converter deps)
pip install 'autopack-grn[gguf]'
# llama.cpp runtime bindings (llama-cpp-python)
pip install 'autopack-grn[llama]'
# Everything for llama.cpp functionality (GGUF export + runtime)
pip install 'autopack-grn[gguf,llama]'
Note: for GGUF and llama.cpp functionality you also need the llama.cpp tools
(llama-quantize, llama-cli) available on your PATH. You can build the
vendored copy and export PATH as shown in
Vendored llama.cpp quick build.
From source (dev)
pip install -e .
# Optional extras while developing
pip install -e '.[onnx]'
pip install -e '.[gguf]'
pip install -e '.[llama]'
pip install -e '.[gguf,llama]'
Building Instructions
python -m build
Running the Application
Quickstart
autopack meta-llama/Llama-3-8B --output-format hf
Add ONNX and GGUF:
autopack meta-llama/Llama-3-8B --output-format hf onnx gguf --summary-json --skip-existing
GGUF only (with default presets Q4_K_M, Q5_K_M, Q8_0):
autopack meta-llama/Llama-3-8B --output-format gguf --skip-existing
Publish to Hub:
autopack publish out/llama3-4bit your-username/llama3-4bit --private \
--commit-message "Add 4-bit quantized weights"
Detailed Usage
Commands Overview
auto
Run common HF quantization variants and optional ONNX/GGUF exports in one go, with a summary table and generated README in the output folder.
autopack [auto] <model_id_or_path> [-o <out_dir>] \
--output-format hf [onnx] [gguf] \
[--eval-dataset <dataset>[::<config>]] \
[--revision <rev>] [--trust-remote-code] [--device auto|cpu|cuda] \
[--no-bench] [--bench-prompt "..."] [--bench-max-new-tokens 16] \
[--bench-warmup 0] [--bench-runs 1]
Key points:
- Default HF variants: bnb-4bit, bnb-8bit, int8-dynamic, bf16
- Add ONNX and/or GGUF via
--output-format - If
-o/--output-diris omitted, the output folder defaults to the last path segment of the model id/path (e.g.,user/model->model). - Benchmarking is enabled by default in
auto; use--no-benchto disable. - If
--eval-datasetis provided, perplexity is computed for each HF variant - If benchmarking is enabled, autopack measures actual Tokens/s per backend and replaces heuristic speedups with real Tokens/s and speedup vs bf16 in the summary and the generated README.
quantize
Produce specific formats with a chosen quantization strategy.
autopack quantize <model_id_or_path> [-o <out_dir>] \
--output-format hf [onnx] [gguf] \
[--quantization bnb-4bit|bnb-8bit|int8-dynamic|none] \
[--dtype auto|float16|bfloat16|float32] \
[--device-map auto|cpu] [--prune <0..0.95>] \
[--revision <rev>] [--trust-remote-code]
publish
Upload an exported model folder to the Hugging Face Hub.
autopack publish <folder> <user_or_org/repo> \
[--private] [--token $HUGGINGFACE_HUB_TOKEN] \
[--branch <rev>] [--commit-message "..."] [--no-create]
bench
Run standalone benchmarks on existing models/artifacts.
autopack bench <target> \
--backend hf [onnx] [gguf] \
[--prompt "Hello"] [--max-new-tokens 64] \
[--device auto] [--num-warmup 1] [--num-runs 3] \
[--trust-remote-code] [--llama-cli /path/to/llama-cli]
Notes:
- For HF,
targetcan be a Hub id or local folder. For ONNX, pass the exported folder. For GGUF, pass a.gguffile or a folder containing one. - ONNX benchmarking requires
optimum[onnxruntime]. GGUF benchmarking requiresllama-cli.
Common Options
--trust-remote-code: enable loading custom modeling code from Hub repos--revision: branch/tag/commit to load--device-map: set tocputo force CPU; defaults toauto--dtype: compute dtype for non-INT8 layers (applies to HF exports)--prune: global magnitude pruning ratio across Linear layers (0..0.95)
Output Formats
hf: Transformers checkpoint with tokenizer and configonnx: ONNX export usingoptimum[onnxruntime]for CausalLMgguf: llama.cpp GGUF viaconvert_hf_to_gguf.pyandllama-quantize
GGUF Details
- Converter resolution order:
--gguf-converterif provided$LLAMA_CPP_CONVERTenv var- Vendored script:
third_party/llama.cpp/convert_hf_to_gguf.py ~/llama.cpp/convert_hf_to_gguf.pyor~/src/llama.cpp/convert_hf_to_gguf.py
- Quant presets: uppercase (e.g.,
Q4_K_M). If omitted, autopack generatesQ4_K_M,Q5_K_M,Q8_0by default. - Isolation: by default, conversion runs in an isolated
.venvinside the output dir. Disable with--gguf-no-isolation. - Architecture checks: pass
--gguf-forceto bypass the basic architecture guard. - Ensure
llama-quantizeis inPATH(typically inthird_party/llama.cpp/build/bin).
ONNX Details
- Requires:
pip install 'optimum[onnxruntime]' - Uses
ORTModelForCausalLM; non-CausalLM models may not be supported in this version.
Perplexity Evaluation
--eval-datasetacceptsdatasetordataset:config(e.g.,wikitext-2-raw-v1)--eval-text-keycontrols which dataset column is used for text (default:text)- Device selection is automatic (
cudaif available, elsecpu) - Only CausalLM architectures are supported for perplexity computation
- Uses a bounded sample count and expects a
textfield in the dataset
More Examples
CPU-friendly int8 dynamic with pruning:
autopack quantize meta-llama/Llama-3-8B \
--output-format hf --quantization int8-dynamic --prune 0.2 --device-map cpu
BF16 only (no quantization):
autopack quantize meta-llama/Llama-3-8B \
--output-format hf --quantization none --dtype bfloat16
Override GGUF presets:
autopack meta-llama/Llama-3-8B \
--output-format gguf --gguf-quant Q5_K_M Q8_0
Auto with benchmarking (reports Tokens/s and real speedup vs bf16):
autopack sshleifer/tiny-gpt2 --output-format hf
Hello World (Transformers on CPU):
pip install autopack-grn
autopack sshleifer/tiny-gpt2 --output-format hf
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained('tiny-gpt2/bf16')
m = AutoModelForCausalLM.from_pretrained('tiny-gpt2/bf16', device_map='cpu')
ids = tok('Hello world', return_tensors='pt').input_ids
out = m.generate(ids, max_new_tokens=8)
print(tok.decode(out[0]))
PY
Hello World (GGUF with llama.cpp):
autopack sshleifer/tiny-gpt2 --output-format gguf
./third_party/llama.cpp/build/bin/llama-cli -m tiny-gpt2/gguf/model-Q4_K_M.gguf -p "Hello world" -n 16
Vendored llama.cpp quick build
cd third_party/llama.cpp
cmake -S . -B build -DGGML_NATIVE=ON
cmake --build build -j
Troubleshooting
llama-quantizenot found: build llama.cpp and ensurebuild/binis inPATH.- BitsAndBytes on Windows: currently not installed by default; prefer CPU/int8-dynamic flows.
- Custom code prompt: pass
--trust-remote-codeto avoid the interactive confirmation.
Environment Variables
HUGGINGFACE_HUB_TOKEN: token to publish to the HubLLAMA_CPP_CONVERT: path toconvert_hf_to_gguf.pyPATH: should include the directory withllama-quantize
Q&A
FAQs
What does “auto” do?
Generates HF variants (4-bit, 8-bit, int8-dynamic, bf16) and prints a summary; GGUF/ONNX are opt-in.
What if I omit --gguf-quant?
autopack will create multiple useful presets by default (Q4_K_M, Q5_K_M, Q8_0).
License: Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autopack_grn-0.1.3.1.tar.gz.
File metadata
- Download URL: autopack_grn-0.1.3.1.tar.gz
- Upload date:
- Size: 124.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbee61cf2dfca6c109afcf0b53525cce02b2cc8c625b7c1af68fa95bd7e93026
|
|
| MD5 |
0bd25ce89c0aca96097cc668b44c76f2
|
|
| BLAKE2b-256 |
b10a524ac1d1676d61e2127c1642e0c7776b47a4320616e7a42cfbf92b115618
|
Provenance
The following attestation bundles were made for autopack_grn-0.1.3.1.tar.gz:
Publisher:
python-publish.yml on GranulaVision/autopack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autopack_grn-0.1.3.1.tar.gz -
Subject digest:
cbee61cf2dfca6c109afcf0b53525cce02b2cc8c625b7c1af68fa95bd7e93026 - Sigstore transparency entry: 496489876
- Sigstore integration time:
-
Permalink:
GranulaVision/autopack@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51 -
Branch / Tag:
refs/tags/v0.1.3.1 - Owner: https://github.com/GranulaVision
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51 -
Trigger Event:
release
-
Statement type:
File details
Details for the file autopack_grn-0.1.3.1-py3-none-any.whl.
File metadata
- Download URL: autopack_grn-0.1.3.1-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ab6f30cb9c535712e666089023435c47ba1ccb9dd7228b0a5c131bbae684ce5
|
|
| MD5 |
a45e59018faf21f00e7ffa3060bb9350
|
|
| BLAKE2b-256 |
6c9d8704168ca50f0f1dd22955654d6956b0e358046dc7541e22b3c82692db98
|
Provenance
The following attestation bundles were made for autopack_grn-0.1.3.1-py3-none-any.whl:
Publisher:
python-publish.yml on GranulaVision/autopack
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autopack_grn-0.1.3.1-py3-none-any.whl -
Subject digest:
8ab6f30cb9c535712e666089023435c47ba1ccb9dd7228b0a5c131bbae684ce5 - Sigstore transparency entry: 496489893
- Sigstore integration time:
-
Permalink:
GranulaVision/autopack@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51 -
Branch / Tag:
refs/tags/v0.1.3.1 - Owner: https://github.com/GranulaVision
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51 -
Trigger Event:
release
-
Statement type: