CLI to quantize and release Hugging Face models in multiple formats

These details have not been verified by PyPI

Project description

autopack

autopack makes your Hugging Face models easy to run, share, and ship. It quantizes once and exports to multiple runtimes, with sensible defaults and an automatic flow that produces a readable summary. It supports HF, ONNX, and GGUF (llama.cpp) formats and can publish to the Hugging Face Hub in one shot.

About · Requirements · Setup · Building Instructions · Running · Detailed Usage · Q&A

About The Project

What is autopack?

autopack is a CLI that helps you quantize and package Hugging Face models into multiple useful formats in a single pass, with an option to publish artifacts to the Hub.

You have a 120B LLM and want to optimize it so that people (not corporations with clusters of B200s) can use it on their 8GB 2060? All you need to do is run:

autopack sentence-transformers/all-MiniLM-L6-v2

Why use it?

Fast: generate multiple variants in one command.
Practical: built on Transformers, bitsandbytes, ONNX, and llama.cpp.
Portable: CPU- and GPU-friendly artifacts, good defaults.

Requirements

Core

Python 3.9+
PyTorch, Transformers, Hugging Face Hub
Optional: bitsandbytes (4/8-bit), optimum[onnxruntime] (ONNX), llama.cpp (GGUF tools)

Notes

GGUF export requires a built llama.cpp and llama-quantize in PATH.
Set HUGGINGFACE_HUB_TOKEN to publish, or pass --token.

Setup

Install

pip install autopack-grn

Optional extras

# ONNX export support
pip install 'autopack-grn[onnx]'

# GGUF export helpers (converter deps)
pip install 'autopack-grn[gguf]'

# llama.cpp runtime bindings (llama-cpp-python)
pip install 'autopack-grn[llama]'

# Everything for llama.cpp functionality (GGUF export + runtime)
pip install 'autopack-grn[gguf,llama]'

Note: for GGUF and llama.cpp functionality you also need the llama.cpp tools (llama-quantize, llama-cli) available on your PATH. You can build the vendored copy and export PATH as shown in Vendored llama.cpp quick build.

From source (dev)

pip install -e .

# Optional extras while developing
pip install -e '.[onnx]'
pip install -e '.[gguf]'
pip install -e '.[llama]'
pip install -e '.[gguf,llama]'

Building Instructions

python -m build

Running the Application

Quickstart

autopack meta-llama/Llama-3-8B --output-format hf

Add ONNX and GGUF:

autopack meta-llama/Llama-3-8B --output-format hf onnx gguf --summary-json --skip-existing

GGUF only (with default presets Q4_K_M, Q5_K_M, Q8_0):

autopack meta-llama/Llama-3-8B --output-format gguf --skip-existing

Publish to Hub:

autopack publish out/llama3-4bit your-username/llama3-4bit --private \
  --commit-message "Add 4-bit quantized weights"

Detailed Usage

Commands Overview

auto

Run common HF quantization variants and optional ONNX/GGUF exports in one go, with a summary table and generated README in the output folder.

autopack [auto] <model_id_or_path> [-o <out_dir>] \
  --output-format hf [onnx] [gguf] \
  [--eval-dataset <dataset>[::<config>]] \
  [--revision <rev>] [--trust-remote-code] [--device auto|cpu|cuda] \
  [--no-bench] [--bench-prompt "..."] [--bench-max-new-tokens 16] \
  [--bench-warmup 0] [--bench-runs 1]

Key points:

Default HF variants: bnb-4bit, bnb-8bit, int8-dynamic, bf16
Add ONNX and/or GGUF via --output-format
If -o/--output-dir is omitted, the output folder defaults to the last path segment of the model id/path (e.g., user/model -> model).
Benchmarking is enabled by default in auto; use --no-bench to disable.
If --eval-dataset is provided, perplexity is computed for each HF variant
If benchmarking is enabled, autopack measures actual Tokens/s per backend and replaces heuristic speedups with real Tokens/s and speedup vs bf16 in the summary and the generated README.

quantize

Produce specific formats with a chosen quantization strategy.

autopack quantize <model_id_or_path> [-o <out_dir>] \
  --output-format hf [onnx] [gguf] \
  [--quantization bnb-4bit|bnb-8bit|int8-dynamic|none] \
  [--dtype auto|float16|bfloat16|float32] \
  [--device-map auto|cpu] [--prune <0..0.95>] \
  [--revision <rev>] [--trust-remote-code]

publish

Upload an exported model folder to the Hugging Face Hub.

autopack publish <folder> <user_or_org/repo> \
  [--private] [--token $HUGGINGFACE_HUB_TOKEN] \
  [--branch <rev>] [--commit-message "..."] [--no-create]

bench

Run standalone benchmarks on existing models/artifacts.

autopack bench <target> \
  --backend hf [onnx] [gguf] \
  [--prompt "Hello"] [--max-new-tokens 64] \
  [--device auto] [--num-warmup 1] [--num-runs 3] \
  [--trust-remote-code] [--llama-cli /path/to/llama-cli]

Notes:

For HF, target can be a Hub id or local folder. For ONNX, pass the exported folder. For GGUF, pass a .gguf file or a folder containing one.
ONNX benchmarking requires optimum[onnxruntime]. GGUF benchmarking requires llama-cli.

Common Options

--trust-remote-code: enable loading custom modeling code from Hub repos
--revision: branch/tag/commit to load
--device-map: set to cpu to force CPU; defaults to auto
--dtype: compute dtype for non-INT8 layers (applies to HF exports)
--prune: global magnitude pruning ratio across Linear layers (0..0.95)

Output Formats

hf: Transformers checkpoint with tokenizer and config
onnx: ONNX export using optimum[onnxruntime] for CausalLM
gguf: llama.cpp GGUF via convert_hf_to_gguf.py and llama-quantize

GGUF Details

Converter resolution order:
1. --gguf-converter if provided
2. $LLAMA_CPP_CONVERT env var
3. Vendored script: third_party/llama.cpp/convert_hf_to_gguf.py
4. ~/llama.cpp/convert_hf_to_gguf.py or ~/src/llama.cpp/convert_hf_to_gguf.py
Quant presets: uppercase (e.g., Q4_K_M). If omitted, autopack generates Q4_K_M, Q5_K_M, Q8_0 by default.
Isolation: by default, conversion runs in an isolated .venv inside the output dir. Disable with --gguf-no-isolation.
Architecture checks: pass --gguf-force to bypass the basic architecture guard.
Ensure llama-quantize is in PATH (typically in third_party/llama.cpp/build/bin).

ONNX Details

Requires: pip install 'optimum[onnxruntime]'
Uses ORTModelForCausalLM; non-CausalLM models may not be supported in this version.

Perplexity Evaluation

--eval-dataset accepts dataset or dataset:config (e.g., wikitext-2-raw-v1)
--eval-text-key controls which dataset column is used for text (default: text)
Device selection is automatic (cuda if available, else cpu)
Only CausalLM architectures are supported for perplexity computation
Uses a bounded sample count and expects a text field in the dataset

More Examples

CPU-friendly int8 dynamic with pruning:

autopack quantize meta-llama/Llama-3-8B \
  --output-format hf --quantization int8-dynamic --prune 0.2 --device-map cpu

BF16 only (no quantization):

autopack quantize meta-llama/Llama-3-8B \
  --output-format hf --quantization none --dtype bfloat16

Override GGUF presets:

autopack meta-llama/Llama-3-8B \
  --output-format gguf --gguf-quant Q5_K_M Q8_0

Auto with benchmarking (reports Tokens/s and real speedup vs bf16):

autopack sshleifer/tiny-gpt2 --output-format hf

Hello World (Transformers on CPU):

pip install autopack-grn
autopack sshleifer/tiny-gpt2 --output-format hf
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained('tiny-gpt2/bf16')
m   = AutoModelForCausalLM.from_pretrained('tiny-gpt2/bf16', device_map='cpu')
ids = tok('Hello world', return_tensors='pt').input_ids
out = m.generate(ids, max_new_tokens=8)
print(tok.decode(out[0]))
PY

Hello World (GGUF with llama.cpp):

autopack sshleifer/tiny-gpt2 --output-format gguf
./third_party/llama.cpp/build/bin/llama-cli -m tiny-gpt2/gguf/model-Q4_K_M.gguf -p "Hello world" -n 16

Vendored llama.cpp quick build

cd third_party/llama.cpp
cmake -S . -B build -DGGML_NATIVE=ON
cmake --build build -j

Troubleshooting

llama-quantize not found: build llama.cpp and ensure build/bin is in PATH.
BitsAndBytes on Windows: currently not installed by default; prefer CPU/int8-dynamic flows.
Custom code prompt: pass --trust-remote-code to avoid the interactive confirmation.

Environment Variables

HUGGINGFACE_HUB_TOKEN: token to publish to the Hub
LLAMA_CPP_CONVERT: path to convert_hf_to_gguf.py
PATH: should include the directory with llama-quantize

Q&A

FAQs

What does “auto” do?

Generates HF variants (4-bit, 8-bit, int8-dynamic, bf16) and prints a summary; GGUF/ONNX are opt-in.

What if I omit `--gguf-quant`?

autopack will create multiple useful presets by default (Q4_K_M, Q5_K_M, Q8_0).

License: Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.5.0

Oct 3, 2025

0.1.4.1

Sep 15, 2025

0.1.4.0

Sep 12, 2025

0.1.3.2

Sep 11, 2025

This version

0.1.3.1

Sep 10, 2025

0.1.3

Sep 9, 2025

0.1.2

Sep 8, 2025

0.1.1

Sep 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autopack_grn-0.1.3.1.tar.gz (124.1 kB view details)

Uploaded Sep 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autopack_grn-0.1.3.1-py3-none-any.whl (27.2 kB view details)

Uploaded Sep 10, 2025 Python 3

File details

Details for the file autopack_grn-0.1.3.1.tar.gz.

File metadata

Download URL: autopack_grn-0.1.3.1.tar.gz
Upload date: Sep 10, 2025
Size: 124.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autopack_grn-0.1.3.1.tar.gz
Algorithm	Hash digest
SHA256	`cbee61cf2dfca6c109afcf0b53525cce02b2cc8c625b7c1af68fa95bd7e93026`
MD5	`0bd25ce89c0aca96097cc668b44c76f2`
BLAKE2b-256	`b10a524ac1d1676d61e2127c1642e0c7776b47a4320616e7a42cfbf92b115618`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autopack_grn-0.1.3.1.tar.gz:

Publisher: python-publish.yml on GranulaVision/autopack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autopack_grn-0.1.3.1.tar.gz
- Subject digest: cbee61cf2dfca6c109afcf0b53525cce02b2cc8c625b7c1af68fa95bd7e93026
- Sigstore transparency entry: 496489876
- Sigstore integration time: Sep 10, 2025
Source repository:
- Permalink: GranulaVision/autopack@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51
- Branch / Tag: refs/tags/v0.1.3.1
- Owner: https://github.com/GranulaVision
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51
- Trigger Event: release

File details

Details for the file autopack_grn-0.1.3.1-py3-none-any.whl.

File metadata

Download URL: autopack_grn-0.1.3.1-py3-none-any.whl
Upload date: Sep 10, 2025
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autopack_grn-0.1.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ab6f30cb9c535712e666089023435c47ba1ccb9dd7228b0a5c131bbae684ce5`
MD5	`a45e59018faf21f00e7ffa3060bb9350`
BLAKE2b-256	`6c9d8704168ca50f0f1dd22955654d6956b0e358046dc7541e22b3c82692db98`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autopack_grn-0.1.3.1-py3-none-any.whl:

Publisher: python-publish.yml on GranulaVision/autopack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autopack_grn-0.1.3.1-py3-none-any.whl
- Subject digest: 8ab6f30cb9c535712e666089023435c47ba1ccb9dd7228b0a5c131bbae684ce5
- Sigstore transparency entry: 496489893
- Sigstore integration time: Sep 10, 2025
Source repository:
- Permalink: GranulaVision/autopack@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51
- Branch / Tag: refs/tags/v0.1.3.1
- Owner: https://github.com/GranulaVision
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@f6bcf356a8a7f7751c291f085e8d97c74d1b3e51
- Trigger Event: release

autopack-grn 0.1.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

autopack

About The Project

What is autopack?

Why use it?

Requirements

Core

Notes

Setup

Install

Optional extras

From source (dev)

Building Instructions

Running the Application

Quickstart

Detailed Usage

Commands Overview

auto

quantize

publish

bench

Common Options

Output Formats

GGUF Details

ONNX Details

Perplexity Evaluation

More Examples

Vendored llama.cpp quick build

Troubleshooting

Environment Variables

Q&A

FAQs

What does “auto” do?

What if I omit --gguf-quant?

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

What if I omit `--gguf-quant`?