Developer-friendly CLI for quantizing, pruning, and exporting Hugging Face models for edge deployment.

These details have not been verified by PyPI

Project description

EdgeForge

Quantize, prune, and deploy Hugging Face LLMs to Google AI Edge Gallery (LiteRT / TFLite) from your terminal.

EdgeForge is a developer-friendly CLI toolkit that downloads Hugging Face transformer models, prepares them for compression, applies quantization and pruning workflows, and stages export artifacts for:

LiteRT .task for Google AI Edge Gallery
TFLite .tflite for mobile inference
GGUF .gguf workflows for llama.cpp

Documentation

For the full practical guide, see:

USAGE_AND_SUPPORT.md

That guide covers:

supported workflows
Windows vs Linux / WSL / Colab guidance
dependency compatibility
GGUF vs TFLite vs LiteRT recommendations
mobile size limits
common error explanations and fixes

Install

pip install edgeforge
pip install "edgeforge[gptq]"
pip install "edgeforge[awq]"
pip install "edgeforge[litert,tflite]"
pip install "edgeforge[gguf]"
pip install "edgeforge[all]"

Recommended install strategy:

use .[torch,gguf] for GGUF workflows
use .[torch,tflite,litert] for TFLite / LiteRT workflows
avoid mixing every backend in one notebook unless you really need to

For this workspace:

.\enve\python.exe -m pip install -e .

Examples:

.\enve\python.exe -m pip install -e ".[torch,gguf]"
.\enve\python.exe -m pip install -e ".[torch,tflite,litert]"

CLI overview

edgeforge auth login
edgeforge auth status
edgeforge download google/gemma-2b-it
edgeforge quantize google/gemma-2b-it --method gptq --bits 4
edgeforge prune ./models/gemma-2b-it-gptq --method magnitude --sparsity 0.3
edgeforge convert ./models/gemma-2b-it-gptq-pruned --format litert
edgeforge run google/gemma-2b-it --quant-method awq --bits 4 --export-format gguf
edgeforge chat ./models/gemma-2b-it.gguf

Step By Step For A New Model

Use this workflow when you want to process a new Hugging Face model from scratch.

1. Activate the environment

C:\Quanitization\enve\Scripts\activate
cd C:\Quanitization

2. Authenticate with Hugging Face

Only needed for gated or private models.

.\enve\Scripts\edgeforge.exe auth login --token hf_xxx

3. Download the model

Public model example:

.\enve\Scripts\edgeforge.exe download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --no-auth

Gated model example:

.\enve\Scripts\edgeforge.exe download google/gemma-2b-it

4. Quantize the model

For the most reliable GGUF workflow, use fp16 first.

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\MODEL_FOLDER_NAME" --method fp16

Example:

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\TinyLlama--TinyLlama-1.1B-Chat-v1.0" --method fp16

5. Optional pruning

.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit" --method magnitude --sparsity 0.1

Example:

.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\TinyLlama--TinyLlama-1.1B-Chat-v1.0-fp16-16bit" --method magnitude --sparsity 0.1

6. Export to GGUF

Without pruning:

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit" --format gguf

With pruning:

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit-pruned-magnitude-10" --format gguf

7. Find the GGUF file

dir "C:\Users\vicky\.edgeforge\exports\MODEL_EXPORT_FOLDER\gguf" /s

8. Run chat with llama.cpp

.\enve\Scripts\edgeforge.exe chat "FULL_PATH_TO_MODEL.gguf" --backend gguf --executable "C:\Users\vicky\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\llama-cli.exe"

Recommended order

Download the model.
Quantize with fp16.
Skip pruning for the first test.
Export to gguf.
Test chat with llama.cpp.
Add pruning only after the base export works.

Full example

.\enve\Scripts\edgeforge.exe download microsoft/phi-2 --no-auth
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\microsoft--phi-2" --method fp16
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\microsoft--phi-2-fp16-16bit" --format gguf
dir "C:\Users\vicky\.edgeforge\exports\microsoft--phi-2-fp16-16bit\gguf" /s

INT8 Workflow

Use this path when you want an INT8-compressed Hugging Face artifact first.

1. Quantize to INT8

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\MODEL_FOLDER_NAME" --method int8

Example:

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\TinyLlama--TinyLlama-1.1B-Chat-v1.0" --method int8

2. Optional pruning

.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-int8-8bit" --method magnitude --sparsity 0.1

3. Important note for GGUF export

INT8 artifacts are useful for local Hugging Face style workflows.
For GGUF export, fp16 is usually the safer source format.
If GGUF conversion from INT8 fails, convert from the fp16 artifact instead.

LiteRT Workflow

Use this path only when you have real LiteRT/TFLite conversion dependencies installed.

1. Install conversion backends

.\enve\python.exe -m pip install ai-edge-torch tensorflow

2. Export to LiteRT

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME" --format litert

3. Verify the output

A real LiteRT .task file should contain a non-trivial model.tflite.
If the bundle contains a tiny placeholder model.tflite, the real backend was not used.
For many general Hugging Face LLMs, LiteRT conversion is still model-dependent and not guaranteed.

4. Best candidates

Gemma-family models
Phi-family models
Smaller transformer models with simpler operator coverage

Common Errors And Fixes

Gated repo error

Error:

GatedRepoError / 403 Client Error

Fix:

Request access on Hugging Face.
Or test with a public model first.

`bitsandbytes` not installed

Error:

bitsandbytes is not installed or not supported

Fix:

.\enve\python.exe -m pip install bitsandbytes

If that still fails on Windows, use fp16 instead.

`sentencepiece` missing during GGUF export

Error:

ModuleNotFoundError: No module named 'sentencepiece'

Fix:

.\enve\python.exe -m pip install sentencepiece

GGUF folder exists but no `.gguf` file

Fix:

Remove the broken export folder.
Rerun conversion with the latest EdgeForge code.
Prefer fp16 as the input artifact for GGUF export.

LiteRT `.task` exists but is not real

Signs:

model.tflite inside the bundle is only a few hundred bytes
ai-edge-torch and tensorflow are not installed

Fix:

.\enve\python.exe -m pip install ai-edge-torch tensorflow

Then rerun:

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME" --format litert

`llama-cli.exe` not found

Error:

FileNotFoundError: [WinError 2]

Fix:

Use the real path to llama-cli.exe
Do not leave C:\path\to\llama-cli.exe as a placeholder

Current backend status

fp16 quantization is implemented with real Hugging Face model loading and saving.
dynamic quantization uses real PyTorch dynamic quantization for Linear layers.
gptq and awq are wired through llmcompressor.oneshot() with generated recipes.
gguf export auto-detects auto-round plus gguf and uses the Python export path when available.
tflite and litert use a real ai-edge-torch conversion path when that package is installed; otherwise EdgeForge writes an explicit fallback artifact instead of pretending conversion succeeded.

Recommended Usage Summary

For the most reliable results, use a text-only model and export to gguf.
For mobile experiments, use smaller models in the 1B to 3B range.
For Google AI Edge Gallery, prefer Linux / WSL / Colab over native Windows.
For larger 7B+ models, prefer GGUF over TFLite / LiteRT.

Package layout

src/edgeforge/
  __init__.py
  __main__.py
  auth.py
  chat.py
  cli.py
  config.py
  converter.py
  downloader.py
  pipeline.py
  pruner.py
  quantizer.py
  utils.py
SKILL.md
tests/

Reality check for LiteRT

EdgeForge is intentionally honest about deployment constraints:

LiteRT export is best for models and operator sets that are already compatible with Google AI Edge tooling.
General Hugging Face decoder-only LLMs often need model-family-specific conversion work before they become valid .task or .tflite artifacts.
GGUF plus llama.cpp remains the most practical local runtime for many larger LLMs.

AWQ note

AutoAWQ is deprecated upstream.
EdgeForge now treats llmcompressor as the recommended AWQ dependency path.
llmcompressor is documented upstream with Linux as the recommended environment for GPU workflows.
On Windows, very large AWQ workflows may still be less reliable than Linux GPU environments.

The current project provides a strong orchestration layer, clear manifests, conversion plans, and extension points instead of pretending every Hugging Face model can be converted to LiteRT in one generic step.

Development

pip install -e ".[dev]"
pytest tests/ -v
python -m build

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Mar 18, 2026

0.1.1

Mar 17, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgeforge-0.1.2.tar.gz (27.3 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

edgeforge-0.1.2-py3-none-any.whl (24.0 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file edgeforge-0.1.2.tar.gz.

File metadata

Download URL: edgeforge-0.1.2.tar.gz
Upload date: Mar 18, 2026
Size: 27.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for edgeforge-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`91a170b232a81fb913ad324769a810a03503f84065bd8afaebf62fa41221f93d`
MD5	`20be74e87ae1edb3bc04909a01423972`
BLAKE2b-256	`1d8bf311fd470db652caff7b95735ba40bd3ada209dc5b9e28b59c5de7b87438`

See more details on using hashes here.

File details

Details for the file edgeforge-0.1.2-py3-none-any.whl.

File metadata

Download URL: edgeforge-0.1.2-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for edgeforge-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f10d69c3746b216e8822249832e8a9e66bb67f7ac58047270d03de565be33ac0`
MD5	`fb09096db751da25f81dde2f292a1a24`
BLAKE2b-256	`2d050bcc0c146e054b55ea08c4d0875a9c3b6d6777257b62657901905151aacd`

See more details on using hashes here.

edgeforge 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

EdgeForge

Documentation

Install

CLI overview

Step By Step For A New Model

1. Activate the environment

2. Authenticate with Hugging Face

3. Download the model

4. Quantize the model

5. Optional pruning

6. Export to GGUF

7. Find the GGUF file

8. Run chat with llama.cpp

Recommended order

Full example

INT8 Workflow

1. Quantize to INT8

2. Optional pruning

3. Important note for GGUF export

LiteRT Workflow

1. Install conversion backends

2. Export to LiteRT

3. Verify the output

4. Best candidates

Common Errors And Fixes

Gated repo error

bitsandbytes not installed

sentencepiece missing during GGUF export

GGUF folder exists but no .gguf file

LiteRT .task exists but is not real

llama-cli.exe not found

Current backend status

Recommended Usage Summary

Package layout

Reality check for LiteRT

AWQ note

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`bitsandbytes` not installed

`sentencepiece` missing during GGUF export

GGUF folder exists but no `.gguf` file

LiteRT `.task` exists but is not real

`llama-cli.exe` not found