Skip to main content

Developer-friendly CLI for quantizing, pruning, and exporting Hugging Face models for edge deployment.

Project description

EdgeForge

Quantize, prune, and deploy Hugging Face LLMs to Google AI Edge Gallery (LiteRT / TFLite) from your terminal.

EdgeForge is a developer-friendly CLI toolkit that downloads Hugging Face transformer models, prepares them for compression, applies quantization and pruning workflows, and stages export artifacts for:

  • LiteRT .task for Google AI Edge Gallery
  • TFLite .tflite for mobile inference
  • GGUF .gguf workflows for llama.cpp

Documentation

For the full practical guide, see:

  • USAGE_AND_SUPPORT.md

That guide covers:

  • supported workflows
  • Windows vs Linux / WSL / Colab guidance
  • dependency compatibility
  • GGUF vs TFLite vs LiteRT recommendations
  • mobile size limits
  • common error explanations and fixes

Install

pip install edgeforge
pip install "edgeforge[gptq]"
pip install "edgeforge[awq]"
pip install "edgeforge[litert,tflite]"
pip install "edgeforge[gguf]"
pip install "edgeforge[all]"

Recommended install strategy:

  • use .[torch,gguf] for GGUF workflows
  • use .[torch,tflite,litert] for TFLite / LiteRT workflows
  • avoid mixing every backend in one notebook unless you really need to

For this workspace:

.\enve\python.exe -m pip install -e .

Examples:

.\enve\python.exe -m pip install -e ".[torch,gguf]"
.\enve\python.exe -m pip install -e ".[torch,tflite,litert]"

CLI overview

edgeforge auth login
edgeforge auth status
edgeforge download google/gemma-2b-it
edgeforge quantize google/gemma-2b-it --method gptq --bits 4
edgeforge prune ./models/gemma-2b-it-gptq --method magnitude --sparsity 0.3
edgeforge convert ./models/gemma-2b-it-gptq-pruned --format litert
edgeforge run google/gemma-2b-it --quant-method awq --bits 4 --export-format gguf
edgeforge chat ./models/gemma-2b-it.gguf

Step By Step For A New Model

Use this workflow when you want to process a new Hugging Face model from scratch.

1. Activate the environment

C:\Quanitization\enve\Scripts\activate
cd C:\Quanitization

2. Authenticate with Hugging Face

Only needed for gated or private models.

.\enve\Scripts\edgeforge.exe auth login --token hf_xxx

3. Download the model

Public model example:

.\enve\Scripts\edgeforge.exe download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --no-auth

Gated model example:

.\enve\Scripts\edgeforge.exe download google/gemma-2b-it

4. Quantize the model

For the most reliable GGUF workflow, use fp16 first.

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\MODEL_FOLDER_NAME" --method fp16

Example:

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\TinyLlama--TinyLlama-1.1B-Chat-v1.0" --method fp16

5. Optional pruning

.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit" --method magnitude --sparsity 0.1

Example:

.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\TinyLlama--TinyLlama-1.1B-Chat-v1.0-fp16-16bit" --method magnitude --sparsity 0.1

6. Export to GGUF

Without pruning:

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit" --format gguf

With pruning:

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit-pruned-magnitude-10" --format gguf

7. Find the GGUF file

dir "C:\Users\vicky\.edgeforge\exports\MODEL_EXPORT_FOLDER\gguf" /s

8. Run chat with llama.cpp

.\enve\Scripts\edgeforge.exe chat "FULL_PATH_TO_MODEL.gguf" --backend gguf --executable "C:\Users\vicky\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\llama-cli.exe"

Recommended order

  1. Download the model.
  2. Quantize with fp16.
  3. Skip pruning for the first test.
  4. Export to gguf.
  5. Test chat with llama.cpp.
  6. Add pruning only after the base export works.

Full example

.\enve\Scripts\edgeforge.exe download microsoft/phi-2 --no-auth
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\microsoft--phi-2" --method fp16
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\microsoft--phi-2-fp16-16bit" --format gguf
dir "C:\Users\vicky\.edgeforge\exports\microsoft--phi-2-fp16-16bit\gguf" /s

INT8 Workflow

Use this path when you want an INT8-compressed Hugging Face artifact first.

1. Quantize to INT8

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\MODEL_FOLDER_NAME" --method int8

Example:

.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\TinyLlama--TinyLlama-1.1B-Chat-v1.0" --method int8

2. Optional pruning

.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-int8-8bit" --method magnitude --sparsity 0.1

3. Important note for GGUF export

  • INT8 artifacts are useful for local Hugging Face style workflows.
  • For GGUF export, fp16 is usually the safer source format.
  • If GGUF conversion from INT8 fails, convert from the fp16 artifact instead.

LiteRT Workflow

Use this path only when you have real LiteRT/TFLite conversion dependencies installed.

1. Install conversion backends

.\enve\python.exe -m pip install ai-edge-torch tensorflow

2. Export to LiteRT

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME" --format litert

3. Verify the output

  • A real LiteRT .task file should contain a non-trivial model.tflite.
  • If the bundle contains a tiny placeholder model.tflite, the real backend was not used.
  • For many general Hugging Face LLMs, LiteRT conversion is still model-dependent and not guaranteed.

4. Best candidates

  • Gemma-family models
  • Phi-family models
  • Smaller transformer models with simpler operator coverage

Common Errors And Fixes

Gated repo error

Error:

GatedRepoError / 403 Client Error

Fix:

  • Request access on Hugging Face.
  • Or test with a public model first.

bitsandbytes not installed

Error:

bitsandbytes is not installed or not supported

Fix:

.\enve\python.exe -m pip install bitsandbytes

If that still fails on Windows, use fp16 instead.

sentencepiece missing during GGUF export

Error:

ModuleNotFoundError: No module named 'sentencepiece'

Fix:

.\enve\python.exe -m pip install sentencepiece

GGUF folder exists but no .gguf file

Fix:

  • Remove the broken export folder.
  • Rerun conversion with the latest EdgeForge code.
  • Prefer fp16 as the input artifact for GGUF export.

LiteRT .task exists but is not real

Signs:

  • model.tflite inside the bundle is only a few hundred bytes
  • ai-edge-torch and tensorflow are not installed

Fix:

.\enve\python.exe -m pip install ai-edge-torch tensorflow

Then rerun:

.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME" --format litert

llama-cli.exe not found

Error:

FileNotFoundError: [WinError 2]

Fix:

  • Use the real path to llama-cli.exe
  • Do not leave C:\path\to\llama-cli.exe as a placeholder

Current backend status

  • fp16 quantization is implemented with real Hugging Face model loading and saving.
  • dynamic quantization uses real PyTorch dynamic quantization for Linear layers.
  • gptq and awq are wired through llmcompressor.oneshot() with generated recipes.
  • gguf export auto-detects auto-round plus gguf and uses the Python export path when available.
  • tflite and litert use a real ai-edge-torch conversion path when that package is installed; otherwise EdgeForge writes an explicit fallback artifact instead of pretending conversion succeeded.

Recommended Usage Summary

  • For the most reliable results, use a text-only model and export to gguf.
  • For mobile experiments, use smaller models in the 1B to 3B range.
  • For Google AI Edge Gallery, prefer Linux / WSL / Colab over native Windows.
  • For larger 7B+ models, prefer GGUF over TFLite / LiteRT.

Package layout

src/edgeforge/
  __init__.py
  __main__.py
  auth.py
  chat.py
  cli.py
  config.py
  converter.py
  downloader.py
  pipeline.py
  pruner.py
  quantizer.py
  utils.py
SKILL.md
tests/

Reality check for LiteRT

EdgeForge is intentionally honest about deployment constraints:

  • LiteRT export is best for models and operator sets that are already compatible with Google AI Edge tooling.
  • General Hugging Face decoder-only LLMs often need model-family-specific conversion work before they become valid .task or .tflite artifacts.
  • GGUF plus llama.cpp remains the most practical local runtime for many larger LLMs.

AWQ note

  • AutoAWQ is deprecated upstream.
  • EdgeForge now treats llmcompressor as the recommended AWQ dependency path.
  • llmcompressor is documented upstream with Linux as the recommended environment for GPU workflows.
  • On Windows, very large AWQ workflows may still be less reliable than Linux GPU environments.

The current project provides a strong orchestration layer, clear manifests, conversion plans, and extension points instead of pretending every Hugging Face model can be converted to LiteRT in one generic step.

Development

pip install -e ".[dev]"
pytest tests/ -v
python -m build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgeforge-0.1.2.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

edgeforge-0.1.2-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file edgeforge-0.1.2.tar.gz.

File metadata

  • Download URL: edgeforge-0.1.2.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for edgeforge-0.1.2.tar.gz
Algorithm Hash digest
SHA256 91a170b232a81fb913ad324769a810a03503f84065bd8afaebf62fa41221f93d
MD5 20be74e87ae1edb3bc04909a01423972
BLAKE2b-256 1d8bf311fd470db652caff7b95735ba40bd3ada209dc5b9e28b59c5de7b87438

See more details on using hashes here.

File details

Details for the file edgeforge-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: edgeforge-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for edgeforge-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f10d69c3746b216e8822249832e8a9e66bb67f7ac58047270d03de565be33ac0
MD5 fb09096db751da25f81dde2f292a1a24
BLAKE2b-256 2d050bcc0c146e054b55ea08c4d0875a9c3b6d6777257b62657901905151aacd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page