Developer-friendly CLI for quantizing, pruning, and exporting Hugging Face models for edge deployment.
Project description
EdgeForge
Quantize, prune, and deploy Hugging Face LLMs to Google AI Edge Gallery (LiteRT / TFLite) from your terminal.
EdgeForge is a developer-friendly CLI toolkit that downloads Hugging Face transformer models, prepares them for compression, applies quantization and pruning workflows, and stages export artifacts for:
- LiteRT
.taskfor Google AI Edge Gallery - TFLite
.tflitefor mobile inference - GGUF
.ggufworkflows forllama.cpp
Documentation
For the full practical guide, see:
USAGE_AND_SUPPORT.md
That guide covers:
- supported workflows
- Windows vs Linux / WSL / Colab guidance
- dependency compatibility
- GGUF vs TFLite vs LiteRT recommendations
- mobile size limits
- common error explanations and fixes
Install
pip install edgeforge
pip install "edgeforge[gptq]"
pip install "edgeforge[awq]"
pip install "edgeforge[litert,tflite]"
pip install "edgeforge[gguf]"
pip install "edgeforge[all]"
Recommended install strategy:
- use
.[torch,gguf]for GGUF workflows - use
.[torch,tflite,litert]for TFLite / LiteRT workflows - avoid mixing every backend in one notebook unless you really need to
For this workspace:
.\enve\python.exe -m pip install -e .
Examples:
.\enve\python.exe -m pip install -e ".[torch,gguf]"
.\enve\python.exe -m pip install -e ".[torch,tflite,litert]"
CLI overview
edgeforge auth login
edgeforge auth status
edgeforge download google/gemma-2b-it
edgeforge quantize google/gemma-2b-it --method gptq --bits 4
edgeforge prune ./models/gemma-2b-it-gptq --method magnitude --sparsity 0.3
edgeforge convert ./models/gemma-2b-it-gptq-pruned --format litert
edgeforge run google/gemma-2b-it --quant-method awq --bits 4 --export-format gguf
edgeforge chat ./models/gemma-2b-it.gguf
Step By Step For A New Model
Use this workflow when you want to process a new Hugging Face model from scratch.
1. Activate the environment
C:\Quanitization\enve\Scripts\activate
cd C:\Quanitization
2. Authenticate with Hugging Face
Only needed for gated or private models.
.\enve\Scripts\edgeforge.exe auth login --token hf_xxx
3. Download the model
Public model example:
.\enve\Scripts\edgeforge.exe download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --no-auth
Gated model example:
.\enve\Scripts\edgeforge.exe download google/gemma-2b-it
4. Quantize the model
For the most reliable GGUF workflow, use fp16 first.
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\MODEL_FOLDER_NAME" --method fp16
Example:
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\TinyLlama--TinyLlama-1.1B-Chat-v1.0" --method fp16
5. Optional pruning
.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit" --method magnitude --sparsity 0.1
Example:
.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\TinyLlama--TinyLlama-1.1B-Chat-v1.0-fp16-16bit" --method magnitude --sparsity 0.1
6. Export to GGUF
Without pruning:
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit" --format gguf
With pruning:
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-fp16-16bit-pruned-magnitude-10" --format gguf
7. Find the GGUF file
dir "C:\Users\vicky\.edgeforge\exports\MODEL_EXPORT_FOLDER\gguf" /s
8. Run chat with llama.cpp
.\enve\Scripts\edgeforge.exe chat "FULL_PATH_TO_MODEL.gguf" --backend gguf --executable "C:\Users\vicky\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\llama-cli.exe"
Recommended order
- Download the model.
- Quantize with
fp16. - Skip pruning for the first test.
- Export to
gguf. - Test chat with
llama.cpp. - Add pruning only after the base export works.
Full example
.\enve\Scripts\edgeforge.exe download microsoft/phi-2 --no-auth
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\microsoft--phi-2" --method fp16
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\microsoft--phi-2-fp16-16bit" --format gguf
dir "C:\Users\vicky\.edgeforge\exports\microsoft--phi-2-fp16-16bit\gguf" /s
INT8 Workflow
Use this path when you want an INT8-compressed Hugging Face artifact first.
1. Quantize to INT8
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\MODEL_FOLDER_NAME" --method int8
Example:
.\enve\Scripts\edgeforge.exe quantize "C:\Users\vicky\.edgeforge\models\TinyLlama--TinyLlama-1.1B-Chat-v1.0" --method int8
2. Optional pruning
.\enve\Scripts\edgeforge.exe prune "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME-int8-8bit" --method magnitude --sparsity 0.1
3. Important note for GGUF export
- INT8 artifacts are useful for local Hugging Face style workflows.
- For GGUF export,
fp16is usually the safer source format. - If GGUF conversion from INT8 fails, convert from the
fp16artifact instead.
LiteRT Workflow
Use this path only when you have real LiteRT/TFLite conversion dependencies installed.
1. Install conversion backends
.\enve\python.exe -m pip install ai-edge-torch tensorflow
2. Export to LiteRT
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME" --format litert
3. Verify the output
- A real LiteRT
.taskfile should contain a non-trivialmodel.tflite. - If the bundle contains a tiny placeholder
model.tflite, the real backend was not used. - For many general Hugging Face LLMs, LiteRT conversion is still model-dependent and not guaranteed.
4. Best candidates
- Gemma-family models
- Phi-family models
- Smaller transformer models with simpler operator coverage
Common Errors And Fixes
Gated repo error
Error:
GatedRepoError / 403 Client Error
Fix:
- Request access on Hugging Face.
- Or test with a public model first.
bitsandbytes not installed
Error:
bitsandbytes is not installed or not supported
Fix:
.\enve\python.exe -m pip install bitsandbytes
If that still fails on Windows, use fp16 instead.
sentencepiece missing during GGUF export
Error:
ModuleNotFoundError: No module named 'sentencepiece'
Fix:
.\enve\python.exe -m pip install sentencepiece
GGUF folder exists but no .gguf file
Fix:
- Remove the broken export folder.
- Rerun conversion with the latest EdgeForge code.
- Prefer
fp16as the input artifact for GGUF export.
LiteRT .task exists but is not real
Signs:
model.tfliteinside the bundle is only a few hundred bytesai-edge-torchandtensorfloware not installed
Fix:
.\enve\python.exe -m pip install ai-edge-torch tensorflow
Then rerun:
.\enve\Scripts\edgeforge.exe convert "C:\Users\vicky\.edgeforge\artifacts\MODEL_NAME" --format litert
llama-cli.exe not found
Error:
FileNotFoundError: [WinError 2]
Fix:
- Use the real path to
llama-cli.exe - Do not leave
C:\path\to\llama-cli.exeas a placeholder
Current backend status
fp16quantization is implemented with real Hugging Face model loading and saving.dynamicquantization uses real PyTorch dynamic quantization forLinearlayers.gptqandawqare wired throughllmcompressor.oneshot()with generated recipes.ggufexport auto-detectsauto-roundplusggufand uses the Python export path when available.tfliteandlitertuse a realai-edge-torchconversion path when that package is installed; otherwise EdgeForge writes an explicit fallback artifact instead of pretending conversion succeeded.
Recommended Usage Summary
- For the most reliable results, use a text-only model and export to
gguf. - For mobile experiments, use smaller models in the
1Bto3Brange. - For Google AI Edge Gallery, prefer Linux / WSL / Colab over native Windows.
- For larger
7B+models, prefer GGUF over TFLite / LiteRT.
Package layout
src/edgeforge/
__init__.py
__main__.py
auth.py
chat.py
cli.py
config.py
converter.py
downloader.py
pipeline.py
pruner.py
quantizer.py
utils.py
SKILL.md
tests/
Reality check for LiteRT
EdgeForge is intentionally honest about deployment constraints:
- LiteRT export is best for models and operator sets that are already compatible with Google AI Edge tooling.
- General Hugging Face decoder-only LLMs often need model-family-specific conversion work before they become valid
.taskor.tfliteartifacts. - GGUF plus
llama.cppremains the most practical local runtime for many larger LLMs.
AWQ note
AutoAWQis deprecated upstream.- EdgeForge now treats
llmcompressoras the recommended AWQ dependency path. llmcompressoris documented upstream with Linux as the recommended environment for GPU workflows.- On Windows, very large AWQ workflows may still be less reliable than Linux GPU environments.
The current project provides a strong orchestration layer, clear manifests, conversion plans, and extension points instead of pretending every Hugging Face model can be converted to LiteRT in one generic step.
Development
pip install -e ".[dev]"
pytest tests/ -v
python -m build
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file edgeforge-0.1.2.tar.gz.
File metadata
- Download URL: edgeforge-0.1.2.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91a170b232a81fb913ad324769a810a03503f84065bd8afaebf62fa41221f93d
|
|
| MD5 |
20be74e87ae1edb3bc04909a01423972
|
|
| BLAKE2b-256 |
1d8bf311fd470db652caff7b95735ba40bd3ada209dc5b9e28b59c5de7b87438
|
File details
Details for the file edgeforge-0.1.2-py3-none-any.whl.
File metadata
- Download URL: edgeforge-0.1.2-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f10d69c3746b216e8822249832e8a9e66bb67f7ac58047270d03de565be33ac0
|
|
| MD5 |
fb09096db751da25f81dde2f292a1a24
|
|
| BLAKE2b-256 |
2d050bcc0c146e054b55ea08c4d0875a9c3b6d6777257b62657901905151aacd
|