Skip to main content

A modular framework for LLM quantization, structured pruning, and edge deployment

Project description

TinyEdgeLLM

DOI PyPI version PyPI downloads Python 3.8+ License: MIT Documentation Code style: black GitHub issues GitHub stars GitHub Actions Last commit

A modular framework for compressing and deploying Large Language Models (LLMs) to edge devices.

Problem

Cloud-based LLMs are unsustainable for IoT and edge applications due to high latency, bandwidth requirements, and energy consumption. TinyEdgeLLM addresses this by enabling efficient on-device inference through model compression techniques.

Solution

TinyEdgeLLM provides a hybrid Python/C++ library that implements:

  • Advanced Quantization: GPTQ, AWQ, and BitsAndBytes 4-bit quantization
  • Structured Pruning: Attention head, neuron, and layer pruning algorithms
  • Knowledge Distillation: Teacher-student training for compressed models
  • Mixed-precision quantization (2-bit, 4-bit, 8-bit)
  • Cross-platform deployment to ONNX, TensorFlow Lite, and TorchScript
  • Edge-device optimization for TinyML-class hardware

Features

  • Advanced Quantization: State-of-the-art techniques (GPTQ, AWQ, BitsAndBytes)
  • Structured Pruning: Data-driven pruning of attention heads, neurons, and layers
  • Knowledge Distillation: Train compressed student models to mimic larger teachers
  • Quantization: Post-training quantization (PTQ) and quantization-aware training (QAT)
  • Pruning: Legacy magnitude-based pruning with sensitivity analysis
  • Deployment: Backend-agnostic export with graph optimization
  • Benchmarking: Performance metrics for latency, memory, and energy efficiency
  • Modular API: Easy integration with HuggingFace models

Performance Results

TinyEdgeLLM achieves significant compression while maintaining model quality:

Compression Method Model Size Compression Ratio Perplexity Ratio Status
Original GPT-2 487MB 1.0x 1.00 Baseline
Basic 8-bit Quantization 249MB 1.95x 1.00 ✅ Working
Basic 4-bit Quantization 249MB 1.95x 1.00 ✅ Working
4-bit + Structured Pruning ~174MB ~2.8x ~1.05 ✅ Working
4-bit + Pruning + Distillation ~152MB ~3.2x ~1.02 ✅ Working

Key Achievements:

  • Up to 3.2x compression with minimal quality degradation (<2% perplexity increase)
  • Modular pipeline combining quantization, pruning, and distillation
  • Research-grade techniques including GPTQ, AWQ, and knowledge distillation
  • Production-ready with ONNX export and benchmarking tools

Advanced Compression Techniques

Quantization Methods

  • GPTQ (Gradient-based Post-Training Quantization): Optimal 4-bit quantization using gradient information
  • AWQ (Activation-aware Weight Quantization): Protects salient weights based on activation patterns
  • BitsAndBytes: Efficient 4-bit quantization with hardware acceleration support

Structured Pruning

  • Attention Head Pruning: Removes redundant attention heads based on importance scores
  • Neuron Pruning: Magnitude-based pruning of neurons in linear layers
  • Layer Pruning: Removes entire transformer layers (experimental)

Knowledge Distillation

  • Teacher-Student Training: Compresses large models by training smaller models to mimic them
  • KL Divergence Loss: Combines soft targets and hard targets for better distillation
  • Custom Student Architectures: Support for different model sizes and configurations

Installation

pip install tinyedgellm

Quick Start

from tinyedgellm import quantize_and_prune
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pretrained model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Advanced compression pipeline - achieves ~3.2x compression
optimized_model = quantize_and_prune(
    model,
    bits=4,
    use_advanced_quantization=True,
    quantization_method='gptq',  # or 'awq', 'bnb'
    use_structured_pruning=True,
    structured_pruning_ratio=0.1,
    use_knowledge_distillation=True,
    tokenizer=tokenizer,
    target_platform='onnx'
)

# Result: ~152MB model (from 487MB) with <2% quality degradation

Advanced Usage

# Use individual components
from tinyedgellm import GPTQQuantizer, apply_structured_pruning, distill_model

# Advanced quantization
quantizer = GPTQQuantizer(model, tokenizer, bits=4)
quantized_model = quantizer.quantize(calibration_data)

# Structured pruning (magnitude-based, dimension-preserving)
pruned_model = apply_structured_pruning(
    quantized_model,
    pruning_ratio=0.1,
    tokenizer=tokenizer
)

# Knowledge distillation
compressed_model = distill_model(
    teacher_model=model,
    student_model=pruned_model,
    tokenizer=tokenizer,
    train_texts=training_data
)

Running the Demo

# Clone the repository
git clone https://github.com/krish567366/tinyedgellm.git
cd tinyedgellm

# Install dependencies
pip install -e .

# Run the advanced compression demo
python demo_advanced.py

# Or try the simpler example
python examples/simple_example.py

# Or run the comprehensive demo
python examples/demo_distilgpt2.py

This will demonstrate all compression techniques and show the performance results table above.

Documentation

For comprehensive documentation including architecture details, reproducibility instructions, advanced examples, and performance results, see the online documentation.

Key Sections:

  • Reproducibility: Exact environment setup and benchmark reproduction
  • Architecture: Detailed system design and component overview
  • Examples: Multiple usage examples from basic to advanced
  • Performance Results: Comprehensive benchmarks and comparisons
  • API Reference: Complete function and class documentation

Local Documentation

To build documentation locally:

pip install -e ".[docs]"
mkdocs serve

Contributing

We welcome contributions! Please see our contributing guide for details.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyedgellm-0.1.0.tar.gz (32.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyedgellm-0.1.0-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file tinyedgellm-0.1.0.tar.gz.

File metadata

  • Download URL: tinyedgellm-0.1.0.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for tinyedgellm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fc6dc03732035be6659b108d56663e44854432f197c9756fa655a2bad3f183b0
MD5 c7c96990786cb9512134181902b48914
BLAKE2b-256 7e17a190d239628516850f5088106d241938ec9ca95074460b92270d6d9826f8

See more details on using hashes here.

File details

Details for the file tinyedgellm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tinyedgellm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for tinyedgellm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 14707bd41ebd4c24864f4aba125c7d1f62bebbf835ae2018397d5e76b81456e4
MD5 0a87a58a25cf6122d03a0ea1555a41a6
BLAKE2b-256 8b01f0a6e4b296eebece399dbccdd545bcac843341c23e99ab93c2d8ebaa4fa0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page