A unified runtime and developer layer for Small Language Models
Project description
SLM Packager
A Unified Runtime & Developer Layer for Small Language Models
SLM Packager is an open-source toolkit for running, packaging, and evaluating Small Language Models (1B-7B parameters) across different formats and runtimes. Think of it as Terraform for SLMs โ making model deployment simple, reproducible, and developer-friendly.
โจ Features
- ๐ฏ Model Registry: One-command downloads from HuggingFace with
slm pull - ๐ Multi-Runtime Support: llama.cpp (GGUF), Transformers (PyTorch), ONNX
- โก GPU Acceleration: MPS (Apple Silicon), CUDA (NVIDIA), Metal (llama.cpp)
- โ๏ธ Auto-Quantization: On-device model quantization with automatic tool setup
- ๐ Benchmarking: Measure speed, memory, latency across runtimes
- ๐ ๏ธ Config-Driven: YAML configs for reproducible deployments
- ๐ API Server: FastAPI-based serving with streaming support
๐ Quick Start
Installation
git clone https://github.com/Ayo-Cyber/slm-packager.git
cd slm-packager
pip install -e .
Pull & Run a Model
# List available models
slm list
# Pull GPT-2 (500MB, fast for testing)
slm pull gpt2
# Run it
slm run gpt2 --prompt "Explain AI in one sentence"
That's it! The model downloads, auto-configures, and runs.
Pull a GGUF Model
# Pull TinyLlama with llama.cpp (637MB)
slm pull tinyllama
# Run with different parameters
slm run tinyllama --prompt "Write a haiku"
๐ฆ Available Models
| Model | Size | Runtime | Description |
|---|---|---|---|
| gpt2 | 500MB | transformers | OpenAI GPT-2, fast to download |
| tinyllama | 637MB | llama.cpp | 1.1B chat model, CPU-optimized |
| phi-2 | 1.6GB | llama.cpp | Microsoft's 2.7B reasoning model |
| qwen-1.8b | 1.1GB | llama.cpp | Alibaba's efficient chat model |
View all: slm list
Pull with specific quantization: slm pull tinyllama --quant q8_0
๐ ๏ธ CLI Commands
# Model management
slm list # Show available models
slm list --installed # Show downloaded models
slm pull <model> # Download a model
slm pull <model> --list-variants # Show quantization options
# Running models
slm run <model> --prompt "Your prompt"
slm run <config.yaml> --prompt "Your prompt"
# Quantization (auto-downloads tool)
slm quantize input.gguf output.gguf --type q4_k_m
# Benchmarking
slm benchmark <model>
# API server
slm serve --port 8000
# Manual config creation
slm init
โก GPU Acceleration
SLM Packager supports GPU acceleration across different hardware platforms.
Apple Silicon (MPS) - Zero Setup Required! ๐
No installation needed - works out of the box on M1/M2/M3 Macs!
# Create GPU-accelerated config
slm init --name gpt2 --path gpt2 --format transformers --runtime transformers --device mps -o gpt2-gpu.yaml
# Run on GPU
slm run gpt2-gpu.yaml --prompt "Explain quantum computing"
Real Performance (M2 Pro):
GPT-2 Performance Comparison
โโ CPU: 1.3 tokens/sec
โโ MPS: 2.4 tokens/sec โก 2.14x faster!
Tested on: M2 Pro, macOS 14.x, GPT-2 (124M parameters)
Requirements:
- macOS 12.3 or later
- Apple Silicon (M1/M2/M3 series)
- PyTorch 1.12+ (included with installation)
NVIDIA GPU (CUDA)
# Install llama.cpp with CUDA support
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir
# Set gpu_layers in config
runtime:
type: llama_cpp
device: cuda
gpu_layers: 32 # Offload layers to GPU
Expected Performance:
- 2-5x speedup vs CPU
- Depends on GPU, model size, and layers offloaded
llama.cpp Metal (Apple Silicon)
For GGUF models with llama.cpp on Apple Silicon:
# Rebuild with Metal support
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir
# Use GPU layers in config
runtime:
type: llama_cpp
device: cpu # Metal auto-detected
gpu_layers: 32 # Offload to GPU
๐ Performance Benchmarks
Real-world performance on different hardware:
GPT-2 (124M parameters)
| Runtime | Device | Tokens/sec | Memory | Notes |
|---|---|---|---|---|
| transformers | CPU (M2 Pro) | 1.3 | 2.1GB | Baseline |
| transformers | MPS (M2 Pro) | 2.4 | 2.1GB | 2.14x speedup โก |
| ONNX | CPU (M2 Pro) | 13.8 | 600MB | With KV-cache |
| llama.cpp | CPU | ~15-20 | ~400MB | Quantized GGUF |
TinyLlama (1.1B parameters)
| Runtime | Device | Tokens/sec | Memory | Notes |
|---|---|---|---|---|
| llama.cpp | CPU | 15-20 | ~800MB | Q4_K_M quantization |
| llama.cpp | Metal (M1) | 40-60 | ~800MB | With GPU layers |
| transformers | CPU | 5-10 | 4GB | Full precision |
Performance varies based on hardware, model size, and configuration. Benchmarks collected on M2 Pro (Dec 2025).
๐ Runtime Comparison
Choose the right runtime for your use case:
| Runtime | Best For | Pros | Cons |
|---|---|---|---|
| llama.cpp | Production, efficiency | Fast, low memory, quantized | GGUF format only |
| transformers | Development, flexibility | Latest models, GPU support | Higher memory |
| ONNX | Cross-platform, optimization | Fast, portable, optimized | Requires model export |
When to Use Each
llama.cpp (GGUF):
- โ Production deployments
- โ Limited memory/CPU
- โ Want quantization
- โ Edge devices
transformers (PyTorch):
- โ Development & experimentation
- โ Latest HuggingFace models
- โ Fine-tuning workflows
- โ GPU available
ONNX:
- โ Cross-platform deployment
- โ ML pipeline integration
- โ Optimized inference graphs
- โ Already have ONNX models
๐ฏ Example Workflows
Developer: Fine-Tune & Quantize
# 1. Fine-tune your model (external tool)
# 2. Quantize it
slm quantize my-model.gguf my-model-q4.gguf --type q4_k_m
# 3. Test it
slm run my-model-q4.gguf --prompt "Test prompt"
# 4. Benchmark it
slm benchmark my-model-q4.gguf
Researcher: Compare Runtimes
# Pull same model, different runtimes
slm pull gpt2 # Transformers
slm pull tinyllama # llama.cpp
# Benchmark both
slm benchmark gpt2
slm benchmark tinyllama
# Compare results
MacBook User: GPU-Accelerated Inference
# Zero setup - just run!
slm pull gpt2
slm init --name gpt2 --path gpt2 --device mps -o gpt2-gpu.yaml
slm run gpt2-gpu.yaml --prompt "Hello!"
# 2.14x faster than CPU! โก
ONNX User: Export & Run
# 1. Export model to ONNX
pip install "optimum[exporters]"
optimum-cli export onnx --model gpt2 --task text-generation-with-past models/gpt2-onnx/
# 2. Create config
slm init --name gpt2 --path models/gpt2-onnx --format onnx --runtime onnx -o gpt2-onnx.yaml
# 3. Run (13.8 tok/s on CPU!)
slm run gpt2-onnx.yaml --prompt "Hello world"
๐ฆ Configuration
Example Config
# my-model.yaml
model:
name: my-custom-model
path: /path/to/model.gguf
format: gguf
description: "My quantized model"
runtime:
type: llama_cpp
device: cpu
threads: 8
context_size: 2048
gpu_layers: 0
params:
temperature: 0.7
top_p: 0.9
top_k: 40
max_tokens: 512
stream: true
stop: ["User:", "\n\n"]
Create Config Interactively
slm init # Guided prompts
๐ API Server
Start a FastAPI server for HTTP access:
# Start server
slm serve --port 8000
# Or with custom config
slm serve --config my-model.yaml --port 8000
API Usage
# Health check
curl http://localhost:8000/health
# Load model
curl -X POST http://localhost:8000/load \
-H "Content-Type: application/json" \
-d '{"config_path": "gpt2.yaml"}'
# Generate text
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 100,
"temperature": 0.8
}'
# Streaming
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{"prompt": "Hello", "stream": true}'
๐ Documentation
Comprehensive guides for each component:
- Quick Start Guide - Complete walkthrough
- Model Formats Guide - GGUF vs PyTorch vs ONNX
- GGUF Setup Guide - Using llama.cpp with Metal/CUDA
- ONNX Guide - Export, run, and optimize ONNX models
- Init Guide - Creating configs manually
- Contributing Guide - Development setup and guidelines
๐งช Testing & Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=slm_packager --cov-report=html
# Code quality
black slm_packager tests
isort slm_packager tests
mypy slm_packager
# Pre-commit hooks
pre-commit install
pre-commit run --all-files
Test Results:
- โ 73 tests passing
- Coverage: 52% overall
- API: 82% โญ
- Core runtime: 60%
- CLI: 47%
See CONTRIBUTING.md for detailed development guidelines.
๐บ๏ธ Roadmap
v0.2 (Current - December 2025)
- Automated test suite (โ Complete: 73 tests, 52% coverage)
- MPS GPU support for Apple Silicon (โ Complete: 2.14x speedup)
- ONNX runtime with KV-cache (โ Complete: 13.8 tok/s)
- API server improvements (โ Complete: 82% coverage)
- CUDA GPU acceleration testing
- Comprehensive benchmark suite
- Expand model registry
v1.0 (Future)
- vLLM integration for high-performance GPU serving
- ROCm support (AMD GPUs)
- Model conversion utilities
- Web UI for model management
- Advanced quantization options
- Multi-GPU support
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI / API Server โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Model Registry & Downloader โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Runtime Abstraction Layer โ
โ โโ llama.cpp (GGUF) โ
โ โ โโ Metal/CUDA support โ
โ โโ Transformers (PyTorch) โ
โ โ โโ MPS/CUDA support โ
โ โโ ONNX Runtime โ
โ โโ Manual KV-cache โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Quantization & Benchmarking โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ก Why SLM Packager?
Problem: Running small language models involves juggling different formats (GGUF, PyTorch, ONNX), runtimes (llama.cpp, transformers, onnxruntime), and configuration options.
Solution: SLM Packager provides:
- Unified interface - One CLI/API for all runtimes
- Auto-configuration - Models work out-of-the-box
- GPU acceleration - Automatic MPS on Mac, easy CUDA setup
- Reproducibility - YAML configs for deployment
- Developer-friendly - Python API, FastAPI server, streaming support
๐ค Contributing
We welcome contributions! See CONTRIBUTING.md for:
- Development setup
- Testing guidelines
- Code style requirements
- Pull request process
๐ License
MIT License - see LICENSE for details
๐ Acknowledgments
- llama.cpp - Efficient GGUF runtime
- HuggingFace - Transformers and model hub
- ONNX Runtime - Optimized inference
- FastAPI - Modern API framework
๐ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- GitHub: @Ayo-Cyber
Built with โค๏ธ for the AI community
Making small language models accessible, fast, and easy to deploy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slm_packager-0.2.1.tar.gz.
File metadata
- Download URL: slm_packager-0.2.1.tar.gz
- Upload date:
- Size: 38.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87ca4849de253e860592e37d219a7c462be8d488b192a8d1288850249d86e058
|
|
| MD5 |
109e0d5744e636cf2cdabc343766be7e
|
|
| BLAKE2b-256 |
979d4695fa8915d96e2c1f624b008b4cca3743c0667835122a611fc57952938a
|
File details
Details for the file slm_packager-0.2.1-py3-none-any.whl.
File metadata
- Download URL: slm_packager-0.2.1-py3-none-any.whl
- Upload date:
- Size: 36.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bc7ddd138db73dec8549ac9ac66f9176cb0c75a3b673dc10b159bdab9d67863
|
|
| MD5 |
0cfb3ca882a9f1bb2653737107fa4570
|
|
| BLAKE2b-256 |
e812f519d137060db77b7700a3f1e00c1318951e175e7b19e4814dee8b27acb3
|