Skip to main content

A unified runtime and developer layer for Small Language Models

Project description

SLM Packager

A Unified Runtime & Developer Layer for Small Language Models

SLM Packager is an open-source toolkit for running, packaging, and evaluating Small Language Models (1B-7B parameters) across different formats and runtimes. Think of it as Terraform for SLMs โ€” making model deployment simple, reproducible, and developer-friendly.

Tests Coverage Python License

โœจ Features

  • ๐ŸŽฏ Model Registry: One-command downloads from HuggingFace with slm pull
  • ๐Ÿ”„ Multi-Runtime Support: llama.cpp (GGUF), Transformers (PyTorch), ONNX
  • โšก GPU Acceleration: MPS (Apple Silicon), CUDA (NVIDIA), Metal (llama.cpp)
  • โš™๏ธ Auto-Quantization: On-device model quantization with automatic tool setup
  • ๐Ÿ“Š Benchmarking: Measure speed, memory, latency across runtimes
  • ๐Ÿ› ๏ธ Config-Driven: YAML configs for reproducible deployments
  • ๐ŸŒ API Server: FastAPI-based serving with streaming support

๐Ÿš€ Quick Start

Installation

git clone https://github.com/Ayo-Cyber/slm-packager.git
cd slm-packager
pip install -e .

Pull & Run a Model

# List available models
slm list

# Pull GPT-2 (500MB, fast for testing)
slm pull gpt2

# Run it
slm run gpt2 --prompt "Explain AI in one sentence"

That's it! The model downloads, auto-configures, and runs.

Pull a GGUF Model

# Pull TinyLlama with llama.cpp (637MB)
slm pull tinyllama

# Run with different parameters
slm run tinyllama --prompt "Write a haiku"

๐Ÿ“ฆ Available Models

Model Size Runtime Description
gpt2 500MB transformers OpenAI GPT-2, fast to download
tinyllama 637MB llama.cpp 1.1B chat model, CPU-optimized
phi-2 1.6GB llama.cpp Microsoft's 2.7B reasoning model
qwen-1.8b 1.1GB llama.cpp Alibaba's efficient chat model

View all: slm list
Pull with specific quantization: slm pull tinyllama --quant q8_0

๐Ÿ› ๏ธ CLI Commands

# Model management
slm list                    # Show available models
slm list --installed        # Show downloaded models
slm pull <model>            # Download a model
slm pull <model> --list-variants  # Show quantization options

# Running models
slm run <model> --prompt "Your prompt"
slm run <config.yaml> --prompt "Your prompt"

# Quantization (auto-downloads tool)
slm quantize input.gguf output.gguf --type q4_k_m

# Benchmarking
slm benchmark <model>

# API server
slm serve --port 8000

# Manual config creation
slm init

โšก GPU Acceleration

SLM Packager supports GPU acceleration across different hardware platforms.

Apple Silicon (MPS) - Zero Setup Required! ๐ŸŽ

No installation needed - works out of the box on M1/M2/M3 Macs!

# Create GPU-accelerated config
slm init --name gpt2 --path gpt2 --format transformers --runtime transformers --device mps -o gpt2-gpu.yaml

# Run on GPU
slm run gpt2-gpu.yaml --prompt "Explain quantum computing"

Real Performance (M2 Pro):

GPT-2 Performance Comparison
โ”œโ”€ CPU:  1.3 tokens/sec
โ””โ”€ MPS:  2.4 tokens/sec  โšก 2.14x faster!

Tested on: M2 Pro, macOS 14.x, GPT-2 (124M parameters)

Requirements:

  • macOS 12.3 or later
  • Apple Silicon (M1/M2/M3 series)
  • PyTorch 1.12+ (included with installation)

NVIDIA GPU (CUDA)

# Install llama.cpp with CUDA support
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir

# Set gpu_layers in config
runtime:
  type: llama_cpp
  device: cuda
  gpu_layers: 32  # Offload layers to GPU

Expected Performance:

  • 2-5x speedup vs CPU
  • Depends on GPU, model size, and layers offloaded

llama.cpp Metal (Apple Silicon)

For GGUF models with llama.cpp on Apple Silicon:

# Rebuild with Metal support
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir

# Use GPU layers in config
runtime:
  type: llama_cpp
  device: cpu  # Metal auto-detected
  gpu_layers: 32  # Offload to GPU

๐Ÿ“Š Performance Benchmarks

Real-world performance on different hardware:

GPT-2 (124M parameters)

Runtime Device Tokens/sec Memory Notes
transformers CPU (M2 Pro) 1.3 2.1GB Baseline
transformers MPS (M2 Pro) 2.4 2.1GB 2.14x speedup โšก
ONNX CPU (M2 Pro) 13.8 600MB With KV-cache
llama.cpp CPU ~15-20 ~400MB Quantized GGUF

TinyLlama (1.1B parameters)

Runtime Device Tokens/sec Memory Notes
llama.cpp CPU 15-20 ~800MB Q4_K_M quantization
llama.cpp Metal (M1) 40-60 ~800MB With GPU layers
transformers CPU 5-10 4GB Full precision

Performance varies based on hardware, model size, and configuration. Benchmarks collected on M2 Pro (Dec 2025).


๐Ÿ“– Runtime Comparison

Choose the right runtime for your use case:

Runtime Best For Pros Cons
llama.cpp Production, efficiency Fast, low memory, quantized GGUF format only
transformers Development, flexibility Latest models, GPU support Higher memory
ONNX Cross-platform, optimization Fast, portable, optimized Requires model export

When to Use Each

llama.cpp (GGUF):

  • โœ… Production deployments
  • โœ… Limited memory/CPU
  • โœ… Want quantization
  • โœ… Edge devices

transformers (PyTorch):

  • โœ… Development & experimentation
  • โœ… Latest HuggingFace models
  • โœ… Fine-tuning workflows
  • โœ… GPU available

ONNX:

  • โœ… Cross-platform deployment
  • โœ… ML pipeline integration
  • โœ… Optimized inference graphs
  • โœ… Already have ONNX models

๐ŸŽฏ Example Workflows

Developer: Fine-Tune & Quantize

# 1. Fine-tune your model (external tool)
# 2. Quantize it
slm quantize my-model.gguf my-model-q4.gguf --type q4_k_m

# 3. Test it
slm run my-model-q4.gguf --prompt "Test prompt"

# 4. Benchmark it
slm benchmark my-model-q4.gguf

Researcher: Compare Runtimes

# Pull same model, different runtimes
slm pull gpt2              # Transformers
slm pull tinyllama         # llama.cpp

# Benchmark both
slm benchmark gpt2
slm benchmark tinyllama

# Compare results

MacBook User: GPU-Accelerated Inference

# Zero setup - just run!
slm pull gpt2
slm init --name gpt2 --path gpt2 --device mps -o gpt2-gpu.yaml
slm run gpt2-gpu.yaml --prompt "Hello!"

# 2.14x faster than CPU! โšก

ONNX User: Export & Run

# 1. Export model to ONNX
pip install "optimum[exporters]"
optimum-cli export onnx --model gpt2 --task text-generation-with-past models/gpt2-onnx/

# 2. Create config  
slm init --name gpt2 --path models/gpt2-onnx --format onnx --runtime onnx -o gpt2-onnx.yaml

# 3. Run (13.8 tok/s on CPU!)
slm run gpt2-onnx.yaml --prompt "Hello world"

๐Ÿ“ฆ Configuration

Example Config

# my-model.yaml
model:
  name: my-custom-model
  path: /path/to/model.gguf
  format: gguf
  description: "My quantized model"

runtime:
  type: llama_cpp
  device: cpu
  threads: 8
  context_size: 2048
  gpu_layers: 0

params:
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 512
  stream: true
  stop: ["User:", "\n\n"]

Create Config Interactively

slm init  # Guided prompts

๐ŸŒ API Server

Start a FastAPI server for HTTP access:

# Start server
slm serve --port 8000

# Or with custom config
slm serve --config my-model.yaml --port 8000

API Usage

# Health check
curl http://localhost:8000/health

# Load model
curl -X POST http://localhost:8000/load \
  -H "Content-Type: application/json" \
  -d '{"config_path": "gpt2.yaml"}'

# Generate text
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The future of AI is",
    "max_tokens": 100,
    "temperature": 0.8
  }'

# Streaming
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{"prompt": "Hello", "stream": true}'

๐Ÿ“– Documentation

Comprehensive guides for each component:


๐Ÿงช Testing & Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=slm_packager --cov-report=html

# Code quality
black slm_packager tests
isort slm_packager tests
mypy slm_packager

# Pre-commit hooks
pre-commit install
pre-commit run --all-files

Test Results:

  • โœ… 73 tests passing
  • Coverage: 52% overall
    • API: 82% โญ
    • Core runtime: 60%
    • CLI: 47%

See CONTRIBUTING.md for detailed development guidelines.


๐Ÿ—บ๏ธ Roadmap

v0.2 (Current - December 2025)

  • Automated test suite (โœ… Complete: 73 tests, 52% coverage)
  • MPS GPU support for Apple Silicon (โœ… Complete: 2.14x speedup)
  • ONNX runtime with KV-cache (โœ… Complete: 13.8 tok/s)
  • API server improvements (โœ… Complete: 82% coverage)
  • CUDA GPU acceleration testing
  • Comprehensive benchmark suite
  • Expand model registry

v1.0 (Future)

  • vLLM integration for high-performance GPU serving
  • ROCm support (AMD GPUs)
  • Model conversion utilities
  • Web UI for model management
  • Advanced quantization options
  • Multi-GPU support

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   CLI / API Server              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Model Registry & Downloader   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Runtime Abstraction Layer     โ”‚
โ”‚   โ”œโ”€ llama.cpp (GGUF)           โ”‚
โ”‚   โ”‚   โ””โ”€ Metal/CUDA support     โ”‚
โ”‚   โ”œโ”€ Transformers (PyTorch)     โ”‚
โ”‚   โ”‚   โ””โ”€ MPS/CUDA support       โ”‚
โ”‚   โ””โ”€ ONNX Runtime               โ”‚
โ”‚       โ””โ”€ Manual KV-cache        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Quantization & Benchmarking   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ’ก Why SLM Packager?

Problem: Running small language models involves juggling different formats (GGUF, PyTorch, ONNX), runtimes (llama.cpp, transformers, onnxruntime), and configuration options.

Solution: SLM Packager provides:

  • Unified interface - One CLI/API for all runtimes
  • Auto-configuration - Models work out-of-the-box
  • GPU acceleration - Automatic MPS on Mac, easy CUDA setup
  • Reproducibility - YAML configs for deployment
  • Developer-friendly - Python API, FastAPI server, streaming support

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • Development setup
  • Testing guidelines
  • Code style requirements
  • Pull request process

๐Ÿ“ License

MIT License - see LICENSE for details


๐Ÿ™ Acknowledgments

  • llama.cpp - Efficient GGUF runtime
  • HuggingFace - Transformers and model hub
  • ONNX Runtime - Optimized inference
  • FastAPI - Modern API framework

๐Ÿ“ž Support


Built with โค๏ธ for the AI community

Making small language models accessible, fast, and easy to deploy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slm_packager-0.2.1.tar.gz (38.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slm_packager-0.2.1-py3-none-any.whl (36.0 kB view details)

Uploaded Python 3

File details

Details for the file slm_packager-0.2.1.tar.gz.

File metadata

  • Download URL: slm_packager-0.2.1.tar.gz
  • Upload date:
  • Size: 38.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for slm_packager-0.2.1.tar.gz
Algorithm Hash digest
SHA256 87ca4849de253e860592e37d219a7c462be8d488b192a8d1288850249d86e058
MD5 109e0d5744e636cf2cdabc343766be7e
BLAKE2b-256 979d4695fa8915d96e2c1f624b008b4cca3743c0667835122a611fc57952938a

See more details on using hashes here.

File details

Details for the file slm_packager-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: slm_packager-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 36.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for slm_packager-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7bc7ddd138db73dec8549ac9ac66f9176cb0c75a3b673dc10b159bdab9d67863
MD5 0cfb3ca882a9f1bb2653737107fa4570
BLAKE2b-256 e812f519d137060db77b7700a3f1e00c1318951e175e7b19e4814dee8b27acb3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page