High-performance GGUF model inference with quantized kernels
Project description
Quicksilver
High-Performance GGUF Inference Engine
Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.
Install
pip install quicksilver-inference
Performance
| Model | Size | Quicksilver | llama.cpp | Speedup |
|---|---|---|---|---|
| SmolLM2-135M | 101MB | 84 tok/s | 43 tok/s | +95% |
| Qwen2.5-1.5B | 1.0GB | 12 tok/s | ~12 tok/s | comparable |
Benchmarked with OMP_NUM_THREADS=8 on Intel CPU
Backends
| Backend | Platform | Status |
|---|---|---|
| CPU | All (AVX2/NEON) | ✅ Production |
| CUDA | NVIDIA GPUs | ✅ Ready |
| Metal | Apple Silicon | ✅ Ready |
| CANN | Huawei Ascend NPU | ✅ Ready |
Features
- Native GGUF Parsing - Zero external dependencies for model loading
- 9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
- AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
- OpenMP Parallelization - Multi-threaded GEMV operations
- Streaming Generation - Real-time token-by-token output
- Batch Inference - Process multiple prompts efficiently
- OpenAI-Compatible API - Drop-in replacement for OpenAI API
- Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)
Installation
CPU (All Platforms)
cd quicksilver
pip install -e .
# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace
# Optimal performance
export OMP_NUM_THREADS=8
CUDA (NVIDIA GPUs)
# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace
Metal (macOS)
# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py
CANN (Huawei Ascend NPU)
# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu
# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace
Quick Start
Basic Inference
from quicksilver.quantized_engine import QuantizedInferenceEngine
engine = QuantizedInferenceEngine("model.gguf")
tokens = engine.generate(prompt_tokens=[1, 2, 3], max_tokens=50)
Streaming Generation
from quicksilver.streaming import StreamingGenerator
generator = StreamingGenerator(engine, tokenizer)
for token in generator.stream(prompt="Hello!", max_tokens=50):
print(token.token_text, end="", flush=True)
Batch Processing
from quicksilver.batch import BatchProcessor, BatchRequest
processor = BatchProcessor(engine, tokenizer)
requests = [
BatchRequest(id="1", prompt="What is AI?"),
BatchRequest(id="2", prompt="Explain quantum computing"),
]
results, metrics = processor.process_batch(requests)
OpenAI-Compatible API Server
# Start server
python -m quicksilver.server --model model.gguf --port 8000
# Use with OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="quicksilver",
messages=[{"role": "user", "content": "Hello!"}],
stream=True # Streaming supported!
)
CLI
quicksilver generate -m model.gguf -p "Once upon a time"
quicksilver benchmark -m model.gguf -n 100
Architecture
quicksilver/
├── core/ # GGUF parsing and quantization
│ ├── parser.py # Native GGUF file parser
│ ├── tensor.py # Quantized tensor operations
│ └── quantization.py # Quantization type definitions
├── csrc/ # C++ kernels
│ └── fused_quantized.cpp # Optimized transformer kernels
├── quantized_engine.py # Main inference engine
└── cli.py # Command-line interface
Supported Quantization Types
| Type | Bits | Block Size | Status |
|---|---|---|---|
| Q4_0 | 4.0 | 32 | ✅ Native GEMV |
| Q5_0 | 5.0 | 32 | ✅ Native GEMV |
| Q8_0 | 8.0 | 32 | ✅ Native GEMV |
| Q2_K | 2.5 | 256 | ✅ Native GEMV |
| Q3_K | 3.4 | 256 | ✅ Native GEMV |
| Q4_K | 4.5 | 256 | ✅ AVX2 SIMD |
| Q5_K | 5.5 | 256 | ✅ Native GEMV |
| Q6_K | 6.5 | 256 | ✅ AVX2 SIMD |
| F16 | 16 | 1 | ✅ Native GEMV |
Roadmap
- CPU inference with quantized kernels
- Beat llama.cpp performance (84 vs 43 tok/s = +95%)
- Support 9 quantization types (covers 95%+ of GGUF models)
- Metal GPU backend shaders
- CUDA GPU backend kernels
- CANN backend for Huawei Ascend NPUs
- OpenAI-compatible API server
- Streaming generation
- Batch inference
- Continuous batching
- Speculative decoding
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black quicksilver/
ruff check quicksilver/
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quicksilver_inference-0.2.2.tar.gz.
File metadata
- Download URL: quicksilver_inference-0.2.2.tar.gz
- Upload date:
- Size: 98.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f311c0881c5f3e5c58888a124003e11e1f4eceb82a56c20b637d40176e29558b
|
|
| MD5 |
7a77429d7ea5f060476c0c7c92b95b73
|
|
| BLAKE2b-256 |
cd2a7e303217ac9dd6119ab363bc7ef752d99ae50da7c3db6463f024eded124b
|
File details
Details for the file quicksilver_inference-0.2.2-py3-none-any.whl.
File metadata
- Download URL: quicksilver_inference-0.2.2-py3-none-any.whl
- Upload date:
- Size: 123.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df855b984adc3417ae8bcb1f175cd667be2bf6f44893e8c84e3aa50bb7acf501
|
|
| MD5 |
66daa71efd0c401b411c5fd26a8d2d4d
|
|
| BLAKE2b-256 |
5c746e649134ab8b33d092e8e162c21ab72483559ed08f1616cd81c712daab9c
|