High-performance CPU inference for GGUF quantized models
Project description
Quicksilver CPU ⚡
High-performance CPU inference for GGUF quantized models
Quicksilver CPU is a lightweight, standalone inference engine optimized for running quantized LLMs on CPUs. It achieves 95+ tokens/second through AVX2/AVX-512 SIMD optimizations, significantly outperforming llama.cpp.
Features
- 🚀 Blazing Fast: 95+ tok/s on modern CPUs (2.2x faster than llama.cpp)
- 📦 Lightweight: Minimal dependencies, pure CPU focus
- 🔧 Native GGUF: Direct parsing without external libraries
- ⚡ SIMD Optimized: AVX2/AVX-512 + OpenMP parallelization
- 🎯 Quantization Support: Q4_K, Q5_0, Q6_K, Q8_0, and more
- 🔄 Streaming: Token-by-token generation with callbacks
- 📊 Batch Processing: Efficient multi-request handling
- 🧠 Smart Caching: Prompt cache + int8 KV compression (3.9x)
- 🛠️ CLI Tools: Benchmark, info, and generation commands
- 📈 Profiling: Built-in performance diagnostics
Installation
From PyPI (coming soon)
pip install quicksilver-cpu
From Source
git clone https://github.com/kossisoroyce/quicksilver-cpu.git
cd quicksilver-cpu
# Install dependencies
pip install pybind11 numpy
# Build and install
pip install -e .
# Or build just the C++ kernel
cd quicksilver_cpu/csrc
python setup.py build_ext --inplace
Quick Start
Basic Inference
from quicksilver_cpu import Engine
# Load model
engine = Engine("model.gguf")
# Generate tokens
tokens = engine.generate([1, 2, 3], max_tokens=50)
print(f"Generated: {tokens}")
Streaming Generation
from quicksilver_cpu import Engine, StreamingGenerator
engine = Engine("model.gguf")
generator = StreamingGenerator(engine)
for token in generator.stream(prompt_tokens=[1, 2, 3], max_tokens=50):
print(f"Token: {token.token_id}", end=" ")
Batch Processing
from quicksilver_cpu import Engine, BatchProcessor, BatchRequest
engine = Engine("model.gguf")
processor = BatchProcessor(engine)
requests = [
BatchRequest(id="1", prompt_tokens=[1, 2, 3], max_tokens=20),
BatchRequest(id="2", prompt_tokens=[4, 5, 6], max_tokens=20),
]
results, metrics = processor.process_batch(requests)
print(f"Processed {len(results)} requests at {metrics.avg_tokens_per_second:.1f} tok/s")
Benchmarking
from quicksilver_cpu import benchmark
tok_per_sec = benchmark("model.gguf", n_tokens=100)
print(f"Speed: {tok_per_sec:.1f} tok/s")
CPU Configuration
from quicksilver_cpu import configure_threads, get_cpu_info, print_cpu_info
# Show CPU info
print_cpu_info()
# Configure optimal threading
config = configure_threads(num_threads=8, bind_cores=True)
print(f"Using {config.num_threads} threads")
Prompt Caching
from quicksilver_cpu import PromptCache
# Cache repeated prompts for faster inference
cache = PromptCache(max_entries=100)
# Store prompt state
cache.put(system_prompt_tokens, cache_len=len(system_prompt_tokens))
# Find matching prefix for new prompts
match, prefix_len = cache.find_prefix_match(new_prompt_tokens)
if match:
print(f"Reusing {prefix_len} cached tokens!")
KV Cache Compression
from quicksilver_cpu import KVCacheManager
# Use int8 compression for 3.9x memory savings
kv_cache = KVCacheManager(
num_layers=32,
num_kv_heads=8,
head_dim=64,
max_seq_len=4096,
use_int8=True, # 3.9x compression
)
print(f"Memory: {kv_cache.memory_usage_mb():.1f} MB")
print(f"Compression: {kv_cache.compression_ratio():.1f}x")
Profiling
from quicksilver_cpu import get_profiler, Engine
engine = Engine("model.gguf")
profiler = get_profiler()
profiler.start("inference")
tokens = engine.generate([1, 2, 3], max_tokens=50)
profiler.stop("inference")
profiler.print_report()
CLI Usage
# Show model information
quicksilver-cpu info -m model.gguf
# Benchmark inference speed
quicksilver-cpu benchmark -m model.gguf -n 100 --threads 8
# Generate text
quicksilver-cpu generate -m model.gguf -p "Hello world" --max-tokens 50 --stream
Supported Quantization Types
| Type | Bits/Weight | Block Size | Status |
|---|---|---|---|
| Q4_K | 4.5 | 256 | ✅ AVX2 optimized |
| Q5_0 | 5.5 | 32 | ✅ Supported |
| Q6_K | 6.5 | 256 | ✅ AVX2 optimized |
| Q8_0 | 8.5 | 32 | ✅ Supported |
| Q4_0 | 4.5 | 32 | ✅ Supported |
| Q2_K | 2.5 | 256 | ✅ Supported |
| Q3_K | 3.4 | 256 | ✅ Supported |
| Q5_K | 5.5 | 256 | ✅ Supported |
| F16 | 16 | 1 | ✅ Supported |
Performance
Benchmarked on Intel Core i7-9750H with SmolLM2-135M Q4_K_M:
| Engine | Tokens/sec | Speedup |
|---|---|---|
| llama.cpp | 43 | 1.0x |
| Quicksilver CPU | 95.7 | 2.22x |
Key Optimizations
- AVX2 SIMD - 8-wide FMA operations for Q4_K/Q6_K GEMV
- Fused Operations - Combined gate+up projections for better cache locality
- OpenMP Parallelization - Multi-threaded layer computations
- Int8 KV Cache - 3.9x memory compression with minimal quality loss
- Prompt Caching - Reuse computations for repeated prefixes
- Memory Alignment - 64-byte aligned allocations for SIMD efficiency
Requirements
- Python 3.9+
- NumPy
- C++17 compiler with AVX2 support
- pybind11 (for building)
Platform Support
| Platform | Status |
|---|---|
| macOS (Apple Silicon) | ✅ Tested |
| macOS (Intel) | ✅ Supported |
| Linux (x86_64) | ✅ Supported |
| Windows | ⚠️ Experimental |
API Reference
Engine
Engine(model_path: str, verbose: bool = True)
generate(prompt_tokens, max_tokens, temperature, top_p)- Generate tokensforward(token_id)- Single forward pass, returns logitsreset_cache()- Clear KV cache
StreamingGenerator
StreamingGenerator(engine, tokenizer=None, default_max_tokens=256)
stream(prompt_tokens, max_tokens, temperature, top_p)- Yield tokensstream_async(...)- Async versionstop()- Request early stop
BatchProcessor
BatchProcessor(engine, tokenizer=None)
process_batch(requests, progress_callback)- Process multiple requests
License
Apache 2.0
Contributing
Contributions welcome! Please open an issue or PR on GitHub.
Related Projects
- Quicksilver - Full inference engine with GPU support
- llama.cpp - Original GGUF implementation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quicksilver_cpu-0.1.0.tar.gz.
File metadata
- Download URL: quicksilver_cpu-0.1.0.tar.gz
- Upload date:
- Size: 48.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3082aeffde54dc15984c9d0b1001b599633c088412addbe7ef457dcf74f4b575
|
|
| MD5 |
756c4f1055f7263c62fed70a5b882133
|
|
| BLAKE2b-256 |
ebc4b7872d829da83b0a2089978e12238817797e81cd2f9df66d0359972c4138
|
File details
Details for the file quicksilver_cpu-0.1.0-cp311-cp311-macosx_15_0_x86_64.whl.
File metadata
- Download URL: quicksilver_cpu-0.1.0-cp311-cp311-macosx_15_0_x86_64.whl
- Upload date:
- Size: 134.5 kB
- Tags: CPython 3.11, macOS 15.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b170dd5ca322474314da67adf7e523e00ef8d4b6fd1f0f2b1c6fc90590f07d3e
|
|
| MD5 |
0fd0ae7df26a7570f4f2316c913ea612
|
|
| BLAKE2b-256 |
43022cd55c0bbaf4858b590762836dfa12fdd72394e506f07dbf8406f1737ab8
|