Skip to main content

High-performance CPU inference for GGUF quantized models

Project description

Quicksilver CPU ⚡

PyPI version License Python 3.9+

High-performance CPU inference for GGUF quantized models

Quicksilver CPU is a lightweight, standalone inference engine optimized for running quantized LLMs on CPUs. It achieves 95+ tokens/second through AVX2/AVX-512 SIMD optimizations, significantly outperforming llama.cpp.

Features

  • 🚀 Blazing Fast: 95+ tok/s on modern CPUs (2.2x faster than llama.cpp)
  • 📦 Lightweight: Minimal dependencies, pure CPU focus
  • 🔧 Native GGUF: Direct parsing without external libraries
  • SIMD Optimized: AVX2/AVX-512 + OpenMP parallelization
  • 🎯 Quantization Support: Q4_K, Q5_0, Q6_K, Q8_0, and more
  • 🔄 Streaming: Token-by-token generation with callbacks
  • 📊 Batch Processing: Efficient multi-request handling
  • 🧠 Smart Caching: Prompt cache + int8 KV compression (3.9x)
  • 🛠️ CLI Tools: Benchmark, info, and generation commands
  • 📈 Profiling: Built-in performance diagnostics

Installation

From PyPI (coming soon)

pip install quicksilver-cpu

From Source

git clone https://github.com/kossisoroyce/quicksilver-cpu.git
cd quicksilver-cpu

# Install dependencies
pip install pybind11 numpy

# Build and install
pip install -e .

# Or build just the C++ kernel
cd quicksilver_cpu/csrc
python setup.py build_ext --inplace

Quick Start

Basic Inference

from quicksilver_cpu import Engine

# Load model
engine = Engine("model.gguf")

# Generate tokens
tokens = engine.generate([1, 2, 3], max_tokens=50)
print(f"Generated: {tokens}")

Streaming Generation

from quicksilver_cpu import Engine, StreamingGenerator

engine = Engine("model.gguf")
generator = StreamingGenerator(engine)

for token in generator.stream(prompt_tokens=[1, 2, 3], max_tokens=50):
    print(f"Token: {token.token_id}", end=" ")

Batch Processing

from quicksilver_cpu import Engine, BatchProcessor, BatchRequest

engine = Engine("model.gguf")
processor = BatchProcessor(engine)

requests = [
    BatchRequest(id="1", prompt_tokens=[1, 2, 3], max_tokens=20),
    BatchRequest(id="2", prompt_tokens=[4, 5, 6], max_tokens=20),
]

results, metrics = processor.process_batch(requests)
print(f"Processed {len(results)} requests at {metrics.avg_tokens_per_second:.1f} tok/s")

Benchmarking

from quicksilver_cpu import benchmark

tok_per_sec = benchmark("model.gguf", n_tokens=100)
print(f"Speed: {tok_per_sec:.1f} tok/s")

CPU Configuration

from quicksilver_cpu import configure_threads, get_cpu_info, print_cpu_info

# Show CPU info
print_cpu_info()

# Configure optimal threading
config = configure_threads(num_threads=8, bind_cores=True)
print(f"Using {config.num_threads} threads")

Prompt Caching

from quicksilver_cpu import PromptCache

# Cache repeated prompts for faster inference
cache = PromptCache(max_entries=100)

# Store prompt state
cache.put(system_prompt_tokens, cache_len=len(system_prompt_tokens))

# Find matching prefix for new prompts
match, prefix_len = cache.find_prefix_match(new_prompt_tokens)
if match:
    print(f"Reusing {prefix_len} cached tokens!")

KV Cache Compression

from quicksilver_cpu import KVCacheManager

# Use int8 compression for 3.9x memory savings
kv_cache = KVCacheManager(
    num_layers=32,
    num_kv_heads=8,
    head_dim=64,
    max_seq_len=4096,
    use_int8=True,  # 3.9x compression
)

print(f"Memory: {kv_cache.memory_usage_mb():.1f} MB")
print(f"Compression: {kv_cache.compression_ratio():.1f}x")

Profiling

from quicksilver_cpu import get_profiler, Engine

engine = Engine("model.gguf")
profiler = get_profiler()

profiler.start("inference")
tokens = engine.generate([1, 2, 3], max_tokens=50)
profiler.stop("inference")

profiler.print_report()

CLI Usage

# Show model information
quicksilver-cpu info -m model.gguf

# Benchmark inference speed
quicksilver-cpu benchmark -m model.gguf -n 100 --threads 8

# Generate text
quicksilver-cpu generate -m model.gguf -p "Hello world" --max-tokens 50 --stream

Supported Quantization Types

Type Bits/Weight Block Size Status
Q4_K 4.5 256 ✅ AVX2 optimized
Q5_0 5.5 32 ✅ Supported
Q6_K 6.5 256 ✅ AVX2 optimized
Q8_0 8.5 32 ✅ Supported
Q4_0 4.5 32 ✅ Supported
Q2_K 2.5 256 ✅ Supported
Q3_K 3.4 256 ✅ Supported
Q5_K 5.5 256 ✅ Supported
F16 16 1 ✅ Supported

Performance

Benchmarked on Intel Core i7-9750H with SmolLM2-135M Q4_K_M:

Engine Tokens/sec Speedup
llama.cpp 43 1.0x
Quicksilver CPU 95.7 2.22x

Key Optimizations

  1. AVX2 SIMD - 8-wide FMA operations for Q4_K/Q6_K GEMV
  2. Fused Operations - Combined gate+up projections for better cache locality
  3. OpenMP Parallelization - Multi-threaded layer computations
  4. Int8 KV Cache - 3.9x memory compression with minimal quality loss
  5. Prompt Caching - Reuse computations for repeated prefixes
  6. Memory Alignment - 64-byte aligned allocations for SIMD efficiency

Requirements

  • Python 3.9+
  • NumPy
  • C++17 compiler with AVX2 support
  • pybind11 (for building)

Platform Support

Platform Status
macOS (Apple Silicon) ✅ Tested
macOS (Intel) ✅ Supported
Linux (x86_64) ✅ Supported
Windows ⚠️ Experimental

API Reference

Engine

Engine(model_path: str, verbose: bool = True)
  • generate(prompt_tokens, max_tokens, temperature, top_p) - Generate tokens
  • forward(token_id) - Single forward pass, returns logits
  • reset_cache() - Clear KV cache

StreamingGenerator

StreamingGenerator(engine, tokenizer=None, default_max_tokens=256)
  • stream(prompt_tokens, max_tokens, temperature, top_p) - Yield tokens
  • stream_async(...) - Async version
  • stop() - Request early stop

BatchProcessor

BatchProcessor(engine, tokenizer=None)
  • process_batch(requests, progress_callback) - Process multiple requests

License

Apache 2.0

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_cpu-0.1.0.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quicksilver_cpu-0.1.0-cp311-cp311-macosx_15_0_x86_64.whl (134.5 kB view details)

Uploaded CPython 3.11macOS 15.0+ x86-64

File details

Details for the file quicksilver_cpu-0.1.0.tar.gz.

File metadata

  • Download URL: quicksilver_cpu-0.1.0.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_cpu-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3082aeffde54dc15984c9d0b1001b599633c088412addbe7ef457dcf74f4b575
MD5 756c4f1055f7263c62fed70a5b882133
BLAKE2b-256 ebc4b7872d829da83b0a2089978e12238817797e81cd2f9df66d0359972c4138

See more details on using hashes here.

File details

Details for the file quicksilver_cpu-0.1.0-cp311-cp311-macosx_15_0_x86_64.whl.

File metadata

File hashes

Hashes for quicksilver_cpu-0.1.0-cp311-cp311-macosx_15_0_x86_64.whl
Algorithm Hash digest
SHA256 b170dd5ca322474314da67adf7e523e00ef8d4b6fd1f0f2b1c6fc90590f07d3e
MD5 0fd0ae7df26a7570f4f2316c913ea612
BLAKE2b-256 43022cd55c0bbaf4858b590762836dfa12fdd72394e506f07dbf8406f1737ab8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page