Skip to main content

High-performance CPU inference for GGUF quantized models

Project description

Quicksilver CPU ⚡

PyPI version License Python 3.9+

High-performance CPU inference for GGUF quantized models

Quicksilver CPU is a lightweight, standalone inference engine optimized for running quantized LLMs on CPUs. It achieves 95+ tokens/second through AVX2/AVX-512 SIMD optimizations, significantly outperforming llama.cpp.

Features

  • 🚀 Blazing Fast: 95+ tok/s on modern CPUs (2.2x faster than llama.cpp)
  • 📦 Lightweight: Minimal dependencies, pure CPU focus
  • 🔧 Native GGUF: Direct parsing without external libraries
  • SIMD Optimized: AVX2/AVX-512 + OpenMP parallelization
  • 🎯 Quantization Support: Q4_K, Q5_0, Q6_K, Q8_0, and more
  • 🔄 Streaming: Token-by-token generation with callbacks
  • 📊 Batch Processing: Efficient multi-request handling
  • 🧠 Smart Caching: Prompt cache + int8 KV compression (3.9x)
  • 🛠️ CLI Tools: Benchmark, info, and generation commands
  • 📈 Profiling: Built-in performance diagnostics

Installation

From PyPI (coming soon)

pip install quicksilver-cpu

From Source

git clone https://github.com/kossisoroyce/quicksilver-cpu.git
cd quicksilver-cpu

# Install dependencies
pip install pybind11 numpy

# Build and install
pip install -e .

# Or build just the C++ kernel
cd quicksilver_cpu/csrc
python setup.py build_ext --inplace

Quick Start

Basic Inference

from quicksilver_cpu import Engine

# Load model
engine = Engine("model.gguf")

# Generate tokens
tokens = engine.generate([1, 2, 3], max_tokens=50)
print(f"Generated: {tokens}")

Streaming Generation

from quicksilver_cpu import Engine, StreamingGenerator

engine = Engine("model.gguf")
generator = StreamingGenerator(engine)

for token in generator.stream(prompt_tokens=[1, 2, 3], max_tokens=50):
    print(f"Token: {token.token_id}", end=" ")

Batch Processing

from quicksilver_cpu import Engine, BatchProcessor, BatchRequest

engine = Engine("model.gguf")
processor = BatchProcessor(engine)

requests = [
    BatchRequest(id="1", prompt_tokens=[1, 2, 3], max_tokens=20),
    BatchRequest(id="2", prompt_tokens=[4, 5, 6], max_tokens=20),
]

results, metrics = processor.process_batch(requests)
print(f"Processed {len(results)} requests at {metrics.avg_tokens_per_second:.1f} tok/s")

Benchmarking

from quicksilver_cpu import benchmark

tok_per_sec = benchmark("model.gguf", n_tokens=100)
print(f"Speed: {tok_per_sec:.1f} tok/s")

CPU Configuration

from quicksilver_cpu import configure_threads, get_cpu_info, print_cpu_info

# Show CPU info
print_cpu_info()

# Configure optimal threading
config = configure_threads(num_threads=8, bind_cores=True)
print(f"Using {config.num_threads} threads")

Prompt Caching

from quicksilver_cpu import PromptCache

# Cache repeated prompts for faster inference
cache = PromptCache(max_entries=100)

# Store prompt state
cache.put(system_prompt_tokens, cache_len=len(system_prompt_tokens))

# Find matching prefix for new prompts
match, prefix_len = cache.find_prefix_match(new_prompt_tokens)
if match:
    print(f"Reusing {prefix_len} cached tokens!")

KV Cache Compression

from quicksilver_cpu import KVCacheManager

# Use int8 compression for 3.9x memory savings
kv_cache = KVCacheManager(
    num_layers=32,
    num_kv_heads=8,
    head_dim=64,
    max_seq_len=4096,
    use_int8=True,  # 3.9x compression
)

print(f"Memory: {kv_cache.memory_usage_mb():.1f} MB")
print(f"Compression: {kv_cache.compression_ratio():.1f}x")

Profiling

from quicksilver_cpu import get_profiler, Engine

engine = Engine("model.gguf")
profiler = get_profiler()

profiler.start("inference")
tokens = engine.generate([1, 2, 3], max_tokens=50)
profiler.stop("inference")

profiler.print_report()

CLI Usage

# Show model information
quicksilver-cpu info -m model.gguf

# Benchmark inference speed
quicksilver-cpu benchmark -m model.gguf -n 100 --threads 8

# Generate text
quicksilver-cpu generate -m model.gguf -p "Hello world" --max-tokens 50 --stream

Audio Support (TTS, STT, STS)

Quicksilver CPU includes full audio support for Text-to-Speech, Speech-to-Text, and Speech-to-Speech processing.

Text-to-Speech (TTS)

from quicksilver_cpu.audio import TTSEngine, TTSConfig

config = TTSConfig(
    model_path="qwen3-tts.gguf",
    sample_rate=24000,
    temperature=0.7,
)
engine = TTSEngine(config)

# Generate speech
result = engine.synthesize("Hello, welcome to Quicksilver!")
result.save("output.wav")

# Streaming generation
for chunk in engine.stream("Long text for streaming..."):
    play_audio(chunk)  # Real-time playback

Speech-to-Text (STT)

from quicksilver_cpu.audio import STTEngine, STTConfig

config = STTConfig(
    model_path="whisper.gguf",
    language="en",  # Auto-detect if None
)
engine = STTEngine(config)

# Transcribe audio
result = engine.transcribe("audio.wav")
print(result.text)

# Export subtitles
with open("subtitles.srt", "w") as f:
    f.write(result.to_srt())

# Streaming transcription
for segment in engine.stream("long_audio.wav"):
    print(f"[{segment.start:.1f}s] {segment.text}")

Speech-to-Speech (STS)

from quicksilver_cpu.audio import STSEngine, STSConfig

config = STSConfig(
    stt_model_path="whisper.gguf",
    tts_model_path="tts.gguf",
    source_language="es",
    target_language="en",
)
engine = STSEngine(config)

# Translate speech
result = engine.translate("spanish_audio.wav")
result.save("english_audio.wav")

# Voice conversion
result = engine.convert_voice("input.wav", target_voice="voice_sample.wav")

# Real-time streaming translation
for chunk in engine.stream("live_audio.wav"):
    play_audio(chunk)

Audio Utilities

from quicksilver_cpu.audio import load_audio, save_audio, AudioBuffer

# Load/save audio
audio, sr = load_audio("input.wav", target_sr=24000)
save_audio(audio, "output.wav", sample_rate=24000)

# Streaming buffer
buffer = AudioBuffer(sample_rate=24000)
buffer.append(audio_chunk)
full_audio = buffer.get_audio()

Supported Quantization Types

Type Bits/Weight Block Size Status
Q4_K 4.5 256 ✅ AVX2 optimized
Q5_0 5.5 32 ✅ Supported
Q6_K 6.5 256 ✅ AVX2 optimized
Q8_0 8.5 32 ✅ Supported
Q4_0 4.5 32 ✅ Supported
Q2_K 2.5 256 ✅ Supported
Q3_K 3.4 256 ✅ Supported
Q5_K 5.5 256 ✅ Supported
F16 16 1 ✅ Supported

Performance

Benchmarked on Intel Core i7-9750H with SmolLM2-135M Q4_K_M:

Engine Tokens/sec Speedup
llama.cpp 43 1.0x
Quicksilver CPU 95.7 2.22x

Key Optimizations

  1. AVX2 SIMD - 8-wide FMA operations for Q4_K/Q6_K GEMV
  2. Fused Operations - Combined gate+up projections for better cache locality
  3. OpenMP Parallelization - Multi-threaded layer computations
  4. Int8 KV Cache - 3.9x memory compression with minimal quality loss
  5. Prompt Caching - Reuse computations for repeated prefixes
  6. Memory Alignment - 64-byte aligned allocations for SIMD efficiency

Requirements

Mandatory

  • Python 3.9+
  • NumPy >= 1.20
  • AVX2 CPU - Required for SIMD optimizations (Intel Haswell+, AMD Excavator+)
  • C++17 compiler - clang++ or g++

Strongly Recommended

  • OpenMP - For multi-threaded inference (2-4x speedup)
    # macOS
    brew install libomp
    
    # Ubuntu/Debian
    sudo apt install libomp-dev
    
    # Fedora/RHEL
    sudo dnf install libomp-devel
    

For Building from Source

  • pybind11 - pip install pybind11

Platform Support

Platform Status
macOS (Apple Silicon) ✅ Tested
macOS (Intel) ✅ Supported
Linux (x86_64) ✅ Supported
Windows ⚠️ Experimental

API Reference

Engine

Engine(model_path: str, verbose: bool = True)
  • generate(prompt_tokens, max_tokens, temperature, top_p) - Generate tokens
  • forward(token_id) - Single forward pass, returns logits
  • reset_cache() - Clear KV cache

StreamingGenerator

StreamingGenerator(engine, tokenizer=None, default_max_tokens=256)
  • stream(prompt_tokens, max_tokens, temperature, top_p) - Yield tokens
  • stream_async(...) - Async version
  • stop() - Request early stop

BatchProcessor

BatchProcessor(engine, tokenizer=None)
  • process_batch(requests, progress_callback) - Process multiple requests

License

Apache 2.0

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_cpu-0.2.0.tar.gz (66.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl (152.8 kB view details)

Uploaded CPython 3.11macOS 15.0+ x86-64

File details

Details for the file quicksilver_cpu-0.2.0.tar.gz.

File metadata

  • Download URL: quicksilver_cpu-0.2.0.tar.gz
  • Upload date:
  • Size: 66.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_cpu-0.2.0.tar.gz
Algorithm Hash digest
SHA256 17296a6db9cd0f3c51f9f771d8c858810102f23bb4f96e7288a3e341eb6d0d7c
MD5 c75490d0069402a29574ad03b9fb2dcc
BLAKE2b-256 f8902cb01e77009e420808e94c46cffac9281191b2aa5dd0ee170b2dea7bb26b

See more details on using hashes here.

File details

Details for the file quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl.

File metadata

File hashes

Hashes for quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl
Algorithm Hash digest
SHA256 daaa7bc2da099e7452d943ba51cefe857deb3231f7c2787397fb475ebd034a7d
MD5 02c4dc3aa1d80744173a2aae20c666c5
BLAKE2b-256 6d92f7eed29a95221623092b4ac61ffa222af02b7568f5067f7f599312797857

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page