High-performance CPU inference for GGUF quantized models

These details have not been verified by PyPI

Project links

Project description

Quicksilver CPU ⚡

High-performance CPU inference for GGUF quantized models

Quicksilver CPU is a lightweight, standalone inference engine optimized for running quantized LLMs on CPUs. It achieves 95+ tokens/second through AVX2/AVX-512 SIMD optimizations, significantly outperforming llama.cpp.

Features

🚀 Blazing Fast: 95+ tok/s on modern CPUs (2.2x faster than llama.cpp)
📦 Lightweight: Minimal dependencies, pure CPU focus
🔧 Native GGUF: Direct parsing without external libraries
⚡ SIMD Optimized: AVX2/AVX-512 + OpenMP parallelization
🎯 Quantization Support: Q4_K, Q5_0, Q6_K, Q8_0, and more
🔄 Streaming: Token-by-token generation with callbacks
📊 Batch Processing: Efficient multi-request handling
🧠 Smart Caching: Prompt cache + int8 KV compression (3.9x)
🛠️ CLI Tools: Benchmark, info, and generation commands
📈 Profiling: Built-in performance diagnostics

Installation

From PyPI (coming soon)

pip install quicksilver-cpu

From Source

git clone https://github.com/kossisoroyce/quicksilver-cpu.git
cd quicksilver-cpu

# Install dependencies
pip install pybind11 numpy

# Build and install
pip install -e .

# Or build just the C++ kernel
cd quicksilver_cpu/csrc
python setup.py build_ext --inplace

Quick Start

Basic Inference

from quicksilver_cpu import Engine

# Load model
engine = Engine("model.gguf")

# Generate tokens
tokens = engine.generate([1, 2, 3], max_tokens=50)
print(f"Generated: {tokens}")

Streaming Generation

from quicksilver_cpu import Engine, StreamingGenerator

engine = Engine("model.gguf")
generator = StreamingGenerator(engine)

for token in generator.stream(prompt_tokens=[1, 2, 3], max_tokens=50):
    print(f"Token: {token.token_id}", end=" ")

Batch Processing

from quicksilver_cpu import Engine, BatchProcessor, BatchRequest

engine = Engine("model.gguf")
processor = BatchProcessor(engine)

requests = [
    BatchRequest(id="1", prompt_tokens=[1, 2, 3], max_tokens=20),
    BatchRequest(id="2", prompt_tokens=[4, 5, 6], max_tokens=20),
]

results, metrics = processor.process_batch(requests)
print(f"Processed {len(results)} requests at {metrics.avg_tokens_per_second:.1f} tok/s")

Benchmarking

from quicksilver_cpu import benchmark

tok_per_sec = benchmark("model.gguf", n_tokens=100)
print(f"Speed: {tok_per_sec:.1f} tok/s")

CPU Configuration

from quicksilver_cpu import configure_threads, get_cpu_info, print_cpu_info

# Show CPU info
print_cpu_info()

# Configure optimal threading
config = configure_threads(num_threads=8, bind_cores=True)
print(f"Using {config.num_threads} threads")

Prompt Caching

from quicksilver_cpu import PromptCache

# Cache repeated prompts for faster inference
cache = PromptCache(max_entries=100)

# Store prompt state
cache.put(system_prompt_tokens, cache_len=len(system_prompt_tokens))

# Find matching prefix for new prompts
match, prefix_len = cache.find_prefix_match(new_prompt_tokens)
if match:
    print(f"Reusing {prefix_len} cached tokens!")

KV Cache Compression

from quicksilver_cpu import KVCacheManager

# Use int8 compression for 3.9x memory savings
kv_cache = KVCacheManager(
    num_layers=32,
    num_kv_heads=8,
    head_dim=64,
    max_seq_len=4096,
    use_int8=True,  # 3.9x compression
)

print(f"Memory: {kv_cache.memory_usage_mb():.1f} MB")
print(f"Compression: {kv_cache.compression_ratio():.1f}x")

Profiling

from quicksilver_cpu import get_profiler, Engine

engine = Engine("model.gguf")
profiler = get_profiler()

profiler.start("inference")
tokens = engine.generate([1, 2, 3], max_tokens=50)
profiler.stop("inference")

profiler.print_report()

CLI Usage

# Show model information
quicksilver-cpu info -m model.gguf

# Benchmark inference speed
quicksilver-cpu benchmark -m model.gguf -n 100 --threads 8

# Generate text
quicksilver-cpu generate -m model.gguf -p "Hello world" --max-tokens 50 --stream

Audio Support (TTS, STT, STS)

Quicksilver CPU includes full audio support for Text-to-Speech, Speech-to-Text, and Speech-to-Speech processing.

Text-to-Speech (TTS)

from quicksilver_cpu.audio import TTSEngine, TTSConfig

config = TTSConfig(
    model_path="qwen3-tts.gguf",
    sample_rate=24000,
    temperature=0.7,
)
engine = TTSEngine(config)

# Generate speech
result = engine.synthesize("Hello, welcome to Quicksilver!")
result.save("output.wav")

# Streaming generation
for chunk in engine.stream("Long text for streaming..."):
    play_audio(chunk)  # Real-time playback

Speech-to-Text (STT)

from quicksilver_cpu.audio import STTEngine, STTConfig

config = STTConfig(
    model_path="whisper.gguf",
    language="en",  # Auto-detect if None
)
engine = STTEngine(config)

# Transcribe audio
result = engine.transcribe("audio.wav")
print(result.text)

# Export subtitles
with open("subtitles.srt", "w") as f:
    f.write(result.to_srt())

# Streaming transcription
for segment in engine.stream("long_audio.wav"):
    print(f"[{segment.start:.1f}s] {segment.text}")

Speech-to-Speech (STS)

from quicksilver_cpu.audio import STSEngine, STSConfig

config = STSConfig(
    stt_model_path="whisper.gguf",
    tts_model_path="tts.gguf",
    source_language="es",
    target_language="en",
)
engine = STSEngine(config)

# Translate speech
result = engine.translate("spanish_audio.wav")
result.save("english_audio.wav")

# Voice conversion
result = engine.convert_voice("input.wav", target_voice="voice_sample.wav")

# Real-time streaming translation
for chunk in engine.stream("live_audio.wav"):
    play_audio(chunk)

Audio Utilities

from quicksilver_cpu.audio import load_audio, save_audio, AudioBuffer

# Load/save audio
audio, sr = load_audio("input.wav", target_sr=24000)
save_audio(audio, "output.wav", sample_rate=24000)

# Streaming buffer
buffer = AudioBuffer(sample_rate=24000)
buffer.append(audio_chunk)
full_audio = buffer.get_audio()

Supported Quantization Types

Type	Bits/Weight	Block Size	Status
Q4_K	4.5	256	✅ AVX2 optimized
Q5_0	5.5	32	✅ Supported
Q6_K	6.5	256	✅ AVX2 optimized
Q8_0	8.5	32	✅ Supported
Q4_0	4.5	32	✅ Supported
Q2_K	2.5	256	✅ Supported
Q3_K	3.4	256	✅ Supported
Q5_K	5.5	256	✅ Supported
F16	16	1	✅ Supported

Performance

Benchmarked on Intel Core i7-9750H with SmolLM2-135M Q4_K_M:

Engine	Tokens/sec	Speedup
llama.cpp	43	1.0x
Quicksilver CPU	95.7	2.22x

Key Optimizations

AVX2 SIMD - 8-wide FMA operations for Q4_K/Q6_K GEMV
Fused Operations - Combined gate+up projections for better cache locality
OpenMP Parallelization - Multi-threaded layer computations
Int8 KV Cache - 3.9x memory compression with minimal quality loss
Prompt Caching - Reuse computations for repeated prefixes
Memory Alignment - 64-byte aligned allocations for SIMD efficiency

Requirements

Mandatory

Python 3.9+
NumPy >= 1.20
AVX2 CPU - Required for SIMD optimizations (Intel Haswell+, AMD Excavator+)
C++17 compiler - clang++ or g++

Strongly Recommended

OpenMP - For multi-threaded inference (2-4x speedup)

# macOS
brew install libomp

# Ubuntu/Debian
sudo apt install libomp-dev

# Fedora/RHEL
sudo dnf install libomp-devel

For Building from Source

pybind11 - pip install pybind11

Platform Support

Platform	Status
macOS (Apple Silicon)	✅ Tested
macOS (Intel)	✅ Supported
Linux (x86_64)	✅ Supported
Windows	⚠️ Experimental

API Reference

`Engine`

Engine(model_path: str, verbose: bool = True)

generate(prompt_tokens, max_tokens, temperature, top_p) - Generate tokens
forward(token_id) - Single forward pass, returns logits
reset_cache() - Clear KV cache

`StreamingGenerator`

StreamingGenerator(engine, tokenizer=None, default_max_tokens=256)

stream(prompt_tokens, max_tokens, temperature, top_p) - Yield tokens
stream_async(...) - Async version
stop() - Request early stop

`BatchProcessor`

BatchProcessor(engine, tokenizer=None)

process_batch(requests, progress_callback) - Process multiple requests

License

Apache 2.0

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

Related Projects

Quicksilver - Full inference engine with GPU support
llama.cpp - Original GGUF implementation

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jan 27, 2026

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_cpu-0.2.0.tar.gz (66.0 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl (152.8 kB view details)

Uploaded Jan 27, 2026 CPython 3.11macOS 15.0+ x86-64

File details

Details for the file quicksilver_cpu-0.2.0.tar.gz.

File metadata

Download URL: quicksilver_cpu-0.2.0.tar.gz
Upload date: Jan 27, 2026
Size: 66.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_cpu-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`17296a6db9cd0f3c51f9f771d8c858810102f23bb4f96e7288a3e341eb6d0d7c`
MD5	`c75490d0069402a29574ad03b9fb2dcc`
BLAKE2b-256	`f8902cb01e77009e420808e94c46cffac9281191b2aa5dd0ee170b2dea7bb26b`

See more details on using hashes here.

File details

Details for the file quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl.

File metadata

Download URL: quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl
Upload date: Jan 27, 2026
Size: 152.8 kB
Tags: CPython 3.11, macOS 15.0+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_cpu-0.2.0-cp311-cp311-macosx_15_0_x86_64.whl
Algorithm	Hash digest
SHA256	`daaa7bc2da099e7452d943ba51cefe857deb3231f7c2787397fb475ebd034a7d`
MD5	`02c4dc3aa1d80744173a2aae20c666c5`
BLAKE2b-256	`6d92f7eed29a95221623092b4ac61ffa222af02b7568f5067f7f599312797857`

See more details on using hashes here.

quicksilver-cpu 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quicksilver CPU ⚡

Features

Installation

From PyPI (coming soon)

From Source

Quick Start

Basic Inference

Streaming Generation

Batch Processing

Benchmarking

CPU Configuration

Prompt Caching

KV Cache Compression

Profiling

CLI Usage

Audio Support (TTS, STT, STS)

Text-to-Speech (TTS)

Speech-to-Text (STT)

Speech-to-Speech (STS)

Audio Utilities

Supported Quantization Types

Performance

Key Optimizations

Requirements

Mandatory

Strongly Recommended

For Building from Source

Platform Support

API Reference

Engine

StreamingGenerator

BatchProcessor

License

Contributing

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Engine`

`StreamingGenerator`

`BatchProcessor`