High-performance GGUF model inference with quantized kernels

These details have not been verified by PyPI

Project links

Project description

Quicksilver

High-Performance GGUF Inference Engine

Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.

Install

pip install quicksilver-inference

Performance

Model	Size	Quicksilver	llama.cpp	Speedup
SmolLM2-135M	101MB	84 tok/s	43 tok/s	+95%
Qwen2.5-1.5B	1.0GB	12 tok/s	~12 tok/s	comparable

Benchmarked with OMP_NUM_THREADS=8 on Intel CPU

Backends

Backend	Platform	Status
CPU	All (AVX2/NEON)	✅ Production
CUDA	NVIDIA GPUs	✅ Ready
Metal	Apple Silicon	✅ Ready
CANN	Huawei Ascend NPU	✅ Ready

Features

Native GGUF Parsing - Zero external dependencies for model loading
9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
OpenMP Parallelization - Multi-threaded GEMV operations
Streaming Generation - Real-time token-by-token output
Batch Inference - Process multiple prompts efficiently
OpenAI-Compatible API - Drop-in replacement for OpenAI API
Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)

Installation

CPU (All Platforms)

cd quicksilver
pip install -e .

# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace

# Optimal performance
export OMP_NUM_THREADS=8

CUDA (NVIDIA GPUs)

# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace

Metal (macOS)

# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py

CANN (Huawei Ascend NPU)

# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu

# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace

Quick Start

Basic Inference

from quicksilver.quantized_engine import QuantizedInferenceEngine

engine = QuantizedInferenceEngine("model.gguf")
tokens = engine.generate(prompt_tokens=[1, 2, 3], max_tokens=50)

Streaming Generation

from quicksilver.streaming import StreamingGenerator

generator = StreamingGenerator(engine, tokenizer)

for token in generator.stream(prompt="Hello!", max_tokens=50):
    print(token.token_text, end="", flush=True)

Batch Processing

from quicksilver.batch import BatchProcessor, BatchRequest

processor = BatchProcessor(engine, tokenizer)
requests = [
    BatchRequest(id="1", prompt="What is AI?"),
    BatchRequest(id="2", prompt="Explain quantum computing"),
]
results, metrics = processor.process_batch(requests)

OpenAI-Compatible API Server

# Start server
python -m quicksilver.server --model model.gguf --port 8000

# Use with OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="quicksilver",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True  # Streaming supported!
)

CLI

quicksilver generate -m model.gguf -p "Once upon a time"
quicksilver benchmark -m model.gguf -n 100

Architecture

quicksilver/
├── core/           # GGUF parsing and quantization
│   ├── parser.py   # Native GGUF file parser
│   ├── tensor.py   # Quantized tensor operations
│   └── quantization.py  # Quantization type definitions
├── csrc/           # C++ kernels
│   └── fused_quantized.cpp  # Optimized transformer kernels
├── quantized_engine.py  # Main inference engine
└── cli.py          # Command-line interface

Supported Quantization Types

Type	Bits	Block Size	Status
Q4_0	4.0	32	✅ Native GEMV
Q5_0	5.0	32	✅ Native GEMV
Q8_0	8.0	32	✅ Native GEMV
Q2_K	2.5	256	✅ Native GEMV
Q3_K	3.4	256	✅ Native GEMV
Q4_K	4.5	256	✅ AVX2 SIMD
Q5_K	5.5	256	✅ Native GEMV
Q6_K	6.5	256	✅ AVX2 SIMD
F16	16	1	✅ Native GEMV

Roadmap

CPU inference with quantized kernels
Beat llama.cpp performance (84 vs 43 tok/s = +95%)
Support 9 quantization types (covers 95%+ of GGUF models)
Metal GPU backend shaders
CUDA GPU backend kernels
CANN backend for Huawei Ascend NPUs
OpenAI-compatible API server
Streaming generation
Batch inference
Continuous batching
Speculative decoding

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black quicksilver/
ruff check quicksilver/

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Jan 27, 2026

0.2.1

Jan 27, 2026

0.2.0

Jan 27, 2026

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_inference-0.2.2.tar.gz (98.6 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quicksilver_inference-0.2.2-py3-none-any.whl (123.4 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file quicksilver_inference-0.2.2.tar.gz.

File metadata

Download URL: quicksilver_inference-0.2.2.tar.gz
Upload date: Jan 27, 2026
Size: 98.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`f311c0881c5f3e5c58888a124003e11e1f4eceb82a56c20b637d40176e29558b`
MD5	`7a77429d7ea5f060476c0c7c92b95b73`
BLAKE2b-256	`cd2a7e303217ac9dd6119ab363bc7ef752d99ae50da7c3db6463f024eded124b`

See more details on using hashes here.

File details

Details for the file quicksilver_inference-0.2.2-py3-none-any.whl.

File metadata

Download URL: quicksilver_inference-0.2.2-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 123.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df855b984adc3417ae8bcb1f175cd667be2bf6f44893e8c84e3aa50bb7acf501`
MD5	`66daa71efd0c401b411c5fd26a8d2d4d`
BLAKE2b-256	`5c746e649134ab8b33d092e8e162c21ab72483559ed08f1616cd81c712daab9c`

See more details on using hashes here.

quicksilver-inference 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quicksilver

Install

Performance

Backends

Features

Installation

CPU (All Platforms)

CUDA (NVIDIA GPUs)

Metal (macOS)

CANN (Huawei Ascend NPU)

Quick Start

Basic Inference

Streaming Generation

Batch Processing

OpenAI-Compatible API Server

CLI

Architecture

Supported Quantization Types

Roadmap

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes