Skip to main content

High-performance GGUF model inference with quantized kernels

Project description

Quicksilver

High-Performance GGUF Inference Engine

PyPI License

Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.

Install

pip install quicksilver-inference

Performance

Model Size Quicksilver llama.cpp Speedup
SmolLM2-135M 101MB 84 tok/s 43 tok/s +95%
Qwen2.5-1.5B 1.0GB 12 tok/s ~12 tok/s comparable

Benchmarked with OMP_NUM_THREADS=8 on Intel CPU

Backends

Backend Platform Status
CPU All (AVX2/NEON) ✅ Production
CUDA NVIDIA GPUs ✅ Ready
Metal Apple Silicon ✅ Ready
CANN Huawei Ascend NPU ✅ Ready

Features

  • Native GGUF Parsing - Zero external dependencies for model loading
  • 9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
  • AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
  • OpenMP Parallelization - Multi-threaded GEMV operations
  • Streaming Generation - Real-time token-by-token output
  • Batch Inference - Process multiple prompts efficiently
  • OpenAI-Compatible API - Drop-in replacement for OpenAI API
  • Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)

Installation

CPU (All Platforms)

cd quicksilver
pip install -e .

# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace

# Optimal performance
export OMP_NUM_THREADS=8

CUDA (NVIDIA GPUs)

# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace

Metal (macOS)

# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py

CANN (Huawei Ascend NPU)

# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu

# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace

Quick Start

Basic Inference

from quicksilver.quantized_engine import QuantizedInferenceEngine

engine = QuantizedInferenceEngine("model.gguf")
tokens = engine.generate(prompt_tokens=[1, 2, 3], max_tokens=50)

Streaming Generation

from quicksilver.streaming import StreamingGenerator

generator = StreamingGenerator(engine, tokenizer)

for token in generator.stream(prompt="Hello!", max_tokens=50):
    print(token.token_text, end="", flush=True)

Batch Processing

from quicksilver.batch import BatchProcessor, BatchRequest

processor = BatchProcessor(engine, tokenizer)
requests = [
    BatchRequest(id="1", prompt="What is AI?"),
    BatchRequest(id="2", prompt="Explain quantum computing"),
]
results, metrics = processor.process_batch(requests)

OpenAI-Compatible API Server

# Start server
python -m quicksilver.server --model model.gguf --port 8000

# Use with OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="quicksilver",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True  # Streaming supported!
)

CLI

quicksilver generate -m model.gguf -p "Once upon a time"
quicksilver benchmark -m model.gguf -n 100

Architecture

quicksilver/
├── core/           # GGUF parsing and quantization
│   ├── parser.py   # Native GGUF file parser
│   ├── tensor.py   # Quantized tensor operations
│   └── quantization.py  # Quantization type definitions
├── csrc/           # C++ kernels
│   └── fused_quantized.cpp  # Optimized transformer kernels
├── quantized_engine.py  # Main inference engine
└── cli.py          # Command-line interface

Supported Quantization Types

Type Bits Block Size Status
Q4_0 4.0 32 ✅ Native GEMV
Q5_0 5.0 32 ✅ Native GEMV
Q8_0 8.0 32 ✅ Native GEMV
Q2_K 2.5 256 ✅ Native GEMV
Q3_K 3.4 256 ✅ Native GEMV
Q4_K 4.5 256 ✅ AVX2 SIMD
Q5_K 5.5 256 ✅ Native GEMV
Q6_K 6.5 256 ✅ AVX2 SIMD
F16 16 1 ✅ Native GEMV

Roadmap

  • CPU inference with quantized kernels
  • Beat llama.cpp performance (84 vs 43 tok/s = +95%)
  • Support 9 quantization types (covers 95%+ of GGUF models)
  • Metal GPU backend shaders
  • CUDA GPU backend kernels
  • CANN backend for Huawei Ascend NPUs
  • OpenAI-compatible API server
  • Streaming generation
  • Batch inference
  • Continuous batching
  • Speculative decoding

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black quicksilver/
ruff check quicksilver/

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_inference-0.2.2.tar.gz (98.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quicksilver_inference-0.2.2-py3-none-any.whl (123.4 kB view details)

Uploaded Python 3

File details

Details for the file quicksilver_inference-0.2.2.tar.gz.

File metadata

  • Download URL: quicksilver_inference-0.2.2.tar.gz
  • Upload date:
  • Size: 98.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.2.2.tar.gz
Algorithm Hash digest
SHA256 f311c0881c5f3e5c58888a124003e11e1f4eceb82a56c20b637d40176e29558b
MD5 7a77429d7ea5f060476c0c7c92b95b73
BLAKE2b-256 cd2a7e303217ac9dd6119ab363bc7ef752d99ae50da7c3db6463f024eded124b

See more details on using hashes here.

File details

Details for the file quicksilver_inference-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for quicksilver_inference-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 df855b984adc3417ae8bcb1f175cd667be2bf6f44893e8c84e3aa50bb7acf501
MD5 66daa71efd0c401b411c5fd26a8d2d4d
BLAKE2b-256 5c746e649134ab8b33d092e8e162c21ab72483559ed08f1616cd81c712daab9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page