Skip to main content

High-performance GGUF model inference with quantized kernels

Project description

Quicksilver

High-Performance GGUF Inference Engine

PyPI License

Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.

Install

pip install quicksilver-inference

Performance

Model Size Quicksilver llama.cpp Speedup
SmolLM2-135M 101MB 84 tok/s 43 tok/s +95%
Qwen2.5-1.5B 1.0GB 12 tok/s ~12 tok/s comparable

Benchmarked with OMP_NUM_THREADS=8 on Intel CPU

Backends

Backend Platform Status
CPU All (AVX2/NEON) ✅ Production
CUDA NVIDIA GPUs ✅ Ready
Metal Apple Silicon ✅ Ready
CANN Huawei Ascend NPU ✅ Ready

Features

  • Native GGUF Parsing - Zero external dependencies for model loading
  • 9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
  • AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
  • OpenMP Parallelization - Multi-threaded GEMV operations
  • Streaming Generation - Real-time token-by-token output
  • Batch Inference - Process multiple prompts efficiently
  • OpenAI-Compatible API - Drop-in replacement for OpenAI API
  • Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)

Installation

CPU (All Platforms)

cd quicksilver
pip install -e .

# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace

# Optimal performance
export OMP_NUM_THREADS=8

CUDA (NVIDIA GPUs)

# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace

Metal (macOS)

# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py

CANN (Huawei Ascend NPU)

# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu

# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace

Quick Start

Basic Inference

from quicksilver.quantized_engine import QuantizedInferenceEngine

engine = QuantizedInferenceEngine("model.gguf")
tokens = engine.generate(prompt_tokens=[1, 2, 3], max_tokens=50)

Streaming Generation

from quicksilver.streaming import StreamingGenerator

generator = StreamingGenerator(engine, tokenizer)

for token in generator.stream(prompt="Hello!", max_tokens=50):
    print(token.token_text, end="", flush=True)

Batch Processing

from quicksilver.batch import BatchProcessor, BatchRequest

processor = BatchProcessor(engine, tokenizer)
requests = [
    BatchRequest(id="1", prompt="What is AI?"),
    BatchRequest(id="2", prompt="Explain quantum computing"),
]
results, metrics = processor.process_batch(requests)

OpenAI-Compatible API Server

# Start server
python -m quicksilver.server --model model.gguf --port 8000

# Use with OpenAI client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="quicksilver",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True  # Streaming supported!
)

CLI

quicksilver generate -m model.gguf -p "Once upon a time"
quicksilver benchmark -m model.gguf -n 100

Architecture

quicksilver/
├── core/           # GGUF parsing and quantization
│   ├── parser.py   # Native GGUF file parser
│   ├── tensor.py   # Quantized tensor operations
│   └── quantization.py  # Quantization type definitions
├── csrc/           # C++ kernels
│   └── fused_quantized.cpp  # Optimized transformer kernels
├── quantized_engine.py  # Main inference engine
└── cli.py          # Command-line interface

Supported Quantization Types

Type Bits Block Size Status
Q4_0 4.0 32 ✅ Native GEMV
Q5_0 5.0 32 ✅ Native GEMV
Q8_0 8.0 32 ✅ Native GEMV
Q2_K 2.5 256 ✅ Native GEMV
Q3_K 3.4 256 ✅ Native GEMV
Q4_K 4.5 256 ✅ AVX2 SIMD
Q5_K 5.5 256 ✅ Native GEMV
Q6_K 6.5 256 ✅ AVX2 SIMD
F16 16 1 ✅ Native GEMV

Roadmap

  • CPU inference with quantized kernels
  • Beat llama.cpp performance (84 vs 43 tok/s = +95%)
  • Support 9 quantization types (covers 95%+ of GGUF models)
  • Metal GPU backend shaders
  • CUDA GPU backend kernels
  • CANN backend for Huawei Ascend NPUs
  • OpenAI-compatible API server
  • Streaming generation
  • Batch inference
  • Continuous batching
  • Speculative decoding

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black quicksilver/
ruff check quicksilver/

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_inference-0.2.0.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quicksilver_inference-0.2.0-py3-none-any.whl (66.1 kB view details)

Uploaded Python 3

File details

Details for the file quicksilver_inference-0.2.0.tar.gz.

File metadata

  • Download URL: quicksilver_inference-0.2.0.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.2.0.tar.gz
Algorithm Hash digest
SHA256 547803d9375ad3d8561d8ab797412e60f16d5a061faec4e0d20b37b85f51e6e1
MD5 b2ff889a36558e9a172597931bb4bdbe
BLAKE2b-256 4d5c67c7a2ad6a463a9b5a906202628a7559cecb733c273d14ce71d0f95ab0c6

See more details on using hashes here.

File details

Details for the file quicksilver_inference-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for quicksilver_inference-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a64c5e8e4a2e02c5dcd61ce900efd6404c6b4184d02bbded741a04e7582d4196
MD5 8c94c7a0e74f1223b895fc8b98874c47
BLAKE2b-256 e6de544370f44efaec07e4de1c981e7b7d97fafb52b1164917f85793c71db6ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page