Efficient CPU inference for ternary language models
Project description
Litespark-Inference
Fast CPU inference for ternary neural networks
Litespark-Inference is a pip-installable Python library that enables efficient inference of ternary models on consumer CPUs. By exploiting the ternary weight structure ({-1, 0, +1}) with custom SIMD kernels, we eliminate floating-point multiplication entirely and achieve dramatic speedups over standard PyTorch inference.
Key Results
Apple Silicon (M1-M5)
Performance comparison on Apple Silicon M5 Max. Litespark-Inference achieves 6.03x memory reduction, 7.15x faster TTFT, and 18.15x higher throughput compared to PyTorch.
| Metric | PyTorch | NEON | Improvement |
|---|---|---|---|
| Memory (MB) | 4,868.22 | 806.81 | 6.03x |
| TTFT (ms) | 4,213.92 | 589.02 | 7.15x |
| Throughput (tok/s) | 2.20 | 39.92 | 18.15x |
Intel Ice Lake / AMD Zen4 (AVX-512 VNNI)
Performance comparison on Intel Ice Lake / AMD Zen4 using AVX-512 VNNI kernels.
| Metric | PyTorch | AVX-512 VNNI | Improvement |
|---|---|---|---|
| Memory (MB) | 4,892.38 | 789.88 | 6.19x |
| TTFT (ms) | 6,647.18 | 1,167.26 | 5.69x |
| Throughput (tok/s) | 0.43 | 41.20 | 95.81x |
Intel Core Ultra (AVX-VNNI)
Performance comparison on Intel Core Ultra using AVX-VNNI kernels.
| Metric | PyTorch | AVX-VNNI | Improvement |
|---|---|---|---|
| Memory (MB) | 4,601.55 | 775.84 | 5.93x |
| TTFT (ms) | 7,173.05 | 1,134.48 | 6.32x |
| Throughput (tok/s) | 0.41 | 39.96 | 97.46x |
Cross-Platform Comparison
Cross-platform performance comparison showing TTFT, throughput, and memory-efficiency improvements across Apple Silicon, Intel, and AMD processors.
Energy Consumption
| System | Metric | PyTorch | Litespark | Improvement |
|---|---|---|---|---|
| Apple M5 Max | Total energy (J) | 606.46 | 101.45 | 5.98x |
| Apple M5 Max | Energy/token (J) | 4.74 | 0.79 | 5.98x |
| AMD Ryzen Threadripper PRO 5965WX | Total energy (J) | 12,173.53 | 957.44 | 12.71x |
| AMD Ryzen Threadripper PRO 5965WX | Energy/token (J) | 95.11 | 7.48 | 12.71x |
Thread Scaling
We also measured Litespark-Inference with the pp128+tg128 protocol across thread counts, separating prompt prefill from autoregressive token generation.
AMD EPYC 9R14 (AWS c7a.4xlarge)
| Threads | Prefill pp128 (tok/s) | Generation tg128 (tok/s) |
|---|---|---|
| 1 | 520.96 | 7.67 |
| 2 | 492.16 | 14.56 |
| 4 | 513.08 | 25.42 |
| 8 | 529.36 | 40.49 |
| 10 | 523.91 | 44.86 |
| 16 | 521.59 | 52.49 |
Intel Xeon Platinum 8488C (AWS c7i.4xlarge)
| Threads | Prefill pp128 (tok/s) | Generation tg128 (tok/s) |
|---|---|---|
| 1 | 102.30 | 6.32 |
| 2 | 105.11 | 11.02 |
| 4 | 112.38 | 16.93 |
| 8 | 120.44 | 25.70 |
| 10 | 135.73 | 23.25 |
| 16 | 131.43 | 30.43 |
Apple M5 Max (MacBook Pro)
Litespark-Inference scaling on Apple M5 Max. Prefill throughput continues scaling through 16 threads, while token generation gains quickly and then flattens out.
| Threads | Prefill pp128 (tok/s) | Generation tg128 (tok/s) |
|---|---|---|
| 1 | 50.52 | 10.98 |
| 2 | 91.55 | 19.46 |
| 4 | 152.09 | 33.29 |
| 8 | 194.23 | 35.81 |
| 10 | 218.88 | 37.44 |
| 16 | 262.59 | 37.92 |
Supported Platforms
- Apple Silicon (M1/M2/M3/M4/M5) — NEON SDOT instructions
- Intel Ice Lake+ — AVX-512 VNNI instructions
- AMD Zen4+ — AVX-512 VNNI instructions
- Intel Core Ultra — AVX-VNNI (256-bit) instructions
- AMD Zen 2–3 / pre-Skylake-X Intel — AVX2 + FMA fallback (256-bit, no VNNI fast path)
Installation
pip install litespark-inference
Requirements:
- Python 3.10+
- PyTorch 2.4+
macOS (recommended):
brew install libomp
OpenMP enables multi-threaded kernel execution. Without it, inference will run single-threaded.
The torchless runtime ships as a setuptools extension and is built
during pip install. No JIT compile happens at first inference; you
do need a C++ compiler available at install time (Xcode CLT on macOS,
build-essential on Linux).
Torchless Runtime
litespark-inference ships with a torchless runtime for the supported
BitNet and Falcon Edge models. It reads safetensors directly, stores ternary
weights in the native packed format used by the SIMD kernels, owns the KV
cache, and does not import torch for inference. The CLI dispatches to it
automatically:
litespark-inference generate "Hello, how are you?"
# [litespark-inference] torchless runtime (model=bitnet-2b).
Or invoke the torchless CLI directly:
python -m litespark_inference.torchless generate "Hello, how are you?" --max-tokens 32 --raw
python -m litespark_inference.torchless info
Headline numbers vs the PyTorch baseline (Apple Silicon M5 Max, bitnet-2b)
Same prompt, same workload (pp128 + tg128), measured with
litespark-benchmark --inference --pytorch:
| Metric | PyTorch bf16 | Litespark torchless (int4) | Speedup |
|---|---|---|---|
| Peak RSS | 4,868.22 MB | 806.81 MB | 6.03× |
| TTFT (pp128) | 4.21 s | 0.59 s | 7.15× |
| Throughput (tg128) | 2.20 tok/s | 39.92 tok/s | 18.15× |
Torchless generation is greedy-only today. Sampling flags are accepted by the
CLI for compatibility but ignored on torchless routes; set
LITESPARK_FORCE_TORCH=1 to force the legacy torch-backed path when sampling
is required.
Usage
Supported Models
bitnet-2bfalcon-edge-1bfalcon-edge-1b-instructfalcon-edge-3bfalcon-edge-3b-instruct
Command Line
# Generate text with the default BitNet model
litespark-inference generate "The meaning of life is"
# Generate text with Falcon Edge instruct
litespark-inference generate "What is the capital of France?" --model falcon-edge-1b-instruct
# Interactive chat with Falcon Edge instruct
litespark-inference chat --model falcon-edge-1b-instruct
# Run benchmark on your hardware
litespark-inference benchmark
# Show system info and detected SIMD capabilities
litespark-inference info
Python API
The torchless runtime is the recommended path — pure numpy plus the native
SIMD kernels, with no torch import at inference.
BitNet-2B has a high-level class:
from litespark_inference.torchless import BitNet
# Auto-downloads from Hugging Face on first use
bn = BitNet.from_pretrained("bitnet-2b")
# chat=True applies the system + chat template and strips the prompt,
# giving a clean instruction-following answer (omit it for raw continuation).
print(bn.generate("Write a short poem about Austin", max_new_tokens=100, chat=True))
Falcon Edge uses the lower-level torchless functions (no high-level class yet):
from litespark_inference.torchless import (
FALCON_TORCHLESS_REPOS, load_falcon_edge, load_tokenizer,
)
from litespark_inference.torchless.runtime import falcon_generate
name = "falcon-edge-1b-instruct"
model = load_falcon_edge(name) # auto-downloads
tokenizer = load_tokenizer(FALCON_TORCHLESS_REPOS[name])
# format_chat applies the system + chat template for the instruct models
prompt = tokenizer.format_chat("What is the capital of France?")
ids = tokenizer.encode(prompt)
out = falcon_generate(model, ids, max_new_tokens=64)
print(tokenizer.decode(out))
from litespark_inference import load_modelis the legacy torch-backed path — it imports PyTorch and runs the dense float baseline. Prefer the torchless API above for fast, low-memory CPU inference; reach forload_model(orLITESPARK_FORCE_TORCH=1) only when you specifically need the torch path, e.g. for sampling.
Kernel Mode (Apple Silicon)
NEON is the default optimized path on Apple Silicon:
# NEON mode (default) — packed torchless inference, ~0.8 GB on BitNet-2B
litespark-inference generate "Hello" --mode neon
# In Python — NEON is automatic; BitNet.from_pretrained() already uses it.
from litespark_inference.torchless import BitNet
bn = BitNet.from_pretrained("bitnet-2b") # NEON SDOT kernel, ~0.8 GB
How It Works
Ternary models use weights constrained to {-1, 0, +1}. This means matrix multiplication reduces to simple addition and subtraction:
y = Σ x_j · w_j → y = Σ(w=+1) x_j - Σ(w=-1) x_j
Litespark-Inference exploits this structure with custom SIMD kernels that:
- Store weights as int8 — enabling direct use of hardware dot product instructions
- Quantize activations per-row — converting float32 inputs to int8 with scale factors
- Use hardware SIMD instructions — NEON SDOT (ARM) or AVX-512 VPDPBUSD (x86)
- Apply zero-point correction — maintaining numerical accuracy
The library automatically detects your CPU's SIMD capabilities and dispatches to the optimal kernel.
Benchmarking
Run the built-in benchmark to measure performance on your hardware:
litespark-inference benchmark
The litespark-benchmark command (installed with the package) runs the
detailed profiling benchmarks:
# Litespark (torchless by default for bitnet-2b) vs the PyTorch baseline,
# pp128 + tg128, both runtimes' RSS captured. ~6 min total.
litespark-benchmark --inference --pytorch
# Quick torchless-only inference benchmark, no PyTorch baseline.
litespark-benchmark --inference --no-matrix
# Full sweep: matrix + thread-scaling + inference (each torch-backed
# kernel phase runs in its own subprocess so torch's libomp never
# coexists with our libomp).
litespark-benchmark --all
# Raw kernel benchmarks (matrix shapes, scaling) standalone.
litespark-benchmark
Citation
If you use Litespark-Inference in your research, please cite:
@article{litespark2026,
title={Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks},
author={Dade, Nii Osae Osae and Morri, Tony and Rahat, Moinul Hossain and Pal, Sayandip},
year={2026}
}
License
Apache License 2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file litespark_inference-1.0.2.tar.gz.
File metadata
- Download URL: litespark_inference-1.0.2.tar.gz
- Upload date:
- Size: 161.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87b49db2c1640341699b506eeff4d42c344a4395dba6596fbf221d59614c40b3
|
|
| MD5 |
c1c56963330280a4c49d02f439f66528
|
|
| BLAKE2b-256 |
f282e9c6626f2f0f4346825f983317b42433113917b4de053cb36be11e520d43
|