Skip to main content

Efficient CPU inference for BitNet 1.58-bit models

Project description

Litespark-Inference

Fast CPU inference for ternary neural networks

Litespark-Inference is a pip-installable Python library that enables efficient inference of ternary models on consumer CPUs. By exploiting the ternary weight structure ({-1, 0, +1}) with custom SIMD kernels, we eliminate floating-point multiplication entirely and achieve dramatic speedups over standard PyTorch inference.

Key Results

Apple Silicon (M1–M4)

Performance on Apple Silicon

Performance comparison on Apple Silicon M4. Litespark-Inference achieves ~14× memory reduction, 9.2× faster TTFT, and 52× higher throughput compared to PyTorch.

Metric PyTorch NEON Accelerate
Memory (MB) 7,673 556 6,949
TTFT (ms) 2,632 288 373
Throughput (tok/s) 0.39 20.4 5.52

Intel Ice Lake / AMD Zen4 (AVX-512 VNNI)

Performance on AVX-512 VNNI

Performance comparison on Intel Ice Lake / AMD Zen4 using AVX-512 VNNI kernels.

Metric PyTorch AVX-512 VNNI Speedup
Memory (MB) 7,800 556 14.0×
TTFT (ms) 2,450 195 12.6×
Throughput (tok/s) 0.42 11.2 26.7×

Intel Core Ultra (AVX-VNNI)

Performance on AVX-VNNI

Performance comparison on Intel Core Ultra using AVX-VNNI kernels.

Metric PyTorch AVX-VNNI Speedup
Memory (MB) 7,750 556 13.9×
TTFT (ms) 2,580 310 8.3×
Throughput (tok/s) 0.40 8.5 21.3×

Cross-Platform Comparison

Cross-Platform Comparison

Cross-platform performance comparison showing consistent speedups across Apple Silicon, Intel, and AMD processors.

Comparison with BitNet.cpp v2

We benchmarked Litespark-Inference against Microsoft's BitNet.cpp v2 using their pp128+tg128 methodology (128-token prompt processing + 128-token generation).

AMD EPYC 9R14 (AWS c7a.2xlarge)

AMD EPYC Comparison

Scaling behavior on AMD EPYC 9R14. BitNet.cpp V2 shows strong prefill scaling, while all implementations converge on similar token generation performance at higher thread counts.

Threads Prefill (Original) Prefill (V2) Prefill (Litespark) Gen (Original) Gen (V2) Gen (Litespark)
1 35.0 43.4 38.2 10.0 15.6 15.9
2 70.0 81.2 74.7 18.0 28.7 28.1
4 140.0 156.8 140.7 30.0 49.2 48.2
8 210.0 291.8 230.7 42.0 66.2 67.5

Intel Xeon Platinum 8488C (AWS c7i.2xlarge)

Intel Xeon Comparison

Scaling behavior on Intel Xeon Platinum 8488C. Litespark-Inference maintains a consistent lead in prefill throughput across all thread configurations.

Threads Prefill (Original) Prefill (V2) Prefill (Litespark) Gen (Original) Gen (V2) Gen (Litespark)
1 27.0 43.4 59.7 10.0 13.3 13.6
2 40.0 65.8 85.9 13.0 19.1 19.5
4 55.0 77.9 110.2 16.0 24.3 25.0
6 79.0 101.3 120.7 20.0 29.5 28.0

Apple M4 (MacBook Pro)

Apple M4 Scaling

Litespark-Inference scaling on Apple M4. Prefill throughput scales nearly linearly up to 4 threads, while token generation benefits from using all 10 CPU cores.

Threads Prefill pp128 (tok/s) Generation tg128 (tok/s)
1 26.1 6.5
2 43.1 11.0
4 81.9 15.4
8 101.2 14.0
10 108.8 19.6

Supported Platforms

  • Apple Silicon (M1/M2/M3/M4) — NEON SDOT instructions
  • Intel Ice Lake+ — AVX-512 VNNI instructions
  • AMD Zen4+ — AVX-512 VNNI instructions
  • Intel Core Ultra — AVX-VNNI (256-bit) instructions

Installation

pip install litespark-inference

Requirements:

  • Python 3.9+
  • PyTorch 2.4+

macOS (recommended):

brew install libomp

OpenMP enables multi-threaded kernel execution. Without it, inference will run single-threaded.

Usage

Command Line

# Generate text
litespark-inference generate "The meaning of life is"

# Interactive chat
litespark-inference chat

# Run benchmark on your hardware
litespark-inference benchmark

# Show system info and detected SIMD capabilities
litespark-inference info

Python API

from litespark_inference import load_model

# Load the BitNet 2B model (auto-downloads from HuggingFace)
model, tokenizer = load_model("bitnet-2b")

# Generate text
input_ids = tokenizer.encode("Hello, world!", return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Kernel Modes (Apple Silicon)

Two inference modes are available on Apple Silicon:

# NEON mode (default) — fast int8 quantized inference, ~556 MB
litespark-inference generate "Hello" --mode neon

# Accelerate mode — float32 with Apple AMX, bit-exact accuracy, ~2.5 GB
litespark-inference generate "Hello" --mode accelerate
# In Python
model, tokenizer = load_model("bitnet-2b", mode="neon")       # default, fast
model, tokenizer = load_model("bitnet-2b", mode="accelerate") # accurate

How It Works

Ternary models use weights constrained to {-1, 0, +1}. This means matrix multiplication reduces to simple addition and subtraction:

y = Σ x_j · w_j  →  y = Σ(w=+1) x_j - Σ(w=-1) x_j

Litespark-Inference exploits this structure with custom SIMD kernels that:

  1. Store weights as int8 — enabling direct use of hardware dot product instructions
  2. Quantize activations per-row — converting float32 inputs to int8 with scale factors
  3. Use hardware SIMD instructions — NEON SDOT (ARM) or AVX-512 VPDPBUSD (x86)
  4. Apply zero-point correction — maintaining numerical accuracy

The library automatically detects your CPU's SIMD capabilities and dispatches to the optimal kernel.

Benchmarking

Run the built-in benchmark to measure performance on your hardware:

litespark-inference benchmark

Or use the benchmark scripts for detailed profiling:

python benchmark_kernel.py      # Kernel-level benchmarks
python benchmark_synthetic.py   # Synthetic workload benchmarks

Citation

If you use Litespark-Inference in your research, please cite:

@article{litespark2024,
  title={Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks},
  author={Dade, Nii Osae Osae and Morri, Maurizio and Rahat, Moinul Hossain},
  year={2024}
}

License

Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litespark_inference-0.1.4.tar.gz (76.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litespark_inference-0.1.4-py3-none-any.whl (80.7 kB view details)

Uploaded Python 3

File details

Details for the file litespark_inference-0.1.4.tar.gz.

File metadata

  • Download URL: litespark_inference-0.1.4.tar.gz
  • Upload date:
  • Size: 76.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for litespark_inference-0.1.4.tar.gz
Algorithm Hash digest
SHA256 16aa65ea8a35f5d4975ae79151b5f70c2e734eb44b14167d93d60be67484a8a8
MD5 66d2e4fefd09168e9f3a3c951e534e43
BLAKE2b-256 1ac8a971029b1c0c975c85f525df8a0f322b96b27acd226daf49dd2278244307

See more details on using hashes here.

File details

Details for the file litespark_inference-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for litespark_inference-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 eea4552f614283aa1a69f0dbfeab61db1d15ebd09af5e83e68eb27e9dc1e91f6
MD5 ae2e311e0a56927a1854902a9dbc8be7
BLAKE2b-256 45da05434a0a22fe9b3ebd7593cc4486fd56c14370c59db67612bc388476f967

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page