Skip to main content

Efficient CPU inference for BitNet 1.58-bit models

Project description

Litespark-Inference

Fast CPU inference for ternary neural networks

Litespark-Inference is a pip-installable Python library that enables efficient inference of ternary models on consumer CPUs. By exploiting the ternary weight structure ({-1, 0, +1}) with custom SIMD kernels, we eliminate floating-point multiplication entirely and achieve dramatic speedups over standard PyTorch inference.

Key Results

Apple Silicon (M1–M4)

Performance on Apple Silicon

Performance comparison on Apple Silicon M4. Litespark-Inference achieves ~14× memory reduction, 9.2× faster TTFT, and 52× higher throughput compared to PyTorch.

Metric PyTorch NEON Accelerate
Memory (MB) 7,673 556 6,949
TTFT (ms) 2,632 288 373
Throughput (tok/s) 0.39 20.4 5.52

Intel Ice Lake / AMD Zen4 (AVX-512 VNNI)

Performance on AVX-512 VNNI

Performance comparison on Intel Ice Lake / AMD Zen4 using AVX-512 VNNI kernels.

Metric PyTorch AVX-512 VNNI Speedup
Memory (MB) 7,800 556 14.0×
TTFT (ms) 2,450 195 12.6×
Throughput (tok/s) 0.42 11.2 26.7×

Intel Core Ultra (AVX-VNNI)

Performance on AVX-VNNI

Performance comparison on Intel Core Ultra using AVX-VNNI kernels.

Metric PyTorch AVX-VNNI Speedup
Memory (MB) 7,750 556 13.9×
TTFT (ms) 2,580 310 8.3×
Throughput (tok/s) 0.40 8.5 21.3×

Cross-Platform Comparison

Cross-Platform Comparison

Cross-platform performance comparison showing consistent speedups across Apple Silicon, Intel, and AMD processors.

Comparison with BitNet.cpp v2

We benchmarked Litespark-Inference against Microsoft's BitNet.cpp v2 using their pp128+tg128 methodology (128-token prompt processing + 128-token generation).

AMD EPYC 9R14 (AWS c7a.2xlarge)

AMD EPYC Comparison

Scaling behavior on AMD EPYC 9R14. BitNet.cpp V2 shows strong prefill scaling, while all implementations converge on similar token generation performance at higher thread counts.

Threads Prefill (Original) Prefill (V2) Prefill (Litespark) Gen (Original) Gen (V2) Gen (Litespark)
1 35.0 43.4 38.2 10.0 15.6 15.9
2 70.0 81.2 74.7 18.0 28.7 28.1
4 140.0 156.8 140.7 30.0 49.2 48.2
8 210.0 291.8 230.7 42.0 66.2 67.5

Intel Xeon Platinum 8488C (AWS c7i.2xlarge)

Intel Xeon Comparison

Scaling behavior on Intel Xeon Platinum 8488C. Litespark-Inference maintains a consistent lead in prefill throughput across all thread configurations.

Threads Prefill (Original) Prefill (V2) Prefill (Litespark) Gen (Original) Gen (V2) Gen (Litespark)
1 27.0 43.4 59.7 10.0 13.3 13.6
2 40.0 65.8 85.9 13.0 19.1 19.5
4 55.0 77.9 110.2 16.0 24.3 25.0
6 79.0 101.3 120.7 20.0 29.5 28.0

Apple M4 (MacBook Pro)

Apple M4 Scaling

Litespark-Inference scaling on Apple M4. Prefill throughput scales nearly linearly up to 4 threads, while token generation benefits from using all 10 CPU cores.

Threads Prefill pp128 (tok/s) Generation tg128 (tok/s)
1 26.1 6.5
2 43.1 11.0
4 81.9 15.4
8 101.2 14.0
10 108.8 19.6

Supported Platforms

  • Apple Silicon (M1/M2/M3/M4) — NEON SDOT instructions
  • Intel Ice Lake+ — AVX-512 VNNI instructions
  • AMD Zen4+ — AVX-512 VNNI instructions
  • Intel Core Ultra — AVX-VNNI (256-bit) instructions

Installation

pip install litespark-inference

Requirements:

  • Python 3.9+
  • PyTorch 2.4+

macOS (recommended):

brew install libomp

OpenMP enables multi-threaded kernel execution. Without it, inference will run single-threaded.

Usage

Command Line

# Generate text
litespark-inference generate "The meaning of life is"

# Interactive chat
litespark-inference chat

# Run benchmark on your hardware
litespark-inference benchmark

# Show system info and detected SIMD capabilities
litespark-inference info

Python API

from litespark_inference import load_model

# Load the BitNet 2B model (auto-downloads from HuggingFace)
model, tokenizer = load_model("bitnet-2b")

# Generate text
input_ids = tokenizer.encode("Hello, world!", return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Kernel Modes (Apple Silicon)

Two inference modes are available on Apple Silicon:

# NEON mode (default) — fast int8 quantized inference, ~556 MB
litespark-inference generate "Hello" --mode neon

# Accelerate mode — float32 with Apple AMX, bit-exact accuracy, ~2.5 GB
litespark-inference generate "Hello" --mode accelerate
# In Python
model, tokenizer = load_model("bitnet-2b", mode="neon")       # default, fast
model, tokenizer = load_model("bitnet-2b", mode="accelerate") # accurate

How It Works

Ternary models use weights constrained to {-1, 0, +1}. This means matrix multiplication reduces to simple addition and subtraction:

y = Σ x_j · w_j  →  y = Σ(w=+1) x_j - Σ(w=-1) x_j

Litespark-Inference exploits this structure with custom SIMD kernels that:

  1. Store weights as int8 — enabling direct use of hardware dot product instructions
  2. Quantize activations per-row — converting float32 inputs to int8 with scale factors
  3. Use hardware SIMD instructions — NEON SDOT (ARM) or AVX-512 VPDPBUSD (x86)
  4. Apply zero-point correction — maintaining numerical accuracy

The library automatically detects your CPU's SIMD capabilities and dispatches to the optimal kernel.

Benchmarking

Run the built-in benchmark to measure performance on your hardware:

litespark-inference benchmark

Or use the benchmark scripts for detailed profiling:

python benchmark_kernel.py      # Kernel-level benchmarks
python benchmark_synthetic.py   # Synthetic workload benchmarks

Citation

If you use Litespark-Inference in your research, please cite:

@article{litespark2024,
  title={Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks},
  author={Dade, Nii Osae Osae and Morri, Maurizio and Rahat, Moinul Hossain},
  year={2024}
}

License

Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litespark_inference-0.1.3.tar.gz (76.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litespark_inference-0.1.3-py3-none-any.whl (80.3 kB view details)

Uploaded Python 3

File details

Details for the file litespark_inference-0.1.3.tar.gz.

File metadata

  • Download URL: litespark_inference-0.1.3.tar.gz
  • Upload date:
  • Size: 76.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for litespark_inference-0.1.3.tar.gz
Algorithm Hash digest
SHA256 10f5bda7bbb009e185e704cc3578c8c9a0b023151eba6d30bc690ef8acb4fb66
MD5 0c053584a741109036df0bdcad3a5c41
BLAKE2b-256 1e9effbe22893607222bbf704ee6d6e7b02f65d5a74b7e1d460128ab6dc8d882

See more details on using hashes here.

File details

Details for the file litespark_inference-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for litespark_inference-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cbb81453ac3158d6bec1aeb51036574c73259163249aef2bbe4879538384ae1e
MD5 a8b98638ba394d3f6c11b3d735d189b9
BLAKE2b-256 14ed0e4e4d03f6b4f8ffccd50e81c90bca10adc33dd1f72afeabfa5719829b35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page