Skip to main content

Efficient CPU inference for BitNet 1.58-bit models

Project description

Litespark-Inference

Fast CPU inference for ternary neural networks

Litespark-Inference is a pip-installable Python library that enables efficient inference of ternary models on consumer CPUs. By exploiting the ternary weight structure ({-1, 0, +1}) with custom SIMD kernels, we eliminate floating-point multiplication entirely and achieve dramatic speedups over standard PyTorch inference.

Key Results

Apple Silicon (M1–M4)

Performance on Apple Silicon

Performance comparison on Apple Silicon M4. Litespark-Inference achieves ~14× memory reduction, 9.2× faster TTFT, and 52× higher throughput compared to PyTorch.

Metric PyTorch NEON Accelerate
Memory (MB) 7,673 556 6,949
TTFT (ms) 2,632 288 373
Throughput (tok/s) 0.39 20.4 5.52

Intel Ice Lake / AMD Zen4 (AVX-512 VNNI)

Performance on AVX-512 VNNI

Performance comparison on Intel Ice Lake / AMD Zen4 using AVX-512 VNNI kernels.

Metric PyTorch AVX-512 VNNI Speedup
Memory (MB) 7,800 556 14.0×
TTFT (ms) 2,450 195 12.6×
Throughput (tok/s) 0.42 11.2 26.7×

Intel Core Ultra (AVX-VNNI)

Performance on AVX-VNNI

Performance comparison on Intel Core Ultra using AVX-VNNI kernels.

Metric PyTorch AVX-VNNI Speedup
Memory (MB) 7,750 556 13.9×
TTFT (ms) 2,580 310 8.3×
Throughput (tok/s) 0.40 8.5 21.3×

Cross-Platform Comparison

Cross-Platform Comparison

Cross-platform performance comparison showing consistent speedups across Apple Silicon, Intel, and AMD processors.

Comparison with BitNet.cpp v2

We benchmarked Litespark-Inference against Microsoft's BitNet.cpp v2 using their pp128+tg128 methodology (128-token prompt processing + 128-token generation).

AMD EPYC 9R14 (AWS c7a.2xlarge)

AMD EPYC Comparison

Scaling behavior on AMD EPYC 9R14. BitNet.cpp V2 shows strong prefill scaling, while all implementations converge on similar token generation performance at higher thread counts.

Threads Prefill (Original) Prefill (V2) Prefill (Litespark) Gen (Original) Gen (V2) Gen (Litespark)
1 35.0 43.4 38.2 10.0 15.6 15.9
2 70.0 81.2 74.7 18.0 28.7 28.1
4 140.0 156.8 140.7 30.0 49.2 48.2
8 210.0 291.8 230.7 42.0 66.2 67.5

Intel Xeon Platinum 8488C (AWS c7i.2xlarge)

Intel Xeon Comparison

Scaling behavior on Intel Xeon Platinum 8488C. Litespark-Inference maintains a consistent lead in prefill throughput across all thread configurations.

Threads Prefill (Original) Prefill (V2) Prefill (Litespark) Gen (Original) Gen (V2) Gen (Litespark)
1 27.0 43.4 59.7 10.0 13.3 13.6
2 40.0 65.8 85.9 13.0 19.1 19.5
4 55.0 77.9 110.2 16.0 24.3 25.0
6 79.0 101.3 120.7 20.0 29.5 28.0

Apple M4 (MacBook Pro)

Apple M4 Scaling

Litespark-Inference scaling on Apple M4. Prefill throughput scales nearly linearly up to 4 threads, while token generation benefits from using all 10 CPU cores.

Threads Prefill pp128 (tok/s) Generation tg128 (tok/s)
1 26.1 6.5
2 43.1 11.0
4 81.9 15.4
8 101.2 14.0
10 108.8 19.6

Supported Platforms

  • Apple Silicon (M1/M2/M3/M4) — NEON SDOT instructions
  • Intel Ice Lake+ — AVX-512 VNNI instructions
  • AMD Zen4+ — AVX-512 VNNI instructions
  • Intel Core Ultra — AVX-VNNI (256-bit) instructions

Installation

git clone https://github.com/Mindbeam-AI/Litespark-Inference.git
cd Litespark-Inference
pip install -e .

Requirements:

  • Python 3.9+
  • PyTorch 2.0+
  • macOS: brew install libomp (for OpenMP support)

Usage

Command Line

# Generate text
litespark-inference generate "The meaning of life is"

# Interactive chat
litespark-inference chat

# Run benchmark on your hardware
litespark-inference benchmark

# Show system info and detected SIMD capabilities
litespark-inference info

Python API

from litespark_inference import load_model

# Load the BitNet 2B model (auto-downloads from HuggingFace)
model, tokenizer = load_model("bitnet-2b")

# Generate text
input_ids = tokenizer.encode("Hello, world!", return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Kernel Modes (Apple Silicon)

Two inference modes are available on Apple Silicon:

# NEON mode (default) — fast int8 quantized inference, ~556 MB
litespark-inference generate "Hello" --mode neon

# Accelerate mode — float32 with Apple AMX, bit-exact accuracy, ~2.5 GB
litespark-inference generate "Hello" --mode accelerate
# In Python
model, tokenizer = load_model("bitnet-2b", mode="neon")       # default, fast
model, tokenizer = load_model("bitnet-2b", mode="accelerate") # accurate

How It Works

Ternary models use weights constrained to {-1, 0, +1}. This means matrix multiplication reduces to simple addition and subtraction:

y = Σ x_j · w_j  →  y = Σ(w=+1) x_j - Σ(w=-1) x_j

Litespark-Inference exploits this structure with custom SIMD kernels that:

  1. Store weights as int8 — enabling direct use of hardware dot product instructions
  2. Quantize activations per-row — converting float32 inputs to int8 with scale factors
  3. Use hardware SIMD instructions — NEON SDOT (ARM) or AVX-512 VPDPBUSD (x86)
  4. Apply zero-point correction — maintaining numerical accuracy

The library automatically detects your CPU's SIMD capabilities and dispatches to the optimal kernel.

Benchmarking

Run the built-in benchmark to measure performance on your hardware:

litespark-inference benchmark

Or use the benchmark scripts for detailed profiling:

python benchmark_kernel.py      # Kernel-level benchmarks
python benchmark_synthetic.py   # Synthetic workload benchmarks

Citation

If you use Litespark-Inference in your research, please cite:

@article{litespark2024,
  title={Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks},
  author={Dade, Nii Osae Osae and Morri, Maurizio and Rahat, Moinul Hossain},
  year={2024}
}

License

Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litespark_inference-0.1.0.tar.gz (74.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litespark_inference-0.1.0-py3-none-any.whl (78.9 kB view details)

Uploaded Python 3

File details

Details for the file litespark_inference-0.1.0.tar.gz.

File metadata

  • Download URL: litespark_inference-0.1.0.tar.gz
  • Upload date:
  • Size: 74.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for litespark_inference-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c582b8db4f136005abf15668da979846b7cf0ace9e63a784b39dc9ab0c314b49
MD5 d7a6424d0d64225af2917e1e234783f2
BLAKE2b-256 88b2aefeb2ca4b16127f911eeaefaf925139232a84b04f57816b561ab481fb08

See more details on using hashes here.

File details

Details for the file litespark_inference-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for litespark_inference-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87723ad34a0d377eb679816b75b6b1b7ea54eca13453a7d08effca4ccfb4eeb9
MD5 4f2dab47c52fd40c716e730175a406e8
BLAKE2b-256 7a8c8b05261ea490af2a05c6d60c4b4786293ff8cc9aafd81b0f6fa1300d3d61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page