Efficient CPU inference for BitNet 1.58-bit models

These details have not been verified by PyPI

Project links

Project description

Litespark-Inference

Fast CPU inference for ternary neural networks

Litespark-Inference is a pip-installable Python library that enables efficient inference of ternary models on consumer CPUs. By exploiting the ternary weight structure ({-1, 0, +1}) with custom SIMD kernels, we eliminate floating-point multiplication entirely and achieve dramatic speedups over standard PyTorch inference.

Key Results

Apple Silicon (M1–M4)

Performance on Apple Silicon

Performance comparison on Apple Silicon M4. Litespark-Inference achieves ~14× memory reduction, 9.2× faster TTFT, and 52× higher throughput compared to PyTorch.

Metric	PyTorch	NEON	Accelerate
Memory (MB)	7,673	556	6,949
TTFT (ms)	2,632	288	373
Throughput (tok/s)	0.39	20.4	5.52

Intel Ice Lake / AMD Zen4 (AVX-512 VNNI)

Performance on AVX-512 VNNI

Performance comparison on Intel Ice Lake / AMD Zen4 using AVX-512 VNNI kernels.

Metric	PyTorch	AVX-512 VNNI	Speedup
Memory (MB)	7,800	556	14.0×
TTFT (ms)	2,450	195	12.6×
Throughput (tok/s)	0.42	11.2	26.7×

Intel Core Ultra (AVX-VNNI)

Performance on AVX-VNNI

Performance comparison on Intel Core Ultra using AVX-VNNI kernels.

Metric	PyTorch	AVX-VNNI	Speedup
Memory (MB)	7,750	556	13.9×
TTFT (ms)	2,580	310	8.3×
Throughput (tok/s)	0.40	8.5	21.3×

Cross-Platform Comparison

Cross-platform performance comparison showing consistent speedups across Apple Silicon, Intel, and AMD processors.

Comparison with BitNet.cpp v2

We benchmarked Litespark-Inference against Microsoft's BitNet.cpp v2 using their pp128+tg128 methodology (128-token prompt processing + 128-token generation).

AMD EPYC 9R14 (AWS c7a.2xlarge)

AMD EPYC Comparison

Scaling behavior on AMD EPYC 9R14. BitNet.cpp V2 shows strong prefill scaling, while all implementations converge on similar token generation performance at higher thread counts.

Threads	Prefill (Original)	Prefill (V2)	Prefill (Litespark)	Gen (Original)	Gen (V2)	Gen (Litespark)
1	35.0	43.4	38.2	10.0	15.6	15.9
2	70.0	81.2	74.7	18.0	28.7	28.1
4	140.0	156.8	140.7	30.0	49.2	48.2
8	210.0	291.8	230.7	42.0	66.2	67.5

Intel Xeon Platinum 8488C (AWS c7i.2xlarge)

Intel Xeon Comparison

Scaling behavior on Intel Xeon Platinum 8488C. Litespark-Inference maintains a consistent lead in prefill throughput across all thread configurations.

Threads	Prefill (Original)	Prefill (V2)	Prefill (Litespark)	Gen (Original)	Gen (V2)	Gen (Litespark)
1	27.0	43.4	59.7	10.0	13.3	13.6
2	40.0	65.8	85.9	13.0	19.1	19.5
4	55.0	77.9	110.2	16.0	24.3	25.0
6	79.0	101.3	120.7	20.0	29.5	28.0

Apple M4 (MacBook Pro)

Apple M4 Scaling

Litespark-Inference scaling on Apple M4. Prefill throughput scales nearly linearly up to 4 threads, while token generation benefits from using all 10 CPU cores.

Threads	Prefill pp128 (tok/s)	Generation tg128 (tok/s)
1	26.1	6.5
2	43.1	11.0
4	81.9	15.4
8	101.2	14.0
10	108.8	19.6

Supported Platforms

Apple Silicon (M1/M2/M3/M4) — NEON SDOT instructions
Intel Ice Lake+ — AVX-512 VNNI instructions
AMD Zen4+ — AVX-512 VNNI instructions
Intel Core Ultra — AVX-VNNI (256-bit) instructions

Installation

pip install litespark-inference

Requirements:

Python 3.9+
PyTorch 2.4+

macOS (recommended):

brew install libomp

OpenMP enables multi-threaded kernel execution. Without it, inference will run single-threaded.

Usage

Command Line

# Generate text
litespark-inference generate "The meaning of life is"

# Interactive chat
litespark-inference chat

# Run benchmark on your hardware
litespark-inference benchmark

# Show system info and detected SIMD capabilities
litespark-inference info

Python API

from litespark_inference import load_model

# Load the BitNet 2B model (auto-downloads from HuggingFace)
model, tokenizer = load_model("bitnet-2b")

# Generate text
input_ids = tokenizer.encode("Hello, world!", return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Kernel Modes (Apple Silicon)

Two inference modes are available on Apple Silicon:

# NEON mode (default) — fast int8 quantized inference, ~556 MB
litespark-inference generate "Hello" --mode neon

# Accelerate mode — float32 with Apple AMX, bit-exact accuracy, ~2.5 GB
litespark-inference generate "Hello" --mode accelerate

# In Python
model, tokenizer = load_model("bitnet-2b", mode="neon")       # default, fast
model, tokenizer = load_model("bitnet-2b", mode="accelerate") # accurate

How It Works

Ternary models use weights constrained to {-1, 0, +1}. This means matrix multiplication reduces to simple addition and subtraction:

y = Σ x_j · w_j  →  y = Σ(w=+1) x_j - Σ(w=-1) x_j

Litespark-Inference exploits this structure with custom SIMD kernels that:

Store weights as int8 — enabling direct use of hardware dot product instructions
Quantize activations per-row — converting float32 inputs to int8 with scale factors
Use hardware SIMD instructions — NEON SDOT (ARM) or AVX-512 VPDPBUSD (x86)
Apply zero-point correction — maintaining numerical accuracy

The library automatically detects your CPU's SIMD capabilities and dispatches to the optimal kernel.

Benchmarking

Run the built-in benchmark to measure performance on your hardware:

litespark-inference benchmark

Or use the benchmark scripts for detailed profiling:

python benchmark_kernel.py      # Kernel-level benchmarks
python benchmark_synthetic.py   # Synthetic workload benchmarks

Citation

If you use Litespark-Inference in your research, please cite:

@article{litespark2024,
  title={Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks},
  author={Dade, Nii Osae Osae and Morri, Maurizio and Rahat, Moinul Hossain},
  year={2024}
}

License

Apache License 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.3

Jun 15, 2026

1.0.2 yanked

Jun 11, 2026

1.0.1 yanked

Jun 11, 2026

1.0.0 yanked

Jun 11, 2026

This version

0.1.4 yanked

Mar 2, 2026

0.1.3 yanked

Feb 27, 2026

0.1.2 yanked

Feb 27, 2026

0.1.1 yanked

Feb 27, 2026

0.1.0 yanked

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litespark_inference-0.1.4.tar.gz (76.5 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

litespark_inference-0.1.4-py3-none-any.whl (80.7 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file litespark_inference-0.1.4.tar.gz.

File metadata

Download URL: litespark_inference-0.1.4.tar.gz
Upload date: Mar 2, 2026
Size: 76.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for litespark_inference-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`16aa65ea8a35f5d4975ae79151b5f70c2e734eb44b14167d93d60be67484a8a8`
MD5	`66d2e4fefd09168e9f3a3c951e534e43`
BLAKE2b-256	`1ac8a971029b1c0c975c85f525df8a0f322b96b27acd226daf49dd2278244307`

See more details on using hashes here.

File details

Details for the file litespark_inference-0.1.4-py3-none-any.whl.

File metadata

Download URL: litespark_inference-0.1.4-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 80.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for litespark_inference-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eea4552f614283aa1a69f0dbfeab61db1d15ebd09af5e83e68eb27e9dc1e91f6`
MD5	`ae2e311e0a56927a1854902a9dbc8be7`
BLAKE2b-256	`45da05434a0a22fe9b3ebd7593cc4486fd56c14370c59db67612bc388476f967`

See more details on using hashes here.

litespark-inference 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Litespark-Inference

Key Results

Apple Silicon (M1–M4)

Intel Ice Lake / AMD Zen4 (AVX-512 VNNI)

Intel Core Ultra (AVX-VNNI)

Cross-Platform Comparison

Comparison with BitNet.cpp v2

AMD EPYC 9R14 (AWS c7a.2xlarge)

Intel Xeon Platinum 8488C (AWS c7i.2xlarge)

Apple M4 (MacBook Pro)

Supported Platforms

Installation

Usage

Command Line

Python API

Kernel Modes (Apple Silicon)

How It Works

Benchmarking

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes