Fast arithmetic coding over symbol probability distributions in Rust.

Project description

distribution-coder

Optimal arithmetic coding over symbol probability distributions.

distribution-coder is a high-performance Python library for compressing sequences of symbols based on step-wise probability distributions. It is designed specifically for Neural Data Compression tasks, such as compressing the output of Large Language Models (LLMs), autoregressive transformers, or any next-token prediction system.

It is implemented in Rust using PyO3, offering zero-copy and zero-allocation operations for maximum throughput and low latency.

Features

Precision: Uses 32-bit frequency precision (backed by 128-bit integer arithmetic) to capture probabilities without underflow, achieving theoretical entropy limits.
Zero-Copy Dispatcher: Natively handles float32, float64, float16, and bfloat16 arrays. It reads memory directly from Numpy/PyTorch without casting or copying.
Framework Agnostic: Seamlessly accepts input from PyTorch, NumPy, JAX, TensorFlow, or standard Python lists.
Cache-Friendly: Uses a streaming, two-pass algorithm that never allocates heap memory for probability tables, preventing cache thrashing during long sequence generation.
Cross-Platform Determinism: Guarantees bit-exact reconstruction across different hardware architectures (x86, ARM, etc.) by avoiding hardware-specific floating-point intrinsics.

Installation

pip install distribution-coder

Quick Start

Basic Usage

The standard workflow involves an "Encoder" loop and a "Decoder" loop. Both must generate/receive the exact same probability distributions in the same order.

import numpy as np
from distribution_coder import DistributionCoder

# --- 1. Encoding ---
encoder = DistributionCoder()

# Mock probability distributions for a sequence of 3 steps
# (In reality, these come from your Neural Network)
step1_probs = [0.1, 0.7, 0.2] # Symbol 1 is likely
step2_probs = [0.8, 0.1, 0.1] # Symbol 0 is likely
step3_probs = [0.05, 0.05, 0.9] # Symbol 2 is likely

# The actual symbols that occurred
symbols = [1, 0, 2]

# Step-wise encoding
encoder.encode_step(step1_probs, symbols[0])
encoder.encode_step(step2_probs, symbols[1])
encoder.encode_step(step3_probs, symbols[2])

# Get compressed bytes
compressed_data = encoder.finish_encoding()
print(f"Compressed size: {len(compressed_data)} bytes")

# --- 2. Decoding ---
decoder = DistributionCoder()
decoder.start_decoding(compressed_data)

# Step-wise decoding
# We feed the SAME distributions and ask for the symbol back
decoded_sym1 = decoder.decode_step(step1_probs)
decoded_sym2 = decoder.decode_step(step2_probs)
decoded_sym3 = decoder.decode_step(step3_probs)

assert [decoded_sym1, decoded_sym2, decoded_sym3] == symbols
print("Successfully decoded sequence!")

Advanced Usage

Working with PyTorch & Mixed Precision

distribution-coder is optimized for modern Deep Learning workflows. It detects Tensor types and reads their underlying memory directly.

Supported Data Types:

float32 (Standard)
float16 (Half Precision - Zero Copy)
bfloat16 (Brain Floating Point - Zero Copy)
float64 (Double Precision)

import torch
from distribution_coder import DistributionCoder

coder = DistributionCoder()

# 1. PyTorch Tensor (CPU)
# Zero-copy access. No need to convert to numpy.
probs_fp32 = torch.softmax(torch.randn(100), dim=0)
coder.encode_step(probs_fp32, 5)

# 2. BFloat16 (TPU/Newer GPU format)
# Handled natively in Rust without casting to float32.
probs_bf16 = probs_fp32.to(torch.bfloat16)
coder.encode_step(probs_bf16, 10)

# 3. GPU Tensors
# Automatically moves to CPU for processing (copy required)
if torch.cuda.is_available():
    probs_gpu = torch.randn(100).cuda()
    coder.encode_step(probs_gpu, 2)

Minimizing Latency

For the lowest possible latency (e.g., real-time voice applications), ensure your probability arrays are:

Contiguous: np.ascontiguousarray(probs) or tensor.contiguous().
Native Types: Use float32, float16, or bfloat16. Python lists will trigger a fast C-level conversion, but native arrays are faster.

Performance Architecture

Traditional Arithmetic Coders often allocate a Cumulative Distribution Function (CDF) array for every token. For a vocabulary of 50,000 tokens, this means allocating, writing, and freeing ~200KB of memory per step.

distribution-coder solves this bottleneck:

Streaming Calculation: It uses a two-pass algorithm (Analysis Pass + Search Pass) that iterates over the probability array in L1 cache without allocating heap memory.
Integer Math: Probabilities are quantized to 32-bit integers summing to . Intermediate calculations use u128 to prevent overflow, allowing for "sharp" probabilities (high confidence) to use minimal bits.

API Reference

`DistributionCoder`

`init()`

Creates a new coder instance with a fresh state.

`encode_step(distribution, symbol: int)`

Encodes a single symbol based on the provided probability distribution.

distribution: Union[list, np.ndarray, torch.Tensor, jax.Array]. The probability distribution. Sum does not strictly need to be 1.0 (it will be normalized), but it should be close.
symbol: int. The index of the symbol to encode (0 <= symbol < len(distribution)).

`finish_encoding() -> bytes`

Finalizes the arithmetic coding process, flushes the internal bit buffer, and returns the compressed byte sequence.

`start_decoding(input_bytes: bytes)`

Resets the state and loads a compressed byte sequence for decoding.

input_bytes: The bytes object returned by finish_encoding().

`decode_step(distribution) -> int`

Decodes the next symbol from the stream based on the provided probability distribution.

distribution: Must be identical to the distribution used at this step during encoding.
Returns: The decoded symbol index.

Project details

Release history Release notifications | RSS feed

0.1.1

Jan 30, 2026

This version

0.1.0

Jan 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distribution_coder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (221.4 kB view details)

Uploaded Jan 30, 2026 PyPymanylinux: glibc 2.17+ x86-64

distribution_coder-0.1.0-cp38-abi3-win_amd64.whl (126.5 kB view details)

Uploaded Jan 30, 2026 CPython 3.8+Windows x86-64

distribution_coder-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (225.7 kB view details)

Uploaded Jan 30, 2026 CPython 3.8+manylinux: glibc 2.17+ x86-64

distribution_coder-0.1.0-cp38-abi3-macosx_11_0_arm64.whl (201.4 kB view details)

Uploaded Jan 30, 2026 CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file distribution_coder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: distribution_coder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 30, 2026
Size: 221.4 kB
Tags: PyPy, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.11.5

File hashes

Hashes for distribution_coder-0.1.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`70d028780103d11ac889879b25a5de4c57fed97827d71c1cab7025d9953849e5`
MD5	`7ed7f09625c6f76663c19fcd5e4f690b`
BLAKE2b-256	`2cf432dd0b44617c48b2b8c930ff9fb92ec5d18e51de9741462f826190568c94`

See more details on using hashes here.

File details

Details for the file distribution_coder-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

Download URL: distribution_coder-0.1.0-cp38-abi3-win_amd64.whl
Upload date: Jan 30, 2026
Size: 126.5 kB
Tags: CPython 3.8+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.11.5

File hashes

Hashes for distribution_coder-0.1.0-cp38-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`edd72c524a567a57ed2f2ba4047f57a97fb44cd1be2ef744012c91fbdce1bd2e`
MD5	`ed91d986213d6aa503d8b7a21614bb6f`
BLAKE2b-256	`6a00bb6a6cafff947e7cd319f709c8fb67ed73414f7b99af294c05e4e605a05f`

See more details on using hashes here.

File details

Details for the file distribution_coder-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: distribution_coder-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 30, 2026
Size: 225.7 kB
Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.11.5

File hashes

Hashes for distribution_coder-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`2e31897c01d9e0f0dd51b2b5637f315ff92d05a1e51b6c49b853ed74d6fbc54d`
MD5	`45f08d48b81ed57c02d80cb0c5831a91`
BLAKE2b-256	`52a823a74253565e4df6cc4d6b79eb035494bd8795f30a21ec0b205cf91bc206`

See more details on using hashes here.

File details

Details for the file distribution_coder-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: distribution_coder-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Upload date: Jan 30, 2026
Size: 201.4 kB
Tags: CPython 3.8+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.11.5

File hashes

Hashes for distribution_coder-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`804e5f24240f6e1f45cf4fb77bb4fa3dbef8c030e47602a569174fc507a38333`
MD5	`2cbd1d3844be64773ce84d55aaf34fa8`
BLAKE2b-256	`3d164ca58b85fb6feed901799f77f01b923de9f64888b2d60890b68ea70b12c8`

See more details on using hashes here.

distribution-coder 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

distribution-coder

Features

Installation

Quick Start

Basic Usage

Advanced Usage

Working with PyTorch & Mixed Precision

Minimizing Latency

Performance Architecture

API Reference

DistributionCoder

__init__()

encode_step(distribution, symbol: int)

finish_encoding() -> bytes

start_decoding(input_bytes: bytes)

decode_step(distribution) -> int

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

`DistributionCoder`

`init()`

`encode_step(distribution, symbol: int)`

`finish_encoding() -> bytes`

`start_decoding(input_bytes: bytes)`

`decode_step(distribution) -> int`