Skip to main content

Fast arithmetic coding over symbol probability distributions in Rust.

Project description

distribution-coder

Optimal arithmetic coding over symbol probability distributions.

distribution-coder is a high-performance Python library for compressing sequences of symbols based on step-wise probability distributions. It is designed specifically for Neural Data Compression tasks, such as compressing the output of Large Language Models (LLMs), autoregressive transformers, or any next-token prediction system.

It is implemented in Rust using PyO3, offering zero-copy and zero-allocation operations for maximum throughput and low latency.

Features

  • Precision: Uses 32-bit frequency precision (backed by 128-bit integer arithmetic) to capture probabilities without underflow, achieving theoretical entropy limits.
  • Zero-Copy Dispatcher: Natively handles float32, float64, float16, and bfloat16 arrays. It reads memory directly from Numpy/PyTorch without casting or copying.
  • Framework Agnostic: Seamlessly accepts input from PyTorch, NumPy, JAX, TensorFlow, or standard Python lists.
  • Cache-Friendly: Uses a streaming, two-pass algorithm that never allocates heap memory for probability tables, preventing cache thrashing during long sequence generation.
  • Cross-Platform Determinism: Guarantees bit-exact reconstruction across different hardware architectures (x86, ARM, etc.) by avoiding hardware-specific floating-point intrinsics.

Installation

pip install distribution-coder

Quick Start

Basic Usage

The standard workflow involves an "Encoder" loop and a "Decoder" loop. Both must generate/receive the exact same probability distributions in the same order.

import numpy as np
from distribution_coder import DistributionCoder

# --- 1. Encoding ---
encoder = DistributionCoder()

# Mock probability distributions for a sequence of 3 steps
# (In reality, these come from your Neural Network)
step1_probs = [0.1, 0.7, 0.2] # Symbol 1 is likely
step2_probs = [0.8, 0.1, 0.1] # Symbol 0 is likely
step3_probs = [0.05, 0.05, 0.9] # Symbol 2 is likely

# The actual symbols that occurred
symbols = [1, 0, 2]

# Step-wise encoding
encoder.encode_step(step1_probs, symbols[0])
encoder.encode_step(step2_probs, symbols[1])
encoder.encode_step(step3_probs, symbols[2])

# Get compressed bytes
compressed_data = encoder.finish_encoding()
print(f"Compressed size: {len(compressed_data)} bytes")

# --- 2. Decoding ---
decoder = DistributionCoder()
decoder.start_decoding(compressed_data)

# Step-wise decoding
# We feed the SAME distributions and ask for the symbol back
decoded_sym1 = decoder.decode_step(step1_probs)
decoded_sym2 = decoder.decode_step(step2_probs)
decoded_sym3 = decoder.decode_step(step3_probs)

assert [decoded_sym1, decoded_sym2, decoded_sym3] == symbols
print("Successfully decoded sequence!")

Advanced Usage

Working with PyTorch & Mixed Precision

distribution-coder is optimized for modern Deep Learning workflows. It detects Tensor types and reads their underlying memory directly.

Supported Data Types:

  • float32 (Standard)
  • float16 (Half Precision - Zero Copy)
  • bfloat16 (Brain Floating Point - Zero Copy)
  • float64 (Double Precision)
import torch
from distribution_coder import DistributionCoder

coder = DistributionCoder()

# 1. PyTorch Tensor (CPU)
# Zero-copy access. No need to convert to numpy.
probs_fp32 = torch.softmax(torch.randn(100), dim=0)
coder.encode_step(probs_fp32, 5)

# 2. BFloat16 (TPU/Newer GPU format)
# Handled natively in Rust without casting to float32.
probs_bf16 = probs_fp32.to(torch.bfloat16)
coder.encode_step(probs_bf16, 10)

# 3. GPU Tensors
# Automatically moves to CPU for processing (copy required)
if torch.cuda.is_available():
    probs_gpu = torch.randn(100).cuda()
    coder.encode_step(probs_gpu, 2)

Minimizing Latency

For the lowest possible latency (e.g., real-time voice applications), ensure your probability arrays are:

  1. Contiguous: np.ascontiguousarray(probs) or tensor.contiguous().
  2. Native Types: Use float32, float16, or bfloat16. Python lists will trigger a fast C-level conversion, but native arrays are faster.

Performance Architecture

Traditional Arithmetic Coders often allocate a Cumulative Distribution Function (CDF) array for every token. For a vocabulary of 50,000 tokens, this means allocating, writing, and freeing ~200KB of memory per step.

distribution-coder solves this bottleneck:

  1. Streaming Calculation: It uses a two-pass algorithm (Analysis Pass + Search Pass) that iterates over the probability array in L1 cache without allocating heap memory.
  2. Integer Math: Probabilities are quantized to 32-bit integers summing to . Intermediate calculations use u128 to prevent overflow, allowing for "sharp" probabilities (high confidence) to use minimal bits.

API Reference

DistributionCoder

__init__()

Creates a new coder instance with a fresh state.

encode_step(distribution, symbol: int)

Encodes a single symbol based on the provided probability distribution.

  • distribution: Union[list, np.ndarray, torch.Tensor, jax.Array]. The probability distribution. Sum does not strictly need to be 1.0 (it will be normalized), but it should be close.
  • symbol: int. The index of the symbol to encode (0 <= symbol < len(distribution)).

finish_encoding() -> bytes

Finalizes the arithmetic coding process, flushes the internal bit buffer, and returns the compressed byte sequence.

start_decoding(input_bytes: bytes)

Resets the state and loads a compressed byte sequence for decoding.

  • input_bytes: The bytes object returned by finish_encoding().

decode_step(distribution) -> int

Decodes the next symbol from the stream based on the provided probability distribution.

  • distribution: Must be identical to the distribution used at this step during encoding.
  • Returns: The decoded symbol index.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

distribution_coder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (221.4 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

distribution_coder-0.1.1-cp38-abi3-win_amd64.whl (126.5 kB view details)

Uploaded CPython 3.8+Windows x86-64

distribution_coder-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (225.7 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

distribution_coder-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (201.5 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file distribution_coder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for distribution_coder-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7158aef77c709242fd2c2b2aa4853392781433f6f590bc3e98876f40b6194c7a
MD5 9b25e85d847af89ee5d21409899da95a
BLAKE2b-256 dc4317d50cc2d717e3e0d94995fb2799fecbf87414dc31fa462d95c9289a835e

See more details on using hashes here.

File details

Details for the file distribution_coder-0.1.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for distribution_coder-0.1.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2539063a121d690f3da04ce9ae9f30d34a0f7d2965a0c84772dd01507067c494
MD5 47db95d329fc80dbf0de6e01bf1677b4
BLAKE2b-256 69aa8dc77c1aef6d91ae2b1cbc15236db6286e2d65515dc5c04c5a0264e5849e

See more details on using hashes here.

File details

Details for the file distribution_coder-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for distribution_coder-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3cb1a80a070ddfbd92f8bbb50321ef0a13aae636afe5e0329772eb82e7e0afba
MD5 1f8469909927d2278e2235cdb02f5db2
BLAKE2b-256 b1390b1e37e0f33a03431d0b9e13d6231fd482786ecf611a5f80a3d0137cee97

See more details on using hashes here.

File details

Details for the file distribution_coder-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for distribution_coder-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aeea1fcf287e67f8dd7e0a0c4e9afffef794bbe559d7d514eaa3feb8e938557b
MD5 051d440bad06247b4875cf0d93a63d4e
BLAKE2b-256 9ec440b3690b2b14f5ccd604d7f112d02effc494a30e35340276f703ece8218a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page