Skip to main content

William: A tool for data compression and machine learning automatization

Reason this release was yanked:

broken

Project description

pipeline status

WILLIAM - A general purpose data compression algorithm

Overview

WILLIAM is an inductive programming system based on the theory of Incremental Compression (IC) [Franz et al. 2021]. Its core principle is that learning = compression:
given a dataset x, the algorithm searches for short descriptions in the form of compositional features f1, f2, …, fs such that

x = f1(f2(... f_s(r_s)))

with each step achieving some compression. This corresponds to an incremental approximation of the Kolmogorov complexity K(x):

K(x) ≈ Σ l(f*i) + K(r_s) + O(s · log l(x))

where each f*i is the shortest compressing feature at step i.

WILLIAM differs from classical ML approaches in that it does not optimize parameters in a fixed representation, but searches a broad algorithmic space for compressing autoencoders.
This yields machine learning algorithms (centralization, regression, classification, decision trees, outlier detection) as emergent special cases of general compression:contentReference[oaicite:0]{index=0}.

For theoretical background, see:

  • A Theory of Incremental Compression (Franz, Antonenko, Soletskyi, 2021):contentReference[oaicite:1]{index=1}
  • WILLIAM: A Monolithic Approach to AGI (Franz, Gogulya, Löffler, 2019)
  • Experiments on the Generalization of Machine Learning Algorithms (Franz, 2020):contentReference[oaicite:2]{index=2}

Key Concepts

  • Incremental Compression
    Decomposes data into features and residuals step by step, ensuring that each feature is independent and incompressible.

  • Features as Properties
    Features formalize algorithmic properties of data and can be related to Martin-Löf randomness tests:
    non-random regularities correspond to compressible features.

  • Universality
    Unlike specialized ML algorithms, WILLIAM discovers short descriptions exhaustively via directed acyclic graphs (DAGs) of operators, reusing values and cutting at information bottlenecks.

  • Emergent ML Algorithms
    Without any tuning, WILLIAM naturally rediscovers:

    • data centralization
    • outlier detection
    • linear regression
    • linear classification
    • decision tree induction:contentReference[oaicite:3]{index=3}

Limitations and Future Work

Overhead accumulation: IC theory implies additive overhead terms.

Alternative descriptions: currently only one compression path is explored at a time.

Reuse of functions: theory of memory/retrieval still open.

Performance: the Python prototype handles graphs of depth 4–5; C++/Rust backend and parallelization are natural next steps.

Despite these challenges, IC theory provides guarantees: incremental compression reaches Kolmogorov complexity up to logarithmic precision

Installation

For a standard installation, use:

pip install william-occam

For a full installation of all dependencies for further development, testing and graphical output use:

pip install .[dev]

Compression examples

You can run various compression tests directly with pytest. Set

export WILLIAM_DEBUG=3

to get visual output after every compression step. Set to 2, if you only want to see the compression results after every task. Now run:

py.test -v -s william/tests/test_alice.py

Enter c and enter to step through the steps with the debugger and look at the generated graphs.

During execution, WILLIAM will:

  • Generate synthetic training data for several regression problems:
  • Search for a minimal program (tree/DAG) that explains the data.
  • Display the compression progress (how the description length decreases).
  • Render the resulting Directed Acyclic Graphs (DAGs) as PDF files in your working directory.

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). You are free to use, share, and modify the code for non-commercial purposes only, with proper attribution to the original author. For full license details, see the LICENSE.md file.

Releasing

Releases are published automatically when a tag is pushed to GitLab.

# Example for version 1.2.3
export RELEASE=v1.2.3

# Create an annotated tag
git tag -a $RELEASE -m "Version $RELEASE"

# Push the specific tag to trigger the CI pipeline
git push origin $RELEASE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

william_occam-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file william_occam-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

  • Download URL: william_occam-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for william_occam-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e98b79671b5fe89deb76c21a852d5eff7bbc538553a74912ca347391c839f36c
MD5 aa1382e861e02a3426f9b25caa74a324
BLAKE2b-256 4ca69a2493d38872648f361112176f878c5a4b1fcc8291d823e30fab551dc3eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page