William: A tool for data compression and machine learning automatization
Project description
WILLIAM - A general purpose data compression algorithm
Overview
WILLIAM is an inductive programming system based on the
theory of Incremental Compression (IC) [Franz et al. 2021].
Its core principle is that learning = compression:
given a dataset x, the algorithm searches for short descriptions in the form
of compositional features f1, f2, …, fs such that
x = f1(f2(... f_s(r_s)))
with each step achieving some compression. This corresponds to an incremental approximation of the Kolmogorov complexity K(x):
K(x) ≈ Σ l(f*i) + K(r_s) + O(s · log l(x))
where each f*i is the shortest compressing feature at step i.
WILLIAM differs from classical ML approaches in that it does not optimize
parameters in a fixed representation, but searches a broad algorithmic space
for compressing autoencoders.
This yields machine learning algorithms (centralization, regression, classification, decision trees, outlier detection) as emergent special cases of general compression:contentReference[oaicite:0]{index=0}.
For theoretical background, see:
- A Theory of Incremental Compression (Franz, Antonenko, Soletskyi, 2021):contentReference[oaicite:1]{index=1}
- WILLIAM: A Monolithic Approach to AGI (Franz, Gogulya, Löffler, 2019)
- Experiments on the Generalization of Machine Learning Algorithms (Franz, 2020):contentReference[oaicite:2]{index=2}
Key Concepts
-
Incremental Compression
Decomposes data into features and residuals step by step, ensuring that each feature is independent and incompressible. -
Features as Properties
Features formalize algorithmic properties of data and can be related to Martin-Löf randomness tests:
non-random regularities correspond to compressible features. -
Universality
Unlike specialized ML algorithms, WILLIAM discovers short descriptions exhaustively via directed acyclic graphs (DAGs) of operators, reusing values and cutting at information bottlenecks. -
Emergent ML Algorithms
Without any tuning, WILLIAM naturally rediscovers:- data centralization
- outlier detection
- linear regression
- linear classification
- decision tree induction:contentReference[oaicite:3]{index=3}
Limitations and Future Work
Overhead accumulation: IC theory implies additive overhead terms.
Alternative descriptions: currently only one compression path is explored at a time.
Reuse of functions: theory of memory/retrieval still open.
Performance: the Python prototype handles graphs of depth 4–5; C++/Rust backend and parallelization are natural next steps.
Despite these challenges, IC theory provides guarantees: incremental compression reaches Kolmogorov complexity up to logarithmic precision
Installation
For a standard installation, use:
pip install william-occam
For a full installation of all dependencies for further development, testing and graphical output use:
pip install william-occam[dev]
Compression examples
You can run various compression tests directly with pytest. Set
export WILLIAM_DEBUG=3
to get visual output after every compression step. Set to 2, if you only want to see the compression results after every task. Now run:
py.test -v -s william/tests/test_alice.py
Enter c and enter to step through the steps with the debugger and look at the generated graphs.
During execution, WILLIAM will:
- Generate synthetic training data for several regression problems:
- Search for a minimal program (tree/DAG) that explains the data.
- Display the compression progress (how the description length decreases).
- Render the resulting Directed Acyclic Graphs (DAGs) as PDF files in your working directory.
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). You are free to use, share, and modify the code for non-commercial purposes only, with proper attribution to the original author. For full license details, see the LICENSE.md file.
Releasing
Releases are published automatically when a tag is pushed to GitLab.
# Example for version 1.2.3
export RELEASE=v1.2.3
# Create a tag and push the specific tag to trigger the CI pipeline
git tag $RELEASE && git push origin $RELEASE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file william_occam-0.2.3.1-cp314-cp314-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: william_occam-0.2.3.1-cp314-cp314-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e07103a9d54adcc04bf21a240e24b88f8d848843b3f6f913ae5e386bb3d6a66e
|
|
| MD5 |
3fe671991afe31c120b66db09b94d755
|
|
| BLAKE2b-256 |
8b070f5bc317861a9d04ca41b5be6d84a78c3f1f0bb2b6072641976649542e2e
|
File details
Details for the file william_occam-0.2.3.1-cp313-cp313-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: william_occam-0.2.3.1-cp313-cp313-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
735e69a962d673207b69185b4211fed8db12526441315861ac5ed1a9d075927f
|
|
| MD5 |
7635289ffe19f7dc43a22f2eef9f6a84
|
|
| BLAKE2b-256 |
ee1fc752403358b7dbbdbb8bac607419b4e17c03e0d8429f18e25d75f2e769c2
|
File details
Details for the file william_occam-0.2.3.1-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: william_occam-0.2.3.1-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07b5d4bdcfde6b373cb27334d82eec9bdaf7416e16099f9930181113004252a5
|
|
| MD5 |
cdf7d6510d0fcc9ae771b4d9256e5852
|
|
| BLAKE2b-256 |
e08237b72fdc52d23357fca5a3309260f58bdcfecaabe859a58ffa8786b41471
|
File details
Details for the file william_occam-0.2.3.1-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: william_occam-0.2.3.1-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.8 MB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
425589801a4dfc00798077356ea9d8836d1af1a6d713c0800d2b0640bdedf6fa
|
|
| MD5 |
4a70b7d2b50da82a708c26623cbaeee9
|
|
| BLAKE2b-256 |
87731ef5a794a23c4da1c926559a46427b3e643327e58d916cb33127e6ac9665
|