Skip to main content

William: A tool for data compression and machine learning automatization

Reason this release was yanked:

broken

Project description

pipeline status

WILLIAM - A general purpose data compression algorithm

Overview

WILLIAM is an inductive programming system based on the theory of Incremental Compression (IC) [Franz et al. 2021]. Its core principle is that learning = compression:
given a dataset x, the algorithm searches for short descriptions in the form of compositional features f1, f2, …, fs such that

x = f1(f2(... f_s(r_s)))

with each step achieving some compression. This corresponds to an incremental approximation of the Kolmogorov complexity K(x):

K(x) ≈ Σ l(f*i) + K(r_s) + O(s · log l(x))

where each f*i is the shortest compressing feature at step i.

WILLIAM differs from classical ML approaches in that it does not optimize parameters in a fixed representation, but searches a broad algorithmic space for compressing autoencoders.
This yields machine learning algorithms (centralization, regression, classification, decision trees, outlier detection) as emergent special cases of general compression:contentReference[oaicite:0]{index=0}.

For theoretical background, see:

  • A Theory of Incremental Compression (Franz, Antonenko, Soletskyi, 2021):contentReference[oaicite:1]{index=1}
  • WILLIAM: A Monolithic Approach to AGI (Franz, Gogulya, Löffler, 2019)
  • Experiments on the Generalization of Machine Learning Algorithms (Franz, 2020):contentReference[oaicite:2]{index=2}

Key Concepts

  • Incremental Compression
    Decomposes data into features and residuals step by step, ensuring that each feature is independent and incompressible.

  • Features as Properties
    Features formalize algorithmic properties of data and can be related to Martin-Löf randomness tests:
    non-random regularities correspond to compressible features.

  • Universality
    Unlike specialized ML algorithms, WILLIAM discovers short descriptions exhaustively via directed acyclic graphs (DAGs) of operators, reusing values and cutting at information bottlenecks.

  • Emergent ML Algorithms
    Without any tuning, WILLIAM naturally rediscovers:

    • data centralization
    • outlier detection
    • linear regression
    • linear classification
    • decision tree induction:contentReference[oaicite:3]{index=3}

Limitations and Future Work

Overhead accumulation: IC theory implies additive overhead terms.

Alternative descriptions: currently only one compression path is explored at a time.

Reuse of functions: theory of memory/retrieval still open.

Performance: the Python prototype handles graphs of depth 4–5; C++/Rust backend and parallelization are natural next steps.

Despite these challenges, IC theory provides guarantees: incremental compression reaches Kolmogorov complexity up to logarithmic precision

Installation

For a minimal installation, use:

pip install .

For a full installation of all dependencies for further development, testing and graphical output use:

pip install .[tests,dev]

Compression examples

You can run various compression tests directly with pytest. Set

export WILLIAM_DEBUG=3

to get visual output after every compression step. Set to 2, if you only want to see the compression results after every task. Now run:

py.test -v -s william/tests/test_alice.py

Enter c and enter to step through the steps with the debugger and look at the generated graphs.

During execution, WILLIAM will:

  • Generate synthetic training data for several regression problems:
  • Search for a minimal program (tree/DAG) that explains the data.
  • Display the compression progress (how the description length decreases).
  • Render the resulting Directed Acyclic Graphs (DAGs) as PDF files in your working directory.

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). You are free to use, share, and modify the code for non-commercial purposes only, with proper attribution to the original author. For full license details, see the LICENSE.md file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

william_occam-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file william_occam-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for william_occam-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a96c378cf08c2aec66d88bef4c9bfa93cde0db7799a9315634d896819dec4582
MD5 31da0e5197830703a089cd208ae96e37
BLAKE2b-256 48031047cb6129bed745c7f9caa63d47ea2a37b872eb2f3613de6d9cd61a8db9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page