Skip to main content

Mixture modelling by Minimum Message Length (MML) using Factor Snob

Project description

About this Project

This repository is an updated version of the open-source Factor Snob program for Mixture modelling by Minimum Message Length (MML) written in C. The original source code was obtained from Lloyd Allison's website at: http://www.allisons.org/ll/MML/Notes/SNOB-factor/. The goal of this project is to make the program more accessible by providing a Python interface to the program. The original program was written in C and compiled to a binary executable.

I have made many changes to the code mainly to improve readability, and to fix some minor bugs I discovered. The current version produces the same output as the original, except for a couple of display bugs present in the original version (see below). Here is a summary of the changes:

  • Refactored function declarations into a central "snob.h" include file.
  • Previous version included "glob.c" within other files. "glob.c" now contains some generic and exported functions.
  • Replaced Old-style C-function prototypes with moder C function prototypes.
  • Removed as many global variables as possible and eliminated many side effects. The previous version relied too much on
    side effects and global variables. The current version has made variable passing more explicit. In some cases where global variables are still used, they have been grouped into global structures instead. This should make it easier to eliminate them entirely in the future. There is still plenty of room for improvement.
  • Refactored many functions to replace goto statements with more modern control structures.
  • Replaced many print statements with a logging facility that can be dialed down using a debug level.
  • Added functions for printing a summary of the existing sample data
  • Added functions for generating a JSON representation of the classification result equivalent to the data produced by the prclass command.
  • Added functions for generating class assignments to the sample items. Also, useful for classification through Python.
  • Fixed a bug in which during printing of Class information, the wrong class pointer was used to calculate one of the Factor costs. This is why the total costs for the Factor model did not match the sum of the parameter, data and variable costs. It also affected the display of the boost character on the class header information.
  • Added functions to allow loading vset and Sample data through function calls rather than from file. This allows vset and sample information to be supplied through Python ctypes calls.
  • Added functions to allow saving and loading of the population model to and from file. This allows the population model to be saved and loaded through Python ctypes calls.
  • Added a Scikit-learn style interface for fitting and predicting using the Factor Snob model. This allows the model to be used in a similar way to other Scikit-learn classification models. The features include the ability to save fitted models and to predict on new data based on previously saved models.
  • Added a python module that can provide the data, run the Factor Snob routines and extract the results within Python, using ctypes.

The original documentation file is included in the docs folder. It contains a lot of useful information about the original SNOB programs and may be useful for understanding some of the input and output parameters.

Any published work using this software should cite the original author's work as well as this repository. See the

The MML Book

[1] C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer, isbn-13:978-0-387-23795-4, 2005.

Using the Python Module

The Python module can be installed using pip as follows:

pip install snob-factor

The module can be used in a similar way to other Scikit-learn classification models. The following example shows how to use the module to fit a Factor Snob model to some sample data and then use the model to predict the class of some new data.

import pandas as pd
import snob

# Load the sample data
train = pd.read_csv('train.csv')  
sfc = snob.SNOBClassifier(
    name='sst',
    attrs={                   # these are the features of the data
        'distance': 'real',   # a real-valued attribute
        'theta': 'radians',   # an angle in radians angles are treated specially
        'phi': 'radians',     # another angle in radians 
    },
    cycles=3,                 # maximum number of cycles to run
    steps=40,                 # maximum number of estimation/assignment steps to run
    moves=2,                  # maximum number of failed class shuffles before giving up each cycle
    tol=0.01,                 # minimum percentage improvement in cost to indicate convergence
    seed=1234567              # random number seed
)

# Fit the model to the sample data
sfc.fit(train)
sfc.save_model('sst.mod')     # save the model to a file for use later

classes = sfc.get_classes()   # get the class parameters for the fitted model
snob.show_classes(classes)    # display the classification summary

# get class assignments for training data
train_pred = sfc.predict()    # assignments the training data, note that predict is called without arguments
print(train_pred)             # train_pred is a pandas DataFrame with the class assignments

# Load some new data to predict
test = pd.read_csv('test.csv')

# Predict the class of the new data
test_pred = sfc.predict(test)
print(test_pred)

The SNOBClassifier class has a number of parameters that can be used to control the behaviour of the model. The attrs parameter is a dictionary that defines the features of the data. The keys are the names of the features and the values are the types of the features. The types can be one of the following:

  • 'real' - a real-valued feature
  • 'radians' - an angle in radians
  • 'degrees' - an angle in degrees
  • 'multi-state' - a categorical feature with more than two but preferably fewer than 20 states
  • 'binary' - a boolean or two-valued feature

Since SNOBClassifier is an unsupervised learning model, the fit() method does not require a target variable. The fit() method takes a pandas DataFrame as input. The DataFrame must contain columns for each of the features defined in the attrs parameter. The fit() method will run the Factor Snob algorithm on the data and produce a classification model for the data. After fitting, the classifier will be fully parameterized and can be used to predict the classes of new data.

The get_classes() method of a fully parameterized classifier can be used to get the class parameters for the fitted model. This returns a list of dictionaries with each dictionary representing one class. The show_classes() function from the snob package can be used to display a summary of the class parameters. The show_classes() function takes the list of class dictionaries as returned by the get_classes() method.

The model can be saved to a file using the save_model(). This method takes a single argument which is the name of the file to save the model to. A previously saved model, can be used by specifying a from_file parameter during initialization of the classifier. The from_file parameter should be set to the name of the file containing the saved model. The attrs parameter is always required even when a from_file parameter is provided. The attrs parameter should be the same as the one used to create the saved model.

The predict() method can then be used to predict the class of new data. The first time a restored model is used to for prediction, the model will be loaded into memory and used to fully parameterize the classifer before prections are performed. Details of class parameters will only be available after the classifer is fully parameterized.

The predict() method takes a pandas DataFrame as input. The DataFrame must contain columns for each of the features.

For example, the model above can be loaded from the saved model file and used as follows:

sfc = snob.SNOBClassifier(
    name='sst',
    attrs={                   # these are the features of the data
        'distance': 'real',   # a real-valued attribute
        'theta': 'radians',   # an angle in radians angles are treated specially
        'phi': 'radians',     # another angle in radians 
    },
    from_file='sst.mod'       # load the model from the file
)
new_data = pd.read_csv('new_data.csv')
new_pred = sfc.predict(new_data)    # No need to run fit again, the model will be loaded from the file

class_info = sfc.get_classes()      # must run predict first to fully parameterize the model
snob.show_classes(class_info)
print(new_pred)

Description:

This program implements an unsupervised classification (or clustering) algorithm based on the Minimum Message Length (MML) principle. The fundamental goal is to find the best model to explain the structure of your data, where "best" means the model that allows for the most compact description of the data.

The total "message length" is the sum of two parts:

  1. Part 1: The Model Cost: The length of the message required to describe the classification model itself (i.e., the number of classes and all their parameters).
  2. Part 2: The Data Cost: The length of the message required to describe the data, given the model.

The algorithm works as a two-level iterative process to find the model that minimizes this total message length.

  1. The Outer Loop: Model Discovery ("Surgery")

This is the high-level search for the optimal number of classes and their relationships. The algorithm starts with an initial set of classes and then iteratively tries to improve the model by performing "surgical" operations:

  • Splitting: A single class is split into two.
  • Merging: Two classes are merged into one.
  • Deleting: An entire class is removed.

After each operation, the algorithm re-evaluates the total message length. The change is only kept if it results in a shorter, more efficient explanation of the data.

  1. The Inner Loop: Parameter Estimation (like E-M)

For any given set of classes, the algorithm must find the optimal parameters for them. This inner loop is very similar to the well-known Expectation-Maximization (E-M) algorithm:

  • Assignment Step (like "E-Step"): For each data point, the algorithm calculates the probability that it belongs to each of the current classes. This is a "soft" assignment.
  • Update Step (like "M-Step"): The parameters of each class (e.g., the mean and standard deviation for a Gaussian attribute, or the probabilities for a categorical attribute) are recalculated based on the weighted collection of data points assigned to them in the previous step.

This inner loop repeats until the class parameters and assignments stabilize.

The "Factor" in Snob Factor

This is not just a simple mixture model. Within each class, it can also model the covariance between attributes using a latent factor. This is a hidden, continuous variable that can influence multiple attributes simultaneously. By using a factor, the model can explain correlations between variables within a class, leading to a more powerful and compact model (i.e., a shorter message length) than if all attributes were assumed to be independent.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snob_factor-2026.4.0.tar.gz (29.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

snob_factor-2026.4.0-cp314-cp314-win_amd64.whl (138.5 kB view details)

Uploaded CPython 3.14Windows x86-64

snob_factor-2026.4.0-cp314-cp314-manylinux_2_39_x86_64.whl (337.8 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.39+ x86-64

snob_factor-2026.4.0-cp314-cp314-macosx_15_0_arm64.whl (119.8 kB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

snob_factor-2026.4.0-cp313-cp313-win_amd64.whl (133.2 kB view details)

Uploaded CPython 3.13Windows x86-64

snob_factor-2026.4.0-cp313-cp313-manylinux_2_39_x86_64.whl (337.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.39+ x86-64

snob_factor-2026.4.0-cp313-cp313-macosx_15_0_arm64.whl (119.8 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

snob_factor-2026.4.0-cp312-cp312-win_amd64.whl (133.2 kB view details)

Uploaded CPython 3.12Windows x86-64

snob_factor-2026.4.0-cp312-cp312-manylinux_2_39_x86_64.whl (337.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.39+ x86-64

snob_factor-2026.4.0-cp312-cp312-macosx_15_0_arm64.whl (119.8 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

snob_factor-2026.4.0-cp311-cp311-win_amd64.whl (133.2 kB view details)

Uploaded CPython 3.11Windows x86-64

snob_factor-2026.4.0-cp311-cp311-manylinux_2_39_x86_64.whl (337.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.39+ x86-64

snob_factor-2026.4.0-cp311-cp311-macosx_15_0_arm64.whl (119.8 kB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

File details

Details for the file snob_factor-2026.4.0.tar.gz.

File metadata

  • Download URL: snob_factor-2026.4.0.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for snob_factor-2026.4.0.tar.gz
Algorithm Hash digest
SHA256 071f65bec7e6522df6ea9e8cd186c056edce338c47c8bf524d2c77f20e0bfaaa
MD5 84f4f11a99ce20c9d070cf959645ea37
BLAKE2b-256 95ad623cf77568b53de280f95db56185dbd35afcf27d7f53b4d9b4a4921e662c

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0.tar.gz:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 793243e49ea1698e183688b1174e30feb32b1f27d3751859ef3a388ef42c72a9
MD5 65d870f2dcd688345e5b1fb6e6b20edb
BLAKE2b-256 da517e5b940970363285280cf7062529a36d1af22e6cf7578f6fd9a8e611c9af

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp314-cp314-win_amd64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp314-cp314-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp314-cp314-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 18c68d4ee18f03ebbac4c653ba4a44933a74028d48110a07b91f8bf0afe33352
MD5 485d0a0a7e84898845d716e50103bffb
BLAKE2b-256 f1b42b84a9832308793f7d4f8a9cd206953adce0f221af29192f0edaba3409dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp314-cp314-manylinux_2_39_x86_64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 fb859acdfb581d9e20784d1431a39a31cb208d8a4636689898948d402be1cd14
MD5 696472412d573b5c01d57a94ce9a4e2d
BLAKE2b-256 ddbfa7788d15b747aeb19b6cf3d2fd56080b5cec53254c2fa5666b0461956659

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 3aeb4f0b0a1f12b7d00d3863a90bfa50a31e36ae41cf995873f6ad1d17d1cf53
MD5 ff58bc531c316ac98cbba68573d98817
BLAKE2b-256 91d82020e20026f17ec4bd8b1ed95ebcc458d6226e33e3e264b9b9f69aac67e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp313-cp313-win_amd64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp313-cp313-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp313-cp313-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 e24dd704c8ec7ba955adcad115361fa8e37b603940c0dc886981cf5165eacaad
MD5 1abd7187f01c9fe72c4011d2635b4465
BLAKE2b-256 c53b09efce811fdba684f12462b63def30b6cf1a11ba7a97ce043fda6f7a0064

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp313-cp313-manylinux_2_39_x86_64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 737dd20979740e160c8f46af1e11d4c80be2b46558a9a5332a0b92ffb60c71fc
MD5 4cba102f6870cc6b3c63d86d03ddea3a
BLAKE2b-256 a850ee816bd4b02bd59ea353401fffcde18dfd8707d4aba4197c989886e34ec1

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4ccb545843963cfdf7b2f470eb5056f2e0f98f2d67faaa71a70ab2f58dacbd7d
MD5 758c1b369f1c899c309800d6620a07b1
BLAKE2b-256 051a2d6db40ed6b2995eb0bfe4d3549c217de5f84ff47159f2cf8954b1112c3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp312-cp312-win_amd64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 b2f80d0d2fb3487d4252aff47d2429fe45f4e6da925ad57afb38b48b00bcb40a
MD5 6fef3d60443e97b78730fcd582eb4d4d
BLAKE2b-256 578d40b8314d5fb064ca4e232597253bbfc3f50c2b07fb032a8a27dc3ce228f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp312-cp312-manylinux_2_39_x86_64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a290fa2968c0cbb12fc41bf0d7a1e4d9681595aa3bfd0be19f44732a4a2f0033
MD5 4a4a412e891ab9e92ff531771ad6f7db
BLAKE2b-256 cbe041107c02c9df9faa56624f22f668bda9112c3ca59942fe64e95ecf5cbe5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 6f9a5da2954d02d5c5be196cb0636f8f35da7c3af778f5e423ba5256fe8d1eaf
MD5 ca69713a1739f8b1f3ec924eb9eed936
BLAKE2b-256 bdfb8f63bad0bc51c52b62ead8cd92e38754a279e0d6705f24df1bee11fc76a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp311-cp311-win_amd64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp311-cp311-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp311-cp311-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 81b03fa1eae8d785836b61ee98595a0ce9c1700b9874f16abbb2ebdb15bb6d7e
MD5 15c25422a8b025a00e22a034969ac711
BLAKE2b-256 25829ca5de36fe6d3ced3ffb997be99da225c1268693e0ec0d56a60443b3b77d

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp311-cp311-manylinux_2_39_x86_64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file snob_factor-2026.4.0-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for snob_factor-2026.4.0-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e10c29963c9ebd2d2cdf80797e62c1c4fc592c3ad6daeef0c4fbfb955ba1cb0d
MD5 c4cc64d046d12430faf489c66414f34c
BLAKE2b-256 e0c01a4c9159228ec397ad739b7e18dcac43c0418cef7088e8d68a380452cfc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for snob_factor-2026.4.0-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: python-app.yml on michel4j/snob-factor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page