Skip to main content

A scikit-learn implementation of BOOMER - an algorithm for learning gradient boosted multi-label classification rules

Project description

BOOMER - Gradient Boosted Multi-Label Classification Rules

License: MIT Documentation Status

This software package provides an implementation of BOOMER - an algorithm for learning gradient boosted multi-label classification rules that integrates with the popular scikit-learn machine learning framework.

The goal of multi-label classification is the automatic assignment of sets of labels to individual data points, for example, the annotation of text documents with topics. The BOOMER algorithm uses gradient boosting to learn an ensemble of rules that is built with respect to a given multivariate loss function. To provide a versatile tool for different use cases, great emphasis is put on the efficiency of the implementation. To ensure its flexibility, it is designed in a modular fashion and can therefore easily be adjusted to different requirements.

References

The algorithm was first published in the following paper. A preprint version is publicly available here.

Michael Rapp, Eneldo Loza Mencía, Johannes Fürnkranz Vu-Linh Nguyen and Eyke Hüllermeier. Learning Gradient Boosted Multi-label Classification Rules. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2020, Springer.

If you use the algorithm in a scientific publication, we would appreciate citations to the mentioned paper. An overview of publications that are concerned with the BOOMER algorithm, together with information on how to cite them, can be found in the section References of the documentation.

Features

The algorithm that is provided by this project currently supports the following core functionalities to learn an ensemble of boosted classification rules:

  • Different label-wise or example-wise loss functions can be minimized during training (optionally using L1 or L2 regularization).
  • The rules may predict for a single label or for all labels (which enables to model local label dependencies).
  • When learning a new rule, random samples of the training examples, features or labels may be used (including different techniques such as sampling with or without replacement or stratification methods).
  • The impact of individual rules on the ensemble can be controlled using shrinkage.
  • Hyper-parameters that provide fine-grained control over the specificity/generality of rules are available.
  • The conditions of rules can be pruned based on a hold-out set.
  • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
  • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.
  • Different strategies for prediction, which can be tailored to the used loss function, are available.

In addition, the following features that may speed up training or reduce the memory footprint are currently implemented:

  • Approximate methods for evaluating potential conditions of rules, based on unsupervised binning methods, can be used.
  • Gradient-based label binning (GBLB) can be used to assign the available labels to a limited number of bins. The use of label binning may speed up training significantly when using rules that predict for multiple labels to minimize a non-decomposable loss function.
  • Dense or sparse feature matrices can be used for training and prediction. The use of sparse matrices may speed up training significantly on some data sets.
  • Dense or sparse label matrices can be used for training. The use of sparse matrices may reduce the memory footprint in case of large data sets.
  • Dense or sparse matrices can be used to store predictions. The use of sparse matrices may reduce the memory footprint in case of large data sets.
  • Multi-threading can be used to parallelize the evaluation of a rule's potential refinements across multiple CPU cores.

Documentation

An extensive user guide, as well as an API documentation for developers, is available at https://mlrl-boomer.readthedocs.io. If you are new to the project, you probably want to read about the following topics:

A collection of benchmark datasets that are compatible with the algorithm are provided in a separate repository.

For an overview of changes and new features that have been included in past releases, please refer to the changelog.

License

This project is open source software licensed under the terms of the MIT license. We welcome contributions to the project to enhance its functionality and make it more accessible to a broader audience. A frequently updated list of contributors is available here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

mlrl_boomer-0.7.0-cp39-cp39-manylinux2014_x86_64.whl (756.3 kB view details)

Uploaded CPython 3.9

mlrl_boomer-0.7.0-cp38-cp38-manylinux2014_x86_64.whl (757.1 kB view details)

Uploaded CPython 3.8

mlrl_boomer-0.7.0-cp37-cp37m-manylinux2014_x86_64.whl (755.1 kB view details)

Uploaded CPython 3.7m

File details

Details for the file mlrl_boomer-0.7.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: mlrl_boomer-0.7.0-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 756.3 kB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for mlrl_boomer-0.7.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b6da2795cf8da8c7b7504f04f4a377205dc5ce11a1f8869d5ec7219e7dd47e19
MD5 d379089be247f6cb105b4ebcea91aaa5
BLAKE2b-256 8452d8bdcb901bd5d8fed934b7e425bde8ece18748d97b5c678835dea93b8aab

See more details on using hashes here.

File details

Details for the file mlrl_boomer-0.7.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: mlrl_boomer-0.7.0-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 757.1 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for mlrl_boomer-0.7.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d466f9682381b4ef933a7992f144bf2f485fd7364cbad6b9ec20c9dd114b3454
MD5 1c7cb8275b15b6d49e0a657d6b8d86ca
BLAKE2b-256 6b7a674f3639d0d3c663b3d2a9888ebf092ae96ea206c3d79519239497f162a0

See more details on using hashes here.

File details

Details for the file mlrl_boomer-0.7.0-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: mlrl_boomer-0.7.0-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 755.1 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for mlrl_boomer-0.7.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7e8ad78757d4e261919386eeeb2c920dd287f3156df709fc11ef52565eaea590
MD5 d1b7c1f20d26bf3fcb0927da495e8c83
BLAKE2b-256 9b44787440906a6c046078a97c82b72cbaa91c6ba34a8ec6258d751d0064d609

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page