Skip to main content

A collection of lego bricks for scikit-learn pipelines

Project description

Downloads Version Conda Version Ruff DOI

scikit-lego

We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project started as a collaboration between multiple companies in the Netherlands but has since received contributions from around the globe. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.

Note that we're not formally affiliated with the scikit-learn project at all, but we aim to strictly adhere to their standards.

The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.

Installation

Install scikit-lego via pip with

python -m pip install scikit-lego

Via conda with

conda install -c conda-forge scikit-lego

Alternatively, to edit and contribute you can fork/clone and run:

python -m pip install -e ".[dev]"
python setup.py develop

Documentation

The documentation can be found here.

Usage

We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.

# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

mod = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

Features

Here's a list of features that this library currently offers:

  • sklego.datasets.load_abalone loads in the abalone dataset
  • sklego.datasets.load_arrests loads in a dataset with fairness concerns
  • sklego.datasets.load_chicken loads in the joyful chickweight dataset
  • sklego.datasets.load_heroes loads a heroes of the storm dataset
  • sklego.datasets.load_hearts loads a dataset about hearts
  • sklego.datasets.load_penguins loads a lovely dataset about penguins
  • sklego.datasets.fetch_creditcard fetch a fraud dataset from openml
  • sklego.datasets.make_simpleseries make a simulated timeseries
  • sklego.pandas_utils.add_lags adds lag values in a pandas dataframe
  • sklego.pandas_utils.log_step a useful decorator to log your pipeline steps
  • sklego.dummy.RandomRegressor dummy benchmark that predicts random values
  • sklego.linear_model.DeadZoneRegressor experimental feature that has a deadzone in the cost function
  • sklego.linear_model.DemographicParityClassifier logistic classifier constrained on demographic parity
  • sklego.linear_model.EqualOpportunityClassifier logistic classifier constrained on equal opportunity
  • sklego.linear_model.ProbWeightRegression linear model that treats coefficients as probabilistic weights
  • sklego.linear_model.LowessRegression locally weighted linear regression
  • sklego.linear_model.LADRegression least absolute deviation regression
  • sklego.linear_model.QuantileRegression linear quantile regression, generalizes LADRegression
  • sklego.linear_model.ImbalancedLinearRegression punish over/under-estimation of a model directly
  • sklego.naive_bayes.GaussianMixtureNB classifies by training a 1D GMM per column per class
  • sklego.naive_bayes.BayesianGaussianMixtureNB classifies by training a bayesian 1D GMM per class
  • sklego.mixture.BayesianGMMClassifier classifies by training a bayesian GMM per class
  • sklego.mixture.BayesianGMMOutlierDetector detects outliers based on a trained bayesian GMM
  • sklego.mixture.GMMClassifier classifies by training a GMM per class
  • sklego.mixture.GMMOutlierDetector detects outliers based on a trained GMM
  • sklego.meta.ConfusionBalancer experimental feature that allows you to balance the confusion matrix
  • sklego.meta.DecayEstimator adds decay to the sample_weight that the model accepts
  • sklego.meta.EstimatorTransformer adds a model output as a feature
  • sklego.meta.OutlierClassifier turns outlier models into classifiers for gridsearch
  • sklego.meta.GroupedPredictor can split the data into runs and run a model on each
  • sklego.meta.GroupedTransformer can split the data into runs and run a transformer on each
  • sklego.meta.SubjectiveClassifier experimental feature to add a prior to your classifier
  • sklego.meta.Thresholder meta model that allows you to gridsearch over the threshold
  • sklego.meta.RegressionOutlierDetector meta model that finds outliers by adding a threshold to regression
  • sklego.meta.ZeroInflatedRegressor predicts zero or applies a regression based on a classifier
  • sklego.preprocessing.ColumnCapper limits extreme values of the model features
  • sklego.preprocessing.ColumnDropper drops a column from pandas
  • sklego.preprocessing.ColumnSelector selects columns based on column name
  • sklego.preprocessing.InformationFilter transformer that can de-correlate features
  • sklego.preprocessing.IdentityTransformer returns the same data, allows for concatenating pipelines
  • sklego.preprocessing.LinearEmbedder reweight features using coefficients from a fitted linear model
  • sklego.preprocessing.OrthogonalTransformer makes all features linearly independent
  • sklego.preprocessing.TypeSelector selects columns based on type
  • sklego.preprocessing.RandomAdder adds randomness in training
  • sklego.preprocessing.RepeatingBasisFunction repeating feature engineering, useful for timeseries
  • sklego.preprocessing.DictMapper assign numeric values on categorical columns
  • sklego.preprocessing.OutlierRemover experimental method to remove outliers during training
  • sklego.preprocessing.MonotonicSplineTransformer re-uses SplineTransformer in an attempt to make monotonic features
  • sklego.model_selection.GroupTimeSeriesSplit timeseries Kfold for groups with different amount of observations per group
  • sklego.model_selection.KlusterFoldValidation experimental feature that does K folds based on clustering
  • sklego.model_selection.TimeGapSplit timeseries Kfold with a gap between train/test
  • sklego.pipeline.DebugPipeline adds debug information to make debugging easier
  • sklego.pipeline.make_debug_pipeline shorthand function to create a debugable pipeline
  • sklego.metrics.correlation_score calculates correlation between model output and feature
  • sklego.metrics.equal_opportunity_score calculates equal opportunity metric
  • sklego.metrics.p_percent_score proxy for model fairness with regards to sensitive attribute
  • sklego.metrics.subset_score calculate a score on a subset of your data (meant for fairness tracking)

New Features

We want to be rather open here in what we accept but we do demand three things before they become added to the project:

  1. any new feature contributes towards a demonstrable real-world usecase
  2. any new feature passes standard unit tests (we use the ones from scikit-learn)
  3. the feature has been discussed in the issue list beforehand

We automate all of our testing and use pre-commit hooks to keep the code working.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_lego-0.9.8.tar.gz (193.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scikit_lego-0.9.8-py3-none-any.whl (227.5 kB view details)

Uploaded Python 3

scikit_lego-0.9.8-py2.py3-none-any.whl (227.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file scikit_lego-0.9.8.tar.gz.

File metadata

  • Download URL: scikit_lego-0.9.8.tar.gz
  • Upload date:
  • Size: 193.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scikit_lego-0.9.8.tar.gz
Algorithm Hash digest
SHA256 c0b1bcc3a74c924cd83db3c39378a3fc18b171e3d4a6d753c824f1272a12940f
MD5 40a655cb49994e2d9f0123980975cd86
BLAKE2b-256 eb09e33289cc2ddb0e83d4453da453360970eb179cdea8e905377e1285ae9cdf

See more details on using hashes here.

File details

Details for the file scikit_lego-0.9.8-py3-none-any.whl.

File metadata

  • Download URL: scikit_lego-0.9.8-py3-none-any.whl
  • Upload date:
  • Size: 227.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scikit_lego-0.9.8-py3-none-any.whl
Algorithm Hash digest
SHA256 6af53354f52e7afb9d936d539fec6e26b29dd55f3305f2724bd94cad22e0628d
MD5 b958587b3caf6a2fdc582ed2b64628aa
BLAKE2b-256 89df5027c5ec4fabf5a4f9056be4348345c7786f9dcbd372aa96ca0a35571c04

See more details on using hashes here.

File details

Details for the file scikit_lego-0.9.8-py2.py3-none-any.whl.

File metadata

  • Download URL: scikit_lego-0.9.8-py2.py3-none-any.whl
  • Upload date:
  • Size: 227.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for scikit_lego-0.9.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 86ad2450b771c5732654a5fec7e73a007f687d1ab8ac29f4689673e45877acad
MD5 03c94d85bd3c1bebc35256f932e255d3
BLAKE2b-256 bbda4715e16a7831cb05a4200f8c1539185f7802897127317620e30a660ddd2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page