Skip to main content

Rethinking machine learning pipelines a bit.

Project description

scikit-play

Rethinking machine learning pipelines a bit.

What does scikit-play do?

I was wondering if there might be an easier way to construct scikit-learn pipelines. Don't get me wrong, scikit-learn is amazing when you want elaborate pipelines (exhibit A, exhibit B) but maybe there is also a place for something more lightweight and playful. This library is all about exploring that.

Imagine that you are dealing with the titanic dataset.

import pandas as pd

df = pd.read_csv("https://calmcode.io/static/data/titanic.csv")
df.head()

Here's what the dataset looks like.

survived pclass name sex age fare sibsp parch
0 3 Braund, Mr. Owen Harris male 22 7.25 1 0
1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 71.2833 1 0
1 3 Heikkinen, Miss. Laina female 26 7.925 0 0
1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 53.1 1 0
0 3 Allen, Mr. William Henry male 35 8.05 0 0

The goal of this dataset is to predict who survived, so survived is the target column for a classification task. But in order to make the right predictions you would need to encode the features in the right way. So to do that, you might construct a preprocessing pipeline like this:

from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from skrub import SelectCols

pipe = make_union(
    SelectCols(["age", "fare", "sibsp", "parch"]),
    make_pipeline(
        SelectCols(["sex", "pclass"]),
        OneHotEncoder()
    )
)

This pipeline takes the age, fare, sibsp and parch features as-is. These features are already numeric so these do not need to be changed. But the sex and pclass features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.

The pipeline works, and it's fine, but you could wonder if this is easy. After all, you do need to know scikit-learn fairly well in order to build a pipeline this way and you may also need to appreciate Python. There's some nesting happening in here as well, so for a novice or somebody who just immediately wants to make a quick model ... there's some stuff that gets in the way. All of this is fine when you consider that scikit-learn needs to allow for elaborate pipelines ... but if you just want something dead simple ... then you may appreciate another syntax instead.

Enter skplay.

Skplay offers an API that allows you to declare the aforementioned pipeline by doing this instead:

from skplay import feats, onehot

formula = feats("age", "fare", "sibsp", "parch") + onehot("sex", "pclass")

This formula object is just an object that can accumulate components.

# This object is a scikit-learn pipeline but with operator support!
formula

skplay

It's pretty much the same pipeline as before, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.

This is what scikit-play is all about, but this is just the start of what it can do. If that sounds interest you can read more on the documentation page.

Alternative you may also explore this tool by installing it via:

uv pip install scikit-play

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_play-0.1.2.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scikit_play-0.1.2-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file scikit_play-0.1.2.tar.gz.

File metadata

  • Download URL: scikit_play-0.1.2.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.13

File hashes

Hashes for scikit_play-0.1.2.tar.gz
Algorithm Hash digest
SHA256 adc117806fa3fba487cfd1eb0d49740a0c1f961c895eae1c3400f6078a1ab3e4
MD5 598245799eb62142e01b1c18d12e4e36
BLAKE2b-256 0956c63e6c79d050aa661e5407c4fc46cec42cb7c3e3618853776db847681f9d

See more details on using hashes here.

File details

Details for the file scikit_play-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scikit_play-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 57fd11fc142ed36eca9e369589f7aef871669cf5c02cd9b388bfdb2da99f546d
MD5 cfb316cd1cf06243ffdbe66187dd4bc0
BLAKE2b-256 48732f1713bde1c4cf79e58f88dab0d4039379e84a5522e5c5bc08d03bf75cd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page