Skip to main content

A flexible alternative to scikit-learn Pipelines

Project description

AppVeyor Codecov ReadTheDocs

skdag - A more flexible alternative to scikit-learn Pipelines

img/skdag-banner.png

scikit-dag (skdag) is an open-sourced, MIT-licenced library that provides advanced workflow management to any machine learning operations that follow scikit-learn conventions. Installation is simple:

pip install skdag

It works by introducing Directed Acyclic Graphs as a drop-in replacement for traditional scikit-learn Pipeline. This gives you a simple interface for a range of use cases including complex pre-processing, model stacking and benchmarking.

from skdag import DAGBuilder

dag = (
   DAGBuilder(infer_dataframe=True)
   .add_step("impute", SimpleImputer())
   .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
   .add_step(
      "blood",
      PCA(n_components=2, random_state=0),
      deps={"impute": ["s1", "s2", "s3", "s4", "s5", "s6"]}
   )
   .add_step(
      "rf",
      RandomForestRegressor(max_depth=5, random_state=0),
      deps=["blood", "vitals"]
   )
   .add_step("svm", SVR(C=0.7), deps=["blood", "vitals"])
   .add_step(
      "knn",
      KNeighborsRegressor(n_neighbors=5),
      deps=["blood", "vitals"]
   )
   .add_step("meta", LinearRegression(), deps=["rf", "svm", "knn"])
   .make_dag()
)

dag.show(detailed=True)
doc/_static/img/cover.png

The above DAG imputes missing values, runs PCA on the columns relating to blood test results and leaves the other columns as they are. Then they get passed to three different regressors before being passed onto a final meta-estimator. Because DAGs (unlike pipelines) allow predictors in the middle or a workflow, you can use them to implement model stacking. We also chose to run the DAG steps in parallel wherever possible.

After building our DAG, we can treat it as any other estimator:

from sklearn import datasets

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.2, random_state=0
)

dag.fit(X_train, y_train)
dag.predict(X_test)

Just like a pipeline, you can optimise it with a gridsearch, pickle it etc.

Note that this package does not deal with things like delayed dependencies and distributed architectures - consider an established solution for such use cases. skdag is just for building and executing local ensembles from estimators.

Read on to learn more about skdag

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skdag-0.0.7.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

skdag-0.0.7-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file skdag-0.0.7.tar.gz.

File metadata

  • Download URL: skdag-0.0.7.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for skdag-0.0.7.tar.gz
Algorithm Hash digest
SHA256 a46b016a5f23c71f087e0361f0030a347c57073d572ee5b09f4782303550bd53
MD5 8c2f5931c64d5bbf44d40a70aceef3d1
BLAKE2b-256 7f01632008a3d588b6fdd00cf0e09816fd1c0d05521030167b753db979d998d1

See more details on using hashes here.

File details

Details for the file skdag-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: skdag-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for skdag-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 988ca0f703d618c8d00d82b44f6f8dd2b27296336e4db11a5a1abcd3b9a662f0
MD5 9eec981cd59f3d520f6e269f02c17ed9
BLAKE2b-256 3be03e0ac3c12f7fcc65822dd859c78cfb7e4d4a63545eb07a5111a78a203a84

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page