Skip to main content

A framework for doing stability analysis with PCS.

Project description

vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!

mit license python3.9+ tests tests joss PyPI - version

Why use vflow?

Using vflows simple wrappers facilitates many best practices for data science, as laid out in the predictability, computability, and stability (PCS) framework for veridical data science. The goal of vflow is to easily enable data science pipelines that follow PCS by providing intuitive low-code syntax, efficient and flexible computational backends via Ray, and well-documented, reproducible experimentation via MLflow.

Computation Reproducibility Prediction Stability
Automatic parallelization and caching throughout the pipeline Automatic experiment tracking and saving Filter the pipeline by training and validation performance Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results

Here we show a simple example of an entire data-science pipeline with several perturbations (e.g. different data subsamples, models, and metrics) written simply using vflow.

import sklearn
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from vflow import Vset, init_args

# initialize data
X, y = make_classification()
X_train, X_test, y_train, y_test = init_args(
    train_test_split(X, y),
    names=["X_train", "X_test", "y_train", "y_test"],  # optionally name the args
)

# subsample data
subsampling_funcs = [sklearn.utils.resample for _ in range(3)]
subsampling_set = Vset(
    name="subsampling", vfuncs=subsampling_funcs, output_matching=True
)
X_trains, y_trains = subsampling_set(X_train, y_train)

# fit models
models = [LogisticRegression(), DecisionTreeClassifier()]
modeling_set = Vset(name="modeling", vfuncs=models, vfunc_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)
preds_test = modeling_set.predict(X_test)

# get metrics
binary_metrics_set = Vset(
    name="binary_metrics",
    vfuncs=[accuracy_score, balanced_accuracy_score],
    vfunc_keys=["Acc", "Bal_Acc"],
)
binary_metrics = binary_metrics_set.evaluate(preds_test, y_test)

Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.

Documentation

See the docs for reference on the API

Notebook examples

Note that some of these require more dependencies than just those required for vflow. To install all, run pip install vflow[nb].

Synthetic classification

Enhancer genomics

fMRI voxel prediction

Fashion mnist classification

Feature importance stability

Clinical decision rule vetting

Installation

Stable version

pip install vflow

Development version (unstable)

pip install vflow@git+https://github.com/Yu-Group/veridical-flow

References

@software{duncan2020vflow,
   author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin},
   doi = {10.21105/joss.03895},
   month = {1},
   title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}},
   url = {https://doi.org/10.21105/joss.03895},
   year = {2022}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vflow-0.1.4.tar.gz (6.2 MB view details)

Uploaded Source

Built Distribution

vflow-0.1.4-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file vflow-0.1.4.tar.gz.

File metadata

  • Download URL: vflow-0.1.4.tar.gz
  • Upload date:
  • Size: 6.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for vflow-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f6585af251f9f086b45d7b86adfa2a6cf1bbb436b41912083bf1a317b4d280a5
MD5 ace09d57d7398593d74c80f1b7608283
BLAKE2b-256 1033bbcb091afc0402132e32c7582c9158227663728f7b17baf8645443685efc

See more details on using hashes here.

File details

Details for the file vflow-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: vflow-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for vflow-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7c5e71fea2f4a7898b29e4bbafe1ae0056cf7dd39967e14a068615f515fd8223
MD5 0d686f89e14421d4f089142fcdc1857f
BLAKE2b-256 92da0fc964abbb196da785ecb9671e2bd11d7e7309f6c2518a5eca6fa709bce8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page