Utilities for helping with pipeline development and integration with 3rd party MLOps services.

These details have not been verified by PyPI

Project links

Project description

Bodywork Pipeline Utilities

Utilities for helping with pipeline development and integration with 3rd party MLOps services.

|-- aws
    |-- Dataset
    |-- get_latest_csv_dataset_from_s3
    |-- get_latest_parquet_dataset_from_s3
    |-- put_csv_dataset_to_s3
    |-- put_parquet_dataset_to_s3
    |-- Model
    |-- get_latest_pkl_model_from_s3
|-- logging
    |-- configure_logger

AWS

A simple dataset and model management framework built on S3 object storage.

Datsets

Training data files in CSV or Parquet format are saved to a S3 bucket using filenames with an ISO timestamp component:

my-s3-project-bucket/
|
|-- datasets/
|    |-- ... 
|    |-- dataset_file_2021-07-10T07:42:23.csv
|    |-- dataset_file_2021-07-11T07:45:12.csv
|    |-- dataset_file_2021-07-12T07:41:02.csv

You can use put_csv_dataset_to_s3 to persist a Pandas DataFrame directly to S3 with a compatible filename, or handle this yourself independently. The latest training data file can be retrieved using get_latest_csv_dataset_from_s3, which will return a Dataset object, which is an object with the following fields:

class Dataset(NamedTuple):
    """Container for downloaded datasets and associated metadata."""

    data: DataFrame
    datetime: datetime
    bucket: str
    key: str
    hash: str

AWS S3 will compute the MD5 hash of every object uploaded to it (referred to as its Entity Tag). This is retrieved from S3 together with other basic metadata about the object. For example,

get_latest_csv_dataset_from_s3("my-s3-project-bucket", "datasets")
# Dataset(
#     data=...,
#     datetime(2021, 7, 12, 7, 41, 02),
#     bucket="my-s3-project-bucket"),
#     key="datasets/dataset_file_2021-07-12T07:41:02.csv",
#     hash="759eccda4ceb7a07cda66ad4ef7cdfbc"
# )

This, together with S3 object versioning (if enabled), can be used to track the precise dataset used to train a model.

Models

The Model class is a simple wrapper for a ML model that adds basic model metadata and the ability to serialise the model directly to S3. It requires a Dataset object containing the data used train the model, so that the model artefact can be explicitly linked to the precise version of the data used to train it. For example,

from sklearn.tree import DecisionTreeRegressor


dataset = get_latest_csv_dataset_from_s3("my-s3-project-bucket", "datasets")
model = Model("my-model", DecisionTreeRegressor(), dataset, {"features": ["x1", "x2"], "foo": "bar"})

model
# name: my-model
# model_type: <class 'sklearn.tree._classes.DecisionTreeRegressor'>
# model_timestamp: 2021-07-12 07:46:08
# model_hash: ab6f998e0f5d8829fcb0017819c45020
# train_dataset_key: datasets/dataset_file_2021-07-12T07:41:02.csv
# train_dataset_hash: 759eccda4ceb7a07cda66ad4ef7cdfbc
# pipeline_git_commit_hash: e585fd3

Model objects can be directly serialised to S3,

model.put_model_to_s3("my-s3-project-bucket", "models")

Which will create objects in a S3 bucket as follows,

my-s3-project-bucket/
|
|-- models/
|    |-- ... 
|    |-- serialised_model_2021-07-10T07:47:33.pkl
|    |-- serialised_model_2021-07-11T07:49:14.pkl
|    |-- serialised_model_2021-07-12T07:46:08.pkl

The Model class is intended as a base class, suitable for pickle-able models (e.g. from Scikit-Learn). More complex model types (e.g. PyTorch or PyMC3 models) should inherit from Model and override the appropriate methods.

Logging

The configure_logger function returns a Python logger configures to print logs using the Bodywork log format. For example,

log = configure_logger()
log.into("foo")
# 2021-07-14 07:57:10,854 - INFO - pipeline.train - foo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.6.dev0 pre-release

Apr 23, 2022

This version

0.1.5

Jul 21, 2021

0.1.4

Jul 20, 2021

0.1.3

Jul 20, 2021

0.1.2

Jul 20, 2021

0.1.1

Jul 15, 2021

0.1.0

Jul 14, 2021

0.0.1b2 pre-release

May 21, 2021

0.0.1b1 pre-release

May 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bodywork_pipeline_utils-0.1.5.tar.gz (13.2 kB view details)

Uploaded Jul 21, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bodywork_pipeline_utils-0.1.5-py3-none-any.whl (13.6 kB view details)

Uploaded Jul 21, 2021 Python 3

File details

Details for the file bodywork_pipeline_utils-0.1.5.tar.gz.

File metadata

Download URL: bodywork_pipeline_utils-0.1.5.tar.gz
Upload date: Jul 21, 2021
Size: 13.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.7.1 requests/2.26.0 setuptools/57.0.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.11

File hashes

Hashes for bodywork_pipeline_utils-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`3a4545c2545fdad4f8e09957b5dbc414bb3d15f380bf739390c737ea602502c3`
MD5	`8a749107af8e99d104ca0c63343ad872`
BLAKE2b-256	`d0bea9444c56fdf28686bbed3936b199da922c026a672b4432d5fcd8c0d61194`

See more details on using hashes here.

File details

Details for the file bodywork_pipeline_utils-0.1.5-py3-none-any.whl.

File metadata

Download URL: bodywork_pipeline_utils-0.1.5-py3-none-any.whl
Upload date: Jul 21, 2021
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.7.1 requests/2.26.0 setuptools/57.0.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.11

File hashes

Hashes for bodywork_pipeline_utils-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`23528d6545e9a6aa081c6538672795370ecd17943c9b8d67cf9178af22f2bb80`
MD5	`605e884bf9bcce2861c04dd37f01fd1f`
BLAKE2b-256	`8c2aaaa8b31374a9674a407f7d61d79df708ac72c298516af1f97f3f1ecfdb0d`

See more details on using hashes here.

bodywork-pipeline-utils 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bodywork Pipeline Utilities

AWS

Datsets

Models

Logging

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes