A tracking infrastructure for PatternAg machine learning projects and experiments.
Project description
Machine Learning Tracking Infrastructure
This codebase contains an importable script that anyone can use during model development to track model performance and experiments. The script tracks ML experiments by recording all pertinent information in project-specific csv files. The csv is then appended each time an experiment is performed to create a running log of all experiments run within a project. All experiment information is displayed in a Looker Data Studio dashboard to make ML experimentation more accessible for viewing.
In addition to displaying tracking information, performance visuals such as confusion matrices and regression plots are displayed in the dashboard to assess model perfomance.
Installation
pip3 install ml_tracking
How to Use
Import the ml_tracking
module and run log_experiment()
with the arguments outlined below:
from ml_tracking.ml_tracking import log_experiment
log_experiment(
kind = str, # classification/regression
project = str, # the name of the project
parameters = dict, # required information for tracking
y_true = list, # the actual y values
y_pred = list, # the predictions generated by the model
extra_parameters = dict/None, # optional information to be tracked
new_csv = bool/None # whether or not a new csv will be created
)
Required Arguments:
-
kind
: The type of modeling being done, passed in as a string; should be either 'regression' or 'classification'. -
project
: The overarching project identifier or name; trackers will be named using the convention:{project_name}_{date}_{i}.csv
. Project names containing slashes or underscores will have these characters replaced with dashes in the GCP bucket and dashboard.Example:
"tracker_testing_1_/first_run" -> "tracker-testing-1-first-run"
-
parameters
: The required information needed from the user:parameters = { 'dataset_uri': str, 'target_column': str, 'test_set': list, 'model': sklearn or other model object }
dataset_uri
: The path to the bucket, as a string containing the finalized dataset being used. If the dataset is not currently in a bucket, it should be uploaded to one. This is required for replicating experiments.target_column
: A string containing the name of the column used as the dependent variable.test_set
: A list of the sample_uuids or other id that can be used to identify which samples are in the test set.model
: The model object; must be the model object, not the model name as a string.
-
y_true
: A list containing the true y-values in the test set. -
y_pred
: A list containing the model predictions; must be in the same order asy_true
. -
extra_parameters
: A dictionary with any additional information the user wishes to track.extra_parameters = { 'scaler':'MinMaxScaler', 'data_cleaning':'removed features > 50% nulls' }
-
new_csv
: Setting this to True will create a new tracking csv; if set to False, new experiments will be appended to the most recent tracking file created.
Viewing results in the dashboard
- Use the project drop-down menu to select a project
- Look in the tracking table to identify the correct prediction id
- If many experiments were logged under the same project name, there will be many prediction ids
- Select the prediction id of interest in the prediction id drop-down menu
- Steps 2 & 3 can be skipped by also looking in the model scoring table and selecting the prediction id with the best performance.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ml_tracking-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0838cd09ee9573ebc68235ab24b8c0ba50c5f058d7127302ef34607955f5c65 |
|
MD5 | 45da4d7ee849b118046453ee57bafc69 |
|
BLAKE2b-256 | 4d65fc89b60cf47a23b33fedbe413634325981058a77731419524a9d1bdd0f66 |