Skip to main content

A tracking infrastructure for PatternAg machine learning projects and experiments.

Project description

Machine Learning Tracking Infrastructure

This codebase contains an importable script that anyone can use during model development to track model performance and experiments. The script tracks ML experiments by recording all pertinent information in project-specific csv files. The csv is then appended each time an experiment is performed to create a running log of all experiments run within a project. All experiment information is displayed in a Looker Data Studio dashboard to make ML experimentation more accessible for viewing.

In addition to displaying tracking information, performance visuals such as confusion matrices and regression plots are displayed in the dashboard to assess model perfomance.

How to Use

Import the ml_tracking module and run log_experiment() with the arguments outlined below:

from ml_tracking import log_experiment

log_experiment(
    kind = str, # classification/regression
    project = str, # the name of the project
    parameters = dict, # required information for tracking
    y_true = list, # the actual y values
    y_pred = list, # the predictions generated by the model
    extra_parameters = dict/None, # optional information to be tracked
    new_csv = bool/None # whether or not a new csv will be created
)

Required Arguments:

  • kind: The type of modeling being done, passed in as a string; should be either 'regression' or 'classification'.

  • project: The overarching project identifier or name; trackers will be named using the convention: {project_name}_{date}_{i}.csv. Project names containing slashes or underscores will have these characters replaced with dashes in the GCP bucket and dashboard.

    Example:

    "tracker_testing_1_/first_run" -> "tracker-testing-1-first-run"

  • parameters: The required information needed from the user:

    parameters = {
       'dataset_uri': str,
       'target_column': str,
       'test_set': list,
       'model': sklearn or other model object
       }
    
    • dataset_uri: The path to the bucket, as a string containing the finalized dataset being used. If the dataset is not currently in a bucket, it should be uploaded to one. This is required for replicating experiments.
    • target_column: A string containing the name of the column used as the dependent variable.
    • test_set: A list of the sample_uuids or other id that can be used to identify which samples are in the test set.
    • model: The model object; must be the model object, not the model name as a string.
  • y_true: A list containing the true y-values in the test set.

  • y_pred: A list containing the model predictions; must be in the same order as y_true.

  • extra_parameters: A dictionary with any additional information the user wishes to track.

    extra_parameters = {
        'scaler':'MinMaxScaler', 
        'data_cleaning':'removed features > 50% nulls'
        }
    
  • new_csv: Setting this to True will create a new tracking csv; if set to False, new experiments will be appended to the most recent tracking file created.

Viewing results in the dashboard

  1. Use the project drop-down menu to select a project
  2. Look in the tracking table to identify the correct prediction id
    • If many experiments were logged under the same project name, there will be many prediction ids
  3. Select the prediction id of interest in the prediction id drop-down menu
  4. Steps 2 & 3 can be skipped by also looking in the model scoring table and selecting the prediction id with the best performance.

img

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_tracking-0.0.2.tar.gz (7.1 kB view hashes)

Uploaded Source

Built Distribution

ml_tracking-0.0.2-py3-none-any.whl (7.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page