Skip to main content

No project description provided

Project description

Gooder AI Package

This package provides a streamlined way to evaluate valuate machine learning models on the Gooder AI platform. It exports a simple yet powerful function, valuate_model, which is designed to seamlessly work with a variety of machine learning frameworks -- including scikit-learn, XGBoost, PyTorch, and Catboost models. The function does the following:

  • Valuates models with Gooder AI.
  • Validates and uploads Gooder AI configurations and test datasets for secure storage and processing.
  • Creates or updates a shared "view" on Gooder AI, allowing users to interactively visualize and analyze their model's business performance.

Learn more here:


Installation

Install the package using pip:

pip install gooder_ai

Sample Jupyter Notebook

Running valuate_model on fraud detection models (suitable for Jupyter notebooks)


Function Parameters

valuate_model(**kwargs)

The valuate_model function takes the following input arguments:

  1. models: list[ScikitModel]

    • Machine learning models that follow the ScikitModel protocol.
    • It must have a scoring function (e.g., predict_proba), which is used to generate probability scores for classification.
    • It must also have a classes_ attribute, representing the possible target classes.
  2. x_data: ndarray | DataFrame | list[str | int | float] | spmatrix

    • A dataset containing the input features for evaluation.
    • This is the dataset that will be fed into the model for prediction.
  3. y: ndarray | DataFrame | list[str | int | float] | spmatrix

    • A dataset representing the true target values (labels) corresponding to x_data.
    • This helps in validating model performance.
  4. config: dict

    • A dictionary containing model configuration settings.
    • To load a starter configuration provided by Gooder AI, you can use the following example:
      from gooder_ai.configs import load_starter_config
      config = load_starter_config()
      
  5. view_meta: ViewMeta

    • A dictionary containing metadata about the "view" (shared result visualization) being created or updated.
    • Structure:
      {
          "mode": Optional["public" | "protected" | "private"],  # Access control
          "view_id": Optional[str],  # ID of an existing view (if updating)
          "dataset_name": Optional[str]  # Name of the dataset (defaults to timestamp)
      }
      
    • If view_id is provided, an existing view is updated; otherwise, a new one is created.
  6. auth_credentials: Credentials

    • A dictionary with user authentication details for the Gooder AI platform.
    • Structure:
      {
          "email": str,  # User's email
          "password": str  # User's password
      }
      
    • These credentials are used for authentication to upload the dataset and configuration. It is optional if upload_data_to_gooder = False and upload_config_to_gooder = False.
  7. model_names: list[str]

    • This property is used to label the score columns in the output dataset and configuration.
    • If not provided default names are generated based on model class names.
    • Example: For a model that outputs binary classification scores, a column named "model1_score, model2_score" will be created.
  8. scorer_names: list[str]

    • This property is used to specify the different scorer function.
    • If not provided by default it uses predict_proba function.
  9. column_names: ColumnNames = {} (optional)

    • A dictionary specifying the column names for the dataset and scores.
    • Structure:
      {
          "dataset_column_names": Optional[list[str]],  # Feature names
          "dependent_variable_name": Optional[str]  # Name of the target variable
      }
      
  10. included_columns: list[str] = [] (optional)

  • An optional list of names specifying which columns to include in the dataset before valuating your models on the Gooder platform. If left unspecified, all columns will be included, which generally results in an unnecessarily large data file, since Gooder typically only makes use of a small number of columns. Even when specified, the following columns will always be included: model scores, dependent variable.
  1. upload_data_to_gooder: Boolean = True (optional)
  • This flag is used to prevent/allow dataset upload to Gooder AI platform
  1. upload_config_to_gooder: Boolean = True (optional)
  • This flag is used to prevent/allow config upload to Gooder AI platform
  1. aws_variables: AWSVariables = {} (optional)
  • A dictionary containing AWS-related variables.
  • Used for authentication and file uploads.
  • Structure:
    {
        "api_url": Optional[str],
        "app_client_id": Optional[str],
        "identity_pool_id": Optional[str],
        "user_pool_id": Optional[str],
        "bucket_name": Optional[str],
        "base_url": Optional[str],
        "validation_api_url": Optional[str]
    }
    
  • Defaults to global values if not provided.
  1. max_size_uploaded_data: int = 10 (optional)
  • Defines the maximum allowed memory size (in megabytes, MB) for the combined dataset when uploading to Gooder AI.
  • Before uploading, the function calculates the memory usage of the full dataset.
  • If the dataset exceeds this threshold and upload_data_to_gooder is True, the operation is aborted and an exception is raised.
  • This is a safety limit to prevent large uploads that could impact performance or exceed platform limits.
  • Default value is 10MB, which is suitable for most use cases.
  • Increase this value if you need to work with larger datasets, but be aware of potential performance implications.
  1. max_size_saved_data: int = 1000 (optional)
  • Defines the maximum allowed memory size (in megabytes, MB) for the combined dataset when saving locally.
  • Before saving, the function calculates the memory usage of the full dataset.
  • If the dataset exceeds this threshold and upload_data_to_gooder is False, the operation is aborted and an exception is raised.
  • This is a safety limit to prevent excessively large local files that could impact system performance.
  • Default value is 1000MB (approximately 1GB), which allows for much larger local datasets compared to uploads.
  • Increase this value if you need to work with very large datasets locally, but be aware of system memory constraints.

Summary

  • The function takes a scikit-learn model, dataset, user credentials, and configuration details.
  • If either the data or config are set to be uploaded, it authenticates with the Gooder AI platform, validates the config, and uploads the file.
  • If either the data or config are set to be uploaded, it either creates a new shared view or updates an existing one.
  • Finally, it returns the view ID and URL, allowing users to access model evaluation results.

Logging configuration

To configure logging in your notebook, add the following code:

import logging
import sys

logging.basicConfig(
    format='%(asctime)s | %(levelname)s : %(message)s',
    level=logging.INFO,
    stream=sys.stdout
)

Log Levels

The logger supports three levels of verbosity:

  1. ERROR: Only prints error logs
  2. INFO: Prints information logs and error logs (default)
  3. DEBUG: Verbose mode that prints all logs, including warnings

By default, sample notebooks are configured to use the INFO level. You can adjust this level based on your requirements.


Custom Model Wrappers

In order to work with PyTorch models, the Gooder AI package provides a ModelWrapper abstract base class (this allows the valuate_model function to internally work with PyTorch models the same way it works with scikit-learn and XGBoost models).

Using ModelWrapper

The ModelWrapper class provides a standardized interface that any model can implement:

from gooder_ai import ModelWrapper

class YourCustomModel(ModelWrapper):
    @abstractmethod
    def predict_proba(self, x):
        """Must return probability predictions as numpy array"""
        pass
    
    @property
    @abstractmethod
    def classes_(self):
        """Must return array of class labels"""
        pass

Sample Workbook: Using valuate_model with PyTorch Models


Common Issues

  1. Mismatch in column names: Ensure that the number of column names matches the dataset shape.
  2. Invalid model type: Ensure that the model conforms to the ScikitModel or XGBoost interface and implements a scoring function e.g predict_proba method.
  3. Authentication failure: Double-check credentials and the Gooder AI endpoint URL.
  4. Dataset size limits: If you encounter size-related errors, adjust the max_size_uploaded_data or max_size_saved_data parameters.
  5. Model naming issues: Ensure that the model_names list has the same length as the models list to avoid default naming.

Running within a Databricks

1. Sample notebook for Databricks

  • This version is ready for use in Databricks and does not contain any %pip install commands.
  • The %pip install commands are removed because:
    • They can cause cold start issues in Databricks
    • They may conflict with cluster-level package management
    • They can lead to inconsistent environments across users
    • Databricks best practices recommend managing dependencies at the cluster level

2. Setting Up the Databricks Environment

  • Use Environment version 2
  • Dependencies:
    • Ensure the following packages are added to your Databricks cluster environment (via the UI, not by the notebook):
      • gooder_ai
      • seaborn
      • matplotlib
      • xgboost
      • numpy
      • pandas
      • scikit-learn
  • Cluster State:
    • Wait for the cluster to show a "Connected" state before running any cells.

3. Handling Large Data Files

  • Databricks workspace has a 500 MB file size limit for uploads and downloads.
  • For datasets larger than 500 MB:
    • Split them into multiple smaller files.
    • Upload the split files to Databricks.
    • Add a cell in your notebook to combine the files into a single DataFrame.
    • Pass the combined DataFrame to valuate_model.
    • Operations will fail if they attempt to:
      • Create a file exceeding 500 MB in the workspace
      • Upload a file larger than 500 MB to the workspace
      • Download a file larger than 500 MB into the workspace.

Important:

After successful execution of the notebook, it will provide you:

  • A configuration file to be used with Gooder AI
  • A CSV file containing the scored test data

If you have not instructed valuate_model to pass these files through the cloud to the Gooder AI application, you must then download these files locally and then upload them to the Gooder AI application to visualize the business performance of your models.

Note

  • valuate_model can be configured to reduce the size of the output CSV file by using the included_columns parameter to specify which columns to include.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gooder_ai-1.4.1.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gooder_ai-1.4.1-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file gooder_ai-1.4.1.tar.gz.

File metadata

  • Download URL: gooder_ai-1.4.1.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Windows/10

File hashes

Hashes for gooder_ai-1.4.1.tar.gz
Algorithm Hash digest
SHA256 2a21ece908f18dd6f334c46d56e9b988013b126b091c27eccdfcfb1fb7bd28c2
MD5 2498a605d289468e63a3a2aa70e66ac1
BLAKE2b-256 75f45200108c0a6ca66ecf615dad41a128656aaff3dcfb7a171ce9dcd1c28463

See more details on using hashes here.

File details

Details for the file gooder_ai-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: gooder_ai-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Windows/10

File hashes

Hashes for gooder_ai-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c1bfd41b1f1407010714c782947b5de6f592d2c033dd0b0387dccb32a177ccce
MD5 9b08b73698188516b12e781c4654d24a
BLAKE2b-256 ab7574a66474eb7c8b833c3a6073606f21eeafb8a1fd890efcd70446992d337a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page