Skip to main content

A Datasette plugin providing an MLOps platform to train, eval and predict machine learning models

Project description

Datasette ML

Bringing Machine Learning models near your data, not the other way around!

Datasette ML is a Datasette plugin providing an MLOps platform to train, evaluate and make predictions from machine learning models.

All the underlying features are provided by sqlite-ml.

PyPI CI/CD Coverage Status License

WARNING: this plugin is still experimental and not ready for production. Some breaking changes might happen between releases before reaching a stable version. Use it at your own risks!

Installation

Install this plugin in the same environment as Datasette:

$ datasette install datasette-ml

Usage

Define configuration within metadata.yml / metadata.json:

plugins:
  datasette-ml:
    db: sqml

A new menu entry is now available, pointing at /-/ml to access the MLOps dashboard.

Configuration properties

Property Type Description
db string Database to store ML models (default is sqml)

Tutorial

Using datasette-ml you can start training Machine Learning models directly along your data, simply by using custom SQL functions! Let's get started by training a classifier against the famous "Iris Dataset" to predict flower types.

Loading the dataset

First let's load our data. For a real world project, your data may live with its own table or being accessed through an SQL view. For the purpose of this tutorial, we can use the sqml_load_dataset function to load standard Scikit-Learn datasets:

SELECT sqml_load_dataset('iris') AS dataset;

It will return the following data:

dataset
{"table": "dataset_iris", "feature_names": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"], "target_names": ["setosa", "versicolor", "virginica"], "size": 150}

The Iris dataset is loaded into a table nammed dataset_iris, containing 150 examples, 4 features and 3 classes to be predicted.

Training a classifier

Now that our dataset is ready, let's train a first machine learning model to perform a classification task using the sqml_train function:

SELECT sqml_train(
  'Iris prediction',
  'classification',
  'logistic_regression',
  'dataset_iris',
  'target'
) AS training;

It will return the following data:

training
{"experiment_name": "Iris prediction", "prediction_type": "classification", "algorithm": "logistic_regression", "deployed": true, "score": 0.9473684210526315}

We have just trained our first machine learning model! The output data informs us that our model has been trained, yields a score of 0.94 and has been deployed.

Performing predictions

Now that we have trained our classifier, let's use it to make predictions!

Predict the target label for the first row of dataset_iris using the sqml_predict function:

SELECT
  dataset_iris.*,
  sqml_predict(
    'Iris prediction',
    json_object(
      'sepal length (cm)', [sepal length (cm)],
      'sepal width (cm)', [sepal width (cm)],
      'petal length (cm)', [petal length (cm)],
      'petal width (cm)', [petal width (cm)]
    )
  ) AS prediction
FROM dataset_iris
LIMIT 1;

This will output the following data:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target prediction
5.1 3.5 1.4 0.2 0.0 0.0

Yay! Our prediction is matching the target label!

Let's see if we can find some predictions not matching the target label. To perform lots of predictions, we will use sqml_predict_batch which is more efficient than sqml_predict:

SELECT
  dataset_iris.*,
  batch.value AS prediction,
  dataset_iris.target = batch.value AS match
FROM
  dataset_iris
  JOIN json_each (
    (
      SELECT
        sqml_predict_batch(
          'Iris prediction',
          json_group_array(
            json_object(
              'sepal length (cm)', [sepal length (cm)],
              'sepal width (cm)', [sepal width (cm)],
              'petal length (cm)', [petal length (cm)],
              'petal width (cm)', [petal width (cm)]
            )
          )
        )
      FROM
        dataset_iris
    )
  ) batch ON (batch.rowid + 1) = dataset_iris.rowid
WHERE match = FALSE;

This will yield the following output data:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target prediction match
5.9 3.2 4.8 1.8 1.0 2.0 0
6.7 3.0 5.0 1.7 1.0 2.0 0
6.0 2.7 5.1 1.6 1.0 2.0 0
4.9 2.5 4.5 1.7 2.0 1.0 0

Oh no! 4 predictions have not predicted the correct target label!

Let's see if we can train a better algorithm to enhance the prediction quality.

Training a new model

Let's use a Support Vector Machine algorithm, usually yielding better results compared to the more simplistic Logistic Regression:

SELECT sqml_train(
  'Iris prediction',
  'classification',
  'svc',
  'dataset_iris',
  'target'
) AS training;

This will yield the following data:

training
{"experiment_name": "Iris prediction", "prediction_type": "classification", "algorithm": "svc", "deployed": true, "score": 0.9736842105263158}

We can already see that the score of this new model is higher than the previous one and it has been deployed.

Let's try our new classifier on the same dataset:

SELECT
  dataset_iris.*,
  batch.value AS prediction,
  dataset_iris.target = batch.value AS match
FROM
  dataset_iris
  JOIN json_each (
    (
      SELECT
        sqml_predict_batch(
          'Iris prediction',
          json_group_array(
            json_object(
              'sepal length (cm)', [sepal length (cm)],
              'sepal width (cm)', [sepal width (cm)],
              'petal length (cm)', [petal length (cm)],
              'petal width (cm)', [petal width (cm)]
            )
          )
        )
      FROM
        dataset_iris
    )
  ) batch ON (batch.rowid + 1) = dataset_iris.rowid
WHERE match = FALSE;

This will lead the following results:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target prediction match
5.9 3.2 4.8 1.8 1.0 2.0 0
6.7 3.0 5.0 1.7 1.0 2.0 0
6.0 2.7 5.1 1.6 1.0 2.0 0

Yay! We manage to predict one more target label with this new model!

Also note that we did not have to do anything to switch to the better model: exactly the same query is used to perform the prediction without having to specify anything about the new model! This is because new models are deployed automatically for the current experiment only if their score outperforms the score of the previously deployed model.

SQL functions

This plugin registers a few SQL functions to perform machine learning model training and predictions:

sqml_load_dataset(name, table)

  • name: str: name of the dataset to load
  • table: str: (optional) custom table name destination for the dataset

sqml_train(experiment_name, prediction_type, algorithm, dataset, target, test_size, split_strategy):

  • experiment_name: str: name of the experiment to train the model within
  • prediction_type: str: prediction task type to be performed for this experiment (regression, classification)
  • algorithm: str: algorithm type to be trained
  • dataset: str: name of the table or view containing the dataset
  • target: str: name of the column to be treated as target label
  • test_size: float: (optional) dataset test size ratio (default is 0.25)
  • split_strategy: str: (optional) dataset train/test split strategy (default is shuffle)

sqml_predict(experiment_name, features)

  • experiment_name: str: name of the experiment to train the model within
  • features: json object: JSON object containing the features

sqml_predict_batch(experiment_name, features)

  • experiment_name: str: name of the experiment to train the model within
  • features: json list: JSON list containing all feature objects

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment and the required dependencies:

poetry shell
poetry install

To run the QA suite:

black --check datasette_ml tests
flake8 datasette_ml tests
mypy datasette_ml tests
pytest -v --cov=datasette_ml --cov=tests --cov-branch --cov-report=term-missing tests

Demo

With the developmnent environment setup, you can run the demo locally:

python demo/generate.py
datasette --metadata demo/metadata.yml demo/sqml.db

Inspiration

All the things on the internet that have been inspiring this project:

License

Licensed under Apache License, Version 2.0

Copyright (c) 2023 - present Romain Clement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasette_ml-0.1.2.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

datasette_ml-0.1.2-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file datasette_ml-0.1.2.tar.gz.

File metadata

  • Download URL: datasette_ml-0.1.2.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for datasette_ml-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d47438bfb2acc2e3c5ada7a49b949abeb26cae6686dff80760490d5352a24af7
MD5 f49e8efb3012cea8787ea581bd0fb66f
BLAKE2b-256 70c66d228d0ae49216e753af46ec552d0c544f9d360afdd4fa8fa62e64b3ba47

See more details on using hashes here.

File details

Details for the file datasette_ml-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: datasette_ml-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Linux/5.15.0-1037-azure

File hashes

Hashes for datasette_ml-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2372c82c06dbeb20bdcc88960dc6028d61c467d821de0b15981de050ec13ecde
MD5 0cbf43a3c849fd6800fd623b7743b804
BLAKE2b-256 0a1e60f828b037d854cf03f012875640e13d44a410bce1ed55798f102760ce21

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page