Skip to main content

Check for data drift with OAI data

Project description

ft-drift

ft-drift helps you check for data drift by comparing two OpenAI multi-turn chat jsonl files.

Install

pip install ft_drift

Background

Checking for dataset drift can help you debug if:

  1. Your model is trained on data that doesn’t reflect production (different prompts, functions, etc).
  2. Your training data contains unexpected or accidental artifacts.

In either situation, you can compare data from relevant sources (i.e. production vs fine-tuning) to find unwanted changes. This is one of the most common source of errors when fine-tuning models!

The demo below shows a cli tool used to detect data drift between two files, file_a.jsonl and file_b.jsonl. Afterwards, a table of important tokens that account for the drift are shown, such as:

  • END-UI-FORMAT
  • UI-FORMAT
  • “```json”
  • etc.

Currently, ft_drift only detects drift in prompt templates, schemas and other token-based drift (as opposed to semantic drift).

Usage

After installing ft_drift, the cli command detect_drift will be available to you.

How Does it Work?

This works by doing the following steps:

  1. Fit a binary classifier (random forest) to discriminate between two datasets.
  2. If the classifier can predict a material difference (ex: AUC >= 0.60) then we know there is drift (something is systematically different b/w the two datasets).
  3. We show the most important features from the classifier which are tokens (segments of text) to help you debug what is different.

If this tool doesn’t detect drift, it doesn’t mean drift doesn’t exist. It just means we didn’t find it. For more background on this approach, see this slide from my talk on MLOps tools:

TODO

Other things that could be added:

  • Semantic drift by incorporating embeddings.
  • More features: length of messages, # of turns etc.
  • Wiring up the function definition diff to the CLI (I don’t need this yet for my use case).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ft-drift-0.0.13.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

ft_drift-0.0.13-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file ft-drift-0.0.13.tar.gz.

File metadata

  • Download URL: ft-drift-0.0.13.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for ft-drift-0.0.13.tar.gz
Algorithm Hash digest
SHA256 e19df9da79362e83cc800b67a1648685f4932f93ad95953fd7560f6562ff917e
MD5 1c129cd5b79c9e799f8f28b6ccb65678
BLAKE2b-256 915e8993172c38d56eb702898717c4414ded66b9b94dd73e907c25038edcadf8

See more details on using hashes here.

File details

Details for the file ft_drift-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: ft_drift-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for ft_drift-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 dc204c7ddb4eb367fdeed69d3599be8d89412a08694a9efd773c46efca69e630
MD5 dd116bdd4e9717349724af2f64b5f91d
BLAKE2b-256 cced7f2ce627c1ff40f3c55060650f2efe33dada52d91aab24e90dba5bed68c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page