Check for data drift with OAI data
Project description
ft-drift
ft-drift
helps you check for data drift by comparing two OpenAI
multi-turn chat jsonl
files.
Install
pip install ft_drift
Background
Checking for dataset drift can help you debug if:
- Your model is trained on data that doesn’t reflect production (different prompts, functions, etc).
- Your training data contains unexpected or accidental artifacts.
In either situation, you can compare data from relevant sources (i.e. production vs fine-tuning) to find unwanted changes. This is one of the most common source of errors when fine-tuning models!
The demo below shows a cli tool used to detect data drift between two
files, file_a.jsonl
and file_b.jsonl
. Afterwards, a table of
important tokens that account for the drift are shown, such as:
END-UI-FORMAT
UI-FORMAT
- “```json”
- etc.
Currently, ft_drift
only detects drift in prompt templates, schemas
and other token-based drift (as opposed to semantic drift).
Usage
After installing ft_drift
, the cli command detect_drift
will be
available to you.
How Does it Work?
This works by doing the following steps:
- Fit a binary classifier (random forest) to discriminate between two datasets.
- If the classifier can predict a material difference (ex: AUC >= 0.60) then we know there is drift (something is systematically different b/w the two datasets).
- We show the most important features from the classifier which are tokens (segments of text) to help you debug what is different.
If this tool doesn’t detect drift, it doesn’t mean drift doesn’t exist. It just means we didn’t find it. For more background on this approach, see this slide from my talk on MLOps tools:
TODO
Other things that could be added:
- Semantic drift by incorporating embeddings.
- More features: length of messages, # of turns etc.
- Wiring up the function definition diff to the CLI (I don’t need this yet for my use case).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ft-drift-0.0.13.tar.gz
.
File metadata
- Download URL: ft-drift-0.0.13.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e19df9da79362e83cc800b67a1648685f4932f93ad95953fd7560f6562ff917e |
|
MD5 | 1c129cd5b79c9e799f8f28b6ccb65678 |
|
BLAKE2b-256 | 915e8993172c38d56eb702898717c4414ded66b9b94dd73e907c25038edcadf8 |
File details
Details for the file ft_drift-0.0.13-py3-none-any.whl
.
File metadata
- Download URL: ft_drift-0.0.13-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc204c7ddb4eb367fdeed69d3599be8d89412a08694a9efd773c46efca69e630 |
|
MD5 | dd116bdd4e9717349724af2f64b5f91d |
|
BLAKE2b-256 | cced7f2ce627c1ff40f3c55060650f2efe33dada52d91aab24e90dba5bed68c1 |