LLM-based automated dataset standardization and evaluation framework
Project description
dataset-preprocessing-agent
Automated dataset standardization using LLM agents.
Every HuggingFace dataset has a unique schema (tweet_text, review_body, sentence1, …), making it hard to reuse models across tasks without manual mapping work. This library automates that step: given a raw dataset, an LLM inspects a small sample and produces a JSON mapping of raw column names to a canonical schema, evaluated against Unitxt and tasksource ground truths.
Installation
pip install dataset-preprocessing-agent
For notebook visualization support:
pip install "dataset-preprocessing-agent[notebook]"
Python 3.10+ required.
For the API backend, export your OpenRouter key:
export OPENROUTER_API_KEY="your_key_here"
Quick Start
from dataset_preprocessing_agent.standardize_api import load_standardized_dataset
result = load_standardized_dataset("glue", config="sst2")
print(result["mapping"])
# {"task": "classification", "text": "sentence", "label": "label"}
Evaluate against Unitxt ground truth
from dataset_preprocessing_agent.eval import evaluate
result = evaluate(hf_name="glue", hf_config="sst2", card_id="sst2")
print(result["score"])
print(result["gt_cols"]) # e.g. ['label', 'sentence']
print(result["pred_cols"]) # e.g. ['label', 'sentence']
Evaluate against tasksource ground truth
from dataset_preprocessing_agent.eval_ts import evaluate_ts
result = evaluate_ts("glue", "rte")
print(result["score"])
print(result["ts_gt"]) # GT mapping from tasksource preprocessing
Architecture
The pipeline runs in three stages:
- Standardization — an LLM inspects 5–10 raw samples and outputs a JSON mapping of raw column names to canonical fields (
task,text/text_a+text_b,label). - Mapping application —
apply_llm_mappingrenames columns and converts integer labels to class name strings. - Evaluation — the predicted raw column set is compared to the ground-truth column set using Jaccard similarity on raw HuggingFace column names.
Backends
| Module | Backend |
|---|---|
standardize_api |
Cloud LLM via OpenRouter API |
standardize_local |
Local HuggingFace model |
Baselines
| Baseline | Method |
|---|---|
baseline_keyword_match |
Synonym dictionary matching |
baseline_embedding_match |
Cosine similarity via all-MiniLM-L6-v2 |
Evaluation backends
| Module | Ground truth |
|---|---|
eval |
Unitxt task cards |
eval_ts |
tasksource preprocessing objects |
Dependencies
| Package | Purpose |
|---|---|
unitxt |
Ground-truth task cards |
tasksource |
Ground-truth preprocessing objects |
datasets |
HuggingFace dataset loading |
transformers + torch + accelerate |
Local model inference |
openai |
OpenRouter API client |
sentence-transformers |
Embedding baseline |
pandas |
Result DataFrames |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataset_preprocessing_agent-0.1.0.tar.gz.
File metadata
- Download URL: dataset_preprocessing_agent-0.1.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e86ae82fb6b52e9712c0f173cdd023cb5bda6ef651a31f471e63d1324f0c9c71
|
|
| MD5 |
74d6d248f2caa001347f435348093bbc
|
|
| BLAKE2b-256 |
a61989ef853e26683c246a174e2eae994d568adf42a28920c336713125c3b55a
|
File details
Details for the file dataset_preprocessing_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataset_preprocessing_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a926872388764d33b96f15d174c20915929cca6509cce5a1eae4fbb977b1567
|
|
| MD5 |
52e2d5cfb8d3718814689e10af95a2f1
|
|
| BLAKE2b-256 |
a589b216c7814e6edf00d026899cadfb3aa78238844df7942523b3acc8af6bb7
|