LLM-based automated dataset standardization and evaluation framework

Project description

dataset-preprocessing-agent

Automated dataset standardization using LLM agents.

Every HuggingFace dataset has a unique schema (tweet_text, review_body, sentence1, …), making it hard to reuse models across tasks without manual mapping work. This library automates that step: given a raw dataset, an LLM inspects a small sample and produces a JSON mapping of raw column names to a canonical schema, evaluated against Unitxt and tasksource ground truths.

Installation

pip install dataset-preprocessing-agent

For notebook visualization support:

pip install "dataset-preprocessing-agent[notebook]"

Python 3.10+ required.

For the API backend, export your OpenRouter key:

export OPENROUTER_API_KEY="your_key_here"

Quick Start

from dataset_preprocessing_agent.standardize_api import load_standardized_dataset

result = load_standardized_dataset("glue", config="sst2")
print(result["mapping"])
# {"task": "classification", "text": "sentence", "label": "label"}

Evaluate against Unitxt ground truth

from dataset_preprocessing_agent.eval import evaluate

result = evaluate(hf_name="glue", hf_config="sst2", card_id="sst2")
print(result["score"])        
print(result["gt_cols"])      # e.g. ['label', 'sentence']
print(result["pred_cols"])    # e.g. ['label', 'sentence']

Evaluate against tasksource ground truth

from dataset_preprocessing_agent.eval_ts import evaluate_ts

result = evaluate_ts("glue", "rte")
print(result["score"])
print(result["ts_gt"])        # GT mapping from tasksource preprocessing

Architecture

The pipeline runs in three stages:

Standardization — an LLM inspects 5–10 raw samples and outputs a JSON mapping of raw column names to canonical fields (task, text / text_a + text_b, label).
Mapping application — apply_llm_mapping renames columns and converts integer labels to class name strings.
Evaluation — the predicted raw column set is compared to the ground-truth column set using Jaccard similarity on raw HuggingFace column names.

Backends

Module	Backend
`standardize_api`	Cloud LLM via OpenRouter API
`standardize_local`	Local HuggingFace model

Baselines

Baseline	Method
`baseline_keyword_match`	Synonym dictionary matching
`baseline_embedding_match`	Cosine similarity via `all-MiniLM-L6-v2`

Evaluation backends

Module	Ground truth
`eval`	Unitxt task cards
`eval_ts`	tasksource preprocessing objects

Dependencies

Package	Purpose
`unitxt`	Ground-truth task cards
`tasksource`	Ground-truth preprocessing objects
`datasets`	HuggingFace dataset loading
`transformers` + `torch` + `accelerate`	Local model inference
`openai`	OpenRouter API client
`sentence-transformers`	Embedding baseline
`pandas`	Result DataFrames

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_preprocessing_agent-0.1.0.tar.gz (16.2 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataset_preprocessing_agent-0.1.0-py3-none-any.whl (20.1 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file dataset_preprocessing_agent-0.1.0.tar.gz.

File metadata

Download URL: dataset_preprocessing_agent-0.1.0.tar.gz
Upload date: Feb 26, 2026
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for dataset_preprocessing_agent-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e86ae82fb6b52e9712c0f173cdd023cb5bda6ef651a31f471e63d1324f0c9c71`
MD5	`74d6d248f2caa001347f435348093bbc`
BLAKE2b-256	`a61989ef853e26683c246a174e2eae994d568adf42a28920c336713125c3b55a`

See more details on using hashes here.

File details

Details for the file dataset_preprocessing_agent-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataset_preprocessing_agent-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 20.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for dataset_preprocessing_agent-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a926872388764d33b96f15d174c20915929cca6509cce5a1eae4fbb977b1567`
MD5	`52e2d5cfb8d3718814689e10af95a2f1`
BLAKE2b-256	`a589b216c7814e6edf00d026899cadfb3aa78238844df7942523b3acc8af6bb7`

See more details on using hashes here.

dataset-preprocessing-agent 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

dataset-preprocessing-agent

Installation

Quick Start

Evaluate against Unitxt ground truth

Evaluate against tasksource ground truth

Architecture

Backends

Baselines

Evaluation backends

Dependencies

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes