Skip to main content

LLM-based automated dataset standardization and evaluation framework

Project description

dataset-preprocessing-agent

Automated dataset standardization using LLM agents.

Every HuggingFace dataset has a unique schema (tweet_text, review_body, sentence1, …), making it hard to reuse models across tasks without manual mapping work. This library automates that step: given a raw dataset, an LLM inspects a small sample and produces a JSON mapping of raw column names to a canonical schema, evaluated against Unitxt and tasksource ground truths.


Installation

pip install dataset-preprocessing-agent

For notebook visualization support:

pip install "dataset-preprocessing-agent[notebook]"

Python 3.10+ required.

For the API backend, export your OpenRouter key:

export OPENROUTER_API_KEY="your_key_here"

Quick Start

from dataset_preprocessing_agent.standardize_api import load_standardized_dataset

result = load_standardized_dataset("glue", config="sst2")
print(result["mapping"])
# {"task": "classification", "text": "sentence", "label": "label"}

Evaluate against Unitxt ground truth

from dataset_preprocessing_agent.eval import evaluate

result = evaluate(hf_name="glue", hf_config="sst2", card_id="sst2")
print(result["score"])        
print(result["gt_cols"])      # e.g. ['label', 'sentence']
print(result["pred_cols"])    # e.g. ['label', 'sentence']

Evaluate against tasksource ground truth

from dataset_preprocessing_agent.eval_ts import evaluate_ts

result = evaluate_ts("glue", "rte")
print(result["score"])
print(result["ts_gt"])        # GT mapping from tasksource preprocessing

Architecture

The pipeline runs in three stages:

  1. Standardization — an LLM inspects 5–10 raw samples and outputs a JSON mapping of raw column names to canonical fields (task, text / text_a + text_b, label).
  2. Mapping applicationapply_llm_mapping renames columns and converts integer labels to class name strings.
  3. Evaluation — the predicted raw column set is compared to the ground-truth column set using Jaccard similarity on raw HuggingFace column names.

Backends

Module Backend
standardize_api Cloud LLM via OpenRouter API
standardize_local Local HuggingFace model

Baselines

Baseline Method
baseline_keyword_match Synonym dictionary matching
baseline_embedding_match Cosine similarity via all-MiniLM-L6-v2

Evaluation backends

Module Ground truth
eval Unitxt task cards
eval_ts tasksource preprocessing objects

Dependencies

Package Purpose
unitxt Ground-truth task cards
tasksource Ground-truth preprocessing objects
datasets HuggingFace dataset loading
transformers + torch + accelerate Local model inference
openai OpenRouter API client
sentence-transformers Embedding baseline
pandas Result DataFrames

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_preprocessing_agent-0.1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataset_preprocessing_agent-0.1.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file dataset_preprocessing_agent-0.1.0.tar.gz.

File metadata

File hashes

Hashes for dataset_preprocessing_agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e86ae82fb6b52e9712c0f173cdd023cb5bda6ef651a31f471e63d1324f0c9c71
MD5 74d6d248f2caa001347f435348093bbc
BLAKE2b-256 a61989ef853e26683c246a174e2eae994d568adf42a28920c336713125c3b55a

See more details on using hashes here.

File details

Details for the file dataset_preprocessing_agent-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataset_preprocessing_agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a926872388764d33b96f15d174c20915929cca6509cce5a1eae4fbb977b1567
MD5 52e2d5cfb8d3718814689e10af95a2f1
BLAKE2b-256 a589b216c7814e6edf00d026899cadfb3aa78238844df7942523b3acc8af6bb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page