Skip to main content

A sample dataset processor for evaluating datasets.

Project description

OAI Dataset Processor

OAI Dataset Processor is a modular framework for processing large datasets using OpenAI-compatible endpoints. It provides SQL-based job persistence, worker-limited task distribution, and JSON schema validation.

Installation

pip install oai-dataset-processor

Key Features

  • Job Persistence: Uses SQLite by default, configurable to any SQLAlchemy database
  • Bulk Processing: Process multiple samples through OpenAI-compatible endpoints
  • Async Execution: Semaphore-based worker limits for efficient job execution
  • JSON Schema Validation: Enforce structured outputs using JSON schemas
  • Progress Monitoring: Live progress bar for async tasks
  • Extensibility: Easy to extend for custom storage or processing logic

Quick Start

from dataset_processor import OpenAIDatasetProcessor, create_runner_sample
from pydantic import BaseModel

# Define output schema
class SampleResponse(BaseModel):
    grade: int
    coherence: int

# Prepare samples
samples = [
    "The quick brown fox jumps over the lazy dog.",
    "What day today?",
    "The illusion of knowledge is the barrier to discovery.",
    "gpus go burrr"
]

job_samples = [
    create_runner_sample(
        job_id="job_123",
        model_name="gpt-4",
        instructions="Grade the sentence for grammar and coherence (1-10 each)",
        input_data=sample,
        output_json_schema=SampleResponse.model_json_schema(),
        sample_id=idx
    ) for idx, sample in enumerate(samples)
]

# Process samples
processor = OpenAIDatasetProcessor(
    base_url="YOUR_BASE_URL_HERE",
    api_key="YOUR_API_KEY_HERE",
    workers=20
)

processor.ingest_samples(job_samples)
results = processor.run_job("job_123")

# Export results
results.to_jsonl("output_results.jsonl")
print(processor.get_job_status("job_123"))

Configuration

  • Database: Default sqlite:///datasetrunner.sqlite. Configure via db_url in OpenAIDatasetProcessor
  • Parallelism: Set concurrent workers via the workers parameter
  • Schema Validation: Define output schemas using Pydantic models

Dependencies

  • openai
  • tqdm
  • pandas
  • sqlalchemy
  • pydantic

Contributing

Contributions welcome! Please submit PRs for features, optimizations or documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oai_dataset_processor-0.1.1.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oai_dataset_processor-0.1.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file oai_dataset_processor-0.1.1.tar.gz.

File metadata

  • Download URL: oai_dataset_processor-0.1.1.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for oai_dataset_processor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 387efe12f086b5047cfb27bc075c4f3c830ef443592c8cfad480c975153eaf4c
MD5 2114de45e979f3c730608cee34e5b1f3
BLAKE2b-256 7cf4c1c3414c6a93341897b4f3b553519d39e064839fb1e88631d43e532c094b

See more details on using hashes here.

File details

Details for the file oai_dataset_processor-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for oai_dataset_processor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 baf0f1fcc4591fbb1a53627eadea97e80a6094d770b817a7d5d60a6fd68697f8
MD5 26ad023a90a3d8937fa4ae85c9c3b0c9
BLAKE2b-256 d14c39ebe3353d070512fad0680bed5a8bef77b09068b26166e6a20f8879dbe4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page