Skip to main content

A sample dataset processor for evaluating datasets.

Project description

OAI Dataset Processor

OAI Dataset Processor is a modular, scalable, and fault-tolerant framework designed for processing large datasets or tasks using OpenAI-compatible endpoints. It provides robust support for job persistence through SQL databases, effective task distribution with worker limits, and smooth integration of JSON schema-based output validation.


Key Features

  • Job Persistence: Uses SQLite by default (configurable) to ensure job data survives crashes or errors.
  • Bulk Processing: Supports ingestion, storage, and processing of multiple samples using an OpenAI-compatible endpoint.
  • Async Execution: Uses Python’s asyncio with semaphore-based worker limits for efficient job execution.
  • JSON Schema Validation: Enforces structured outputs using flexible JSON schema definitions.
  • Progress Monitoring: Displays async task progress with a live progress bar.
  • Extensibility: Easy to expand for custom storage backends (e.g., Postgres) or additional processing logic.

Installation

To install the package, simply clone the repository and install the required dependencies:

pip install -r requirements.txt

Usage

Code Example

from dataset_processor import OpenAIDatasetProcessor, create_runner_sample
from pydantic import BaseModel

# Configure instructions and JSON schema
sample_instructions = "Please grade the sentence for grammar and coherence, 1-10 for each, respond with JSON."

class SampleResponse(BaseModel):
    grade: int
    coherence: int

json_schema = SampleResponse.model_json_schema()

# Prepare input samples
input_samples = [
    "The quick brown fox jumps over the lazy dog.",
    "What day today?",
    "The illusion of knowledge is the barrier to discovery.",
    "gpus go burrr"
]

samples = []
for idx, input_sample in enumerate(input_samples):
    samples.append(create_runner_sample(
        job_id="job_123",
        model_name="gpt-4",
        instructions=sample_instructions,
        input_data=input_sample,
        output_json_schema=json_schema,
        sample_id=idx
    ))

# Create the processor and ingest samples
processor = OpenAIDatasetProcessor(
    base_url="http://api.openai.compatible.endpoint/api",
    api_key="YOUR_API_KEY",
    workers=20
)
processor.ingest_samples(samples)

# Run the job and retrieve results
results = processor.run_job("job_123")

# Save results to JSONL
results.to_jsonl("output_results.jsonl")

# View Job Status
print(processor.get_job_status("job_123"))

Configurations

  • Job Persistence: By default, jobs are stored in sqlite:///datasetrunner.sqlite. To use a different database:

    • Pass the desired db_url to OpenAIDatasetProcessor or StorageHandler.
    • Supported DBs: SQLite, PostgreSQL, and other SQLAlchemy-compatible backends.
  • Async Execution and Parallelism: Configure the number of workers using the workers parameter. The worker limit is enforced via Python’s asyncio.Semaphore.

  • Custom Output Validation: Define reusable validation schemas using Pydantic models. For instance, the SampleResponse model provides a consistent structure for score-based outputs.


Key Classes and Functions:

  • OpenAIDatasetProcessor: Main class to handle ingestion, processing, and retrieval of jobs and samples.
  • StorageHandler: Handles database operations for retrieving, saving, and managing datasets.
  • create_runner_sample: Simplifies the creation of samples with job-specific metadata.

Dependencies

The following dependencies are required (managed via requirements.txt):

  • openai
  • tqdm
  • pandas
  • sqlalchemy
  • pydantic

Contributing

Contributions are welcome! Whether you want to add new features, optimize performance, or expand documentation, feel free to fork the repository and submit a pull request.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oai_dataset_processor-0.1.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oai_dataset_processor-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file oai_dataset_processor-0.1.0.tar.gz.

File metadata

  • Download URL: oai_dataset_processor-0.1.0.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for oai_dataset_processor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 20cd6ad4bc4bffad3ca9d570d26640dd5902f1387276abd24fbd6de59ace6cff
MD5 2c2b083f2c9965fcd8453352e24bc46e
BLAKE2b-256 3dd4faa73294de7ce4c78411bbe9ae276ac435c387379aaed40b93eef0a33be4

See more details on using hashes here.

File details

Details for the file oai_dataset_processor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for oai_dataset_processor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 26bdf89f406670aba3e4e0d08a85ffb4f4edf0a5dd21ebcf058c5c2f6dd8589b
MD5 e339b58e5adc9ff39c5ca4479955e487
BLAKE2b-256 781b840483bf664040a19c302bdcbdb07ad8bf77184de349223367757d9c1c18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page