A sample dataset processor for evaluating datasets.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language
- Python :: 3.12

Project description

OAI Dataset Processor

OAI Dataset Processor is a modular, scalable, and fault-tolerant framework designed for processing large datasets or tasks using OpenAI-compatible endpoints. It provides robust support for job persistence through SQL databases, effective task distribution with worker limits, and smooth integration of JSON schema-based output validation.

Key Features

Job Persistence: Uses SQLite by default (configurable) to ensure job data survives crashes or errors.
Bulk Processing: Supports ingestion, storage, and processing of multiple samples using an OpenAI-compatible endpoint.
Async Execution: Uses Python’s asyncio with semaphore-based worker limits for efficient job execution.
JSON Schema Validation: Enforces structured outputs using flexible JSON schema definitions.
Progress Monitoring: Displays async task progress with a live progress bar.
Extensibility: Easy to expand for custom storage backends (e.g., Postgres) or additional processing logic.

Installation

To install the package, simply clone the repository and install the required dependencies:

pip install -r requirements.txt

Usage

Code Example

from dataset_processor import OpenAIDatasetProcessor, create_runner_sample
from pydantic import BaseModel

# Configure instructions and JSON schema
sample_instructions = "Please grade the sentence for grammar and coherence, 1-10 for each, respond with JSON."

class SampleResponse(BaseModel):
    grade: int
    coherence: int

json_schema = SampleResponse.model_json_schema()

# Prepare input samples
input_samples = [
    "The quick brown fox jumps over the lazy dog.",
    "What day today?",
    "The illusion of knowledge is the barrier to discovery.",
    "gpus go burrr"
]

samples = []
for idx, input_sample in enumerate(input_samples):
    samples.append(create_runner_sample(
        job_id="job_123",
        model_name="gpt-4",
        instructions=sample_instructions,
        input_data=input_sample,
        output_json_schema=json_schema,
        sample_id=idx
    ))

# Create the processor and ingest samples
processor = OpenAIDatasetProcessor(
    base_url="http://api.openai.compatible.endpoint/api",
    api_key="YOUR_API_KEY",
    workers=20
)
processor.ingest_samples(samples)

# Run the job and retrieve results
results = processor.run_job("job_123")

# Save results to JSONL
results.to_jsonl("output_results.jsonl")

# View Job Status
print(processor.get_job_status("job_123"))

Configurations

Job Persistence: By default, jobs are stored in sqlite:///datasetrunner.sqlite. To use a different database:
- Pass the desired db_url to OpenAIDatasetProcessor or StorageHandler.
- Supported DBs: SQLite, PostgreSQL, and other SQLAlchemy-compatible backends.
Async Execution and Parallelism: Configure the number of workers using the workers parameter. The worker limit is enforced via Python’s asyncio.Semaphore.
Custom Output Validation: Define reusable validation schemas using Pydantic models. For instance, the SampleResponse model provides a consistent structure for score-based outputs.

Key Classes and Functions:

OpenAIDatasetProcessor: Main class to handle ingestion, processing, and retrieval of jobs and samples.
StorageHandler: Handles database operations for retrieving, saving, and managing datasets.
create_runner_sample: Simplifies the creation of samples with job-specific metadata.

Dependencies

The following dependencies are required (managed via requirements.txt):

openai
tqdm
pandas
sqlalchemy
pydantic

Contributing

Contributions are welcome! Whether you want to add new features, optimize performance, or expand documentation, feel free to fork the repository and submit a pull request.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language
- Python :: 3.12

Release history Release notifications | RSS feed

0.1.4

Jan 15, 2025

0.1.3

Jan 15, 2025

0.1.2

Jan 14, 2025

0.1.1

Dec 9, 2024

This version

0.1.0

Dec 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oai_dataset_processor-0.1.0.tar.gz (13.5 kB view details)

Uploaded Dec 9, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oai_dataset_processor-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Dec 9, 2024 Python 3

File details

Details for the file oai_dataset_processor-0.1.0.tar.gz.

File metadata

Download URL: oai_dataset_processor-0.1.0.tar.gz
Upload date: Dec 9, 2024
Size: 13.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for oai_dataset_processor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`20cd6ad4bc4bffad3ca9d570d26640dd5902f1387276abd24fbd6de59ace6cff`
MD5	`2c2b083f2c9965fcd8453352e24bc46e`
BLAKE2b-256	`3dd4faa73294de7ce4c78411bbe9ae276ac435c387379aaed40b93eef0a33be4`

See more details on using hashes here.

File details

Details for the file oai_dataset_processor-0.1.0-py3-none-any.whl.

File metadata

Download URL: oai_dataset_processor-0.1.0-py3-none-any.whl
Upload date: Dec 9, 2024
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for oai_dataset_processor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26bdf89f406670aba3e4e0d08a85ffb4f4edf0a5dd21ebcf058c5c2f6dd8589b`
MD5	`e339b58e5adc9ff39c5ca4479955e487`
BLAKE2b-256	`781b840483bf664040a19c302bdcbdb07ad8bf77184de349223367757d9c1c18`

See more details on using hashes here.

oai-dataset-processor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OAI Dataset Processor

Key Features

Installation

Usage

Code Example

Configurations

Key Classes and Functions:

Dependencies

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes