A sample dataset processor for evaluating datasets.
Project description
OAI Dataset Processor
OAI Dataset Processor is a modular, scalable, and fault-tolerant framework designed for processing large datasets or tasks using OpenAI-compatible endpoints. It provides robust support for job persistence through SQL databases, effective task distribution with worker limits, and smooth integration of JSON schema-based output validation.
Key Features
- Job Persistence: Uses SQLite by default (configurable) to ensure job data survives crashes or errors.
- Bulk Processing: Supports ingestion, storage, and processing of multiple samples using an OpenAI-compatible endpoint.
- Async Execution: Uses Python’s
asynciowith semaphore-based worker limits for efficient job execution. - JSON Schema Validation: Enforces structured outputs using flexible JSON schema definitions.
- Progress Monitoring: Displays async task progress with a live progress bar.
- Extensibility: Easy to expand for custom storage backends (e.g., Postgres) or additional processing logic.
Installation
To install the package, simply clone the repository and install the required dependencies:
pip install -r requirements.txt
Usage
Code Example
from dataset_processor import OpenAIDatasetProcessor, create_runner_sample
from pydantic import BaseModel
# Configure instructions and JSON schema
sample_instructions = "Please grade the sentence for grammar and coherence, 1-10 for each, respond with JSON."
class SampleResponse(BaseModel):
grade: int
coherence: int
json_schema = SampleResponse.model_json_schema()
# Prepare input samples
input_samples = [
"The quick brown fox jumps over the lazy dog.",
"What day today?",
"The illusion of knowledge is the barrier to discovery.",
"gpus go burrr"
]
samples = []
for idx, input_sample in enumerate(input_samples):
samples.append(create_runner_sample(
job_id="job_123",
model_name="gpt-4",
instructions=sample_instructions,
input_data=input_sample,
output_json_schema=json_schema,
sample_id=idx
))
# Create the processor and ingest samples
processor = OpenAIDatasetProcessor(
base_url="http://api.openai.compatible.endpoint/api",
api_key="YOUR_API_KEY",
workers=20
)
processor.ingest_samples(samples)
# Run the job and retrieve results
results = processor.run_job("job_123")
# Save results to JSONL
results.to_jsonl("output_results.jsonl")
# View Job Status
print(processor.get_job_status("job_123"))
Configurations
-
Job Persistence: By default, jobs are stored in
sqlite:///datasetrunner.sqlite. To use a different database:- Pass the desired
db_urltoOpenAIDatasetProcessororStorageHandler. - Supported DBs: SQLite, PostgreSQL, and other SQLAlchemy-compatible backends.
- Pass the desired
-
Async Execution and Parallelism: Configure the number of workers using the
workersparameter. The worker limit is enforced via Python’sasyncio.Semaphore. -
Custom Output Validation: Define reusable validation schemas using Pydantic models. For instance, the
SampleResponsemodel provides a consistent structure for score-based outputs.
Key Classes and Functions:
OpenAIDatasetProcessor: Main class to handle ingestion, processing, and retrieval of jobs and samples.StorageHandler: Handles database operations for retrieving, saving, and managing datasets.create_runner_sample: Simplifies the creation of samples with job-specific metadata.
Dependencies
The following dependencies are required (managed via requirements.txt):
openaitqdmpandassqlalchemypydantic
Contributing
Contributions are welcome! Whether you want to add new features, optimize performance, or expand documentation, feel free to fork the repository and submit a pull request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oai_dataset_processor-0.1.0.tar.gz.
File metadata
- Download URL: oai_dataset_processor-0.1.0.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20cd6ad4bc4bffad3ca9d570d26640dd5902f1387276abd24fbd6de59ace6cff
|
|
| MD5 |
2c2b083f2c9965fcd8453352e24bc46e
|
|
| BLAKE2b-256 |
3dd4faa73294de7ce4c78411bbe9ae276ac435c387379aaed40b93eef0a33be4
|
File details
Details for the file oai_dataset_processor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: oai_dataset_processor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26bdf89f406670aba3e4e0d08a85ffb4f4edf0a5dd21ebcf058c5c2f6dd8589b
|
|
| MD5 |
e339b58e5adc9ff39c5ca4479955e487
|
|
| BLAKE2b-256 |
781b840483bf664040a19c302bdcbdb07ad8bf77184de349223367757d9c1c18
|