A sample dataset processor for evaluating datasets.
Project description
OAI Dataset Processor
OAI Dataset Processor is a modular framework for processing large datasets using OpenAI-compatible endpoints. It provides SQL-based job persistence, worker-limited task distribution, and JSON schema validation.
Installation
pip install oai-dataset-processor
Key Features
- Job Persistence: Uses SQLite by default, configurable to any SQLAlchemy database
- Bulk Processing: Process multiple samples through OpenAI-compatible endpoints
- Async Execution: Semaphore-based worker limits for efficient job execution
- JSON Schema Validation: Enforce structured outputs using JSON schemas
- Progress Monitoring: Live progress bar for async tasks
- Extensibility: Easy to extend for custom storage or processing logic
Quick Start
from dataset_processor import OpenAIDatasetProcessor, create_runner_sample
from pydantic import BaseModel
# Define output schema
class SampleResponse(BaseModel):
grade: int
coherence: int
# Prepare samples
samples = [
"The quick brown fox jumps over the lazy dog.",
"What day today?",
"The illusion of knowledge is the barrier to discovery.",
"gpus go burrr"
]
job_samples = [
create_runner_sample(
job_id="job_123",
model_name="gpt-4",
instructions="Grade the sentence for grammar and coherence (1-10 each)",
input_data=sample,
output_json_schema=SampleResponse.model_json_schema(),
sample_id=idx
) for idx, sample in enumerate(samples)
]
# Process samples
processor = OpenAIDatasetProcessor(
base_url="YOUR_BASE_URL_HERE",
api_key="YOUR_API_KEY_HERE",
workers=20
)
processor.ingest_samples(job_samples)
results = processor.run_job("job_123")
# Export results
results.to_jsonl("output_results.jsonl")
print(processor.get_job_status("job_123"))
Configuration
- Database: Default
sqlite:///datasetrunner.sqlite. Configure viadb_urlinOpenAIDatasetProcessor - Parallelism: Set concurrent workers via the
workersparameter - Schema Validation: Define output schemas using Pydantic models
Dependencies
openaitqdmpandassqlalchemypydantic
Contributing
Contributions welcome! Please submit PRs for features, optimizations or documentation.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oai_dataset_processor-0.1.4.tar.gz.
File metadata
- Download URL: oai_dataset_processor-0.1.4.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e7b3e612a72c684ccc7779df339348b8463941a2beb551b753642e28278abe8
|
|
| MD5 |
92a2b95d43093946f10729b7f4bb407e
|
|
| BLAKE2b-256 |
07b6120ac72c60e43713b81d04c55bbab98b87d9cf15eea28ffe09655582b233
|
File details
Details for the file oai_dataset_processor-0.1.4-py3-none-any.whl.
File metadata
- Download URL: oai_dataset_processor-0.1.4-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6c629b92eb2bd20d145ae386660cb59c6ddae17e2ee6918d2b6b46d4b98fa3b
|
|
| MD5 |
6e3efaaef8bda6ffba21074449836032
|
|
| BLAKE2b-256 |
7258b46d1600957f172b6fd29f38de110403b3a8989362e0fbc61e36989db990
|