Ceres - The LLM Dataset Engine. Production-grade SDK for processing tabular datasets using LLMs with reliability, observability, and cost control

These details have not been verified by PyPI

Project links

Project description

Ceres - LLM Dataset Engine

 ▄▄▄▄▄▄▄▄▄▄▄  ▄▄▄▄▄▄▄▄▄▄▄  ▄▄▄▄▄▄▄▄▄▄▄  ▄▄▄▄▄▄▄▄▄▄▄  ▄▄▄▄▄▄▄▄▄▄▄
▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌
▐░█▀▀▀▀▀▀▀▀▀ ▐░█▀▀▀▀▀▀▀▀▀ ▐░█▀▀▀▀▀▀▀█░▌▐░█▀▀▀▀▀▀▀▀▀ ▐░█▀▀▀▀▀▀▀▀▀
▐░▌          ▐░▌          ▐░▌       ▐░▌▐░▌          ▐░▌
▐░▌          ▐░█▄▄▄▄▄▄▄▄▄ ▐░█▄▄▄▄▄▄▄█░▌▐░█▄▄▄▄▄▄▄▄▄ ▐░█▄▄▄▄▄▄▄▄▄
▐░▌          ▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌
▐░▌          ▐░█▀▀▀▀▀▀▀▀▀ ▐░█▀▀▀▀█░█▀▀ ▐░█▀▀▀▀▀▀▀▀▀  ▀▀▀▀▀▀▀▀▀█░▌
▐░▌          ▐░▌          ▐░▌     ▐░▌  ▐░▌                    ▐░▌
▐░█▄▄▄▄▄▄▄▄▄ ▐░█▄▄▄▄▄▄▄▄▄ ▐░▌      ▐░▌ ▐░█▄▄▄▄▄▄▄▄▄  ▄▄▄▄▄▄▄▄▄█░▌
▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌▐░▌       ▐░▌▐░░░░░░░░░░░▌▐░░░░░░░░░░░▌
 ▀▀▀▀▀▀▀▀▀▀▀  ▀▀▀▀▀▀▀▀▀▀▀  ▀         ▀  ▀▀▀▀▀▀▀▀▀▀▀  ▀▀▀▀▀▀▀▀▀▀▀

Production-grade SDK for batch processing tabular datasets with LLMs. Built on LlamaIndex for provider abstraction, adds batch orchestration, automatic cost tracking, checkpointing, and YAML configuration for dataset transformation at scale.

Features

Quick API: 3-line hello world with smart defaults and auto-detection
Simple API: Fluent builder pattern for full control when needed
Reliability: Automatic retries, checkpointing, error policies (99.9% completion rate)
Cost Control: Pre-execution estimation, budget limits, real-time tracking
Observability: Progress bars, structured logging, metrics, cost reports
Extensibility: Plugin architecture, custom stages, multiple LLM providers
Production Ready: Zero data loss on crashes, resume from checkpoint
Multiple Providers: OpenAI, Azure OpenAI, Anthropic Claude, Groq, MLX (Apple Silicon), and custom APIs
Local Inference: Run models locally with MLX (Apple Silicon) or Ollama - 100% free, private, offline-capable
Multi-Column Processing: Generate multiple output columns with composition or JSON parsing
Custom Providers: Integrate any OpenAI-compatible API (Together.AI, vLLM, Ollama, custom endpoints)

Quick Start

Option 1: Quick API (Recommended)

The simplest way to get started - just provide your data, prompt, and model:

from ceres import QuickPipeline

# Process data with smart defaults
pipeline = QuickPipeline.create(
    data="data.csv",
    prompt="Clean this text: {description}",
    model="gpt-4o-mini"
)

# Execute pipeline
result = pipeline.execute()
print(f"Processed {result.metrics.processed_rows} rows")
print(f"Total cost: ${result.costs.total_cost:.4f}")

What's auto-detected:

Input columns from {placeholders} in prompt
Provider from model name (gpt-4 → openai, claude → anthropic)
Parser type (JSON for multi-column, text for single column)
Sensible batch size and concurrency for the provider

Option 2: Builder API (Full Control)

For advanced use cases requiring explicit configuration:

from ceres import PipelineBuilder

# Build with explicit settings
pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["description"],
              output_columns=["cleaned"])
    .with_prompt("Clean this text: {description}")
    .with_llm(provider="openai", model="gpt-4o-mini")
    .with_batch_size(100)
    .with_concurrency(5)
    .build()
)

# Estimate cost before running
estimate = pipeline.estimate_cost()
print(f"Estimated cost: ${estimate.total_cost:.4f}")

# Execute pipeline
result = pipeline.execute()
print(f"Total cost: ${result.costs.total_cost:.4f}")

Installation

Using uv (recommended)

# Basic installation
uv add ceres

# With MLX support (Apple Silicon only)
uv add "ceres[mlx]"

Using pip

# Basic installation
pip install ceres

# With MLX support (Apple Silicon only)
pip install "ceres[mlx]"

Set up API keys

# For cloud providers
export OPENAI_API_KEY="your-key-here"
# or
export AZURE_OPENAI_API_KEY="your-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
# or
export ANTHROPIC_API_KEY="your-key-here"
# or
export GROQ_API_KEY="your-key-here"
# or
export TOGETHER_API_KEY="your-key-here"

# For MLX (Apple Silicon)
export HUGGING_FACE_HUB_TOKEN="your-token-here"  # For model downloads

# Local providers (Ollama, vLLM) don't need API keys

Usage Examples

1. Simple Data Processing

from ceres import DatasetProcessor

# Minimal configuration for simple use cases
processor = DatasetProcessor(
    data="reviews.csv",
    input_column="customer_review",
    output_column="sentiment",
    prompt="Classify sentiment as: Positive, Negative, or Neutral\nReview: {customer_review}\nSentiment:",
    llm_config={"provider": "openai", "model": "gpt-4o-mini"}
)

# Test on sample first
sample = processor.run_sample(n=10)
print(sample)

# Process full dataset
result = processor.run()

2. Structured Data Extraction

from ceres import PipelineBuilder

pipeline = (
    PipelineBuilder.create()
    .from_dataframe(
        df,
        input_columns=["product_description"],
        output_columns=["brand", "model", "price", "condition"]
    )
    .with_prompt("""
        Extract structured information and return JSON:
        {
          "brand": "...",
          "model": "...",
          "price": "...",
          "condition": "new|used|refurbished"
        }

        Description: {product_description}
    """)
    .with_llm(provider="openai", model="gpt-4o-mini", temperature=0.0)
    .build()
)

result = pipeline.execute()

3. With Cost Control

pipeline = (
    PipelineBuilder.create()
    .from_csv("large_dataset.csv",
              input_columns=["text"],
              output_columns=["summary"])
    .with_prompt("Summarize in 10 words: {text}")
    .with_llm(provider="openai", model="gpt-4o-mini")
    # Cost control settings
    .with_max_budget(10.0)  # Maximum $10
    .with_batch_size(100)
    .with_concurrency(5)
    .with_rate_limit(60)  # 60 requests/min
    .with_checkpoint_interval(500)  # Checkpoint every 500 rows
    .build()
)

# Estimate first
estimate = pipeline.estimate_cost()
if estimate.total_cost > 10.0:
    print("Cost too high!")
    exit()

result = pipeline.execute()

4. Multiple Input Columns

pipeline = (
    PipelineBuilder.create()
    .from_csv("products.csv",
              input_columns=["title", "description", "category"],
              output_columns=["optimized_title"])
    .with_prompt("""
        Optimize this product title for SEO.

        Current Title: {title}
        Description: {description}
        Category: {category}

        Optimized Title:
    """)
    .with_llm(provider="openai", model="gpt-4o-mini")
    .with_output("optimized_products.csv", format="csv")
    .build()
)

result = pipeline.execute()

5. Azure OpenAI

pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["text"], output_columns=["result"])
    .with_prompt("Process: {text}")
    .with_llm(
        provider="azure_openai",
        model="gpt-4",
        azure_endpoint="https://your-endpoint.openai.azure.com/",
        azure_deployment="your-deployment-name",
        api_version="2024-02-15-preview"
    )
    .build()
)

6. Anthropic Claude

pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["text"], output_columns=["analysis"])
    .with_prompt("Analyze: {text}")
    .with_llm(
        provider="anthropic",
        model="claude-3-opus-20240229",
        temperature=0.0,
        max_tokens=1024
    )
    .build()
)

7. Local Inference with MLX (Apple Silicon)

# 100% free, private, offline-capable inference on M1/M2/M3/M4 Macs
pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["text"], output_columns=["summary"])
    .with_prompt("Summarize: {text}")
    .with_llm(
        provider="mlx",
        model="mlx-community/Qwen3-1.7B-4bit",  # Fast, small model
        max_tokens=100,
        input_cost_per_1k_tokens=0.0,  # Free!
        output_cost_per_1k_tokens=0.0
    )
    .with_concurrency(1)  # MLX works best with concurrency=1
    .build()
)

Requirements:

macOS with Apple Silicon (M1/M2/M3/M4)
Install with: pip install ceres[mlx]

8. Provider Presets (Simplified Configuration)

from ceres import PipelineBuilder
from ceres.core.specifications import LLMProviderPresets

# Use pre-configured providers (80% less boilerplate!)
pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["text"], output_columns=["result"])
    .with_prompt("Process: {text}")
    .with_llm_spec(LLMProviderPresets.TOGETHER_AI_LLAMA_70B)  # One line!
    .build()
)

# Available presets:
# - LLMProviderPresets.GPT4O_MINI
# - LLMProviderPresets.GPT4O
# - LLMProviderPresets.TOGETHER_AI_LLAMA_70B
# - LLMProviderPresets.TOGETHER_AI_LLAMA_8B
# - LLMProviderPresets.OLLAMA_LLAMA_70B (free, local)
# - LLMProviderPresets.OLLAMA_LLAMA_8B (free, local)
# - LLMProviderPresets.GROQ_LLAMA_70B
# - LLMProviderPresets.CLAUDE_SONNET_4

# Override preset settings:
custom = LLMProviderPresets.GPT4O_MINI.model_copy(
    update={"temperature": 0.9, "max_tokens": 500}
)
pipeline.with_llm_spec(custom)

# Custom provider via factory:
custom_vllm = LLMProviderPresets.create_custom_openai_compatible(
    provider_name="My vLLM Server",
    model="mistral-7b-instruct",
    base_url="http://my-server:8000/v1"
)
pipeline.with_llm_spec(custom_vllm)

Benefits:

Zero configuration errors (pre-validated)
Correct pricing and URLs built-in
IDE autocomplete for discovery
80% code reduction vs parameter-based config

9. Custom OpenAI-Compatible APIs (Parameter-Based)

# Alternative: Configure providers with individual parameters
pipeline = (
    PipelineBuilder.create()
    .from_csv("data.csv", input_columns=["text"], output_columns=["result"])
    .with_prompt("Process: {text}")
    .with_llm(
        provider="openai_compatible",
        provider_name="Together.AI",  # Or "Ollama", "vLLM", etc.
        model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        base_url="https://api.together.xyz/v1",  # Custom endpoint
        api_key="${TOGETHER_API_KEY}",
        input_cost_per_1k_tokens=0.0006,
        output_cost_per_1k_tokens=0.0006
    )
    .build()
)

Supported APIs:

Ollama (local): http://localhost:11434/v1
Together.AI (cloud): https://api.together.xyz/v1
vLLM (self-hosted): Your custom endpoint
Any OpenAI-compatible API

10. Multi-Column Output with JSON Parsing

# Single LLM call generates multiple output columns
pipeline = (
    PipelineBuilder.create()
    .from_csv("products.csv",
              input_columns=["description"],
              output_columns=["brand", "category", "price"])  # Multiple outputs!
    .with_prompt("""
        Extract structured data from this product description.
        Return JSON format:
        {
          "brand": "...",
          "category": "...",
          "price": "..."
        }

        Description: {description}
    """)
    .with_llm(provider="openai", model="gpt-4o-mini", temperature=0.0)
    .build()
)

result = pipeline.execute()
# Result has 3 new columns: brand, category, price

11. Pipeline Composition (Multi-Column with Dependencies)

from ceres import PipelineComposer

# Create multiple pipelines with dependencies
composer = PipelineComposer(input_data=df)

# Pipeline 1: Generate sentiment score
sentiment_pipeline = (
    PipelineBuilder.create()
    .from_dataframe(df, input_columns=["review"], output_columns=["sentiment"])
    .with_prompt("Rate sentiment (0-100): {review}")
    .with_llm(provider="openai", model="gpt-4o-mini")
    .build()
)

# Pipeline 2: Generate explanation (depends on sentiment)
explanation_pipeline = (
    PipelineBuilder.create()
    .from_dataframe(df,
                    input_columns=["review", "sentiment"],
                    output_columns=["explanation"])
    .with_prompt("Explain why this review has {sentiment}% sentiment: {review}")
    .with_llm(provider="openai", model="gpt-4o-mini")
    .build()
)

# Compose and execute
result = (
    composer
    .add_column("sentiment", sentiment_pipeline)
    .add_column("explanation", explanation_pipeline, depends_on=["sentiment"])
    .execute()
)

CLI Usage

Ceres includes a powerful command-line interface for processing datasets without writing code.

List Available Providers

# See all supported LLM providers
ceres list-providers

This shows:

Provider IDs (openai, azure_openai, anthropic, groq, mlx, openai_compatible)
Platform requirements
Cost estimates
Use cases
Required environment variables

Process Datasets

# Basic usage
ceres process --config config.yaml

# Override input/output
ceres process --config config.yaml --input data.csv --output results.csv

# Override provider and model
ceres process --config config.yaml --provider groq --model llama-3.3-70b-versatile

# Set budget limit
ceres process --config config.yaml --max-budget 10.0

# Dry run (estimate only, don't execute)
ceres process --config config.yaml --dry-run

# Estimate cost
ceres estimate --config config.yaml --input data.csv

# Inspect data
ceres inspect --input data.csv --head 10

Example Config File

# config.yaml
dataset:
  source_type: csv
  source_path: data.csv
  input_columns: [text]
  output_columns: [sentiment]

prompt:
  template: "Classify sentiment: {text}"

llm:
  provider: openai
  model: gpt-4o-mini
  temperature: 0.0

processing:
  batch_size: 100
  concurrency: 5
  max_budget: 10.0

output:
  destination_type: csv
  destination_path: output.csv

Architecture

The SDK follows a layered architecture:

┌─────────────────────────────────────────┐
│  Layer 4: High-Level API                │
│  (Pipeline, PipelineBuilder)            │
├─────────────────────────────────────────┤
│  Layer 3: Orchestration Engine          │
│  (PipelineExecutor, StateManager)       │
├─────────────────────────────────────────┤
│  Layer 2: Processing Stages             │
│  (DataLoader, LLMInvocation, Parser)    │
├─────────────────────────────────────────┤
│  Layer 1: Infrastructure Adapters       │
│  (LLMClient, DataReader, Checkpoint)    │
├─────────────────────────────────────────┤
│  Layer 0: Core Utilities                │
│  (RetryHandler, RateLimiter, Logging)   │
└─────────────────────────────────────────┘

Key Design Principles

Simple: Straightforward solutions
DRY: No code duplication
Type Safe: Type hints throughout
Separation of Concerns: Configuration vs. execution

Supported LLM Providers

Provider	Platform	Cost	Use Case	Setup
OpenAI	Cloud (All)	$$	Production, high quality	`OPENAI_API_KEY`
Azure OpenAI	Cloud (All)	$$	Enterprise, compliance	`AZURE_OPENAI_API_KEY`
Anthropic	Cloud (All)	$$$	Long context, Claude models	`ANTHROPIC_API_KEY`
Groq	Cloud (All)	Free tier	Fast inference, development	`GROQ_API_KEY`
MLX	macOS (M1/M2/M3/M4)	Free	Local, private, offline	`pip install ceres[mlx]`
OpenAI-Compatible	Custom/Local/Cloud	Varies	Ollama, vLLM, Together.AI	`base_url` + optional API key

Run ceres list-providers to see detailed information about each provider.

Use Cases

Data Cleaning: Clean, normalize, standardize text data
Sentiment Analysis: Classify sentiment at scale
Information Extraction: Extract structured data from unstructured text
Categorization: Auto-categorize products, documents, emails
Content Generation: Generate descriptions, summaries, titles
Translation: Translate content to multiple languages
Data Enrichment: Enhance datasets with LLM-generated insights
Product Matching: Compare and score product similarity
Content Moderation: Flag inappropriate content at scale

Performance

Throughput: Process 1,000 rows in < 5 minutes (GPT-4o-mini, concurrency=5)
Reliability: 99.9% completion rate with automatic retries
Cost Efficiency: Pre-execution estimation within 10% accuracy
Memory: < 500MB for datasets up to 50K rows

Observability & Debugging

Enable distributed tracing with OpenTelemetry for production debugging:

from ceres.observability import enable_tracing

# Console exporter (development)
enable_tracing(exporter="console")

# Jaeger exporter (production)
enable_tracing(exporter="jaeger", endpoint="http://localhost:14268/api/traces")

# Your pipeline execution is now traced
result = pipeline.execute()

Features:

Per-stage latency tracking
LLM token usage and cost per call
Error traces with stack traces
PII-safe by default (prompts sanitized)
Export to Jaeger, Datadog, or any OpenTelemetry-compatible backend

Installation:

pip install ceres-llm[observability]

See examples/18_observability.py for complete examples.

Configuration Options

Processing Configuration

.with_batch_size(100)          # Rows per batch
.with_concurrency(5)            # Parallel requests
.with_checkpoint_interval(500)  # Checkpoint frequency
.with_rate_limit(60)            # Requests per minute
.with_max_budget(10.0)          # Maximum USD budget

LLM Configuration

.with_llm(
    provider="openai",
    model="gpt-4o-mini",
    temperature=0.0,        # 0.0-2.0
    max_tokens=1024,        # Max output tokens
    api_key="..."           # Or from env
)

Output Configuration

.with_output(
    path="output.csv",
    format="csv",              # csv, excel, parquet
    merge_strategy="replace"   # replace, append, update
)

Testing

# Run tests
uv run pytest

# With coverage
uv run pytest --cov=src --cov-report=html

# Run specific test
uv run pytest tests/test_pipeline.py

Documentation

README.md (this file): Quick start and usage guide
LLM_DATASET_ENGINE.md: Complete architecture and design documentation
examples/: Example scripts demonstrating various features
Code docstrings: Inline documentation for all public APIs

Contributing

Contributions welcome! Please follow:

Fork the repository at https://github.com/ptimizeroracle/Ceres
Create a feature branch
Follow the existing code style (Black, Ruff)
Add tests for new features
Update documentation
Submit a pull request

License

MIT License - see LICENSE file for details

Acknowledgments

Built with LlamaIndex for LLM provider abstraction
Ceres adds batch processing, cost tracking, checkpointing, and configuration management on top of LlamaIndex's LLM clients
Thanks to the open-source community

Support

Repository: https://github.com/ptimizeroracle/Ceres
Issues: Open an issue on GitHub
Discussions: Use GitHub Discussions for questions
Email: git@binblok.com

Recent Updates

Version 1.0.0 (October 2025)

New Features:

✅ Provider Presets: Pre-configured LLMSpec objects for common providers (80% code reduction)
✅ Simplified Configuration: New with_llm_spec() method accepting LLMSpec objects
✅ MLX Integration: Local inference on Apple Silicon (M1/M2/M3/M4) - 100% free, private, offline
✅ OpenAI-Compatible Provider: Support for Ollama, vLLM, Together.AI, and custom APIs
✅ Multi-Column Processing: Generate multiple output columns with JSON parsing
✅ Pipeline Composition: Chain pipelines with dependencies between columns
✅ CLI Provider Discovery: ceres list-providers command to explore all providers
✅ Auto-Retry for Multi-Column: Automatic retry now checks all output columns for failures
✅ Custom LLM Clients: Extend LLMClient base class for exotic APIs

Improvements:

Zero configuration errors with validated presets
Enhanced error handling for multi-column outputs
Better streaming implementation
Improved documentation with provider comparison guide
More examples (14+ example files including provider presets demo)

Roadmap

Upcoming Features

RAG Integration (Next Release)

Retrieval-Augmented Generation for context-aware dataset processing
Custom retrieval stage via plugin architecture
Vector store integration (Pinecone, Weaviate, ChromaDB)
Dynamic context injection per row
See docs/DESIGN_IMPROVEMENT.md for detailed design exploration

Other Planned Features

Support for true streaming execution (in progress)
Multi-modal support (images, PDFs)
Distributed processing (Spark integration)
Web UI for pipeline management
Additional LLM providers (Cohere, AI21, Mistral)

Built with Python and LlamaIndex

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.10.1

Apr 23, 2026

1.10.0

Apr 23, 2026

1.10.0rc1 pre-release yanked

Apr 22, 2026

1.9.1 yanked

Apr 22, 2026

1.9.0 yanked

Apr 22, 2026

1.7.0 yanked

Mar 31, 2026

1.6.2

Mar 24, 2026

1.6.1

Mar 20, 2026

1.5.3

Mar 11, 2026

1.5.2

Mar 11, 2026

1.5.0

Mar 10, 2026

1.4.3

Mar 9, 2026

1.4.2

Mar 9, 2026

1.4.1

Mar 8, 2026

1.3.4

Nov 20, 2025

1.3.3

Nov 16, 2025

1.3.1

Nov 16, 2025

1.2.1

Nov 12, 2025

1.2.0

Nov 9, 2025

1.1.0

Nov 9, 2025

1.0.4

Oct 29, 2025

1.0.3

Oct 27, 2025

1.0.2

Oct 27, 2025

1.0.1

Oct 27, 2025

This version

1.0.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ondine-1.0.0.tar.gz (705.2 kB view details)

Uploaded Oct 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ondine-1.0.0-py3-none-any.whl (111.8 kB view details)

Uploaded Oct 27, 2025 Python 3

File details

Details for the file ondine-1.0.0.tar.gz.

File metadata

Download URL: ondine-1.0.0.tar.gz
Upload date: Oct 27, 2025
Size: 705.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for ondine-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`cba5e46a78687d3f09b47f5195bbf24b2a13c4a5cee86ceaa9ed3ec47ae6ca83`
MD5	`15531853166601a6f82cba94f406cd7d`
BLAKE2b-256	`f8c99ddc19def79da8a074d1b82f6ac5fcbbeea7d9527fb491a9d78390103ba4`

See more details on using hashes here.

File details

Details for the file ondine-1.0.0-py3-none-any.whl.

File metadata

Download URL: ondine-1.0.0-py3-none-any.whl
Upload date: Oct 27, 2025
Size: 111.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for ondine-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7205ea8018d56d8f457adf5e265e6ace47d5a4a982fd2d6648d6dc8128047d6`
MD5	`3c9864f141e55619cb09ca71f7b0f7c9`
BLAKE2b-256	`3416ba449bc11562cb9f86e9063674eb69cf7a7cf535b9fd2b4f34dadbfc8fb0`

See more details on using hashes here.

ondine 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Ceres - LLM Dataset Engine

Features

Quick Start

Option 1: Quick API (Recommended)

Option 2: Builder API (Full Control)

Installation

Using uv (recommended)

Using pip

Set up API keys

Usage Examples

1. Simple Data Processing

2. Structured Data Extraction

3. With Cost Control

4. Multiple Input Columns

5. Azure OpenAI

6. Anthropic Claude

7. Local Inference with MLX (Apple Silicon)

8. Provider Presets (Simplified Configuration)

9. Custom OpenAI-Compatible APIs (Parameter-Based)

10. Multi-Column Output with JSON Parsing

11. Pipeline Composition (Multi-Column with Dependencies)

CLI Usage

List Available Providers

Process Datasets

Example Config File

Architecture

Key Design Principles

Supported LLM Providers

Use Cases

Performance

Observability & Debugging

Configuration Options

Processing Configuration

LLM Configuration

Output Configuration

Testing

Documentation

Contributing

License

Acknowledgments

Support

Recent Updates

Version 1.0.0 (October 2025)

Roadmap

Upcoming Features

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes