Skip to main content

Convert RAG logs into fine-tuning datasets

Project description

🧪 Distillery

Convert RAG logs into high-quality fine-tuning datasets.

Stop spending weeks manually labeling data. Distillery automatically transforms your RAG production logs into training datasets, helping you reduce costs and improve performance.

Why Distillery?

The Problem

You built a RAG system. It's in production. It works. But every query costs money (embeddings + retrieval + LLM). At scale, you're spending $100s-$1000s per month.

You know fine-tuning could help, but creating training data takes weeks.

The Solution

Distillery automatically converts your RAG logs into fine-tuning datasets:

# Analyze your RAG logs
distillery analyze --source langsmith --project my-rag-app

# Generate training data
distillery generate --output training_data.jsonl --min-score 0.85

# Compare costs
distillery compare --monthly-queries 50000

# Fine-tune
distillery train --dataset training_data.jsonl --model gpt-4o-mini

Result: 60-90% cost reduction, weeks saved on data prep.

Features

🔌 Universal Log Support

  • LangSmith: Official LangChain observability (most popular)
  • JSONL: Custom logs from any RAG system
  • Coming soon: LlamaIndex, Haystack, custom databases

🎯 Smart Quality Filtering

  • Filter by retrieval scores
  • Remove uncertain responses ("I don't know")
  • Keep only positive/neutral user feedback
  • Customizable thresholds

📊 Data Quality Metrics

  • Diversity scoring
  • Quality assessment
  • Topic distribution
  • Automatic deduplication

💰 Cost Calculator

  • Compare RAG vs fine-tuned costs
  • Calculate ROI and break-even
  • Project savings at scale

🚀 Multiple Formats

  • OpenAI (chat completion)
  • Llama
  • Mistral
  • Custom templates

🔒 Privacy-First

  • All processing happens locally
  • Your data never leaves your machine
  • No telemetry, no tracking

Installation

pip install distillery-ai

# With LangSmith support
pip install distillery-ai[langsmith]

Quick Start

1. From LangSmith

from distillery.connectors import create_langsmith_connector
from distillery.filters import filter_logs
from distillery.converters import convert_to_openai
from distillery.utils import estimate_savings

# Connect to LangSmith
connector = create_langsmith_connector("my-rag-project")
logs = list(connector.fetch_logs(limit=1000))

# Filter high-quality examples
filtered = filter_logs(logs, min_score=0.85)

# Convert to training format
training_examples = convert_to_openai(filtered)

# Calculate savings
comparison = estimate_savings(logs, training_examples, monthly_queries=50000)
print(comparison)

2. From JSONL Files

from distillery.connectors import create_jsonl_connector

# Point to your log files
connector = create_jsonl_connector("logs/*.jsonl")
logs = list(connector.fetch_logs())

# Rest is the same...

CLI Usage

Analyze Logs

# From LangSmith
distillery analyze \
  --source langsmith \
  --project my-rag-project

# From files
distillery analyze \
  --source jsonl \
  --path "logs/*.jsonl"

# Output:
# 📊 Total queries: 10,234
# ✅ Successful (score > 0.8): 8,721 (85%)
# 📝 User feedback: thumbs_up: 892, thumbs_down: 143
# 🏷️  Topics: refunds (34%), shipping (28%), returns (18%)

Generate Training Data

distillery generate \
  --source langsmith \
  --project my-rag-project \
  --output training_data.jsonl \
  --min-score 0.85 \
  --format openai

# Output:
# ✅ Filtered 8,721 high-quality examples
# ✅ Generated 8,721 training examples
# 💰 Estimated training cost: $68.42
# 📂 Saved to: training_data.jsonl

Compare Costs

distillery compare \
  --logs logs/*.jsonl \
  --monthly-queries 50000

# Output:
# Current RAG: $93/month
# Fine-tuned: $3/month
# Savings: $90/month ($1,080/year)
# Break-even: 0.8 months
# ROI: 1,580% annual return

Fine-Tune

distillery train \
  --dataset training_data.jsonl \
  --model gpt-4o-mini \
  --suffix customer-support-v1

# Output:
# ✅ Uploaded training data
# ✅ Started fine-tune job: ftjob-abc123
# ✅ Model will be: ft:gpt-4o-mini:customer-support-v1
# ⏱️  Estimated completion: 2 hours

Advanced Usage

Custom Quality Filters

from distillery.filters import QualityFilter, min_retrieval_score

# Create custom filter
filter = QualityFilter()
filter.add_filter(min_retrieval_score(0.9))  # Very strict
filter.add_filter(lambda log: len(log.response) > 50)  # Longer responses

filtered = filter.filter(logs)

Data Augmentation

from distillery.augmenters import augment_dataset

# Generate variations of each query
augmented = augment_dataset(
    training_examples,
    variations=3,  # 3x dataset size
    model="gpt-4o-mini"
)

Include Retrieved Context

from distillery.converters import convert_to_openai

# Bake context into training data
examples = convert_to_openai(
    logs,
    include_context=True  # Includes retrieved chunks
)

Real-World Examples

Example 1: Customer Support Bot

Input: 10,000 RAG queries over 30 days
Filter: 8,500 high-quality examples
Training cost: $71
Monthly RAG cost: $180
Monthly fine-tuned cost: $6
Savings: $174/month ($2,088/year)
Break-even: 2 weeks

Example 2: Documentation Q&A

Input: 50,000 queries/month
Filter: 42,000 high-quality examples
Training cost: $298
Monthly RAG cost: $890
Monthly fine-tuned cost: $28
Savings: $862/month ($10,344/year)
Break-even: 11 days

How It Works

1. RAG Logs Collection

Your RAG system logs:

  • User queries
  • Retrieved documents
  • LLM responses
  • User feedback (optional)

2. Quality Filtering

Distillery filters for:

  • High retrieval scores (> 0.8)
  • Confident responses (no "I don't know")
  • Positive/neutral feedback
  • Reasonable length

3. Format Conversion

Transforms to OpenAI format:

{
  "messages": [
    {"role": "user", "content": "What's the refund policy?"},
    {"role": "assistant", "content": "Our refund policy..."}
  ]
}

4. Fine-Tuning

Upload to OpenAI and train:

  • Model learns domain knowledge
  • No retrieval needed at inference
  • 60-90% cost reduction
  • 3-5x faster responses

Architecture

RAG Production Logs
        ↓
   Connectors (LangSmith/JSONL)
        ↓
   Quality Filtering
        ↓
   Format Conversion
        ↓
   Training Data (JSONL)
        ↓
   Fine-Tuned Model

Requirements

  • Python 3.9+
  • OpenAI API key (for fine-tuning)
  • LangSmith API key (optional, for LangSmith logs)

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Support

Roadmap

  • LangSmith connector
  • JSONL connector
  • OpenAI format converter
  • Quality filtering
  • Cost calculator
  • LlamaIndex connector
  • Data augmentation
  • Web UI
  • Team accounts
  • Continuous retraining

Built with ❤️ by the Distillery team.

Stop spending weeks on data labeling. Start saving money today.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillery_ai-0.1.0.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distillery_ai-0.1.0-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file distillery_ai-0.1.0.tar.gz.

File metadata

  • Download URL: distillery_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for distillery_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9db8856ef9d54621ad6868689f87853f446f050d7323b92afc8d550e78ec5a00
MD5 704c7e8cb177970c32c7489cb1b8df47
BLAKE2b-256 8868d5ee15dbfe7235bb12749d5bf0366c171e329ece148f6fe07fd652c82e37

See more details on using hashes here.

File details

Details for the file distillery_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: distillery_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for distillery_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8686b4348f5116c844dfe849ed1badd2cc31647a21fab7d70440dd9be45eae23
MD5 2f2192c337b0a8e5348d1255d916c6a4
BLAKE2b-256 c00cd866d308b62c3791c9a4324ba49c21fd6245941a1c3e3f1f38aed3bd4909

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page