Convert RAG logs into fine-tuning datasets
Project description
🧪 Distillery
Convert RAG logs into high-quality fine-tuning datasets.
Stop spending weeks manually labeling data. Distillery automatically transforms your RAG production logs into training datasets, helping you reduce costs and improve performance.
Why Distillery?
The Problem
You built a RAG system. It's in production. It works. But every query costs money (embeddings + retrieval + LLM). At scale, you're spending $100s-$1000s per month.
You know fine-tuning could help, but creating training data takes weeks.
The Solution
Distillery automatically converts your RAG logs into fine-tuning datasets:
# Analyze your RAG logs
distillery analyze --source langsmith --project my-rag-app
# Generate training data
distillery generate --output training_data.jsonl --min-score 0.85
# Compare costs
distillery compare --monthly-queries 50000
# Fine-tune
distillery train --dataset training_data.jsonl --model gpt-4o-mini
Result: 60-90% cost reduction, weeks saved on data prep.
Features
🔌 Universal Log Support
- LangSmith: Official LangChain observability (most popular)
- JSONL: Custom logs from any RAG system
- Coming soon: LlamaIndex, Haystack, custom databases
🎯 Smart Quality Filtering
- Filter by retrieval scores
- Remove uncertain responses ("I don't know")
- Keep only positive/neutral user feedback
- Customizable thresholds
📊 Data Quality Metrics
- Diversity scoring
- Quality assessment
- Topic distribution
- Automatic deduplication
💰 Cost Calculator
- Compare RAG vs fine-tuned costs
- Calculate ROI and break-even
- Project savings at scale
🚀 Multiple Formats
- OpenAI (chat completion)
- Llama
- Mistral
- Custom templates
🔒 Privacy-First
- All processing happens locally
- Your data never leaves your machine
- No telemetry, no tracking
Installation
pip install distillery-ai
# With LangSmith support
pip install distillery-ai[langsmith]
Quick Start
1. From LangSmith
from distillery.connectors import create_langsmith_connector
from distillery.filters import filter_logs
from distillery.converters import convert_to_openai
from distillery.utils import estimate_savings
# Connect to LangSmith
connector = create_langsmith_connector("my-rag-project")
logs = list(connector.fetch_logs(limit=1000))
# Filter high-quality examples
filtered = filter_logs(logs, min_score=0.85)
# Convert to training format
training_examples = convert_to_openai(filtered)
# Calculate savings
comparison = estimate_savings(logs, training_examples, monthly_queries=50000)
print(comparison)
2. From JSONL Files
from distillery.connectors import create_jsonl_connector
# Point to your log files
connector = create_jsonl_connector("logs/*.jsonl")
logs = list(connector.fetch_logs())
# Rest is the same...
CLI Usage
Analyze Logs
# From LangSmith
distillery analyze \
--source langsmith \
--project my-rag-project
# From files
distillery analyze \
--source jsonl \
--path "logs/*.jsonl"
# Output:
# 📊 Total queries: 10,234
# ✅ Successful (score > 0.8): 8,721 (85%)
# 📝 User feedback: thumbs_up: 892, thumbs_down: 143
# 🏷️ Topics: refunds (34%), shipping (28%), returns (18%)
Generate Training Data
distillery generate \
--source langsmith \
--project my-rag-project \
--output training_data.jsonl \
--min-score 0.85 \
--format openai
# Output:
# ✅ Filtered 8,721 high-quality examples
# ✅ Generated 8,721 training examples
# 💰 Estimated training cost: $68.42
# 📂 Saved to: training_data.jsonl
Compare Costs
distillery compare \
--logs logs/*.jsonl \
--monthly-queries 50000
# Output:
# Current RAG: $93/month
# Fine-tuned: $3/month
# Savings: $90/month ($1,080/year)
# Break-even: 0.8 months
# ROI: 1,580% annual return
Fine-Tune
distillery train \
--dataset training_data.jsonl \
--model gpt-4o-mini \
--suffix customer-support-v1
# Output:
# ✅ Uploaded training data
# ✅ Started fine-tune job: ftjob-abc123
# ✅ Model will be: ft:gpt-4o-mini:customer-support-v1
# ⏱️ Estimated completion: 2 hours
Advanced Usage
Custom Quality Filters
from distillery.filters import QualityFilter, min_retrieval_score
# Create custom filter
filter = QualityFilter()
filter.add_filter(min_retrieval_score(0.9)) # Very strict
filter.add_filter(lambda log: len(log.response) > 50) # Longer responses
filtered = filter.filter(logs)
Data Augmentation
from distillery.augmenters import augment_dataset
# Generate variations of each query
augmented = augment_dataset(
training_examples,
variations=3, # 3x dataset size
model="gpt-4o-mini"
)
Include Retrieved Context
from distillery.converters import convert_to_openai
# Bake context into training data
examples = convert_to_openai(
logs,
include_context=True # Includes retrieved chunks
)
Real-World Examples
Example 1: Customer Support Bot
Input: 10,000 RAG queries over 30 days
Filter: 8,500 high-quality examples
Training cost: $71
Monthly RAG cost: $180
Monthly fine-tuned cost: $6
Savings: $174/month ($2,088/year)
Break-even: 2 weeks
Example 2: Documentation Q&A
Input: 50,000 queries/month
Filter: 42,000 high-quality examples
Training cost: $298
Monthly RAG cost: $890
Monthly fine-tuned cost: $28
Savings: $862/month ($10,344/year)
Break-even: 11 days
How It Works
1. RAG Logs Collection
Your RAG system logs:
- User queries
- Retrieved documents
- LLM responses
- User feedback (optional)
2. Quality Filtering
Distillery filters for:
- High retrieval scores (> 0.8)
- Confident responses (no "I don't know")
- Positive/neutral feedback
- Reasonable length
3. Format Conversion
Transforms to OpenAI format:
{
"messages": [
{"role": "user", "content": "What's the refund policy?"},
{"role": "assistant", "content": "Our refund policy..."}
]
}
4. Fine-Tuning
Upload to OpenAI and train:
- Model learns domain knowledge
- No retrieval needed at inference
- 60-90% cost reduction
- 3-5x faster responses
Architecture
RAG Production Logs
↓
Connectors (LangSmith/JSONL)
↓
Quality Filtering
↓
Format Conversion
↓
Training Data (JSONL)
↓
Fine-Tuned Model
Requirements
- Python 3.9+
- OpenAI API key (for fine-tuning)
- LangSmith API key (optional, for LangSmith logs)
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.
Support
- 📧 Email: hello@distillery.ai
- 💬 Discord: Join our community
- 📚 Docs: docs.distillery.ai
Roadmap
- LangSmith connector
- JSONL connector
- OpenAI format converter
- Quality filtering
- Cost calculator
- LlamaIndex connector
- Data augmentation
- Web UI
- Team accounts
- Continuous retraining
Built with ❤️ by the Distillery team.
Stop spending weeks on data labeling. Start saving money today.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distillery_ai-0.1.0.tar.gz.
File metadata
- Download URL: distillery_ai-0.1.0.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9db8856ef9d54621ad6868689f87853f446f050d7323b92afc8d550e78ec5a00
|
|
| MD5 |
704c7e8cb177970c32c7489cb1b8df47
|
|
| BLAKE2b-256 |
8868d5ee15dbfe7235bb12749d5bf0366c171e329ece148f6fe07fd652c82e37
|
File details
Details for the file distillery_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: distillery_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8686b4348f5116c844dfe849ed1badd2cc31647a21fab7d70440dd9be45eae23
|
|
| MD5 |
2f2192c337b0a8e5348d1255d916c6a4
|
|
| BLAKE2b-256 |
c00cd866d308b62c3791c9a4324ba49c21fd6245941a1c3e3f1f38aed3bd4909
|