Generate high-quality synthetic instruction-tuning data from seed examples. Simple API, built-in quality filtering, cost-aware.
Project description
castwright
Generate synthetic instruction-tuning data that doesn't look synthetic.
castwright takes a handful of seed examples and produces thousands of new instruction-output pairs using any LLM API. It handles the annoying parts — prompt engineering, JSON parsing, deduplication, quality filtering — so you can focus on the model you're actually training.
from castwright import generate, load_seeds, save_results, GenerationConfig
from castwright import OpenAIProvider
seeds = load_seeds("seeds.jsonl")
provider = OpenAIProvider(model="gpt-4o-mini")
result = generate(seeds, provider, GenerationConfig(n=500, temperature=0.9))
save_results(result, "training_data.jsonl")
print(f"Saved {len(result.examples)} examples ({result.n_filtered} filtered)")
Why castwright?
Building a fine-tuning dataset by hand is slow. Getting an LLM to generate training data sounds easy until you deal with:
- Refusals showing up in your training set
- Repetitive examples that add nothing
- Models talking about generating data instead of actually doing it
- Raw JSON extraction from markdown blocks
- Deduplication against your seeds so the model doesn't just copy them
distilabel tried to solve this but became a pipeline framework. Alpaca_eval is evaluation-only. Self-instruct is a research repo, not a library.
castwright is the missing middle ground: a pip-installable library with a clean API, built-in quality filters, and output in every format fine-tuning frameworks expect.
What you get:
- Pluggable LLM backends: OpenAI, Anthropic, or any OpenAI-compatible API
- Six fast heuristic filters that catch bad generations before they hit your training set
- Automatic dedup against your seed data
- Output in Alpaca, ShareGPT, or OpenAI chat format
- Multi-turn conversation generation
- A CLI for quick generation runs without writing Python
- Zero required dependencies (provider SDKs are optional extras)
Install
pip install castwright
With OpenAI support:
pip install castwright[openai]
With Anthropic support:
pip install castwright[anthropic]
Everything (both providers + CLI):
pip install castwright[all]
Seed file format
Create a JSONL file with your seed examples. You need at least a few good ones — castwright uses them to teach the LLM what you want:
{"instruction": "Explain the difference between TCP and UDP", "output": "TCP is a connection-oriented protocol that guarantees delivery..."}
{"instruction": "Write a Python function to flatten a nested list", "output": "def flatten(lst):\n result = []\n for item in lst:\n if isinstance(item, list):\n result.extend(flatten(item))\n else:\n result.append(item)\n return result"}
{"instruction": "What causes a segfault?", "input": "In C/C++ programs", "output": "A segmentation fault occurs when a program tries to access memory..."}
Also accepts JSON arrays and prompt/response field names.
Usage
Basic generation
from castwright import generate, GenerationConfig, Seed
from castwright import OpenAIProvider
seeds = [
Seed(instruction="Explain recursion", output="Recursion is when a function calls itself..."),
Seed(instruction="What is a hash table?", output="A hash table is a data structure that maps keys to values..."),
]
provider = OpenAIProvider(model="gpt-4o-mini")
config = GenerationConfig(n=100, temperature=0.9, diversity_factor=0.7)
result = generate(seeds, provider, config)
print(f"Generated: {result.n_generated}, Filtered: {result.n_filtered}, Kept: {len(result.examples)}")
Output formats
from castwright import save_results, OutputFormat
# Alpaca format (default) — works with axolotl, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.ALPACA)
# ShareGPT format — works with FastChat, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.SHAREGPT)
# OpenAI chat format — works with OpenAI fine-tuning API
save_results(result, "data.jsonl", OutputFormat.OPENAI)
Multi-turn conversations
from castwright import generate_multiturn, Seed
from castwright import OpenAIProvider
seeds = [Seed(instruction="Help me debug this Python code", output="Let me look at that...")]
provider = OpenAIProvider()
result = generate_multiturn(seeds, provider, n=50, turns=4)
Custom providers
Any OpenAI-compatible API works out of the box:
from castwright import OpenAIProvider
# vLLM, Ollama, Together, etc.
provider = OpenAIProvider(
model="meta-llama/Llama-3-70B-Instruct",
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
Quality filters
castwright applies six filters by default:
| Filter | What it catches |
|---|---|
not_empty |
Blank instruction or output |
min_length |
Instructions shorter than 10 characters |
not_repetitive |
Output with >30% consecutive word repeats |
not_refusal |
"I'm sorry, I can't..." responses |
no_meta_talk |
"Here's an example..." meta-commentary |
balanced_formatting |
Unclosed code blocks |
You can also pass your own:
from castwright import filter_examples, GeneratedExample
def my_filter(ex: GeneratedExample) -> bool:
return len(ex.output) > 100
filtered = filter_examples(result.examples, filters=[my_filter])
Generation config
GenerationConfig(
n=100, # Number of examples to generate
model="gpt-4o-mini", # Model name (passed to provider)
temperature=0.9, # Sampling temperature (0.0-2.0)
max_retries=3, # Retries on parse failure
diversity_factor=0.7, # 0.0=similar to seeds, 1.0=very diverse
output_format=OutputFormat.ALPACA,
)
CLI
# Generate from seed file
castwright gen seeds.jsonl -n 200 -m gpt-4o-mini -o output.jsonl --provider openai
# Use Anthropic
castwright gen seeds.jsonl -n 100 -m claude-sonnet-4-20250514 -o output.jsonl --provider anthropic
# Preview your seed examples
castwright preview seeds.jsonl
# Test without API calls
castwright gen seeds.jsonl -n 10 -o test.jsonl --provider mock
Comparison with alternatives
| castwright | distilabel | self-instruct | manual | |
|---|---|---|---|---|
| pip install | yes | yes | clone repo | - |
| Simple API | 3 lines | pipeline DSL | scripts | - |
| Quality filters | built-in | separate step | none | human |
| Multi-provider | OpenAI, Anthropic, any compatible | varies | OpenAI only | - |
| Format output | Alpaca, ShareGPT, OpenAI | custom | Alpaca | any |
| Maintained | active | founders left | archived | - |
See Also
Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:
| Project | What it does |
|---|---|
| tokonomics | Token counting & cost management for LLM APIs |
| datacrux | Training data quality — dedup, PII, contamination |
| datamix | Dataset mixing & curriculum optimization |
| toksight | Tokenizer analysis & comparison |
| trainpulse | Training health monitoring |
| ckpt | Checkpoint inspection, diffing & merging |
| quantbench | Quantization quality analysis |
| infermark | Inference benchmarking |
| modeldiff | Behavioral regression testing |
| vibesafe | AI-generated code safety scanner |
| injectionguard | Prompt injection detection |
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file castwright-0.2.0.tar.gz.
File metadata
- Download URL: castwright-0.2.0.tar.gz
- Upload date:
- Size: 36.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
483ed365993842873b50691b057fb09727633f833a136215d63bd8c38f1823a5
|
|
| MD5 |
c234e5ebe8fd1d2214bbf5ee760e3f93
|
|
| BLAKE2b-256 |
e4167a30a9fae23c5e748ed881dc0ac9b584c45bc13acbb0ba345bf8eb75f95d
|
File details
Details for the file castwright-0.2.0-py3-none-any.whl.
File metadata
- Download URL: castwright-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8de8bbcfd6cc595e11e2614f96a49dc4bff4a38d0bcfde3eafe5828a9080cf7
|
|
| MD5 |
f5bcdd11fc2b7ec470affa58e38c8872
|
|
| BLAKE2b-256 |
b759fbe9e78f8e0416cb070ad733722e2749130467e1339ff48fad77c44ae431
|