Generate high-quality synthetic instruction-tuning data from seed examples. Simple API, built-in quality filtering, cost-aware.

These details have not been verified by PyPI

Project links

Project description

castwright

Generate synthetic instruction-tuning data that doesn't look synthetic.

castwright takes a handful of seed examples and produces thousands of new instruction-output pairs using any LLM API. It handles the annoying parts — prompt engineering, JSON parsing, deduplication, quality filtering — so you can focus on the model you're actually training.

castwright generation report

from castwright import generate, load_seeds, save_results, GenerationConfig
from castwright import OpenAIProvider

seeds = load_seeds("seeds.jsonl")
provider = OpenAIProvider(model="gpt-4o-mini")

result = generate(seeds, provider, GenerationConfig(n=500, temperature=0.9))
save_results(result, "training_data.jsonl")
print(f"Saved {len(result.examples)} examples ({result.n_filtered} filtered)")

Why castwright?

Building a fine-tuning dataset by hand is slow. Getting an LLM to generate training data sounds easy until you deal with:

Refusals showing up in your training set
Repetitive examples that add nothing
Models talking about generating data instead of actually doing it
Raw JSON extraction from markdown blocks
Deduplication against your seeds so the model doesn't just copy them

distilabel tried to solve this but became a pipeline framework. Alpaca_eval is evaluation-only. Self-instruct is a research repo, not a library.

castwright is the missing middle ground: a pip-installable library with a clean API, built-in quality filters, and output in every format fine-tuning frameworks expect.

What you get:

Pluggable LLM backends: OpenAI, Anthropic, or any OpenAI-compatible API
Six fast heuristic filters that catch bad generations before they hit your training set
Automatic dedup against your seed data
Output in Alpaca, ShareGPT, or OpenAI chat format
Multi-turn conversation generation
A CLI for quick generation runs without writing Python
Zero required dependencies (provider SDKs are optional extras)

Install

pip install castwright

With OpenAI support:

pip install castwright[openai]

With Anthropic support:

pip install castwright[anthropic]

Everything (both providers + CLI):

pip install castwright[all]

Seed file format

Create a JSONL file with your seed examples. You need at least a few good ones — castwright uses them to teach the LLM what you want:

{"instruction": "Explain the difference between TCP and UDP", "output": "TCP is a connection-oriented protocol that guarantees delivery..."}
{"instruction": "Write a Python function to flatten a nested list", "output": "def flatten(lst):\n    result = []\n    for item in lst:\n        if isinstance(item, list):\n            result.extend(flatten(item))\n        else:\n            result.append(item)\n    return result"}
{"instruction": "What causes a segfault?", "input": "In C/C++ programs", "output": "A segmentation fault occurs when a program tries to access memory..."}

Also accepts JSON arrays and prompt/response field names.

Usage

Basic generation

from castwright import generate, GenerationConfig, Seed
from castwright import OpenAIProvider

seeds = [
    Seed(instruction="Explain recursion", output="Recursion is when a function calls itself..."),
    Seed(instruction="What is a hash table?", output="A hash table is a data structure that maps keys to values..."),
]

provider = OpenAIProvider(model="gpt-4o-mini")
config = GenerationConfig(n=100, temperature=0.9, diversity_factor=0.7)

result = generate(seeds, provider, config)
print(f"Generated: {result.n_generated}, Filtered: {result.n_filtered}, Kept: {len(result.examples)}")

Output formats

from castwright import save_results, OutputFormat

# Alpaca format (default) — works with axolotl, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.ALPACA)

# ShareGPT format — works with FastChat, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.SHAREGPT)

# OpenAI chat format — works with OpenAI fine-tuning API
save_results(result, "data.jsonl", OutputFormat.OPENAI)

Multi-turn conversations

from castwright import generate_multiturn, Seed
from castwright import OpenAIProvider

seeds = [Seed(instruction="Help me debug this Python code", output="Let me look at that...")]
provider = OpenAIProvider()

result = generate_multiturn(seeds, provider, n=50, turns=4)

Custom providers

Any OpenAI-compatible API works out of the box:

from castwright import OpenAIProvider

# vLLM, Ollama, Together, etc.
provider = OpenAIProvider(
    model="meta-llama/Llama-3-70B-Instruct",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

Quality filters

castwright applies six filters by default:

Filter	What it catches
`not_empty`	Blank instruction or output
`min_length`	Instructions shorter than 10 characters
`not_repetitive`	Output with >30% consecutive word repeats
`not_refusal`	"I'm sorry, I can't..." responses
`no_meta_talk`	"Here's an example..." meta-commentary
`balanced_formatting`	Unclosed code blocks

You can also pass your own:

from castwright import filter_examples, GeneratedExample

def my_filter(ex: GeneratedExample) -> bool:
    return len(ex.output) > 100

filtered = filter_examples(result.examples, filters=[my_filter])

Generation config

GenerationConfig(
    n=100,                    # Number of examples to generate
    model="gpt-4o-mini",     # Model name (passed to provider)
    temperature=0.9,          # Sampling temperature (0.0-2.0)
    max_retries=3,            # Retries on parse failure
    diversity_factor=0.7,     # 0.0=similar to seeds, 1.0=very diverse
    output_format=OutputFormat.ALPACA,
)

CLI

# Generate from seed file
castwright gen seeds.jsonl -n 200 -m gpt-4o-mini -o output.jsonl --provider openai

# Use Anthropic
castwright gen seeds.jsonl -n 100 -m claude-sonnet-4-20250514 -o output.jsonl --provider anthropic

# Preview your seed examples
castwright preview seeds.jsonl

# Test without API calls
castwright gen seeds.jsonl -n 10 -o test.jsonl --provider mock

Comparison with alternatives

	castwright	distilabel	self-instruct	manual
pip install	yes	yes	clone repo	-
Simple API	3 lines	pipeline DSL	scripts	-
Quality filters	built-in	separate step	none	human
Multi-provider	OpenAI, Anthropic, any compatible	varies	OpenAI only	-
Format output	Alpaca, ShareGPT, OpenAI	custom	Alpaca	any
Maintained	active	founders left	archived	-

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
datamix	Dataset mixing & curriculum optimization
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
infermark	Inference benchmarking
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 10, 2026

This version

0.2.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

castwright-0.2.0.tar.gz (36.2 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

castwright-0.2.0-py3-none-any.whl (22.5 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file castwright-0.2.0.tar.gz.

File metadata

Download URL: castwright-0.2.0.tar.gz
Upload date: Apr 10, 2026
Size: 36.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for castwright-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`483ed365993842873b50691b057fb09727633f833a136215d63bd8c38f1823a5`
MD5	`c234e5ebe8fd1d2214bbf5ee760e3f93`
BLAKE2b-256	`e4167a30a9fae23c5e748ed881dc0ac9b584c45bc13acbb0ba345bf8eb75f95d`

See more details on using hashes here.

File details

Details for the file castwright-0.2.0-py3-none-any.whl.

File metadata

Download URL: castwright-0.2.0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 22.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for castwright-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8de8bbcfd6cc595e11e2614f96a49dc4bff4a38d0bcfde3eafe5828a9080cf7`
MD5	`f5bcdd11fc2b7ec470affa58e38c8872`
BLAKE2b-256	`b759fbe9e78f8e0416cb070ad733722e2749130467e1339ff48fad77c44ae431`

See more details on using hashes here.

castwright 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

castwright

Why castwright?

Install

Seed file format

Usage

Basic generation

Output formats

Multi-turn conversations

Custom providers

Quality filters

Generation config

CLI

Comparison with alternatives

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes