Skip to main content

Generate high-quality synthetic instruction-tuning data from seed examples. Simple API, built-in quality filtering, cost-aware.

Project description

castwright

CI Python 3.9+ License: Apache 2.0

Generate synthetic instruction-tuning data that doesn't look synthetic.

castwright takes a handful of seed examples and produces thousands of new instruction-output pairs using any LLM API. It handles the annoying parts — prompt engineering, JSON parsing, deduplication, quality filtering — so you can focus on the model you're actually training.

castwright generation report

from castwright import generate, load_seeds, save_results, GenerationConfig
from castwright import OpenAIProvider

seeds = load_seeds("seeds.jsonl")
provider = OpenAIProvider(model="gpt-4o-mini")

result = generate(seeds, provider, GenerationConfig(n=500, temperature=0.9))
save_results(result, "training_data.jsonl")
print(f"Saved {len(result.examples)} examples ({result.n_filtered} filtered)")

Why castwright?

Building a fine-tuning dataset by hand is slow. Getting an LLM to generate training data sounds easy until you deal with:

  • Refusals showing up in your training set
  • Repetitive examples that add nothing
  • Models talking about generating data instead of actually doing it
  • Raw JSON extraction from markdown blocks
  • Deduplication against your seeds so the model doesn't just copy them

distilabel tried to solve this but became a pipeline framework. Alpaca_eval is evaluation-only. Self-instruct is a research repo, not a library.

castwright is the missing middle ground: a pip-installable library with a clean API, built-in quality filters, and output in every format fine-tuning frameworks expect.

What you get:

  • Pluggable LLM backends: OpenAI, Anthropic, or any OpenAI-compatible API
  • Six fast heuristic filters that catch bad generations before they hit your training set
  • Automatic dedup against your seed data
  • Output in Alpaca, ShareGPT, or OpenAI chat format
  • Multi-turn conversation generation
  • A CLI for quick generation runs without writing Python
  • Zero required dependencies (provider SDKs are optional extras)

Install

pip install castwright

With OpenAI support:

pip install castwright[openai]

With Anthropic support:

pip install castwright[anthropic]

Everything (both providers + CLI):

pip install castwright[all]

Seed file format

Create a JSONL file with your seed examples. You need at least a few good ones — castwright uses them to teach the LLM what you want:

{"instruction": "Explain the difference between TCP and UDP", "output": "TCP is a connection-oriented protocol that guarantees delivery..."}
{"instruction": "Write a Python function to flatten a nested list", "output": "def flatten(lst):\n    result = []\n    for item in lst:\n        if isinstance(item, list):\n            result.extend(flatten(item))\n        else:\n            result.append(item)\n    return result"}
{"instruction": "What causes a segfault?", "input": "In C/C++ programs", "output": "A segmentation fault occurs when a program tries to access memory..."}

Also accepts JSON arrays and prompt/response field names.

Usage

Basic generation

from castwright import generate, GenerationConfig, Seed
from castwright import OpenAIProvider

seeds = [
    Seed(instruction="Explain recursion", output="Recursion is when a function calls itself..."),
    Seed(instruction="What is a hash table?", output="A hash table is a data structure that maps keys to values..."),
]

provider = OpenAIProvider(model="gpt-4o-mini")
config = GenerationConfig(n=100, temperature=0.9, diversity_factor=0.7)

result = generate(seeds, provider, config)
print(f"Generated: {result.n_generated}, Filtered: {result.n_filtered}, Kept: {len(result.examples)}")

Output formats

from castwright import save_results, OutputFormat

# Alpaca format (default) — works with axolotl, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.ALPACA)

# ShareGPT format — works with FastChat, LLaMA-Factory
save_results(result, "data.jsonl", OutputFormat.SHAREGPT)

# OpenAI chat format — works with OpenAI fine-tuning API
save_results(result, "data.jsonl", OutputFormat.OPENAI)

Multi-turn conversations

from castwright import generate_multiturn, Seed
from castwright import OpenAIProvider

seeds = [Seed(instruction="Help me debug this Python code", output="Let me look at that...")]
provider = OpenAIProvider()

result = generate_multiturn(seeds, provider, n=50, turns=4)

Custom providers

Any OpenAI-compatible API works out of the box:

from castwright import OpenAIProvider

# vLLM, Ollama, Together, etc.
provider = OpenAIProvider(
    model="meta-llama/Llama-3-70B-Instruct",
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

Quality filters

castwright applies six filters by default:

Filter What it catches
not_empty Blank instruction or output
min_length Instructions shorter than 10 characters
not_repetitive Output with >30% consecutive word repeats
not_refusal "I'm sorry, I can't..." responses
no_meta_talk "Here's an example..." meta-commentary
balanced_formatting Unclosed code blocks

You can also pass your own:

from castwright import filter_examples, GeneratedExample

def my_filter(ex: GeneratedExample) -> bool:
    return len(ex.output) > 100

filtered = filter_examples(result.examples, filters=[my_filter])

Generation config

GenerationConfig(
    n=100,                    # Number of examples to generate
    model="gpt-4o-mini",     # Model name (passed to provider)
    temperature=0.9,          # Sampling temperature (0.0-2.0)
    max_retries=3,            # Retries on parse failure
    diversity_factor=0.7,     # 0.0=similar to seeds, 1.0=very diverse
    output_format=OutputFormat.ALPACA,
)

CLI

# Generate from seed file
castwright gen seeds.jsonl -n 200 -m gpt-4o-mini -o output.jsonl --provider openai

# Use Anthropic
castwright gen seeds.jsonl -n 100 -m claude-sonnet-4-20250514 -o output.jsonl --provider anthropic

# Preview your seed examples
castwright preview seeds.jsonl

# Test without API calls
castwright gen seeds.jsonl -n 10 -o test.jsonl --provider mock

Comparison with alternatives

castwright distilabel self-instruct manual
pip install yes yes clone repo -
Simple API 3 lines pipeline DSL scripts -
Quality filters built-in separate step none human
Multi-provider OpenAI, Anthropic, any compatible varies OpenAI only -
Format output Alpaca, ShareGPT, OpenAI custom Alpaca any
Maintained active founders left archived -

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
datacrux Training data quality — dedup, PII, contamination
datamix Dataset mixing & curriculum optimization
toksight Tokenizer analysis & comparison
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
infermark Inference benchmarking
modeldiff Behavioral regression testing
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

castwright-0.2.0.tar.gz (36.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

castwright-0.2.0-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file castwright-0.2.0.tar.gz.

File metadata

  • Download URL: castwright-0.2.0.tar.gz
  • Upload date:
  • Size: 36.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for castwright-0.2.0.tar.gz
Algorithm Hash digest
SHA256 483ed365993842873b50691b057fb09727633f833a136215d63bd8c38f1823a5
MD5 c234e5ebe8fd1d2214bbf5ee760e3f93
BLAKE2b-256 e4167a30a9fae23c5e748ed881dc0ac9b584c45bc13acbb0ba345bf8eb75f95d

See more details on using hashes here.

File details

Details for the file castwright-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: castwright-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for castwright-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8de8bbcfd6cc595e11e2614f96a49dc4bff4a38d0bcfde3eafe5828a9080cf7
MD5 f5bcdd11fc2b7ec470affa58e38c8872
BLAKE2b-256 b759fbe9e78f8e0416cb070ad733722e2749130467e1339ff48fad77c44ae431

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page