Generate conversational, tool-calling, structured-output, and preference datasets — easily and at scale
Project description
AfterImage
Synthetic conversational dataset generation for LLM fine-tuning.
Generate multi-turn chat data, DPO preference pairs, and structured outputs — from a single YAML file or a composable Python API.
Demonstration of a typical conversational dataset generation, where Afterimage simulates both sides of the conversation.
Generating a document-grounded Q&A dataset from BIS credit risk principles → ShareGPT format
News
April 23, 2026 — OpenSimula
OpenSimula is an experimental, open implementation of mechanism-design ideas from Simula (Davidson et al., TMLR; see also Google’s research blog on the framing). It covers LLM-built factor taxonomies, weighted mix sampling over those factors, meta-prompt diversification (with optional complexification), requirement critics with refinement, and an independent double-critic gate for verifiable multiple-choice items. Checkpoints live under an opensimula/ subtree (manifest, taxonomy bundle, sampling strategy); you can stream datapoints to JSONL, hook GenerationMonitor into OpenSimula, or bridge scenarios into ConversationGenerator via SimulaInstructionGeneratorCallback.
This module is not affiliated with Google and is not a reference port of internal systems—it is an independent take on the published Simula recipe.
Try it: walkthrough and CLI notes in examples/simula/README.md, scripts in examples/simula/, package overview in afterimage/simula/README.md. Narrative + monitoring notes: OpenSimula · autodoc: Simula / OpenSimula API.
Table of Contents
- News
- Why AfterImage
- Features
- Installation
- Quickstart — CLI
- Quickstart — Python API
- Supported LLM Providers
- Export Formats
- How It Works
- Configuration Reference
- Repository Layout
- Contributing
- License
Why AfterImage
Fine-tuning a model requires data. Real conversations are slow to collect, expensive to label, and almost never domain-specific enough.
AfterImage flips the problem: you define what the data should look like, and it generates it for you using any LLM you already have access to.
Your documents + LLM → Realistic, diverse, quality-filtered training data
What you get:
- Multi-turn conversations that read like real interactions — not templated Q&A pairs
- Document-grounded datasets tied to your corpus (RAG-style)
- DPO / RLHF preference pairs without a single manual label
- Data already formatted for the training framework you use
Features
| Category | What's included |
|---|---|
| Generation | Multi-turn chat · Document-grounded QA · Persona-driven diversity · Structured output · Tool-calling |
| Preference Data | DPO · RLHF · UltraFeedback · Anthropic HH · ORPO |
| Quality | LLM-as-judge · Embedding-based metrics · Auto-improve retries · Composite scoring |
| Providers | Gemini · OpenAI · DeepSeek · OpenRouter · Local (vLLM / Ollama / llama.cpp) |
| Export | ShareGPT · Alpaca · Messages · LLaMA Factory · Oumi · OpenAI fine-tune · DPO · Raw |
| Storage | JSONL (default) · SQLite · PostgreSQL · MySQL |
| Scale | Async-first · Concurrent generation · Smart API key rotation with rate limiting |
| Observability | Real-time metrics · Configurable alerts · HTML analytics reports |
| Interface | CLI · Python API · FastAPI REST server · Gradio demo UI |
Installation
If you want your agent to do it for you: Just copy and paste the following to your agent:
Read https://afterimage.altai.dev/llms.txt and follow it for installing AfterImage, documentation links, and examples.
If you are doing it yourself:
pip install afterimage
# or with uv (recommended)
uv add afterimage
Requires Python 3.11+
Optional extras:
| Extra | What it adds |
|---|---|
embeddings-local |
Local embeddings via sentence-transformers for Qdrant workflows and embedding-based quality checks |
server |
FastAPI REST server (afterimage-server CLI entry point) |
training |
PyTorch / TRL stack, Gradio UI, and training scripts under examples/ |
pip install "afterimage[server]"
pip install "afterimage[embeddings-local,server,training]"
Quickstart — CLI
Set your API key and run one command:
export GEMINI_API_KEY=your_key_here
afterimage generate -c examples/configs/basic.yaml
Preview the plan without spending any API credits:
afterimage generate -c examples/configs/basic.yaml --dry-run
Export to your training framework:
# List all available formats
afterimage export --list-formats
# Export to multiple formats in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages -f alpaca
# Create a train/val split automatically
afterimage export -i output/dataset.jsonl -f messages --split 0.9
# Push directly to Hugging Face Hub
afterimage push -c your_config.yaml --repo-id your-org/your-dataset
Generate DPO preference pairs:
afterimage preference -c your_config.yaml
Analyze your dataset:
afterimage analyze -i output/dataset.jsonl -o report.html
Quickstart — Python API
The CLI is powered by the same composable Python API. Drop into it whenever you need a custom pipeline.
Minimal conversation generation:
import asyncio
import os
from afterimage import ConversationGenerator
async def main():
gen = ConversationGenerator(
respondent_prompt="You are a helpful AI assistant. Answer clearly and concisely.",
api_key=os.environ["GEMINI_API_KEY"],
model_name="gemini-2.5-flash",
)
await gen.generate(num_dialogs=50, max_turns=4, max_concurrency=5)
print(f"Generated {len(gen.load_conversations())} conversations.")
asyncio.run(main())
Document-grounded generation with personas:
import asyncio
import os
from afterimage import (
ConversationGenerator,
PersonaGenerator,
PersonaInstructionGeneratorCallback,
InMemoryDocumentProvider,
WithContextRespondentPromptModifier,
)
DOCUMENTS = [
"Pour-over coffee is brewed by pouring hot water over grounds through a filter. "
"Key variables are grind size, water temperature (90–96 °C), and pour rate.",
"Espresso is brewed at 9 bar pressure through finely-ground beans. "
"It is the base for lattes, cappuccinos, and macchiatos.",
]
async def main():
api_key = os.environ["GEMINI_API_KEY"]
docs = InMemoryDocumentProvider(DOCUMENTS)
# Generate diverse user personas from your documents
persona_gen = PersonaGenerator(api_key=api_key)
await persona_gen.generate_from_documents(docs)
gen = ConversationGenerator(
respondent_prompt="You are a coffee expert. Answer questions based on the provided context.",
api_key=api_key,
model_name="gemini-2.5-flash",
instruction_generator_callback=PersonaInstructionGeneratorCallback(
api_key=api_key,
documents=docs,
num_random_contexts=1,
),
respondent_prompt_modifier=WithContextRespondentPromptModifier(),
)
await gen.generate(num_dialogs=100, max_turns=3, max_concurrency=5)
asyncio.run(main())
Generate DPO preference pairs:
import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.preference.generator import PreferenceGenerator
from afterimage.evaluator import ConversationJudge
async def main():
api_key = os.environ["GEMINI_API_KEY"]
base_gen = ConversationGenerator(
respondent_prompt="You are a helpful assistant.",
api_key=api_key,
model_name="gemini-2.5-flash",
)
judge = ConversationJudge(api_key=api_key, model_name="gemini-2.5-flash")
pref_gen = PreferenceGenerator(conversation_generator=base_gen, judge=judge)
await pref_gen.generate(num_pairs=200, max_concurrency=4)
asyncio.run(main())
More complete examples live under examples/. Full API reference is at afterimage.altai.dev.
Supported LLM Providers
| Provider | provider key |
Model examples | Notes |
|---|---|---|---|
| Google Gemini | gemini |
gemini-2.5-flash, gemini-2.0-flash |
Default in CLI configs |
| OpenAI | openai |
gpt-4o, gpt-4o-mini |
Full API support |
| DeepSeek | deepseek |
deepseek-chat, deepseek-reasoner |
Captures chain-of-thought reasoning |
| OpenRouter | openrouter |
Any model via OpenRouter | Access 100+ models with one key |
| Local | local |
Any OpenAI-compatible server | vLLM, Ollama, llama.cpp — zero API cost |
Providers can be mixed freely — use a fast/cheap model to simulate the user (correspondent) and a stronger model to generate answers (respondent).
Scale beyond rate limits with SmartKeyPool — automatic key rotation across concurrent requests:
from afterimage.key_management import SmartKeyPool
from afterimage import ConversationGenerator
pool = SmartKeyPool(["key_1", "key_2", "key_3"])
gen = ConversationGenerator(
respondent_prompt="You are a helpful assistant.",
api_key=pool,
model_name="gemini-2.5-flash",
)
Export Formats
One command converts your raw JSONL to any fine-tuning format:
| Format | --format flag |
Target framework |
|---|---|---|
| ShareGPT | sharegpt |
LLaMA Factory · FastChat · Axolotl |
| Alpaca | alpaca |
LLaMA Factory · many community trainers |
| HuggingFace Messages | messages |
TRL SFTTrainer · HuggingFace ecosystem |
| LLaMA Factory | llama_factory |
LLaMA Factory native format |
| Oumi | oumi |
Oumi training framework |
| OpenAI Fine-tune | openai_finetune |
OpenAI fine-tuning API |
| DPO | dpo |
TRL DPOTrainer · preference training |
| Raw | raw |
Custom pipelines — minimal processing |
# Export and split into train/val in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages --split 0.9
How AfterImage Works
AfterImage runs a two-agent loop per dialog:
- Correspondent generates user questions — driven by personas, document context, or custom instruction callbacks
- Respondent answers — with optional RAG context injected per turn
- Quality gate scores each dialog using LLM-as-judge + embedding metrics; retries below-threshold dialogs automatically
- Storage writes each dialog incrementally — crash-safe, resumable
- Export converts the raw JSONL to any training format in a single CLI command
Configuration Reference
The fastest path to generation is a YAML config:
# examples/configs/basic.yaml
generation:
num_dialogs: 100
max_turns: 4
max_concurrency: 5
model:
provider: gemini # gemini | openai | deepseek | openrouter | local
model_name: gemini-2.5-flash
api_key_env: GEMINI_API_KEY # environment variable name
respondent:
system_prompt: |
You are an expert assistant. Answer clearly and concisely.
# Optional: document grounding (RAG)
# documents:
# provider: directory # directory | file | jsonl | memory | qdrant
# path: ./my_docs/
# Optional: persona diversity
# personas:
# enabled: true
# Optional: context-grounded instruction generation
# context:
# enabled: true
# num_random_contexts: 2
# Optional: quality gate
# quality:
# auto_improve: true
output:
path: ./output/dataset.jsonl
storage: jsonl # jsonl | sql
# Validate config before running
afterimage validate -c examples/configs/basic.yaml
# Run
afterimage generate -c examples/configs/basic.yaml
All YAML options and their defaults are documented at afterimage.altai.dev.
Repository Layout
afterimage/ Core library
├── providers/ LLM, document, and embedding providers
├── callbacks/ Instruction generators, stopping criteria, prompt modifiers
├── evaluation/ LLM-as-judge and embedding-based evaluators
├── preference/ DPO / RLHF preference pair generation
├── integrations/ Export format adapters (ShareGPT, Alpaca, Messages, …)
├── analytics/ Dataset analytics engine and HTML report generator
└── server/ FastAPI REST server with SSE progress streaming
examples/
├── configs/ Ready-to-run YAML configs (basic, RAG, local, budget)
├── caselaw_rag/ Qdrant + HF CAP embeddings tutorial (index + generate)
├── demo_ui/ Gradio web UI — interactive generation + fine-tuning
└── *.py Python API usage examples
docs/ Sphinx sources (hosted at afterimage.altai.dev)
tests/ pytest suite — Python 3.11, 3.12, 3.13
Contributing
Contributions are welcome. Read DESIGN.md for architecture notes before opening a large PR.
# Clone and install all extras
git clone https://github.com/altaidevorg/afterimage
cd afterimage
uv sync --all-extras
# Run the test suite
pytest
# Check style
ruff check .
ruff format .
Open an issue before submitting significant changes — it helps align on design direction early and avoids wasted effort.
License
Built by Altai Dev · Documentation · PyPI · Blog
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afterimage-0.15.1.tar.gz.
File metadata
- Download URL: afterimage-0.15.1.tar.gz
- Upload date:
- Size: 202.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1c2059f0c5b41c4de022af31a9e3af1e3d6b13969ebb3492aa32ebf238a1bcf
|
|
| MD5 |
2dc8ee2e986bfa0884874d76c0da5a68
|
|
| BLAKE2b-256 |
f49f2254d4c219116a4e9de395520cae30382727e1ef7d6de59f2dfe807c3790
|
File details
Details for the file afterimage-0.15.1-py3-none-any.whl.
File metadata
- Download URL: afterimage-0.15.1-py3-none-any.whl
- Upload date:
- Size: 211.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eabbba6fb97fa9a39ae63255e10d3445476da65822e1d706c353f77bd6f328ce
|
|
| MD5 |
baa75750854e8354fb1f6b5893e790ab
|
|
| BLAKE2b-256 |
90369f8724be264496c70573d1c9af3411203a1b6608094e488734b08eada3a1
|