Skip to main content

Generate conversational, tool-calling, structured-output, and preference datasets — easily and at scale

Project description

AfterImage

Synthetic conversational dataset generation for LLM fine-tuning.

Generate multi-turn chat data, DPO preference pairs, and structured outputs — from a single YAML file or a composable Python API.

Tests Ruff format Ruff lint PyPI version PyPI downloads Python License Docs Medium


Demonstration of a typical conversational dataset generation, where Afterimage simulates both sides of the conversation.

AfterImage demo — Credit Risk Management Q&A Bot

Generating a document-grounded Q&A dataset from BIS credit risk principles → ShareGPT format


News

April 23, 2026 — OpenSimula

OpenSimula is an experimental, open implementation of mechanism-design ideas from Simula (Davidson et al., TMLR; see also Google’s research blog on the framing). It covers LLM-built factor taxonomies, weighted mix sampling over those factors, meta-prompt diversification (with optional complexification), requirement critics with refinement, and an independent double-critic gate for verifiable multiple-choice items. Checkpoints live under an opensimula/ subtree (manifest, taxonomy bundle, sampling strategy); you can stream datapoints to JSONL, hook GenerationMonitor into OpenSimula, or bridge scenarios into ConversationGenerator via SimulaInstructionGeneratorCallback.

This module is not affiliated with Google and is not a reference port of internal systems—it is an independent take on the published Simula recipe.

Try it: walkthrough and CLI notes in examples/simula/README.md, scripts in examples/simula/, package overview in afterimage/simula/README.md. Narrative + monitoring notes: OpenSimula · autodoc: Simula / OpenSimula API.


Table of Contents


Why AfterImage

Fine-tuning a model requires data. Real conversations are slow to collect, expensive to label, and almost never domain-specific enough.

AfterImage flips the problem: you define what the data should look like, and it generates it for you using any LLM you already have access to.

Your documents  +  LLM  →  Realistic, diverse, quality-filtered training data

What you get:

  • Multi-turn conversations that read like real interactions — not templated Q&A pairs
  • Document-grounded datasets tied to your corpus (RAG-style)
  • DPO / RLHF preference pairs without a single manual label
  • Data already formatted for the training framework you use

Features

Category What's included
Generation Multi-turn chat · Document-grounded QA · Persona-driven diversity · Structured output · Tool-calling
Preference Data DPO · RLHF · UltraFeedback · Anthropic HH · ORPO
Quality LLM-as-judge · Embedding-based metrics · Auto-improve retries · Composite scoring
Providers Gemini · OpenAI · DeepSeek · OpenRouter · Local (vLLM / Ollama / llama.cpp)
Export ShareGPT · Alpaca · Messages · LLaMA Factory · Oumi · OpenAI fine-tune · DPO · Raw
Storage JSONL (default) · SQLite · PostgreSQL · MySQL
Scale Async-first · Concurrent generation · Smart API key rotation with rate limiting
Observability Real-time metrics · Configurable alerts · HTML analytics reports
Interface CLI · Python API · FastAPI REST server · Gradio demo UI

Installation

If you want your agent to do it for you: Just copy and paste the following to your agent:

Read https://afterimage.altai.dev/llms.txt and follow it for installing AfterImage, documentation links, and examples.

If you are doing it yourself:

pip install afterimage
# or with uv (recommended)

uv add afterimage

Requires Python 3.11+

Optional extras:

Extra What it adds
embeddings-local Local embeddings via sentence-transformers for Qdrant workflows and embedding-based quality checks
server FastAPI REST server (afterimage-server CLI entry point)
training PyTorch / TRL stack, Gradio UI, and training scripts under examples/
pip install "afterimage[server]"
pip install "afterimage[embeddings-local,server,training]"

Quickstart — CLI

Set your API key and run one command:

export GEMINI_API_KEY=your_key_here
afterimage generate -c examples/configs/basic.yaml

Preview the plan without spending any API credits:

afterimage generate -c examples/configs/basic.yaml --dry-run

Export to your training framework:

# List all available formats
afterimage export --list-formats

# Export to multiple formats in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages -f alpaca

# Create a train/val split automatically
afterimage export -i output/dataset.jsonl -f messages --split 0.9

# Push directly to Hugging Face Hub
afterimage push -c your_config.yaml --repo-id your-org/your-dataset

Generate DPO preference pairs:

afterimage preference -c your_config.yaml

Analyze your dataset:

afterimage analyze -i output/dataset.jsonl -o report.html

Quickstart — Python API

The CLI is powered by the same composable Python API. Drop into it whenever you need a custom pipeline.

Minimal conversation generation:

import asyncio
import os
from afterimage import ConversationGenerator

async def main():
    gen = ConversationGenerator(
        respondent_prompt="You are a helpful AI assistant. Answer clearly and concisely.",
        api_key=os.environ["GEMINI_API_KEY"],
        model_name="gemini-2.5-flash",
    )
    await gen.generate(num_dialogs=50, max_turns=4, max_concurrency=5)
    print(f"Generated {len(gen.load_conversations())} conversations.")

asyncio.run(main())

Document-grounded generation with personas:

import asyncio
import os
from afterimage import (
    ConversationGenerator,
    PersonaGenerator,
    PersonaInstructionGeneratorCallback,
    InMemoryDocumentProvider,
    WithContextRespondentPromptModifier,
)

DOCUMENTS = [
    "Pour-over coffee is brewed by pouring hot water over grounds through a filter. "
    "Key variables are grind size, water temperature (90–96 °C), and pour rate.",
    "Espresso is brewed at 9 bar pressure through finely-ground beans. "
    "It is the base for lattes, cappuccinos, and macchiatos.",
]

async def main():
    api_key = os.environ["GEMINI_API_KEY"]
    docs = InMemoryDocumentProvider(DOCUMENTS)

    # Generate diverse user personas from your documents
    persona_gen = PersonaGenerator(api_key=api_key)
    await persona_gen.generate_from_documents(docs)

    gen = ConversationGenerator(
        respondent_prompt="You are a coffee expert. Answer questions based on the provided context.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
        instruction_generator_callback=PersonaInstructionGeneratorCallback(
            api_key=api_key,
            documents=docs,
            num_random_contexts=1,
        ),
        respondent_prompt_modifier=WithContextRespondentPromptModifier(),
    )

    await gen.generate(num_dialogs=100, max_turns=3, max_concurrency=5)

asyncio.run(main())

Generate DPO preference pairs:

import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.preference.generator import PreferenceGenerator
from afterimage.evaluator import ConversationJudge

async def main():
    api_key = os.environ["GEMINI_API_KEY"]

    base_gen = ConversationGenerator(
        respondent_prompt="You are a helpful assistant.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
    )

    judge = ConversationJudge(api_key=api_key, model_name="gemini-2.5-flash")

    pref_gen = PreferenceGenerator(conversation_generator=base_gen, judge=judge)
    await pref_gen.generate(num_pairs=200, max_concurrency=4)

asyncio.run(main())

More complete examples live under examples/. Full API reference is at afterimage.altai.dev.


Supported LLM Providers

Provider provider key Model examples Notes
Google Gemini gemini gemini-2.5-flash, gemini-2.0-flash Default in CLI configs
OpenAI openai gpt-4o, gpt-4o-mini Full API support
DeepSeek deepseek deepseek-chat, deepseek-reasoner Captures chain-of-thought reasoning
OpenRouter openrouter Any model via OpenRouter Access 100+ models with one key
Local local Any OpenAI-compatible server vLLM, Ollama, llama.cpp — zero API cost

Providers can be mixed freely — use a fast/cheap model to simulate the user (correspondent) and a stronger model to generate answers (respondent).

Scale beyond rate limits with SmartKeyPool — automatic key rotation across concurrent requests:

from afterimage.key_management import SmartKeyPool
from afterimage import ConversationGenerator

pool = SmartKeyPool(["key_1", "key_2", "key_3"])

gen = ConversationGenerator(
    respondent_prompt="You are a helpful assistant.",
    api_key=pool,
    model_name="gemini-2.5-flash",
)

Export Formats

One command converts your raw JSONL to any fine-tuning format:

Format --format flag Target framework
ShareGPT sharegpt LLaMA Factory · FastChat · Axolotl
Alpaca alpaca LLaMA Factory · many community trainers
HuggingFace Messages messages TRL SFTTrainer · HuggingFace ecosystem
LLaMA Factory llama_factory LLaMA Factory native format
Oumi oumi Oumi training framework
OpenAI Fine-tune openai_finetune OpenAI fine-tuning API
DPO dpo TRL DPOTrainer · preference training
Raw raw Custom pipelines — minimal processing
# Export and split into train/val in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages --split 0.9

How AfterImage Works

AfterImage runs a two-agent loop per dialog:

AfterImage Dialog-Level Workflow

  1. Correspondent generates user questions — driven by personas, document context, or custom instruction callbacks
  2. Respondent answers — with optional RAG context injected per turn
  3. Quality gate scores each dialog using LLM-as-judge + embedding metrics; retries below-threshold dialogs automatically
  4. Storage writes each dialog incrementally — crash-safe, resumable
  5. Export converts the raw JSONL to any training format in a single CLI command

Configuration Reference

The fastest path to generation is a YAML config:

# examples/configs/basic.yaml

generation:
  num_dialogs: 100
  max_turns: 4
  max_concurrency: 5

model:
  provider: gemini              # gemini | openai | deepseek | openrouter | local
  model_name: gemini-2.5-flash
  api_key_env: GEMINI_API_KEY   # environment variable name

respondent:
  system_prompt: |
    You are an expert assistant. Answer clearly and concisely.

# Optional: document grounding (RAG)
# documents:
#   provider: directory         # directory | file | jsonl | memory | qdrant
#   path: ./my_docs/

# Optional: persona diversity
# personas:
#   enabled: true

# Optional: context-grounded instruction generation
# context:
#   enabled: true
#   num_random_contexts: 2

# Optional: quality gate
# quality:
#   auto_improve: true

output:
  path: ./output/dataset.jsonl
  storage: jsonl                # jsonl | sql
# Validate config before running
afterimage validate -c examples/configs/basic.yaml

# Run
afterimage generate -c examples/configs/basic.yaml

All YAML options and their defaults are documented at afterimage.altai.dev.


Repository Layout

afterimage/              Core library
├── providers/           LLM, document, and embedding providers
├── callbacks/           Instruction generators, stopping criteria, prompt modifiers
├── evaluation/          LLM-as-judge and embedding-based evaluators
├── preference/          DPO / RLHF preference pair generation
├── integrations/        Export format adapters (ShareGPT, Alpaca, Messages, …)
├── analytics/           Dataset analytics engine and HTML report generator
└── server/              FastAPI REST server with SSE progress streaming

examples/
├── configs/             Ready-to-run YAML configs (basic, RAG, local, budget)
├── caselaw_rag/       Qdrant + HF CAP embeddings tutorial (index + generate)
├── demo_ui/             Gradio web UI — interactive generation + fine-tuning
└── *.py                 Python API usage examples

docs/                    Sphinx sources (hosted at afterimage.altai.dev)
tests/                   pytest suite — Python 3.11, 3.12, 3.13

Contributing

Contributions are welcome. Read DESIGN.md for architecture notes before opening a large PR.

# Clone and install all extras
git clone https://github.com/altaidevorg/afterimage
cd afterimage
uv sync --all-extras

# Run the test suite
pytest

# Check style
ruff check .
ruff format .

Open an issue before submitting significant changes — it helps align on design direction early and avoids wasted effort.


License

Apache License 2.0


Built by Altai Dev · Documentation · PyPI · Blog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afterimage-0.15.0.tar.gz (201.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

afterimage-0.15.0-py3-none-any.whl (211.1 kB view details)

Uploaded Python 3

File details

Details for the file afterimage-0.15.0.tar.gz.

File metadata

  • Download URL: afterimage-0.15.0.tar.gz
  • Upload date:
  • Size: 201.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.19

File hashes

Hashes for afterimage-0.15.0.tar.gz
Algorithm Hash digest
SHA256 7849f666cf428302337b0498320f5a0f8ea217a35fa59ccaf836471cc93745e3
MD5 d08ad2127d94c4ed8653c8bb4d1e238a
BLAKE2b-256 c97f0ec291877c8dc9d8fea8db1d984d5ad966ec5c95e7835c4fddd9c04af4a5

See more details on using hashes here.

File details

Details for the file afterimage-0.15.0-py3-none-any.whl.

File metadata

File hashes

Hashes for afterimage-0.15.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9522ed942eeaaff8b28c8d8582208f33b36be9ab4e7aacc4dbc7eefda01ba027
MD5 7d2569b3b2b021a173b8884ac81ec02a
BLAKE2b-256 367c91973a3c96fa8cbc03d816047722ea7ac2f104e55d4390a669bdb385dda9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page