Generate conversational, tool-calling, structured-output, and preference datasets — easily and at scale

These details have not been verified by PyPI

Project links

Project description

AfterImage

Synthetic conversational dataset generation for LLM fine-tuning.

Generate multi-turn chat data, DPO preference pairs, and structured outputs — from a single YAML file or a composable Python API.

Demonstration of a typical conversational dataset generation, where Afterimage simulates both sides of the conversation.

AfterImage demo — Credit Risk Management Q&A Bot

Generating a document-grounded Q&A dataset from BIS credit risk principles → ShareGPT format

News

April 23, 2026 — OpenSimula

OpenSimula is an experimental, open implementation of mechanism-design ideas from Simula (Davidson et al., TMLR; see also Google’s research blog on the framing). It covers LLM-built factor taxonomies, weighted mix sampling over those factors, meta-prompt diversification (with optional complexification), requirement critics with refinement, and an independent double-critic gate for verifiable multiple-choice items. Checkpoints live under an opensimula/ subtree (manifest, taxonomy bundle, sampling strategy); you can stream datapoints to JSONL, hook GenerationMonitor into OpenSimula, or bridge scenarios into ConversationGenerator via SimulaInstructionGeneratorCallback.

This module is not affiliated with Google and is not a reference port of internal systems—it is an independent take on the published Simula recipe.

Try it: walkthrough and CLI notes in examples/simula/README.md, scripts in examples/simula/, package overview in afterimage/simula/README.md. Narrative + monitoring notes: OpenSimula · autodoc: Simula / OpenSimula API.

News
Why AfterImage
Features
Installation
Quickstart — CLI
Quickstart — Python API
Supported LLM Providers
Export Formats
How It Works
Configuration Reference
Repository Layout
Contributing
License

Why AfterImage

Fine-tuning a model requires data. Real conversations are slow to collect, expensive to label, and almost never domain-specific enough.

AfterImage flips the problem: you define what the data should look like, and it generates it for you using any LLM you already have access to.

Your documents  +  LLM  →  Realistic, diverse, quality-filtered training data

What you get:

Multi-turn conversations that read like real interactions — not templated Q&A pairs
Document-grounded datasets tied to your corpus (RAG-style)
DPO / RLHF preference pairs without a single manual label
Data already formatted for the training framework you use

Features

Category	What's included
Generation	Multi-turn chat · Document-grounded QA · Persona-driven diversity · Structured output · Tool-calling
Preference Data	DPO · RLHF · UltraFeedback · Anthropic HH · ORPO
Quality	LLM-as-judge · Embedding-based metrics · Auto-improve retries · Composite scoring
Providers	Gemini · OpenAI · DeepSeek · OpenRouter · Local (vLLM / Ollama / llama.cpp)
Export	ShareGPT · Alpaca · Messages · LLaMA Factory · Oumi · OpenAI fine-tune · DPO · Raw
Storage	JSONL (default) · SQLite · PostgreSQL · MySQL
Scale	Async-first · Concurrent generation · Smart API key rotation with rate limiting
Observability	Real-time metrics · Configurable alerts · HTML analytics reports
Interface	CLI · Python API · FastAPI REST server · Gradio demo UI

Installation

If you want your agent to do it for you: Just copy and paste the following to your agent:

Read https://afterimage.altai.dev/llms.txt and follow it for installing AfterImage, documentation links, and examples.

If you are doing it yourself:

pip install afterimage
# or with uv (recommended)

uv add afterimage

Requires Python 3.11+

Optional extras:

Extra	What it adds
`embeddings-local`	Local embeddings via `sentence-transformers` for Qdrant workflows and embedding-based quality checks
`server`	FastAPI REST server (`afterimage-server` CLI entry point)
`training`	PyTorch / TRL stack, Gradio UI, and training scripts under `examples/`

pip install "afterimage[server]"
pip install "afterimage[embeddings-local,server,training]"

Quickstart — CLI

Set your API key and run one command:

export GEMINI_API_KEY=your_key_here
afterimage generate -c examples/configs/basic.yaml

Preview the plan without spending any API credits:

afterimage generate -c examples/configs/basic.yaml --dry-run

Export to your training framework:

# List all available formats
afterimage export --list-formats

# Export to multiple formats in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages -f alpaca

# Create a train/val split automatically
afterimage export -i output/dataset.jsonl -f messages --split 0.9

# Push directly to Hugging Face Hub
afterimage push -c your_config.yaml --repo-id your-org/your-dataset

Generate DPO preference pairs:

afterimage preference -c your_config.yaml

Analyze your dataset:

afterimage analyze -i output/dataset.jsonl -o report.html

Quickstart — Python API

The CLI is powered by the same composable Python API. Drop into it whenever you need a custom pipeline.

Minimal conversation generation:

import asyncio
import os
from afterimage import ConversationGenerator

async def main():
    gen = ConversationGenerator(
        respondent_prompt="You are a helpful AI assistant. Answer clearly and concisely.",
        api_key=os.environ["GEMINI_API_KEY"],
        model_name="gemini-2.5-flash",
    )
    await gen.generate(num_dialogs=50, max_turns=4, max_concurrency=5)
    print(f"Generated {len(gen.load_conversations())} conversations.")

asyncio.run(main())

Document-grounded generation with personas:

import asyncio
import os
from afterimage import (
    ConversationGenerator,
    PersonaGenerator,
    PersonaInstructionGeneratorCallback,
    InMemoryDocumentProvider,
    WithContextRespondentPromptModifier,
)

DOCUMENTS = [
    "Pour-over coffee is brewed by pouring hot water over grounds through a filter. "
    "Key variables are grind size, water temperature (90–96 °C), and pour rate.",
    "Espresso is brewed at 9 bar pressure through finely-ground beans. "
    "It is the base for lattes, cappuccinos, and macchiatos.",
]

async def main():
    api_key = os.environ["GEMINI_API_KEY"]
    docs = InMemoryDocumentProvider(DOCUMENTS)

    # Generate diverse user personas from your documents
    persona_gen = PersonaGenerator(api_key=api_key)
    await persona_gen.generate_from_documents(docs)

    gen = ConversationGenerator(
        respondent_prompt="You are a coffee expert. Answer questions based on the provided context.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
        instruction_generator_callback=PersonaInstructionGeneratorCallback(
            api_key=api_key,
            documents=docs,
            num_random_contexts=1,
        ),
        respondent_prompt_modifier=WithContextRespondentPromptModifier(),
    )

    await gen.generate(num_dialogs=100, max_turns=3, max_concurrency=5)

asyncio.run(main())

Generate DPO preference pairs:

import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.preference.generator import PreferenceGenerator
from afterimage.evaluator import ConversationJudge

async def main():
    api_key = os.environ["GEMINI_API_KEY"]

    base_gen = ConversationGenerator(
        respondent_prompt="You are a helpful assistant.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
    )

    judge = ConversationJudge(api_key=api_key, model_name="gemini-2.5-flash")

    pref_gen = PreferenceGenerator(conversation_generator=base_gen, judge=judge)
    await pref_gen.generate(num_pairs=200, max_concurrency=4)

asyncio.run(main())

More complete examples live under examples/. Full API reference is at afterimage.altai.dev.

Supported LLM Providers

Provider	`provider` key	Model examples	Notes
Google Gemini	`gemini`	`gemini-2.5-flash`, `gemini-2.0-flash`	Default in CLI configs
OpenAI	`openai`	`gpt-4o`, `gpt-4o-mini`	Full API support
DeepSeek	`deepseek`	`deepseek-chat`, `deepseek-reasoner`	Captures chain-of-thought reasoning
OpenRouter	`openrouter`	Any model via OpenRouter	Access 100+ models with one key
Local	`local`	Any OpenAI-compatible server	vLLM, Ollama, llama.cpp — zero API cost

Providers can be mixed freely — use a fast/cheap model to simulate the user (correspondent) and a stronger model to generate answers (respondent).

Scale beyond rate limits with SmartKeyPool — automatic key rotation across concurrent requests:

from afterimage.key_management import SmartKeyPool
from afterimage import ConversationGenerator

pool = SmartKeyPool(["key_1", "key_2", "key_3"])

gen = ConversationGenerator(
    respondent_prompt="You are a helpful assistant.",
    api_key=pool,
    model_name="gemini-2.5-flash",
)

Export Formats

One command converts your raw JSONL to any fine-tuning format:

Format	`--format` flag	Target framework
ShareGPT	`sharegpt`	LLaMA Factory · FastChat · Axolotl
Alpaca	`alpaca`	LLaMA Factory · many community trainers
HuggingFace Messages	`messages`	TRL `SFTTrainer` · HuggingFace ecosystem
LLaMA Factory	`llama_factory`	LLaMA Factory native format
Oumi	`oumi`	Oumi training framework
OpenAI Fine-tune	`openai_finetune`	OpenAI fine-tuning API
DPO	`dpo`	TRL `DPOTrainer` · preference training
Raw	`raw`	Custom pipelines — minimal processing

# Export and split into train/val in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages --split 0.9

How AfterImage Works

AfterImage runs a two-agent loop per dialog:

AfterImage Dialog-Level Workflow

Correspondent generates user questions — driven by personas, document context, or custom instruction callbacks
Respondent answers — with optional RAG context injected per turn
Quality gate scores each dialog using LLM-as-judge + embedding metrics; retries below-threshold dialogs automatically
Storage writes each dialog incrementally — crash-safe, resumable
Export converts the raw JSONL to any training format in a single CLI command

Configuration Reference

The fastest path to generation is a YAML config:

# examples/configs/basic.yaml

generation:
  num_dialogs: 100
  max_turns: 4
  max_concurrency: 5

model:
  provider: gemini              # gemini | openai | deepseek | openrouter | local
  model_name: gemini-2.5-flash
  api_key_env: GEMINI_API_KEY   # environment variable name

respondent:
  system_prompt: |
    You are an expert assistant. Answer clearly and concisely.

# Optional: document grounding (RAG)
# documents:
#   provider: directory         # directory | file | jsonl | memory | qdrant
#   path: ./my_docs/

# Optional: persona diversity
# personas:
#   enabled: true

# Optional: context-grounded instruction generation
# context:
#   enabled: true
#   num_random_contexts: 2

# Optional: quality gate
# quality:
#   auto_improve: true

output:
  path: ./output/dataset.jsonl
  storage: jsonl                # jsonl | sql

# Validate config before running
afterimage validate -c examples/configs/basic.yaml

# Run
afterimage generate -c examples/configs/basic.yaml

All YAML options and their defaults are documented at afterimage.altai.dev.

Repository Layout

afterimage/              Core library
├── providers/           LLM, document, and embedding providers
├── callbacks/           Instruction generators, stopping criteria, prompt modifiers
├── evaluation/          LLM-as-judge and embedding-based evaluators
├── preference/          DPO / RLHF preference pair generation
├── integrations/        Export format adapters (ShareGPT, Alpaca, Messages, …)
├── analytics/           Dataset analytics engine and HTML report generator
└── server/              FastAPI REST server with SSE progress streaming

examples/
├── configs/             Ready-to-run YAML configs (basic, RAG, local, budget)
├── caselaw_rag/       Qdrant + HF CAP embeddings tutorial (index + generate)
├── demo_ui/             Gradio web UI — interactive generation + fine-tuning
└── *.py                 Python API usage examples

docs/                    Sphinx sources (hosted at afterimage.altai.dev)
tests/                   pytest suite — Python 3.11, 3.12, 3.13

Contributing

Contributions are welcome. Read DESIGN.md for architecture notes before opening a large PR.

# Clone and install all extras
git clone https://github.com/altaidevorg/afterimage
cd afterimage
uv sync --all-extras

# Run the test suite
pytest

# Check style
ruff check .
ruff format .

Open an issue before submitting significant changes — it helps align on design direction early and avoids wasted effort.

License

Apache License 2.0

Built by Altai Dev · Documentation · PyPI · Blog

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.16.0

May 13, 2026

This version

0.15.1

Apr 29, 2026

0.15.0

Apr 23, 2026

0.14.4

Apr 13, 2026

0.14.3

Apr 13, 2026

0.14.2

Apr 12, 2026

0.14.1

Apr 12, 2026

0.13.0

Apr 9, 2026

0.1.1

Jan 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afterimage-0.15.1.tar.gz (202.1 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afterimage-0.15.1-py3-none-any.whl (211.9 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file afterimage-0.15.1.tar.gz.

File metadata

Download URL: afterimage-0.15.1.tar.gz
Upload date: Apr 29, 2026
Size: 202.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.19

File hashes

Hashes for afterimage-0.15.1.tar.gz
Algorithm	Hash digest
SHA256	`a1c2059f0c5b41c4de022af31a9e3af1e3d6b13969ebb3492aa32ebf238a1bcf`
MD5	`2dc8ee2e986bfa0884874d76c0da5a68`
BLAKE2b-256	`f49f2254d4c219116a4e9de395520cae30382727e1ef7d6de59f2dfe807c3790`

See more details on using hashes here.

File details

Details for the file afterimage-0.15.1-py3-none-any.whl.

File metadata

Download URL: afterimage-0.15.1-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 211.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.19

File hashes

Hashes for afterimage-0.15.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eabbba6fb97fa9a39ae63255e10d3445476da65822e1d706c353f77bd6f328ce`
MD5	`baa75750854e8354fb1f6b5893e790ab`
BLAKE2b-256	`90369f8724be264496c70573d1c9af3411203a1b6608094e488734b08eada3a1`

See more details on using hashes here.

afterimage 0.15.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AfterImage

News

April 23, 2026 — OpenSimula

Table of Contents

Why AfterImage

Features

Installation

Quickstart — CLI

Quickstart — Python API

Supported LLM Providers

Export Formats

How AfterImage Works

Configuration Reference

Repository Layout

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes