Turn coding agent traces into training data
Project description
Teich
Turn coding agent sessions into training data.
Run codex or pi to capture raw traces, or use chat mode to generate text-only training rows directly.
Easily format, filter, combine and mask any supported dataset(s) for supervised fine-tuning (SFT)
⚡ Quick Start
pip install teich
teich init my-project && cd my-project
teich generate -c config.yaml
Or use astral-uv
uvx teich init my-project && cd my-project
uvx teich generate -c config.yaml
Be sure to edit your config.yaml and prompts.csv file as needed
⭐ What Teich Does
- Trace-first data collection: Run real coding agents and keep the raw session traces when you want full fidelity
- Multi-agent support: Works with Codex, Pi, and a text-only
chatmode - Structured output: Converts traces into chat messages with tool calls, reasoning, and tool results, or emits ready-to-train chat rows directly
- SFT-ready formatting: Applies chat templates and creates assistant masks for supervised fine-tuning
- Hugging Face integration: Load raw traces or structured JSONL datasets from local folders, files, or dataset repos
📥 Prerequisites
Requirements for agent trace generation:
- Docker
- OpenAI/OpenRouter API key (or local OpenAI-compatible endpoint)
agent.provider: chat does not require Docker. The Python utilities also work without Docker if you already have traces or structured JSONL datasets.
🚀 Usage
Generate traces from prompts
# Initialize project
teich init my-project
cd my-project
# Add prompts to prompts.csv, then:
export OPENAI_API_KEY=sk-...
teich generate -c config.yaml
Outputs:
codex/pi: raw traces inoutput/, sandboxes insandbox/, and aREADME.mdchat: text-only JSONL training rows inoutput/and a datasetREADME.md
If publish.repo_id is configured, Teich also creates or updates the matching Hugging Face dataset repo and uploads the generated JSONL, README, and tools.json automatically.
Generate a text-only chat dataset
agent:
provider: chat
model:
model: gpt-4.1-mini
api:
provider: openai
wire_api: responses
Each generated JSONL line will look like:
{"messages":[{"role":"system","content":"You are a helpful assistant","thinking":null},{"role":"user","content":"Hello","thinking":null},{"role":"assistant","content":"Hi!","thinking":"I should greet the user."}],"system":"You are a helpful assistant","prompt":"Hello","thinking":"I should greet the user.","response":"Hi!","model":"gpt-4.1-mini"}
Load and format for training
from teich import load_traces, format_and_mask
# Load from local folder, local file, or HF dataset
tool_dataset = load_traces("badlogicgames/pi-mono", split="train")
chat_dataset = load_traces("./chat-output/chat.jsonl")
# Apply chat template and create masks across multiple datasets
training_data = format_and_mask(
[tool_dataset, chat_dataset],
tokenizer,
chat_template_kwargs={"enable_thinking": True}
)
# Preview a formatted example
print(training_data.preview())
Manual tokenizer flow with load_traces
from teich import load_traces
dataset = load_traces("./output")
example = dataset[0]
rendered = tokenizer.apply_chat_template(
example["messages"],
tools=example.get("tools") or [],
tokenize=False,
add_generation_prompt=False,
enable_thinking=True,
)
tokenized = tokenizer(rendered, truncation=True, max_length=32768)
📋 Configuration
config.yaml:
agent:
provider: codex # or pi or chat
model:
model: codex-mini-latest
approval_policy: never
sandbox: danger-full-access
prompts_file: prompts.csv
output:
traces_dir: ./output
sandbox_dir: ./sandbox
pretty_name: "My Agent Traces"
publish:
repo_id: armand0e/my-dataset
hf_token: hf_xxx
private: false
Dataset tags are auto-generated from the provider and model:
codex/pi:agent-traces,<provider>,distillation,<model>,teichchat:conversational,distillation,teich,<model>
If publish.hf_token is omitted, Teich also accepts HF_TOKEN, HUGGINGFACE_HUB_TOKEN, or TEICH_HF_TOKEN from the environment.
Local providers (LM Studio, Ollama)
export TEICH_PROVIDER=LMstudio
export TEICH_MODEL=gemma-4
export TEICH_BASE_URL=http://localhost:1234/v1
export TEICH_API_KEY=llm
teich generate -c config.yaml
🏗️ Data Structure
Training examples include:
prompt: initial task descriptionmessages: chat history (system, user, assistant, tool)tools: tool schemas used in the sessionmetadata: session info, model, timestamps, and usage when available
Structured chat datasets can also include convenience top-level fields like:
systemthinkingresponsemodel
Assistant messages capture:
content: text responsereasoning_content: chain-of-thought tracestool_calls: function calls with arguments
🔧 Python API
from teich import (
load_traces, # Load from folder, file, or HF dataset
format_and_mask, # Apply chat template + assistant masks
Config, # Load config.yaml
TrainingExample # Typed training example
)
📦 Trace-First Workflow
Teich preserves the raw agent session as the source of truth:
- Collect: Run agents on real tasks → raw
.jsonltraces - Inspect/Share: Traces are human-readable and uploadable
- Convert: Transform to structured examples when ready
- Format: Apply model-specific chat templates for training
If you choose agent.provider: chat, Teich skips the trace-preservation step and writes structured text-only JSONL rows directly.
This means you can:
- Re-convert with different logic later
- Share raw traces before releasing training data
- Train on the same sessions with different model templates
🛠️ Development
uv pip install -e ".[dev]"
pytest tests/test_formatter.py tests/test_loader.py -q
📌 Status
Teich is alpha. The core workflow is stable and usable. APIs may evolve as more agent types and training workflows are added.
📄 License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file teich-0.1.1a11.tar.gz.
File metadata
- Download URL: teich-0.1.1a11.tar.gz
- Upload date:
- Size: 6.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4f1ca96557a8f671f06ee5201989661cc5da63c63966fe5ccda32846be22911
|
|
| MD5 |
fb40f7a462372e7a94e63903d15b7ce2
|
|
| BLAKE2b-256 |
5ece0efefa600f8172329a157f3ec6b3dd77faf030d307dec31b36c75401e938
|
File details
Details for the file teich-0.1.1a11-py3-none-any.whl.
File metadata
- Download URL: teich-0.1.1a11-py3-none-any.whl
- Upload date:
- Size: 50.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c269d48551a7eae71942b8cdaaeb1751a603b9a7e28d01b4ebb1ffc9c8f73384
|
|
| MD5 |
e977c7bd631b36f6777c62b7045e0165
|
|
| BLAKE2b-256 |
45880b40f07ba06fcd59381ed54ca2dc5be1eede5aa7d3b327491891ad8b64d0
|