Skip to main content

Turn coding agent traces into auditable supervised fine-tuning data

Project description

Teich

Turn coding agent sessions into auditable supervised fine-tuning data.


Run codex or pi to capture raw coding-agent traces, or use chat mode to generate text-only training rows directly.

Load local folders, local files, or Hugging Face dataset repos; normalize them into messages/tools; and prepare pre-tokenized, audited SFT datasets with a Teich-owned data collator.

⚡ Quick Start

pip install teich
teich init my-project && cd my-project
teich generate -c config.yaml

Or use astral-uv

uvx teich init my-project && cd my-project
uvx teich generate -c config.yaml

Be sure to edit your config.yaml and prompts.csv file as needed

⭐ What Teich Does

  • Trace-first data collection: Run real coding agents and keep raw session traces as the source of truth
  • Multi-agent support: Works with Codex, Pi, and a text-only chat mode
  • Structured conversion: Converts traces into chat messages with tool calls, reasoning, tool results, metadata, and configured tool snapshots
  • SFT-ready preparation: Applies tokenizer chat templates, masks labels, builds a Teich collator, and audits the dataset before training
  • Hugging Face integration: Publishes dataset cards plus tools.json, and loads local or Hub datasets through one API

📥 Prerequisites

Requirements for agent trace generation:

  • Docker
  • OpenAI/OpenRouter API key (or local OpenAI-compatible endpoint)

agent.provider: chat does not require Docker. The Python utilities also work without Docker if you already have traces or structured JSONL datasets.

Training examples use your existing finetuning stack. For the TRL example below, install compatible versions of transformers, trl, and your model-loading stack separately.

🚀 Usage

Generate traces from prompts

# Initialize project
teich init my-project
cd my-project

# Add prompts to prompts.csv, then:
export OPENAI_API_KEY=sk-...
teich generate -c config.yaml

Outputs:

  • codex / pi: raw traces in output/, sandboxes in sandbox/, and a README.md
  • chat: text-only JSONL training rows in output/ and a dataset README.md

If publish.repo_id is configured, Teich also creates or updates the matching Hugging Face dataset repo and uploads the generated JSONL, README, and tools.json automatically.

If a long run is interrupted, use:

teich generate -c config.yaml --resume

Teich will scan existing outputs and skip prompts that already converted into completed training examples.

Prompt files can be CSV, text, JSONL/NDJSON, or JSON. JSONL is recommended for very long or multiline prompts.

Generate a text-only chat dataset

agent:
  provider: chat

model:
  model: gpt-4.1-mini

api:
  provider: openai
  wire_api: responses

Each generated JSONL line will look like:

{"messages":[{"role":"system","content":"You are a helpful assistant","thinking":null},{"role":"user","content":"Hello","thinking":null},{"role":"assistant","content":"Hi!","thinking":"I should greet the user."}],"system":"You are a helpful assistant","prompt":"Hello","thinking":"I should greet the user.","response":"Hi!","model":"gpt-4.1-mini"}

Prepare for training

from teich import prepare_data

train_dataset = prepare_data(
    ["username/chat-traces", "username/tool-traces"],
    tokenizer,
    max_length=32768,
    drop_oversized_examples=True,
    chat_template_kwargs={"enable_thinking": True},
)

prepare_data loads local folders, local files, Hugging Face datasets, or a list mixing any of those with already-loaded datasets.Dataset objects; applies the tokenizer chat template; optionally tokenizes only to drop rows above max_length; and returns trainer-friendly text rows with Teich supervision metadata for multi-turn/tool-aware masking. Mixed chat-only and tool-call datasets are formatted separately before concatenation, so their schemas do not need to match beyond the normalized messages/tools fields.

Train with TRL SFTTrainer

Let SFTTrainer tokenize the text field, then call mask_data to apply Teich's response-only labels to the trainer dataset:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
from teich import mask_data, prepare_data

model_id = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

train_dataset = prepare_data(
    "badlogicgames/pi-mono",
    tokenizer,
    max_length=32768,
    drop_oversized_examples=True,
    chat_template_kwargs={"enable_thinking": True},
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    args=SFTConfig(
        dataset_text_field="text",
        dataset_num_proc=1,
        max_length=32768,
        packing=False,
        output_dir="outputs",
        per_device_train_batch_size=1,
    ),
)
trainer = mask_data(trainer, tokenizer=tokenizer)
trainer.train()

mask_data follows the same trainer-first shape as Unsloth's response-only helper, but uses Teich's span metadata so multi-turn tool calls and tool responses are masked correctly. Keep packing=False for this flow because packed datasets merge row boundaries before masking.

Advanced load and format flow

from teich import format_and_mask, load_traces

tool_dataset = load_traces("badlogicgames/pi-mono", split="train")
chat_dataset = load_traces("./chat-output/chat.jsonl")

training_data = format_and_mask(
    [tool_dataset, chat_dataset],
    tokenizer,
    max_length=32768,
    chat_template_kwargs={"enable_thinking": True},
    strict=True,
)

Manual tokenizer flow with load_traces

from teich import load_traces

dataset = load_traces("./output")
example = dataset[0]

rendered = tokenizer.apply_chat_template(
    example["messages"],
    tools=example.get("tools") or [],
    tokenize=False,
    add_generation_prompt=False,
    enable_thinking=True,
)
tokenized = tokenizer(rendered, truncation=True, max_length=32768)

📋 Configuration

config.yaml:

agent:
  provider: codex  # or pi or chat

model:
  model: codex-mini-latest
  approval_policy: never
  sandbox: danger-full-access

prompts_file: prompts.csv

output:
  traces_dir: ./output
  sandbox_dir: ./sandbox
  pretty_name: "My Agent Traces"

publish:
  repo_id: armand0e/my-dataset
  hf_token: hf_xxx
  private: false

Dataset tags are auto-generated from the provider and model:

  • codex / pi: agent-traces, <provider>, distillation, <model>, teich
  • chat: conversational, distillation, teich, <model>

If publish.hf_token is omitted, Teich also accepts HF_TOKEN, HUGGINGFACE_HUB_TOKEN, or TEICH_HF_TOKEN from the environment.

Local providers (LM Studio, Ollama)

export TEICH_PROVIDER=LMstudio
export TEICH_MODEL=gemma-4
export TEICH_BASE_URL=http://localhost:1234/v1
export TEICH_API_KEY=llm

teich generate -c config.yaml

🏗️ Data Structure

Training examples include:

  • prompt: initial task description
  • messages: chat history (system, user, assistant, tool)
  • tools: tool schemas used in the session
  • metadata: session info, model, timestamps, and usage when available

Structured chat datasets can also include convenience top-level fields like:

  • system
  • thinking
  • response
  • model

Assistant messages capture:

  • content: text response
  • reasoning_content: chain-of-thought traces
  • tool_calls: function calls with arguments

🔧 Python API

from teich import (
    prepare_sft_dataset, # Load, format, mask, collate, and audit for SFT
    TeichDataCollator,   # Collator for pre-tokenized Teich SFT data
    load_traces,         # Load from folder, file, or HF dataset
    format_and_mask,     # Apply chat template + assistant masks
    preview_sft_example, # Preview supervised vs masked tokens
    Config,              # Load config.yaml
    TrainingExample,     # Typed training example
)

README.md is the package readme used for PyPI, so these examples are the canonical public package docs.

📦 Trace-First Workflow

Teich preserves the raw agent session as the source of truth:

  1. Collect: Run agents on real tasks → raw .jsonl traces
  2. Inspect/Share: Traces are human-readable and uploadable
  3. Convert: Transform to structured examples when ready
  4. Prepare: Apply model-specific chat templates, mask labels, collate, and audit for training

If you choose agent.provider: chat, Teich skips the trace-preservation step and writes structured text-only JSONL rows directly.

This means you can:

  • Re-convert with different logic later
  • Share raw traces before releasing training data
  • Train on the same sessions with different model templates

🛠️ Development

uv pip install -e ".[dev]"
uv run pytest --ignore=tests/test_integration.py -q

📌 Status

Teich is alpha. The core workflow is stable and usable. APIs may evolve as more agent types and training workflows are added.

📄 License

Apache-2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teich-0.1.1a26.tar.gz (5.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

teich-0.1.1a26-py3-none-any.whl (67.5 kB view details)

Uploaded Python 3

File details

Details for the file teich-0.1.1a26.tar.gz.

File metadata

  • Download URL: teich-0.1.1a26.tar.gz
  • Upload date:
  • Size: 5.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for teich-0.1.1a26.tar.gz
Algorithm Hash digest
SHA256 1a7c762146a37e4f1f4b33add83107f13c58db10924a7873d4caf710fe30e710
MD5 350e0c302e1a835a816e2bbb1a7b3f1e
BLAKE2b-256 c14a312f4f9a2117c5fd689587b9e9574868644720f0537c2c17c106c9d03a5f

See more details on using hashes here.

File details

Details for the file teich-0.1.1a26-py3-none-any.whl.

File metadata

  • Download URL: teich-0.1.1a26-py3-none-any.whl
  • Upload date:
  • Size: 67.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for teich-0.1.1a26-py3-none-any.whl
Algorithm Hash digest
SHA256 e39319a2b35be8019e6876f316784507b4f0dd98cd396d4b263ededf15f5bec5
MD5 fc9caeb0984f1af2f479ffd1a374d833
BLAKE2b-256 44f460bb73a647c8bc7f8b298a2c50c322d133c0a998954e3a215f201ce5e1e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page