Skip to main content

Replay-first tracing + deterministic replay + diffing for agentic workflows.

Project description

Forkline

Forkline is a local-first, replay-first tracing and diffing library for agentic AI workflows.

Its purpose is simple and strict:

Make agent runs reproducible, inspectable, and diffable.

Forkline treats nondeterminism as something to be controlled, not merely observed.


Why Forkline exists

Modern agentic systems fail in a frustrating way:

  • The same prompt behaves differently on different days
  • Tool calls change silently
  • Debugging becomes guesswork
  • CI becomes flaky or meaningless

Logs and dashboards tell you that something changed.
Forkline is built to tell you where, when, and why.


What Forkline does

Forkline allows you to:

  • Record an agent run as a deterministic, local artifact
  • Replay that run without re-invoking the LLM
  • Diff two runs and detect the first point of divergence
  • Capture tool calls safely with deterministic redaction
  • Gate CI builds on behavioral identity — no network, no flake, no ambiguity

This turns agent behavior into something you can reason about like code.


Replay (Deterministic)

Replay in Forkline means:

  • Offline execution — No network calls, no LLM invocations during replay
  • Artifact injection — Tool and LLM outputs come from recorded artifacts, not live calls
  • First-divergence detection — Comparison halts at the first observable difference
  • Read-only — Replay never mutates the original recording
  • Deterministic — Same inputs always produce identical comparison results
from forkline import SQLiteStore, ReplayEngine, ReplayStatus

store = SQLiteStore()
engine = ReplayEngine(store)

# Record a run (see docs/RECORDING_V0.md)
# ...

# Compare two recorded runs
result = engine.compare_runs("baseline-run", "current-run")

if result.status == ReplayStatus.MATCH:
    print("Runs are identical")
elif result.status == ReplayStatus.DIVERGED:
    print(f"Diverged at step {result.divergence.step_idx}: {result.divergence.divergence_type}")

See docs/REPLAY_ENGINE_V0.md for full replay documentation.


Quick Start

# Install (editable)
pip install -e .

# Run a script under forkline tracing
forkline run examples/minimal.py

# List recorded runs
forkline list

# Replay a run (prints summary)
forkline replay <run_id>

# Diff two runs
forkline diff <run_id_a> <run_id_b>

CLI Reference

# Run a script and capture metadata (timestamps, exit code, script path)
forkline run examples/minimal.py
# => run_id: 8a3f...

# Pass arguments to the script (use -- to separate)
forkline run examples/minimal.py -- --verbose --count 5

# List runs (newest first, table format)
forkline list
forkline list --limit 10
forkline list --json

# Replay a run (load and summarize events)
forkline replay <run_id>
forkline replay <run_id> --json

# Diff two runs (finds first divergence)
forkline diff <run_id_a> <run_id_b>
forkline diff <run_id_a> <run_id_b> --format json

# Use a custom database path
forkline run --db myproject.db examples/minimal.py
forkline list --db myproject.db

# CI integration (see "CI Integration" below)
forkline ci record --entrypoint examples/my_flow.py --out baseline.run.json
forkline ci check --entrypoint examples/my_flow.py --expected baseline.run.json

Example: catching LLM nondeterminism with Ollama Qwen3

examples/ollama_qwen3.py calls Ollama's Qwen3 model and records the input/output as forkline events. Run it twice — the LLM gives a different response each time, and forkline diff catches it.

# Prerequisites: ollama pull qwen3

$ forkline run examples/ollama_qwen3.py
Calling qwen3 ...
Response: A fork bomb is a denial-of-service attack that recursively spawns
an infinite number of processes to exhaust system resources, causing a crash
or severe performance degradation.
run_id: b015f49f45c04002a3c489fe84b45c5c

$ forkline run examples/ollama_qwen3.py
Calling qwen3 ...
Response: A fork bomb is a type of denial-of-service attack that recursively
spawns an infinite number of processes using the fork() system call, thereby
exhausting system resources and causing the system to crash or become
unresponsive.
run_id: 7b08ac5e533d456daa7a24921c0d1687

forkline list — both runs, newest first:

ID                                    Created               Script                          Status
------------------------------------------------------------------------------------------------------
7b08ac5e533d456daa7a24921c0d1687      2026-02-23 01:04:34   examples/ollama_qwen3.py        ok
b015f49f45c04002a3c489fe84b45c5c      2026-02-23 01:04:20   examples/ollama_qwen3.py        ok

forkline replay b015f4... — summary of the first run:

Run: b015f49f45c04002a3c489fe84b45c5c
Script: examples/ollama_qwen3.py
Status: ok
Duration: 10.74s
Total events: 2
Events by type:
  input: 1
  output: 1

forkline diff b015f4... 7b08ac... — nondeterminism caught:

Step 1 diverged:
  old.type: output
  old.payload: {"model": "qwen3", "response": "A fork bomb is a denial-of-service attack tha...
  new.type: output
  new.payload: {"model": "qwen3", "response": "A fork bomb is a type of denial-of-service at...

Same prompt, same model — different output. That's exactly the problem Forkline exists to surface.

Programmatic API

from forkline import ReplayEngine, SQLiteStore, ReplayStatus

engine = ReplayEngine(SQLiteStore())
result = engine.compare_runs("baseline-run", "new-run")

if result.is_match():
    print("No behavioral changes")
else:
    print(f"Diverged: {result.divergence.summary()}")

See QUICKSTART_RECORDING_V0.md for recording and REPLAY_ENGINE_V0.md for replay.


CI Integration

Forkline ships with a dedicated CI layer that turns agent behavior into a build gate. If behavior changes, the build fails — deterministically, offline, with a clear diff.

Record a baseline, check it in CI

# Record a baseline artifact (local dev)
forkline ci record --entrypoint examples/my_flow.py --out tests/testdata/my_flow.run.json

# Commit it to version control
git add tests/testdata/my_flow.run.json

# In CI: check that behavior hasn't changed
forkline ci check --entrypoint examples/my_flow.py --expected tests/testdata/my_flow.run.json
# Exit 0 = identical behavior, Exit 1 = behavior changed

CI commands

# Record a normalized, committable artifact
forkline ci record --entrypoint <script> --out <path> [--offline]

# Validate an artifact's schema and structure
forkline ci replay --artifact <path> [--strict]

# Diff two artifacts (exit 1 on divergence)
forkline ci diff --expected <path> --actual <path> [--format json|text]

# All-in-one: record actual, diff against expected
forkline ci check --entrypoint <script> --expected <path> [--offline]

# Normalize an artifact for stable diffs
forkline ci normalize <artifact> [--out <path>]

Offline mode

CI runs enforce a hard no-network guarantee. When --offline is set (or FORKLINE_OFFLINE=1), any network access raises ForklineOfflineError immediately — no hangs, no timeouts, deterministic failure.

Exit codes

Forkline CI uses a strict, stable exit code contract:

Code Meaning
0 Success, no diff
1 Diff detected (fail the build)
2 Usage/config error
3 Replay failed
4 Offline violation
5 Artifact/schema error
6 Internal error

Python test helper

from forkline.testing import assert_no_diff

def test_my_flow():
    assert_no_diff(
        entrypoint="examples/my_flow.py",
        expected_artifact="tests/testdata/my_flow.run.json",
        offline=True,
    )

GitHub Actions

- name: Check for behavioral diffs
  run: |
    forkline ci check \
      --entrypoint examples/my_flow.py \
      --expected tests/testdata/my_flow.run.json \
      --offline

For the full CI guide — artifact normalization, re-recording baselines, repo layout, and more — see docs/ci.md.


Artifact Stability Guarantee

Forkline guarantees replay compatibility across minor versions. Breaking changes require a major version increment and migration support.

Every run artifact includes a mandatory schema_version field. Older artifacts are automatically migrated to the current schema via a deterministic, side-effect-free migration pipeline. Unknown fields are always ignored, never rejected — ensuring forward compatibility with newer artifact versions.

For the full artifact schema specification, see docs/artifact_schema.md.


Design principles

Forkline is intentionally opinionated.

  • Replay-first, not dashboards-first
  • Determinism over probabilistic insight
  • Local-first artifacts
  • Diff over metrics
  • Explicit schemas over implicit behavior

If a feature does not help reproduce, replay, or diff an agent run, it does not belong in Forkline.


Security & Data Redaction

Forkline is designed to be safe by default when handling sensitive data.

Core invariant

By default, Forkline artifacts MUST NOT contain recoverable sensitive user, customer, or proprietary data.

This means:

  • No raw LLM prompts or responses are persisted by default
  • Secrets are NEVER written to disk in any mode
  • PII and customer data are redacted before persistence
  • Redaction happens at capture time, before any disk write

What IS recorded (SAFE mode)

Forkline preserves everything needed for replay and diffing:

  • Step ordering and control flow
  • Tool and model identifiers
  • Timestamps and execution metadata
  • Stable cryptographic hashes of redacted values
  • Structural shape of inputs/outputs

This enables deterministic replay, accurate diffing, and forensic debugging — without exposing sensitive data.

Escalation modes

For development and debugging, Forkline supports explicit opt-in modes:

  • SAFE (default): Production-safe, full redaction
  • DEBUG: Local development, raw values persisted
  • ENCRYPTED_DEBUG: Encrypted payloads for break-glass production debugging

Full policy

For the complete security design and redaction mechanisms, see:

👉 docs/REDACTION_POLICY.md


Why CLI-first

Forkline is CLI-first by design, not by convenience.

Agent debugging and reproducibility are developer workflows.
They live in terminals, CI pipelines, local machines, and code reviews — not dashboards.

Determinism and scriptability

CLI commands are composable, automatable, and repeatable.

This makes Forkline usable in:

  • CI pipelines (forkline ci check gates merges on behavioral identity)
  • test suites (assert_no_diff for snapshot-style testing)
  • local debugging loops
  • regression checks

If it can’t be scripted, it can’t be trusted as infrastructure.


Local-first by default

A CLI enforces Forkline’s local-first philosophy:

  • artifacts live on disk
  • runs replay offline
  • no hidden network dependencies
  • no opaque browser state

This keeps behavior inspectable and failure modes obvious.


Diff is terminal-native

Diffing is already how developers reason about change:

  • git diff
  • pytest failures
  • compiler diagnostics
  • performance regressions

Forkline extends this mental model to agent behavior.

A CLI makes Forkline additive to existing tooling, not a replacement.


Avoiding dashboard gravity

Dashboards optimize for:

  • aggregation over root cause
  • real-time metrics over replayability
  • visualization over determinism

Forkline explicitly avoids this gravity.

If a feature requires a UI to be understandable, it is usually hiding complexity rather than exposing truth.


UIs can come later — CLIs must come first

Forkline does not reject UIs.
It rejects UI-first design.

The CLI defines the real API surface and semantic contract. Any future UI must be a thin layer on top — never the other way around.

Forkline is CLI-first because reproducibility, diffing, and trust are terminal-native problems.


First-Divergence Diffing

Forkline can compare two recorded runs and identify the first point of divergence with deterministic classification, structured diffs, and a resync window that handles inserted/deleted steps.

CLI Usage

# Pretty diff (default)
forkline diff run_a_id run_b_id

# JSON diff
forkline diff run_a_id run_b_id --format json

# Custom database path
forkline diff run_a_id run_b_id --db myproject.db

Programmatic Usage

from forkline import SQLiteStore
from forkline.core.first_divergence import find_first_divergence, DivergenceType

store = SQLiteStore()
run_a = store.load_run("baseline")
run_b = store.load_run("current")

result = find_first_divergence(run_a, run_b)

if result.status == DivergenceType.EXACT_MATCH:
    print("Runs are identical")
else:
    print(f"Diverged: {result.explanation}")
    print(f"  Type: {result.status}")
    print(f"  At: step {result.idx_a} (run_a) / step {result.idx_b} (run_b)")
    if result.output_diff:
        for op in result.output_diff:
            print(f"  {op['op']} {op['path']}")

Sample Output

First divergence: output_divergence
  Step 2 'generate_response': output differs (same input)

  Run A step 2 'generate_response':
    input_hash:  a1b2c3d4e5f6a7b8...
    output_hash: 1234567890abcdef...
    events: 3
    has_error: False

  Run B step 2 'generate_response':
    input_hash:  a1b2c3d4e5f6a7b8...
    output_hash: fedcba0987654321...
    events: 3
    has_error: False

  Output diff:
    replace $.result.text: "Expected response" -> "Different response"

  Last equal: step 1
  Context A: [step 0 'init', step 1 'prepare', step 2 'generate_response']
  Context B: [step 0 'init', step 1 'prepare', step 2 'generate_response']

Divergence Types

Type Meaning
exact_match Runs are identical
input_divergence Same step name, different input
output_divergence Same step name and input, different output
op_divergence Step names differ at same position
missing_steps Steps in run_a not present in run_b
extra_steps Steps in run_b not present in run_a
error_divergence Error state differs between steps

How Resync Works

When a mismatch is found, the engine searches within a configurable window (default 10 steps) for matching "soft signatures" (step_name, input_hash). This correctly identifies inserted or deleted steps rather than reporting every subsequent step as divergent.


What Forkline is NOT

Forkline explicitly does not aim to be:

  • OpenTelemetry or distributed tracing — No spans, traces, or exporters
  • Production observability — Not for real-time monitoring or alerting
  • An evaluation or benchmarking framework — Not for scoring or ranking models
  • Prompt engineering tooling — Not for A/B testing or prompt optimization
  • A hosted SaaS or dashboard product — Local-first, no cloud dependencies

Forkline is offline forensic debugging infrastructure, not an analytics or observability platform.

For recording schema details, see docs/RECORDING_V0.md.


Roadmap

Forkline follows a disciplined, execution-first roadmap.

The v0 series focuses on correctness and determinism, not polish.

  1. ✅ Deterministic run recording
  2. ✅ Offline replay engine
  3. ✅ First-divergence diffing
  4. ✅ CLI (run, list, replay, diff)
  5. ✅ CI integration (ci record, ci replay, ci diff, ci check, offline enforcement, exit codes)

The canonical roadmap and design contract live here:

👉 docs/ROADMAP.md


Status

Forkline is early-stage and under active development.

APIs are expected to change until v1.0.
Feedback is welcome, especially around replay semantics and diffing behavior.


License

Forkline is licensed under the Apache 2.0 License.


Philosophy (one sentence)

Forkline exists because “it changed” is not a useful debugging answer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forkline-0.5.0.tar.gz (67.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forkline-0.5.0-py3-none-any.whl (71.5 kB view details)

Uploaded Python 3

File details

Details for the file forkline-0.5.0.tar.gz.

File metadata

  • Download URL: forkline-0.5.0.tar.gz
  • Upload date:
  • Size: 67.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for forkline-0.5.0.tar.gz
Algorithm Hash digest
SHA256 95c78066a00b01f92d2d847a2124ab7176649bb4ffb6802a6dadb2e9fc5925de
MD5 22ac2947043facbf6a63103b808c2661
BLAKE2b-256 6612bbf077af89b9f339c6c6b83e9f9e8b98d5acd69fa613cf1e2b89b7c4ef41

See more details on using hashes here.

File details

Details for the file forkline-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: forkline-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 71.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for forkline-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9525d7f5ad4c3a10d007beeb107f0080e0dee5d54d1ef738e7edc7c91023c6c6
MD5 ccab8eebd0f77a7040359199c7f8d9dc
BLAKE2b-256 557de13365d2044f4f59f342d072c01130b126ba77c08c4bea3eb66eee2aae61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page