Skip to main content

Engineering notebook for AI-assisted development

Project description

buildlog

The Only Agent Learning System You Can Prove Works

PyPI Python CI License: MIT

Falsifiable claims. Measurable outcomes. No vibes.

buildlog - The Only Agent Learning System You Can Prove Works

RE: The art — Yes, it's AI-generated. Yes, that's hypocritical for a project about rigor over vibes. Looking for an actual artist to pay for a real logo. If you know someone good, open an issue or DM me. Budget exists.

The Problem · The Claim · The Mechanism · Quick Start · Review Gauntlet


The Problem

Everyone's building "agent memory." Blog posts announce breakthroughs. Tweets show impressive demos. Products ship with "learning" in the tagline.

Ask them one question: How do you know it works?

You'll get:

  • "It feels smarter"
  • "Users report better results"
  • "The agent remembers things now"

That's not evidence. That's vibes.

Here's what a real answer looks like:

"We track Repeated Mistake Rate (RMR) across sessions. Our null hypothesis is that the system makes no difference. After 50 sessions, RMR decreased from 34% to 12% (p < 0.01). The effect size is 0.65. Here's the data."

If you can't say something like that, you don't have agent learning. You have a demo.


The Claim

buildlog makes a falsifiable claim:

H₀ (Null Hypothesis): buildlog makes no measurable difference to agent behavior.

H₁ (Alternative): Agents using buildlog-learned rules have lower Repeated Mistake Rate than baseline.

We provide the infrastructure to reject or fail to reject this hypothesis with your own data.

If buildlog doesn't work, the numbers will show it. That's the point.


The Metric: Repeated Mistake Rate (RMR)

RMR = (Mistakes that match previous mistakes) / (Total mistakes logged)

A mistake "matches" if it has the same semantic signature—same error class, similar description, same root cause showing up again.

Why RMR?

  • Observable: You can count it
  • Attributable: Lower RMR after rule injection = signal
  • Meaningful: Repeating mistakes is the actual pain point

RMR is not the only metric that matters. But it's one we can measure, and measurement is where science starts.


The Mechanism

buildlog is building toward contextual bandits for automatic rule selection. Here's where we are:

What Exists Today (v0.7)

┌─────────────────────────────────────────────────────────────────┐
│                    CURRENT INFRASTRUCTURE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ✅ Rule extraction     From entries, reviews, curated seeds   │
│  ✅ Confidence scoring  Frequency + recency based              │
│  ✅ Reward logging      Accept/reject/revision signals         │
│  ✅ Experiment tracking Sessions, mistakes, RMR calculation    │
│  ✅ Review gauntlet     Curated persona-based code review      │
│  ⏳ Manual promotion    Human selects rules to surface         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

What's Coming (v0.8+)

┌─────────────────────────────────────────────────────────────────┐
│                    CONTEXTUAL BANDIT (PLANNED)                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Context (c):     Error class, file type, task category        │
│  Arms (a):        Candidate rules to surface                   │
│  Reward (r):      Human feedback (👍 helpful / 👎 not helpful) │
│                                                                 │
│  Policy π(c) → a:  Thompson Sampling with Beta-Bernoulli       │
│                                                                 │
│  Objective:       Minimize regret = Σ(r* - r_chosen)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Arms = learned rules (from buildlog entries, code reviews, explicit teaching)

Context = what kind of problem you're working on

Reward = did surfacing this rule actually help?

The reward infrastructure exists. The bandit policy is next. Thompson Sampling will provide theoretical guarantees: O(√(KT log K)) regret bounds.

We're building in public—the bandit implementation will be developed with full documentation of the process.


The Pipeline

buildlog captures signal at every stage:

flowchart LR
    A["Work Sessions"] --> B["Structured Entries"]
    B --> C["Extracted Rules"]
    C --> D["Manual Promotion"]
    D --> E["Rule Surfaced"]
    E --> F["Human Feedback"]
    F --> G["Reward Logged"]
    G -.-> H["Bandit Policy"]
    H -.-> D

    style F fill:#ff6b6b,color:#fff
    style G fill:#4ecdc4,color:#fff
    style H fill:#666,color:#fff,stroke-dasharray: 5 5

Dashed: Coming in v0.8 — automatic rule selection via Thompson Sampling

Stage 1: Capture

Document your work. Include the fuckups—they're the most valuable signal.

buildlog new auth-api
# Edit the markdown, document what happened

Stage 2: Extract

Pull structured rules from your entries.

buildlog distill    # Extract patterns
buildlog skills     # Deduplicate into rules

Stage 3: Promote

Surface rules to your agent via CLAUDE.md, settings.json, or Agent Skills.

buildlog promote --target skill

Stage 4: Measure

Track what happens when rules are active.

buildlog experiment start --error-class "type-errors"
# ... work session ...
buildlog experiment log-mistake --error-class "type-errors" \
  --description "Forgot to handle null case"
buildlog experiment end
buildlog experiment report

Stage 5: Learn

Log reward signals when rules help (or don't).

# Via MCP
buildlog_log_reward(
    skill_id="arch-123",
    reward=1,           # 1 = helped, 0 = didn't help
    context="type-errors",
    outcome="Caught the bug before commit"
)

Review Gauntlet

Run your code through ruthless reviewer personas, each with curated rules from authoritative sources.

# See available reviewers
buildlog gauntlet list

# Output:
# Review Gauntlet Personas
# ==================================================
#   security_karen
#     OWASP Top 10 security review
#     Rules: 13 (v1)
#
#   test_terrorist
#     Comprehensive testing coverage audit
#     Rules: 21 (v1)
#
# Total: 2 personas, 34 rules

Reviewer Personas

Persona Focus Rules
Security Karen OWASP Top 10, auth, injection, secrets 13
Test Terrorist Coverage, property-based, metamorphic, contracts 21
Ruthless Reviewer Code quality, FP principles Coming soon

Each rule includes:

  • Context: When to apply it
  • Antipattern: What violation looks like
  • Rationale: Why it matters (with citations)

Usage

# Generate a review prompt
buildlog gauntlet prompt src/api.py

# Export rules for manual review
buildlog gauntlet rules --format markdown -o review_checklist.md

# After running a review, persist learnings
buildlog gauntlet learn review_issues.json --source "PR#42"

Gauntlet Loop (Agent Integration)

For AI agents, the gauntlet loop automates the fix-rerun cycle:

buildlog gauntlet loop src/ --persona security_karen --persona test_terrorist

The loop provides structured checkpoints:

Severity Action Human Needed?
Critical Agent fixes, reruns No
Major Checkpoint: continue? Yes
Minor Accept risk or fix? Yes
Clean Done No

MCP tools for agent integration:

  • buildlog_gauntlet_issues — Report findings, get next action
  • buildlog_gauntlet_accept_risk — Accept remaining issues (optionally create GitHub issues)

The gauntlet integrates with the learning loop—issues found become rules that accumulate confidence.


Experiment Infrastructure

buildlog ships with infrastructure to run actual experiments:

# Start a tracked session
buildlog experiment start --error-class "api-design"

# Log mistakes as they happen
buildlog experiment log-mistake \
  --error-class "api-design" \
  --description "Returned 200 for error case"

# End session
buildlog experiment end

# Get metrics
buildlog experiment metrics

# Full report across all sessions
buildlog experiment report

The report includes:

  • Total sessions, total mistakes
  • Repeat rate (RMR)
  • Mistakes by error class
  • Rules that correlate with corrections

This is the data you need to make claims.


Quick Start

# Install
pip install buildlog

# Initialize
buildlog init

# Create your first entry
buildlog new my-feature

# After a few entries, extract rules
buildlog distill
buildlog skills

# Start measuring
buildlog experiment start
# ... work ...
buildlog experiment end
buildlog experiment report

MCP Server (Claude Code Integration)

pip install buildlog[mcp]

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "buildlog": {
      "command": "buildlog-mcp"
    }
  }
}

Available tools:

Tool Purpose
buildlog_status View rules by category and confidence
buildlog_promote Surface rules to agent
buildlog_reject Mark false positives
buildlog_diff Rules pending review
buildlog_learn_from_review Extract rules from code review
buildlog_log_reward Record reward signal
buildlog_start_session Begin tracked experiment
buildlog_log_mistake Record mistake during session
buildlog_experiment_report Full experiment report
buildlog_gauntlet_issues Report gauntlet findings, get next action
buildlog_gauntlet_accept_risk Accept remaining issues, optionally create GH issues

CLI Commands

buildlog init                    # Initialize buildlog
buildlog new <slug>              # Create entry
buildlog list                    # List entries
buildlog distill                 # Extract patterns
buildlog skills                  # Generate rules
buildlog stats                   # Usage statistics
buildlog reward <outcome>        # Log reward signal

# Experiments
buildlog experiment start        # Begin tracked session
buildlog experiment log-mistake  # Record mistake
buildlog experiment end          # End session
buildlog experiment report       # Full report

# Review Gauntlet
buildlog gauntlet list           # Show reviewers
buildlog gauntlet rules          # Export rules
buildlog gauntlet prompt <path>  # Generate review prompt
buildlog gauntlet learn <file>   # Persist learnings
buildlog gauntlet loop <path>    # Auto-fix loop with HITL checkpoints

What This Is Not

This is not AGI. This is not "agents that truly learn." This is not a revolution.

This is:

  • A structured way to capture engineering knowledge
  • A bandit framework for rule selection
  • Infrastructure to measure whether it works

Boring? Maybe. But boring things that work beat exciting things that don't.


The Falsification Protocol

Want to test whether buildlog actually helps? Here's the protocol:

  1. Baseline: Run N sessions without buildlog rules active. Log mistakes.
  2. Treatment: Run N sessions with buildlog rules active. Log mistakes.
  3. Compare: Calculate RMR for both conditions.
  4. Statistical test: Two-proportion z-test or chi-squared.
  5. Report: Effect size, confidence interval, p-value.

If p > 0.05, we fail to reject the null. buildlog didn't help. That's a valid outcome.

If p < 0.05, we have evidence of an effect. How big? Check the effect size.

This is how you know. Not vibes. Data.


Theoretical Foundations

For the technically curious:

Concept Application in buildlog Status
Confidence scoring Frequency + recency decay ✅ Implemented
Semantic hashing Mistake deduplication for RMR ✅ Implemented
Reward signals Binary feedback infrastructure ✅ Implemented
Thompson Sampling Rule selection under uncertainty ⏳ Planned (v0.8)
Beta-Bernoulli model Posterior updates from binary reward ⏳ Planned (v0.8)
Contextual bandits Context-dependent rule selection ⏳ Planned (v0.8)
Regret bounds O(√(KT log K)) theoretical guarantee ⏳ Planned (v0.8)

We're not inventing new math. We're applying proven frameworks to a new domain. The infrastructure for reward collection is live; the bandit policy is the next milestone.


Honest Limitations

Not Yet Implemented

  • Automatic rule selection: Currently manual promotion; Thompson Sampling bandit planned for v0.8
  • Context-aware surfacing: Rules are surfaced globally, not based on task context

Hard Problems We're Working On

  • Credit assignment: When multiple rules are active, which one helped?
  • Non-stationarity: Developer skill changes over time
  • Cold start: New rules have high uncertainty
  • Context representation: What features actually matter?

These are hard problems. We have directional ideas, not solutions. If you're a researcher working on bandit algorithms or causal inference, we'd love to talk.


Philosophy

  1. Falsifiability over impressiveness - If you can't prove it wrong, it's not a claim
  2. Measurement over intuition - "Feels better" is not evidence
  3. Mechanisms over magic - Explain how it works or admit you don't know
  4. Boring over exciting - Proven frameworks beat novel demos
  5. Honesty over marketing - State limitations. Invite scrutiny.

Contributing

We're especially interested in:

  • Better context representations for the bandit
  • Credit assignment approaches
  • Statistical methodology improvements
  • Real-world experiment results (positive or negative)
git clone https://github.com/Peleke/buildlog-template
cd buildlog-template
pip install -e ".[dev]"
pytest

License

MIT License — see LICENSE


"Agent learning" without measurement is just prompt engineering with extra steps.

buildlog is measurement.

Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buildlog-0.7.0.tar.gz (82.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

buildlog-0.7.0-py3-none-any.whl (100.2 kB view details)

Uploaded Python 3

File details

Details for the file buildlog-0.7.0.tar.gz.

File metadata

  • Download URL: buildlog-0.7.0.tar.gz
  • Upload date:
  • Size: 82.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for buildlog-0.7.0.tar.gz
Algorithm Hash digest
SHA256 66fc411e4634913129165417fc89b2c609f7a011b63ec8d6971f109fb30d8ae3
MD5 3905e3c59e399312c1e0bc6b00c24221
BLAKE2b-256 52625bc6b1cb8e815b6eb9c4f76dcc268d475a2de7326b41853693e01f2be40c

See more details on using hashes here.

File details

Details for the file buildlog-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: buildlog-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 100.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for buildlog-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1902f60c6ff2ba4e1b94a811b9233bf990315e64d01b1f2c4c62d536c285a6f
MD5 2ece20120e8775735505876daae5fa01
BLAKE2b-256 8846032eaf004a159f7140cf91870f0aa1265868d1d62b14dfd5bcd795767818

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page