Skip to main content

Engineering notebook for AI-assisted development

Project description

NOTE: Forgive the agent dump; edit incoming.

RE: The art — Yes, it's AI-generated. Yes, that's hypocritical for a project about rigor over vibes. Now that this is no longer internal, looking for an actual artist to pay for a real logo. If you know someone good, open an issue or DM me. Budget exists.

buildlog

The Only Agent Learning System You Can Prove Works

PyPI Python CI License: MIT

Falsifiable claims. Measurable outcomes. No vibes.

buildlog - The Only Agent Learning System You Can Prove Works

The Problem · The Claim · The Mechanism · Quick Start


The Problem

Everyone's building "agent memory." Blog posts announce breakthroughs. Tweets show impressive demos. Products ship with "learning" in the tagline.

Ask them one question: How do you know it works?

You'll get:

  • "It feels smarter"
  • "Users report better results"
  • "The agent remembers things now"

That's not evidence. That's vibes.

Here's what a real answer looks like:

"We track Repeated Mistake Rate (RMR) across sessions. Our null hypothesis is that the system makes no difference. After 50 sessions, RMR decreased from 34% to 12% (p < 0.01). The effect size is 0.65. Here's the data."

If you can't say something like that, you don't have agent learning. You have a demo.


The Claim

buildlog makes a falsifiable claim:

H₀ (Null Hypothesis): buildlog makes no measurable difference to agent behavior.

H₁ (Alternative): Agents using buildlog-learned rules have lower Repeated Mistake Rate than baseline.

We provide the infrastructure to reject or fail to reject this hypothesis with your own data.

If buildlog doesn't work, the numbers will show it. That's the point.


The Metric: Repeated Mistake Rate (RMR)

RMR = (Mistakes that match previous mistakes) / (Total mistakes logged)

A mistake "matches" if it has the same semantic signature—same error class, similar description, same root cause showing up again.

Why RMR?

  • Observable: You can count it
  • Attributable: Lower RMR after rule injection = signal
  • Meaningful: Repeating mistakes is the actual pain point

RMR is not the only metric that matters. But it's one we can measure, and measurement is where science starts.


The Mechanism

buildlog uses contextual bandits to select which rules to surface.

┌─────────────────────────────────────────────────────────────────┐
│                    CONTEXTUAL BANDIT SETUP                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Context (c):     Error class, file type, task category        │
│  Arms (a):        Candidate rules to surface                   │
│  Reward (r):      Human feedback (👍 helpful / 👎 not helpful) │
│                                                                 │
│  Policy π(c) → a:  Thompson Sampling with Beta-Bernoulli       │
│                                                                 │
│  Objective:       Minimize regret = Σ(r* - r_chosen)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Arms = learned rules (from buildlog entries, code reviews, explicit teaching)

Context = what kind of problem you're working on

Reward = did surfacing this rule actually help?

The system explores (tries uncertain rules) and exploits (uses proven rules) based on accumulated evidence. Thompson Sampling provides theoretical guarantees: O(√(KT log K)) regret bounds.

This isn't magic. It's a well-understood framework with decades of research. We're applying it to agent rule selection.


The Pipeline

buildlog captures signal at every stage:

flowchart LR
    A["Work Sessions"] --> B["Structured Entries"]
    B --> C["Extracted Rules"]
    C --> D["Bandit Selection"]
    D --> E["Rule Surfaced"]
    E --> F["Human Feedback"]
    F --> G["Posterior Update"]
    G --> D

    style F fill:#ff6b6b,color:#fff
    style G fill:#4ecdc4,color:#fff

Stage 1: Capture

Document your work. Include the fuckups—they're the most valuable signal.

buildlog new auth-api
# Edit the markdown, document what happened

Stage 2: Extract

Pull structured rules from your entries.

buildlog distill    # Extract patterns
buildlog skills     # Deduplicate into rules

Stage 3: Promote

Surface rules to your agent via CLAUDE.md, settings.json, or Agent Skills.

buildlog promote --target skill

Stage 4: Measure

Track what happens when rules are active.

buildlog experiment start --error-class "type-errors"
# ... work session ...
buildlog experiment log-mistake --error-class "type-errors" \
  --description "Forgot to handle null case"
buildlog experiment end
buildlog experiment report

Stage 5: Learn

Log reward signals when rules help (or don't).

# Via MCP
buildlog_log_reward(
    skill_id="arch-123",
    reward=1,           # 1 = helped, 0 = didn't help
    context="type-errors",
    outcome="Caught the bug before commit"
)

Experiment Infrastructure

buildlog ships with infrastructure to run actual experiments:

# Start a tracked session
buildlog experiment start --error-class "api-design"

# Log mistakes as they happen
buildlog experiment log-mistake \
  --error-class "api-design" \
  --description "Returned 200 for error case"

# End session
buildlog experiment end

# Get metrics
buildlog experiment metrics

# Full report across all sessions
buildlog experiment report

The report includes:

  • Total sessions, total mistakes
  • Repeat rate (RMR)
  • Mistakes by error class
  • Rules that correlate with corrections

This is the data you need to make claims.


Quick Start

# Install
pip install buildlog

# Initialize
buildlog init

# Create your first entry
buildlog new my-feature

# After a few entries, extract rules
buildlog distill
buildlog skills

# Start measuring
buildlog experiment start
# ... work ...
buildlog experiment end
buildlog experiment report

MCP Server (Claude Code Integration)

pip install buildlog[mcp]

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "buildlog": {
      "command": "buildlog-mcp"
    }
  }
}

Available tools:

Tool Purpose
buildlog_status View rules by category and confidence
buildlog_promote Surface rules to agent
buildlog_reject Mark false positives
buildlog_diff Rules pending review
buildlog_learn_from_review Extract rules from code review
buildlog_log_reward Record reward signal for bandit
buildlog_rewards View reward history
buildlog_start_session Begin tracked experiment session
buildlog_end_session End session
buildlog_log_mistake Record mistake during session
buildlog_session_metrics Get session statistics
buildlog_experiment_report Full experiment report

What This Is Not

This is not AGI. This is not "agents that truly learn." This is not a revolution.

This is:

  • A structured way to capture engineering knowledge
  • A bandit framework for rule selection
  • Infrastructure to measure whether it works

Boring? Maybe. But boring things that work beat exciting things that don't.


The Falsification Protocol

Want to test whether buildlog actually helps? Here's the protocol:

  1. Baseline: Run N sessions without buildlog rules active. Log mistakes.
  2. Treatment: Run N sessions with buildlog rules active. Log mistakes.
  3. Compare: Calculate RMR for both conditions.
  4. Statistical test: Two-proportion z-test or chi-squared.
  5. Report: Effect size, confidence interval, p-value.

If p > 0.05, we fail to reject the null. buildlog didn't help. That's a valid outcome.

If p < 0.05, we have evidence of an effect. How big? Check the effect size.

This is how you know. Not vibes. Data.


Theoretical Foundations

For the technically curious:

Concept Application in buildlog
Thompson Sampling Rule selection under uncertainty
Beta-Bernoulli model Posterior updates from binary reward
Contextual bandits Context-dependent rule selection
Regret bounds O(√(KT log K)) theoretical guarantee
Semantic hashing Mistake deduplication for RMR

We're not inventing new math. We're applying proven frameworks to a new domain.


Honest Limitations

Things we don't have figured out yet:

  • Credit assignment: When multiple rules are active, which one helped?
  • Non-stationarity: Developer skill changes over time
  • Cold start: New rules have high uncertainty
  • Context representation: What features actually matter?

These are hard problems. We have directional ideas, not solutions. If you're a researcher working on bandit algorithms or causal inference, we'd love to talk.


Philosophy

  1. Falsifiability over impressiveness - If you can't prove it wrong, it's not a claim
  2. Measurement over intuition - "Feels better" is not evidence
  3. Mechanisms over magic - Explain how it works or admit you don't know
  4. Boring over exciting - Proven frameworks beat novel demos
  5. Honesty over marketing - State limitations. Invite scrutiny.

Contributing

We're especially interested in:

  • Better context representations for the bandit
  • Credit assignment approaches
  • Statistical methodology improvements
  • Real-world experiment results (positive or negative)
git clone https://github.com/Peleke/buildlog-template
cd buildlog-template
pip install -e ".[dev]"
pytest

License

MIT License — see LICENSE


"Agent learning" without measurement is just prompt engineering with extra steps.

buildlog is measurement.

Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buildlog-0.5.0.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

buildlog-0.5.0-py3-none-any.whl (68.8 kB view details)

Uploaded Python 3

File details

Details for the file buildlog-0.5.0.tar.gz.

File metadata

  • Download URL: buildlog-0.5.0.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for buildlog-0.5.0.tar.gz
Algorithm Hash digest
SHA256 0445cf4463fa7568f4220cb0c020e38bc6e368ebd68b60caa94f6de8d34aba57
MD5 3b747a146798498e57701cf393ba71e4
BLAKE2b-256 8e318f2c010de917baea6b2e4d8244e5385ef30b8cff5be7ea937e04c099d2ee

See more details on using hashes here.

File details

Details for the file buildlog-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: buildlog-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 68.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for buildlog-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4729d2f21a70d1b895d08f6775b1d6fc2308f2ea5b83484409bc71a38e21afb2
MD5 98d7b9de3d8b8ba3cb28b00b2323609c
BLAKE2b-256 9f0a28b861e9563fed213dc3a7df6a20c90a39717743001b62a6fc8ec599535f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page