Engineering notebook for AI-assisted development

These details have not been verified by PyPI

Project links

Project description

buildlog

The Only Agent Learning System You Can Prove Works

Falsifiable claims. Measurable outcomes. No vibes.

RE: The art — Yes, it's AI-generated. Yes, that's hypocritical for a project about rigor over vibes. Looking for an actual artist to pay for a real logo. If you know someone good, open an issue or DM me. Budget exists.

The Problem · The Claim · The Mechanism · Quick Start · Review Gauntlet

The Problem

Everyone's building "agent memory." Blog posts announce breakthroughs. Tweets show impressive demos. Products ship with "learning" in the tagline.

Ask them one question: How do you know it works?

You'll get:

"It feels smarter"
"Users report better results"
"The agent remembers things now"

That's not evidence. That's vibes.

Here's what a real answer looks like:

"We track Repeated Mistake Rate (RMR) across sessions. Our null hypothesis is that the system makes no difference. After 50 sessions, RMR decreased from 34% to 12% (p < 0.01). The effect size is 0.65. Here's the data."

If you can't say something like that, you don't have agent learning. You have a demo.

The Claim

buildlog makes a falsifiable claim:

H₀ (Null Hypothesis): buildlog makes no measurable difference to agent behavior.

H₁ (Alternative): Agents using buildlog-learned rules have lower Repeated Mistake Rate than baseline.

We provide the infrastructure to reject or fail to reject this hypothesis with your own data.

If buildlog doesn't work, the numbers will show it. That's the point.

The Metric: Repeated Mistake Rate (RMR)

RMR = (Mistakes that match previous mistakes) / (Total mistakes logged)

A mistake "matches" if it has the same semantic signature—same error class, similar description, same root cause showing up again.

Why RMR?

Observable: You can count it
Attributable: Lower RMR after rule injection = signal
Meaningful: Repeating mistakes is the actual pain point

RMR is not the only metric that matters. But it's one we can measure, and measurement is where science starts.

The Mechanism

buildlog is building toward contextual bandits for automatic rule selection. Here's where we are:

What Exists Today (v0.7)

┌─────────────────────────────────────────────────────────────────┐
│                    CURRENT INFRASTRUCTURE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ✅ Rule extraction     From entries, reviews, curated seeds   │
│  ✅ Confidence scoring  Frequency + recency based              │
│  ✅ Reward logging      Accept/reject/revision signals         │
│  ✅ Experiment tracking Sessions, mistakes, RMR calculation    │
│  ✅ Review gauntlet     Curated persona-based code review      │
│  ⏳ Manual promotion    Human selects rules to surface         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

What's Coming (v0.8+)

┌─────────────────────────────────────────────────────────────────┐
│                    CONTEXTUAL BANDIT (PLANNED)                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Context (c):     Error class, file type, task category        │
│  Arms (a):        Candidate rules to surface                   │
│  Reward (r):      Human feedback (👍 helpful / 👎 not helpful) │
│                                                                 │
│  Policy π(c) → a:  Thompson Sampling with Beta-Bernoulli       │
│                                                                 │
│  Objective:       Minimize regret = Σ(r* - r_chosen)           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Arms = learned rules (from buildlog entries, code reviews, explicit teaching)

Context = what kind of problem you're working on

Reward = did surfacing this rule actually help?

The reward infrastructure exists. The bandit policy is next. Thompson Sampling will provide theoretical guarantees: O(√(KT log K)) regret bounds.

We're building in public—the bandit implementation will be developed with full documentation of the process.

The Pipeline

buildlog captures signal at every stage:

flowchart LR
    A["Work Sessions"] --> B["Structured Entries"]
    B --> C["Extracted Rules"]
    C --> D["Manual Promotion"]
    D --> E["Rule Surfaced"]
    E --> F["Human Feedback"]
    F --> G["Reward Logged"]
    G -.-> H["Bandit Policy"]
    H -.-> D

    style F fill:#ff6b6b,color:#fff
    style G fill:#4ecdc4,color:#fff
    style H fill:#666,color:#fff,stroke-dasharray: 5 5

Dashed: Coming in v0.8 — automatic rule selection via Thompson Sampling

Stage 1: Capture

Document your work. Include the fuckups—they're the most valuable signal.

buildlog new auth-api
# Edit the markdown, document what happened

Stage 2: Extract

Pull structured rules from your entries.

buildlog distill    # Extract patterns
buildlog skills     # Deduplicate into rules

Stage 3: Promote

Surface rules to your agent via CLAUDE.md, settings.json, or Agent Skills.

buildlog promote --target skill

Stage 4: Measure

Track what happens when rules are active.

buildlog experiment start --error-class "type-errors"
# ... work session ...
buildlog experiment log-mistake --error-class "type-errors" \
  --description "Forgot to handle null case"
buildlog experiment end
buildlog experiment report

Stage 5: Learn

Log reward signals when rules help (or don't).

# Via MCP
buildlog_log_reward(
    skill_id="arch-123",
    reward=1,           # 1 = helped, 0 = didn't help
    context="type-errors",
    outcome="Caught the bug before commit"
)

Review Gauntlet

Run your code through ruthless reviewer personas, each with curated rules from authoritative sources.

# See available reviewers
buildlog gauntlet list

# Output:
# Review Gauntlet Personas
# ==================================================
#   security_karen
#     OWASP Top 10 security review
#     Rules: 13 (v1)
#
#   test_terrorist
#     Comprehensive testing coverage audit
#     Rules: 21 (v1)
#
# Total: 2 personas, 34 rules

Reviewer Personas

Persona	Focus	Rules
Security Karen	OWASP Top 10, auth, injection, secrets	13
Test Terrorist	Coverage, property-based, metamorphic, contracts	21
Ruthless Reviewer	Code quality, FP principles	Coming soon

Each rule includes:

Context: When to apply it
Antipattern: What violation looks like
Rationale: Why it matters (with citations)

Usage

# Generate a review prompt
buildlog gauntlet prompt src/api.py

# Export rules for manual review
buildlog gauntlet rules --format markdown -o review_checklist.md

# After running a review, persist learnings
buildlog gauntlet learn review_issues.json --source "PR#42"

Gauntlet Loop (Agent Integration)

For AI agents, the gauntlet loop automates the fix-rerun cycle:

buildlog gauntlet loop src/ --persona security_karen --persona test_terrorist

The loop provides structured checkpoints:

Severity	Action	Human Needed?
Critical	Agent fixes, reruns	No
Major	Checkpoint: continue?	Yes
Minor	Accept risk or fix?	Yes
Clean	Done	No

MCP tools for agent integration:

buildlog_gauntlet_issues — Report findings, get next action
buildlog_gauntlet_accept_risk — Accept remaining issues (optionally create GitHub issues)

The gauntlet integrates with the learning loop—issues found become rules that accumulate confidence.

Experiment Infrastructure

buildlog ships with infrastructure to run actual experiments:

# Start a tracked session
buildlog experiment start --error-class "api-design"

# Log mistakes as they happen
buildlog experiment log-mistake \
  --error-class "api-design" \
  --description "Returned 200 for error case"

# End session
buildlog experiment end

# Get metrics
buildlog experiment metrics

# Full report across all sessions
buildlog experiment report

The report includes:

Total sessions, total mistakes
Repeat rate (RMR)
Mistakes by error class
Rules that correlate with corrections

This is the data you need to make claims.

Quick Start

# Install
pip install buildlog

# Initialize
buildlog init

# Create your first entry
buildlog new my-feature

# After a few entries, extract rules
buildlog distill
buildlog skills

# Start measuring
buildlog experiment start
# ... work ...
buildlog experiment end
buildlog experiment report

MCP Server (Claude Code Integration)

pip install buildlog[mcp]

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "buildlog": {
      "command": "buildlog-mcp"
    }
  }
}

Available tools:

Tool	Purpose
`buildlog_status`	View rules by category and confidence
`buildlog_promote`	Surface rules to agent
`buildlog_reject`	Mark false positives
`buildlog_diff`	Rules pending review
`buildlog_learn_from_review`	Extract rules from code review
`buildlog_log_reward`	Record reward signal
`buildlog_start_session`	Begin tracked experiment
`buildlog_log_mistake`	Record mistake during session
`buildlog_experiment_report`	Full experiment report
`buildlog_gauntlet_issues`	Report gauntlet findings, get next action
`buildlog_gauntlet_accept_risk`	Accept remaining issues, optionally create GH issues

CLI Commands

buildlog init                    # Initialize buildlog
buildlog new <slug>              # Create entry
buildlog list                    # List entries
buildlog distill                 # Extract patterns
buildlog skills                  # Generate rules
buildlog stats                   # Usage statistics
buildlog reward <outcome>        # Log reward signal

# Experiments
buildlog experiment start        # Begin tracked session
buildlog experiment log-mistake  # Record mistake
buildlog experiment end          # End session
buildlog experiment report       # Full report

# Review Gauntlet
buildlog gauntlet list           # Show reviewers
buildlog gauntlet rules          # Export rules
buildlog gauntlet prompt <path>  # Generate review prompt
buildlog gauntlet learn <file>   # Persist learnings
buildlog gauntlet loop <path>    # Auto-fix loop with HITL checkpoints

What This Is Not

This is not AGI. This is not "agents that truly learn." This is not a revolution.

This is:

A structured way to capture engineering knowledge
A bandit framework for rule selection
Infrastructure to measure whether it works

Boring? Maybe. But boring things that work beat exciting things that don't.

The Falsification Protocol

Want to test whether buildlog actually helps? Here's the protocol:

Baseline: Run N sessions without buildlog rules active. Log mistakes.
Treatment: Run N sessions with buildlog rules active. Log mistakes.
Compare: Calculate RMR for both conditions.
Statistical test: Two-proportion z-test or chi-squared.
Report: Effect size, confidence interval, p-value.

If p > 0.05, we fail to reject the null. buildlog didn't help. That's a valid outcome.

If p < 0.05, we have evidence of an effect. How big? Check the effect size.

This is how you know. Not vibes. Data.

Theoretical Foundations

For the technically curious:

Concept	Application in buildlog	Status
Confidence scoring	Frequency + recency decay	✅ Implemented
Semantic hashing	Mistake deduplication for RMR	✅ Implemented
Reward signals	Binary feedback infrastructure	✅ Implemented
Thompson Sampling	Rule selection under uncertainty	⏳ Planned (v0.8)
Beta-Bernoulli model	Posterior updates from binary reward	⏳ Planned (v0.8)
Contextual bandits	Context-dependent rule selection	⏳ Planned (v0.8)
Regret bounds	O(√(KT log K)) theoretical guarantee	⏳ Planned (v0.8)

We're not inventing new math. We're applying proven frameworks to a new domain. The infrastructure for reward collection is live; the bandit policy is the next milestone.

Honest Limitations

Not Yet Implemented

Automatic rule selection: Currently manual promotion; Thompson Sampling bandit planned for v0.8
Context-aware surfacing: Rules are surfaced globally, not based on task context

Hard Problems We're Working On

Credit assignment: When multiple rules are active, which one helped?
Non-stationarity: Developer skill changes over time
Cold start: New rules have high uncertainty
Context representation: What features actually matter?

These are hard problems. We have directional ideas, not solutions. If you're a researcher working on bandit algorithms or causal inference, we'd love to talk.

Philosophy

Falsifiability over impressiveness - If you can't prove it wrong, it's not a claim
Measurement over intuition - "Feels better" is not evidence
Mechanisms over magic - Explain how it works or admit you don't know
Boring over exciting - Proven frameworks beat novel demos
Honesty over marketing - State limitations. Invite scrutiny.

Contributing

We're especially interested in:

Better context representations for the bandit
Credit assignment approaches
Statistical methodology improvements
Real-world experiment results (positive or negative)

git clone https://github.com/Peleke/buildlog-template
cd buildlog-template
pip install -e ".[dev]"
pytest

License

MIT License — see LICENSE

"Agent learning" without measurement is just prompt engineering with extra steps.

buildlog is measurement.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.23.0

Mar 14, 2026

0.22.0

Mar 14, 2026

0.21.1

Mar 12, 2026

0.21.0

Mar 12, 2026

0.20.0

Mar 7, 2026

0.18.4

Feb 15, 2026

0.18.2

Feb 14, 2026

0.18.1

Feb 14, 2026

0.15.0

Feb 13, 2026

0.13.1

Feb 7, 2026

0.13.0

Feb 6, 2026

0.12.0

Feb 5, 2026

0.11.1

Feb 5, 2026

0.11.0

Feb 5, 2026

0.10.5

Feb 4, 2026

0.10.4

Feb 4, 2026

0.10.3

Feb 4, 2026

0.10.2

Feb 4, 2026

0.10.1

Feb 4, 2026

0.10.0

Feb 4, 2026

0.9.0

Feb 2, 2026

0.8.0

Jan 31, 2026

This version

0.7.0

Jan 22, 2026

0.6.1

Jan 22, 2026

0.6.0

Jan 22, 2026

0.5.0

Jan 22, 2026

0.4.0

Jan 17, 2026

0.3.0

Jan 17, 2026

0.2.0

Jan 16, 2026

0.1.0

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buildlog-0.7.0.tar.gz (82.8 kB view details)

Uploaded Jan 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

buildlog-0.7.0-py3-none-any.whl (100.2 kB view details)

Uploaded Jan 22, 2026 Python 3

File details

Details for the file buildlog-0.7.0.tar.gz.

File metadata

Download URL: buildlog-0.7.0.tar.gz
Upload date: Jan 22, 2026
Size: 82.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for buildlog-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`66fc411e4634913129165417fc89b2c609f7a011b63ec8d6971f109fb30d8ae3`
MD5	`3905e3c59e399312c1e0bc6b00c24221`
BLAKE2b-256	`52625bc6b1cb8e815b6eb9c4f76dcc268d475a2de7326b41853693e01f2be40c`

See more details on using hashes here.

File details

Details for the file buildlog-0.7.0-py3-none-any.whl.

File metadata

Download URL: buildlog-0.7.0-py3-none-any.whl
Upload date: Jan 22, 2026
Size: 100.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for buildlog-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1902f60c6ff2ba4e1b94a811b9233bf990315e64d01b1f2c4c62d536c285a6f`
MD5	`2ece20120e8775735505876daae5fa01`
BLAKE2b-256	`8846032eaf004a159f7140cf91870f0aa1265868d1d62b14dfd5bcd795767818`

See more details on using hashes here.

buildlog 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

buildlog

The Only Agent Learning System You Can Prove Works

The Problem

The Claim

The Metric: Repeated Mistake Rate (RMR)

The Mechanism

What Exists Today (v0.7)

What's Coming (v0.8+)

The Pipeline

Stage 1: Capture

Stage 2: Extract

Stage 3: Promote

Stage 4: Measure

Stage 5: Learn

Review Gauntlet

Reviewer Personas

Usage

Gauntlet Loop (Agent Integration)

Experiment Infrastructure

Quick Start

MCP Server (Claude Code Integration)

CLI Commands

What This Is Not

The Falsification Protocol

Theoretical Foundations

Honest Limitations

Not Yet Implemented

Hard Problems We're Working On

Philosophy

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes