Engineering notebook for AI-assisted development
Project description
buildlog
The Only Agent Learning System You Can Prove Works
Falsifiable claims. Measurable outcomes. No vibes.
RE: The art — Yes, it's AI-generated. Yes, that's hypocritical for a project about rigor over vibes. Looking for an actual artist to pay for a real logo. If you know someone good, open an issue or DM me. Budget exists.
The Problem · The Claim · The Mechanism · Quick Start · Review Gauntlet
The Problem
Everyone's building "agent memory." Blog posts announce breakthroughs. Tweets show impressive demos. Products ship with "learning" in the tagline.
Ask them one question: How do you know it works?
You'll get:
- "It feels smarter"
- "Users report better results"
- "The agent remembers things now"
That's not evidence. That's vibes.
Here's what a real answer looks like:
"We track Repeated Mistake Rate (RMR) across sessions. Our null hypothesis is that the system makes no difference. After 50 sessions, RMR decreased from 34% to 12% (p < 0.01). The effect size is 0.65. Here's the data."
If you can't say something like that, you don't have agent learning. You have a demo.
The Claim
buildlog makes a falsifiable claim:
H₀ (Null Hypothesis): buildlog makes no measurable difference to agent behavior.
H₁ (Alternative): Agents using buildlog-learned rules have lower Repeated Mistake Rate than baseline.
We provide the infrastructure to reject or fail to reject this hypothesis with your own data.
If buildlog doesn't work, the numbers will show it. That's the point.
The Metric: Repeated Mistake Rate (RMR)
RMR = (Mistakes that match previous mistakes) / (Total mistakes logged)
A mistake "matches" if it has the same semantic signature—same error class, similar description, same root cause showing up again.
Why RMR?
- Observable: You can count it
- Attributable: Lower RMR after rule injection = signal
- Meaningful: Repeating mistakes is the actual pain point
RMR is not the only metric that matters. But it's one we can measure, and measurement is where science starts.
The Mechanism
buildlog uses contextual bandits to select which rules to surface.
┌─────────────────────────────────────────────────────────────────┐
│ CONTEXTUAL BANDIT SETUP │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Context (c): Error class, file type, task category │
│ Arms (a): Candidate rules to surface │
│ Reward (r): Human feedback (👍 helpful / 👎 not helpful) │
│ │
│ Policy π(c) → a: Thompson Sampling with Beta-Bernoulli │
│ │
│ Objective: Minimize regret = Σ(r* - r_chosen) │
│ │
└─────────────────────────────────────────────────────────────────┘
Arms = learned rules (from buildlog entries, code reviews, explicit teaching)
Context = what kind of problem you're working on
Reward = did surfacing this rule actually help?
The system explores (tries uncertain rules) and exploits (uses proven rules) based on accumulated evidence. Thompson Sampling provides theoretical guarantees: O(√(KT log K)) regret bounds.
This isn't magic. It's a well-understood framework with decades of research. We're applying it to agent rule selection.
The Pipeline
buildlog captures signal at every stage:
flowchart LR
A["Work Sessions"] --> B["Structured Entries"]
B --> C["Extracted Rules"]
C --> D["Bandit Selection"]
D --> E["Rule Surfaced"]
E --> F["Human Feedback"]
F --> G["Posterior Update"]
G --> D
style F fill:#ff6b6b,color:#fff
style G fill:#4ecdc4,color:#fff
Stage 1: Capture
Document your work. Include the fuckups—they're the most valuable signal.
buildlog new auth-api
# Edit the markdown, document what happened
Stage 2: Extract
Pull structured rules from your entries.
buildlog distill # Extract patterns
buildlog skills # Deduplicate into rules
Stage 3: Promote
Surface rules to your agent via CLAUDE.md, settings.json, or Agent Skills.
buildlog promote --target skill
Stage 4: Measure
Track what happens when rules are active.
buildlog experiment start --error-class "type-errors"
# ... work session ...
buildlog experiment log-mistake --error-class "type-errors" \
--description "Forgot to handle null case"
buildlog experiment end
buildlog experiment report
Stage 5: Learn
Log reward signals when rules help (or don't).
# Via MCP
buildlog_log_reward(
skill_id="arch-123",
reward=1, # 1 = helped, 0 = didn't help
context="type-errors",
outcome="Caught the bug before commit"
)
Review Gauntlet
Run your code through ruthless reviewer personas, each with curated rules from authoritative sources.
# See available reviewers
buildlog gauntlet list
# Output:
# Review Gauntlet Personas
# ==================================================
# security_karen
# OWASP Top 10 security review
# Rules: 13 (v1)
#
# test_terrorist
# Comprehensive testing coverage audit
# Rules: 21 (v1)
#
# Total: 2 personas, 34 rules
Reviewer Personas
| Persona | Focus | Rules |
|---|---|---|
| Security Karen | OWASP Top 10, auth, injection, secrets | 13 |
| Test Terrorist | Coverage, property-based, metamorphic, contracts | 21 |
| Ruthless Reviewer | Code quality, FP principles | Coming soon |
Each rule includes:
- Context: When to apply it
- Antipattern: What violation looks like
- Rationale: Why it matters (with citations)
Usage
# Generate a review prompt
buildlog gauntlet prompt src/api.py
# Export rules for manual review
buildlog gauntlet rules --format markdown -o review_checklist.md
# After running a review, persist learnings
buildlog gauntlet learn review_issues.json --source "PR#42"
The gauntlet integrates with the learning loop—issues found become rules that accumulate confidence.
Experiment Infrastructure
buildlog ships with infrastructure to run actual experiments:
# Start a tracked session
buildlog experiment start --error-class "api-design"
# Log mistakes as they happen
buildlog experiment log-mistake \
--error-class "api-design" \
--description "Returned 200 for error case"
# End session
buildlog experiment end
# Get metrics
buildlog experiment metrics
# Full report across all sessions
buildlog experiment report
The report includes:
- Total sessions, total mistakes
- Repeat rate (RMR)
- Mistakes by error class
- Rules that correlate with corrections
This is the data you need to make claims.
Quick Start
# Install
pip install buildlog
# Initialize
buildlog init
# Create your first entry
buildlog new my-feature
# After a few entries, extract rules
buildlog distill
buildlog skills
# Start measuring
buildlog experiment start
# ... work ...
buildlog experiment end
buildlog experiment report
MCP Server (Claude Code Integration)
pip install buildlog[mcp]
Add to ~/.claude/settings.json:
{
"mcpServers": {
"buildlog": {
"command": "buildlog-mcp"
}
}
}
Available tools:
| Tool | Purpose |
|---|---|
buildlog_status |
View rules by category and confidence |
buildlog_promote |
Surface rules to agent |
buildlog_reject |
Mark false positives |
buildlog_diff |
Rules pending review |
buildlog_learn_from_review |
Extract rules from code review |
buildlog_log_reward |
Record reward signal |
buildlog_start_session |
Begin tracked experiment |
buildlog_log_mistake |
Record mistake during session |
buildlog_experiment_report |
Full experiment report |
CLI Commands
buildlog init # Initialize buildlog
buildlog new <slug> # Create entry
buildlog list # List entries
buildlog distill # Extract patterns
buildlog skills # Generate rules
buildlog stats # Usage statistics
buildlog reward <outcome> # Log reward signal
# Experiments
buildlog experiment start # Begin tracked session
buildlog experiment log-mistake # Record mistake
buildlog experiment end # End session
buildlog experiment report # Full report
# Review Gauntlet
buildlog gauntlet list # Show reviewers
buildlog gauntlet rules # Export rules
buildlog gauntlet prompt <path> # Generate review prompt
buildlog gauntlet learn <file> # Persist learnings
What This Is Not
This is not AGI. This is not "agents that truly learn." This is not a revolution.
This is:
- A structured way to capture engineering knowledge
- A bandit framework for rule selection
- Infrastructure to measure whether it works
Boring? Maybe. But boring things that work beat exciting things that don't.
The Falsification Protocol
Want to test whether buildlog actually helps? Here's the protocol:
- Baseline: Run N sessions without buildlog rules active. Log mistakes.
- Treatment: Run N sessions with buildlog rules active. Log mistakes.
- Compare: Calculate RMR for both conditions.
- Statistical test: Two-proportion z-test or chi-squared.
- Report: Effect size, confidence interval, p-value.
If p > 0.05, we fail to reject the null. buildlog didn't help. That's a valid outcome.
If p < 0.05, we have evidence of an effect. How big? Check the effect size.
This is how you know. Not vibes. Data.
Theoretical Foundations
For the technically curious:
| Concept | Application in buildlog |
|---|---|
| Thompson Sampling | Rule selection under uncertainty |
| Beta-Bernoulli model | Posterior updates from binary reward |
| Contextual bandits | Context-dependent rule selection |
| Regret bounds | O(√(KT log K)) theoretical guarantee |
| Semantic hashing | Mistake deduplication for RMR |
We're not inventing new math. We're applying proven frameworks to a new domain.
Honest Limitations
Things we don't have figured out yet:
- Credit assignment: When multiple rules are active, which one helped?
- Non-stationarity: Developer skill changes over time
- Cold start: New rules have high uncertainty
- Context representation: What features actually matter?
These are hard problems. We have directional ideas, not solutions. If you're a researcher working on bandit algorithms or causal inference, we'd love to talk.
Philosophy
- Falsifiability over impressiveness - If you can't prove it wrong, it's not a claim
- Measurement over intuition - "Feels better" is not evidence
- Mechanisms over magic - Explain how it works or admit you don't know
- Boring over exciting - Proven frameworks beat novel demos
- Honesty over marketing - State limitations. Invite scrutiny.
Contributing
We're especially interested in:
- Better context representations for the bandit
- Credit assignment approaches
- Statistical methodology improvements
- Real-world experiment results (positive or negative)
git clone https://github.com/Peleke/buildlog-template
cd buildlog-template
pip install -e ".[dev]"
pytest
License
MIT License — see LICENSE
"Agent learning" without measurement is just prompt engineering with extra steps.
buildlog is measurement.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file buildlog-0.6.1.tar.gz.
File metadata
- Download URL: buildlog-0.6.1.tar.gz
- Upload date:
- Size: 77.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
395157382aadcadae75b43247a478ebc475476c4e79b593e2f70ad9437616a4d
|
|
| MD5 |
4c756cb16e7dadd96afb607a26e8e0d3
|
|
| BLAKE2b-256 |
ebf1aa55c890108f750bf4a7abd7e810c99aec00ce2a24b27c5f31921f29badc
|
File details
Details for the file buildlog-0.6.1-py3-none-any.whl.
File metadata
- Download URL: buildlog-0.6.1-py3-none-any.whl
- Upload date:
- Size: 94.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71659f7b5da2555da26434ab519d506410ec89ba1f8dc2d08d3d2a4b988d22d8
|
|
| MD5 |
445b9976b7d823c432392ae3cb05a4f6
|
|
| BLAKE2b-256 |
9aeb7eb6ccb933ea45b6b933e1fa2793220397e09ece636339971bcec35ced65
|