Self-improving agents with closed-loop learning — agents that learn to get it right
Project description
CannyForge
Self-Improving Agents Through Closed-Loop Learning
CannyForge demonstrates how autonomous agents can genuinely learn from experience through closed-loop feedback. Skills are defined declaratively via AgentSkills.io-compliant SKILL.md files -- no Python subclassing required. The engine handles execution, error detection, pattern learning, rule application, and rule lifecycle automatically.
Install
pip install cannyforge # from PyPI
cannyforge demo # run the 3-act demo
cannyforge run "write email" # execute a task
Or install from source:
git clone https://github.com/cannyforge/cannyforge.git
cd cannyforge
pip install -e .
CLI
cannyforge demo # animated terminal demo
cannyforge demo --speed 0 # instant (CI)
cannyforge run "task" # execute one task
cannyforge new-skill name # scaffold a skill
cannyforge stats # show KB state
cannyforge rules email_writer # inspect rules
cannyforge learn # trigger learning
cannyforge export # export training data
cannyforge install github:user/repo/path/to/skill # install from GitHub
cannyforge serve # start MCP server
cannyforge dashboard # launch Streamlit dashboard
Quick Start (code)
from cannyforge import CannyForge
forge = CannyForge()
result = forge.execute("Write an email about the 3 PM meeting")
print(result.success, result.output) # False, then True after learning
Core Concept
Task --> [Apply Rules] --> Execute --> Outcome --> Learn --> Update Rules
^ |
+-------------------- Knowledge Base <-------------+
The key insight: Knowledge must flow back into execution. Rules learned from past errors are evaluated against new tasks and actively prevent predicted failures -- and rules that stop working are automatically retired.
skill — warm start: templates and structure ready from day one forge — calibration: watches every execution, builds rules, enforces them, and retires what doesn't work
Run the Animated Demo
cannyforge demo # normal speed
cannyforge demo --speed 0 # instant (CI / quick review)
cannyforge demo --speed 2 # slow (presentations)
cannyforge demo --seed 7 # different random sequence
The demo runs three acts in your terminal:
- Act I — Tasks execute with zero rules. Same errors repeat. Auto-learn fires mid-stream.
- Act II — Rules active. Forge enforces what it learned.
- Act III — A poorly-calibrated rule degrades ACTIVE → PROBATION → DORMANT, then gets resurrected when the same errors resurface.
Run Tests
pytest tests/ -v
258 tests across 9 test files covering skill loading, knowledge rules, declarative execution, learning, LLM integration, multi-step execution, integration, spec compliance, and production readiness.
How Learning Works
1. Automatic Trigger
CannyForge monitors errors per skill and auto-triggers a learning cycle when enough uncovered signal accumulates -- no manual call needed:
forge = CannyForge()
# Just execute tasks. Learning triggers automatically when:
# - 2+ distinct error types appear that no existing rule covers, OR
# - 20+ raw errors accumulate since the last cycle
result = forge.execute("Write email about the 3 PM meeting")
# TimezoneError logged → uncovered signal accumulates
# ...after enough failures, forge.run_learning_cycle() fires automatically
2. Pattern Detection
# Can also trigger manually
metrics = forge.run_learning_cycle(min_frequency=3, min_confidence=0.3)
# Generated rule:
# IF task.description matches '\d{1,2}\s*(am|pm)'
# AND context.has_timezone == False
# THEN add_field(context.timezone, 'UTC')
# flag(_flags, 'timezone_added')
3. Rule Application
# Rules apply before execution (PREVENTION), after (VALIDATION),
# or on mid-execution failure (RECOVERY)
result = forge.execute("Send email about 2 PM meeting")
print(result.rules_applied) # ['rule_timezoneerror_1']
4. Adaptive Confidence Updates
Rule confidence uses an adaptive exponential moving average. The prior dominates early (when few observations exist), observations dominate later:
prior_weight = 2.0 / (applications + 2)
confidence = prior_weight × prior + (1 − prior_weight) × effectiveness
This allows rules to recover from initial bad luck and converge correctly without being locked in by early results.
5. Rule Lifecycle
Rules that underperform are demoted, not deleted. The knowledge is preserved for resurrection:
ACTIVE → effectiveness < threshold, n≥5 → PROBATION
PROBATION → effectiveness ≥ threshold×1.1 → ACTIVE (hysteresis)
PROBATION → n≥15 AND eff < threshold×0.7 → DORMANT
DORMANT → same error type resurfaces → ACTIVE (resurrection)
Thresholds differ by rule type — PREVENTION rules are held to a higher standard (0.45) than RECOVERY rules (0.25), which face harder attribution problems.
Dormant rules fire the resurrection path in add_rule() the next time the learning cycle regenerates a rule for the same error type. The resurrected rule starts with partial confidence (min(new_conf × 0.6, 0.5)), not a full reset, so the degradation history informs the restart.
Creating a New Skill
Create a directory under skills/ with a single SKILL.md file:
skills/
my-new-skill/
SKILL.md # required -- defines the skill
assets/ # optional -- templates, data files
templates.yaml
scripts/ # optional -- custom Python handler
handler.py
Minimal SKILL.md
---
name: my-new-skill
description: What this skill does.
metadata:
triggers:
- keyword1
- keyword2
output_type: result_type
---
# My New Skill
Detailed description in markdown.
That's it. CannyForge auto-discovers the skill, matches tasks to it via triggers, and wires up the learning loop. No code changes needed.
Execution Tiers (priority order)
scripts/handler.py— full control via custom Python (highest priority)- LLM-powered — when an
llm_provideris passed toCannyForge(), uses multi-step tool-calling loop - Template-based — intent matching against
assets/templates.yaml(fallback)
Optional: Templates
greeting:
match: [hello, hi]
subject: "Greeting"
body: "Hello there!"
default:
match: []
subject: "General"
body: "Default output"
Optional: Custom Handler
from cannyforge.skills import ExecutionResult, ExecutionStatus, SkillOutput
def execute(context, metadata):
return ExecutionResult(
status=ExecutionStatus.SUCCESS,
output=SkillOutput(content={"key": "value"}, output_type="custom"),
)
Architecture
Declarative Skills (AgentSkills.io Spec)
Skills are defined via SKILL.md with YAML frontmatter following the AgentSkills.io specification. CannyForge-specific extensions live under the metadata field:
| Field | Purpose |
|---|---|
name |
Hyphenated lowercase identifier (e.g. email-writer) |
description |
What the skill does |
license |
License type |
metadata.triggers |
Keywords for task-to-skill matching |
metadata.output_type |
Output category |
metadata.context_fields |
Typed execution context fields with defaults |
Included Skills
| Skill | Triggers | Output Type |
|---|---|---|
email-writer |
email, write email, compose, draft email | |
calendar-manager |
calendar, schedule, meeting, book, reserve | calendar_event |
web-searcher |
search, find, research, look up, query | search_results |
content-summarizer |
summarize, summary, abstract, condense, extract | summary |
Core Components
skills.py -- Declarative Skill System
ExecutionContext: Dynamic properties via__getattr__/__setattr__, backward-compatible with rule dictsDeclarativeSkill: Three-tier execution (handler → LLM → template), multi-step loop bounded bymax_stepsSkillLoader: Scansskills/directory, parses frontmatter, creates skill instancesSkillRegistry: Trigger-based task matching with scoring (match count + earliest position)StepRecord: Per-step tracking of tool calls, tool results, errors, and recovery applied
knowledge.py -- Actionable Knowledge System
RuleStatus:ACTIVE/PROBATION/DORMANTlifecycle states- Rules with
Condition → Actionstructure; conditions:contains,matches,equals,gt,lt effective_confidence: confidence × staleness decay (10% per 30 days idle, floor 50%)PATTERN_LIBRARY: Backbone intelligence shared across all skills —TimezoneError,SpamTriggerError,AttachmentError,ConflictError,PreferenceError,PoorQueryError,LowCredibilityError- Adaptive EMA confidence updates in
record_outcome(); lifecycle transitions in_check_lifecycle() add_rule()detects dormant resurrection and probation boost via semantic match (samesource_error_type+rule_type)
learning.py -- Pattern Detection and Learning Engine
PatternDetector: Groups errors by type, filters bymin_frequencyandmin_confidence = frequency / total_errorsLearningEngine.run_learning_cycle(): Two passes — PREVENTION rules from error repo, RECOVERY rules from step error repo- Dormant-aware
already_has_rulecheck: dormant rules are allowed to be re-derived and resurrected
core.py -- Unified Interface
_maybe_auto_learn(): Per-skill uncovered-error tracking, auto-triggers learning cycle- Dynamic error classification derived from
PATTERN_LIBRARY(keyword → error type) - LLM-based error classification when a provider is available
reset(): Clears stats and learning data; for clean KB state passdata_dir=tempfile.mkdtemp()at construction
llm.py -- LLM Providers
LLMProviderABC withClaudeProvider,OpenAIProvider,DeepSeekProvider,MockProviderMockProvidersupportsstep_responseslist for deterministic multi-step test scenarios
storage.py -- Storage Backends
JSONFileBackend: Default file-based storage (JSONL for errors/successes, JSON for rules)SQLiteBackend: Thread-safe relational storage with automatic schema migration
adapters/ -- Framework Integration
langchain.py:CannyForgeToolwraps any skill as a LangChain toolcrewai.py:CannyForgeCrewToolwraps any skill as a CrewAI tool
Project Structure
cannyforge/
├── pyproject.toml # Project config, pytest settings
├── CLAUDE.md # Developer guide
│
├── cannyforge/ # Main package
│ ├── __init__.py # Public API exports
│ ├── cli.py # CLI entry point (11 commands)
│ ├── core.py # CannyForge orchestrator
│ ├── knowledge.py # Rules, conditions, actions, PATTERN_LIBRARY
│ ├── skills.py # DeclarativeSkill, SkillLoader, SkillRegistry
│ ├── learning.py # ErrorRecord, PatternDetector, LearningEngine
│ ├── llm.py # LLM providers (Claude, OpenAI, DeepSeek, Mock)
│ ├── tools.py # ToolDefinition, ToolExecutor, ToolRegistry
│ ├── storage.py # Storage backends (JSON, SQLite)
│ ├── workers.py # Background learning workers
│ ├── registry.py # Community skill registry
│ ├── mcp_server.py # MCP server
│ ├── export.py # Training data export (DPO, Anthropic)
│ ├── dashboard.py # Streamlit monitoring dashboard
│ ├── adapters/ # Framework adapters
│ │ ├── langchain.py # LangChain integration
│ │ └── crewai.py # CrewAI integration
│ ├── services/ # External services (mock + real)
│ │ ├── slack_service.py
│ │ ├── email_service.py
│ │ └── crm_service.py
│ └── bundled_skills/ # Built-in skills
│ ├── email-writer/
│ ├── calendar-manager/
│ ├── web-searcher/
│ └── content-summarizer/
│
├── scenarios/
│ ├── demo.py # Animated terminal demo (3 acts)
│ └── scenario_email.py # Ablation scenario
│
├── examples/
│ └── quickstart.py # Quickstart example
│
├── tests/ # 258 tests
│ ├── conftest.py # Shared fixtures
│ ├── test_skill_loader.py
│ ├── test_knowledge.py
│ ├── test_declarative_skill.py
│ ├── test_learning.py
│ ├── test_llm.py
│ ├── test_tools.py
│ ├── test_integration.py
│ ├── test_spec_compliance.py
│ └── test_production.py # Production readiness tests
│
└── .github/workflows/ci.yml # CI: test (Python 3.10-3.12) + spec validation
Usage Examples
Basic Execution
from cannyforge import CannyForge
forge = CannyForge()
result = forge.execute("Write a professional email about the project")
print(f"Skill: {result.skill_name}")
print(f"Success: {result.success}")
print(f"Rules applied: {result.rules_applied}")
print(f"Output: {result.output}")
With LLM Provider
from cannyforge import CannyForge, ClaudeProvider
forge = CannyForge(llm_provider=ClaudeProvider())
# Skills now use the three-tier execution:
# 1. Custom handler (if present)
# 2. LLM multi-step tool loop
# 3. Template fallback
result = forge.execute("Write an email about the meeting at 3 PM")
Learning Cycle (manual)
# Auto-learning fires automatically, but you can also trigger manually
metrics = forge.run_learning_cycle(min_frequency=3, min_confidence=0.3)
print(f"Patterns detected: {metrics.patterns_detected}")
print(f"Rules generated: {metrics.rules_generated}")
Statistics
stats = forge.get_statistics()
print(f"Success rate: {stats['execution']['success_rate']:.1%}")
print(f"Total rules: {stats['learning']['total_rules']}")
# Rule lifecycle breakdown
kb_stats = forge.knowledge_base.get_statistics()
print(kb_stats['rules_by_status']) # {'active': N, 'probation': N, 'dormant': N}
Rule Inspection
for rule in forge.knowledge_base.get_rules("email_writer"):
print(f"{rule.name}: {rule.status.value} "
f"eff={rule.effectiveness:.2f} conf={rule.effective_confidence:.2f}")
Validation
CannyForge uses ablation testing to prove learning effectiveness (see scenarios/scenario_email.py):
- Constant error rate: No predetermined decay — improvement comes only from rules preventing errors
- Train/test split: Rules learned on training tasks, evaluated on held-out tasks
- Ablation control: Direct comparison with vs without learning applied
CI/CD
GitHub Actions runs on every push and PR to main:
- test: Runs full test suite on Python 3.10, 3.11, 3.12
- spec-validation: Validates all
SKILL.mdfiles against spec requirements
Limitations and Future Work
Current limitations:
- Pattern confidence is
frequency / total_errors— minority error types can fall below threshold when dominated by a high-frequency type - Attribution problem: all rules in
applied_rulesare credited/blamed equally; true causal attribution requires controlled experiments PATTERN_LIBRARYmust be extended manually to support new error types
Future directions:
- Causal inference for pattern attribution
- Meta-learning across scenarios
- Multi-agent collaborative learning
- Real-world API integration
Further Reading
- Blog post: From Prompt Tweaks to Learning Machines: The Agent Skill Primitive
- Technical appendix:
docs/TECHNICAL_APPENDIX_EMAIL_SCENARIO_WALKTHROUGH.md
License
See LICENSE file for details.
CannyForge -- Agents that genuinely learn from experience through closed-loop feedback.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cannyforge-0.1.0.tar.gz.
File metadata
- Download URL: cannyforge-0.1.0.tar.gz
- Upload date:
- Size: 84.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68c87de72ff463f7bf02c7e6d68fb225cb5b4a599096bcbb7943a52d6c3ab920
|
|
| MD5 |
d95026f649d391bee6f1c596d4830f6b
|
|
| BLAKE2b-256 |
90c839cd1bd916c61da699c7c3291ae1dbf18c556208c2c3bec144f8eab01aa5
|
File details
Details for the file cannyforge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cannyforge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 73.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c1baae87338a6973735956d8ab69246603041bac18f5256bff1497c5e91d2f3
|
|
| MD5 |
3e7d50942a79309a4c3e08b74d5b001a
|
|
| BLAKE2b-256 |
5e352a5d0200d5e3d68e8eade8636717450617bd7d1af8be643b86550e2d1a82
|