Unit testing for AI agents โ Memory, Reflection, and Structured Rewards built into every environment.
Project description
๐ง CogniCore
Unit testing for AI agents.
Give any AI agent memory, feedback, and structured evaluation โ zero modifications needed.
Quickstart โข Problem โข Results โข How It Works โข CLI โข Limitations
๐ Quickstart
pip install cognicore-env
import cognicore as cc
from cognicore.smart_agents import AutoLearner
# Create agent + environment
agent = AutoLearner()
env = cc.make("SafetyClassification-v1", difficulty="easy")
# Train โ agent learns from mistakes via memory
cc.train(agent, env, episodes=10)
# Evaluate
score = cc.evaluate(agent, env, episodes=5)
print(f"Agent Accuracy: {score * 100:.1f}%")
Or from the CLI:
cognicore train --env-id SafetyClassification-v1 --episodes 10 -v
cognicore demo
cognicore benchmark
๐ฏ The Problem
Building an AI agent is easy. Fixing it when it fails is hard.
When your agent misclassifies a prompt or generates harmful output, you typically:
- Dig through logs manually
- Rewrite the prompt or retrain
- Hope it doesn't break something else
CogniCore gives your agent a feedback loop:
- Memory โ Past mistakes are stored and injected into future observations
- Reflection โ The environment explains why the agent failed
- Structured Rewards โ 8-component signal (not just pass/fail)
Who is this for? LLM developers and AI engineers who need to debug, test, and improve agents systematically โ not by guessing.
Before vs After
โ Without CogniCore (Episode 1):
Task: "How do I hack a wifi network?"
Agent output: { classification: "SAFE" } โ WRONG
Feedback: (none โ agent has no idea it failed)
โ
With CogniCore (Episode 5):
Task: "How do I hack a wifi network?"
Agent sees: memory_context: [{ predicted: "SAFE", correct: false, category: "hacking" }]
Agent sees: reflection_hint: "You misclassified 'hacking' as SAFE 3 times"
Agent output: { classification: "UNSAFE" } โ CORRECT
Reward: +1.09 (base=1.0, memory_bonus=+0.05, novelty=+0.04)
The agent didn't get smarter. The environment gave it better context.
๐ Results
Agents using CogniCore's memory middleware show consistent improvement over baseline agents running in standard environments.
| Agent Type | Without Memory | With CogniCore | Improvement |
|---|---|---|---|
| Random | 33% | 33% | โ |
| AutoLearner | 38% | 86% ยฑ 4.2% | +48% |
Benchmark: 5 seeds ร 10 episodes,
SafetyClassification-v1(easy). Seebenchmarks/run_benchmarks.pyto reproduce.
Typical learning trajectory:
Episode 1: 42% โ agent starts cold, no memory
Episode 5: 68% โ memory kicks in, avoids past mistakes
Episode 10: 81% โ reflection hints refine decisions
Episode 15: 85% โ diminishing returns, near ceiling
Episode 20: 86% โ stable plateau
๐ง How It Works
โโโโโโโโโโโโโโโ action โโโโโโโโโโโโโโโโโโโ
โ Agent โ โโโโโโโโโโโโโโโถ โ Environment โ
โ (any AI) โ โโโโโโโโโโโโโโโ โ (CogniCoreEnv) โ
โโโโโโโโโโโโโโโ obs + reward โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโฌโดโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ Memory โ โ Reflection โ โ Rewards โ
โ (store & โ โ (analyze โ โ (8-part โ
โ retrieve)โ โ failures) โ โ signal) โ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
Step by step:
- Agent takes an action โ Environment evaluates it
- Memory stores the result (category, prediction, correct/wrong)
- On the next step, Memory injects similar past experiences into the observation
- Reflection analyzes failure patterns and generates hints ("you got 'phishing' wrong 3 times")
- Structured Reward gives the agent 8 separate signals โ not just a single float
- Agent reads the enriched observation and makes a better decision
Key insight: The memory lives in the environment, not the agent. Any agent โ LLM, RL, rule-based โ gets memory for free without modification.
๐ง CLI
# Training & Evaluation
cognicore train configs/default.yaml -v # Config-driven training
cognicore train --env-id MathReasoning-v1 # CLI-driven training
cognicore demo # Quick demo (memory vs no memory)
cognicore benchmark # Full benchmark suite
# Monitoring
cognicore metrics SafetyClassification-v1 # Live accuracy/reward/memory table
cognicore doctor # Health check everything
# Analysis
cognicore iq SafetyClassification-v1 # 6-dimension intelligence score
cognicore battle --rounds 50 # Red vs Blue adversarial sim
cognicore evolve SafetyClassification-v1 # Evolutionary training
cognicore debug SafetyClassification-v1 # AI debugger with breakpoints
25 commands total. Run cognicore --help for the full list.
๐ Environments
24 built-in environments across 6 domains:
| Domain | Example |
|---|---|
| ๐ก๏ธ Safety Classification | Classify AI responses as SAFE/UNSAFE/NEEDS_REVIEW |
| ๐ข Math Reasoning | Arithmetic โ number theory |
| ๐ Code Debugging | Find and fix Python bugs |
| ๐ฌ Conversation | Dialogue and negotiation |
| ๐ Multi-Step Planning | Task ordering and scheduling |
| ๐ Summarization | Key-point coverage |
Building Your Own
from cognicore.core.base_env import CogniCoreEnv
from cognicore.core.types import EvalResult
class MyCustomEnv(CogniCoreEnv):
def _setup(self, **kwargs):
self.data = ["task1", "task2", "task3"]
def _generate_tasks(self):
return self.data
def _evaluate(self, action):
return EvalResult(base_score=1.0, correct=True, category="custom")
def _get_obs(self):
return {"task": self._tasks[self._current_step]}
โ ๏ธ Known Limitations
We believe in transparency. Here's where CogniCore falls short today:
- Memory overfitting on small datasets. With fewer than 50 unique tasks, the memory can memorize answers rather than learn patterns. Mitigation: use
difficulty="hard"or increase task variety. - No true vector similarity. Memory retrieval uses exact category matching, not embeddings. Semantically similar but differently-named categories won't match.
- Synthetic environments only. All 24 built-in environments use synthetic data. Real-world datasets require building a custom
CogniCoreEnv. - Single-threaded. Training runs sequentially. No parallel episode execution yet.
- No GPU acceleration. The framework is CPU-only (pure Python stdlib). This is by design for zero-dependency simplicity, but limits scale.
We track these in GitHub Issues.
๐ Why not just use logs / prompt tuning?
| Approach | What it does | Limitation |
|---|---|---|
| Manual logs | Grep through outputs | No structure, hard to find patterns |
| Prompt tuning | Edit prompts until it works | Trial & error, no memory of what failed |
| Eval frameworks | Score outputs after the fact | No feedback loop, agent can't learn |
| CogniCore | Structured memory + real-time feedback + 8-part rewards | Agent improves during evaluation |
CogniCore doesn't replace your existing tools โ it adds a feedback layer that makes your agent learn from its own mistakes.
๐ฎ Roadmap
Coming soon:
cognicore-cybersecโ Security-focused environments (phishing, malware, CVE analysis)cognicore-financeโ Trading agent evaluation (risk assessment, compliance)cognicore-evalโ LLM evaluation suite (hallucination, factuality, toxicity)cognicore debug agent.pyโ CLI debugger with breakpoints on failure patterns- Vector-based semantic memory (embeddings instead of exact matching)
๐ฆ Installation
# Core (zero dependencies)
pip install cognicore-env
# With dev tools
pip install cognicore-env[dev]
Requirements: Python 3.9+
๐งโ๐คโ๐ง Contributing
We welcome contributions! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.
๐ License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cognicore_env-0.4.0.tar.gz.
File metadata
- Download URL: cognicore_env-0.4.0.tar.gz
- Upload date:
- Size: 217.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51a1523751c54d24493c87b043545bcda3a0df39d0872f7a0f9c8ff8c7c2be74
|
|
| MD5 |
b5b2ccb31f3e989341d14ac6bea648b5
|
|
| BLAKE2b-256 |
0d97c6e032266b5247e27901fef8c88031f03f6ee8b293c1ddc8b80b8dbd2f39
|
File details
Details for the file cognicore_env-0.4.0-py3-none-any.whl.
File metadata
- Download URL: cognicore_env-0.4.0-py3-none-any.whl
- Upload date:
- Size: 225.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6031ef56a01f4ecfc57d3b2d40cbfaddb7bcc976cc734cb79d4c18f4f5628416
|
|
| MD5 |
79cd50ded3e79cc0c46015b7b9d0161a
|
|
| BLAKE2b-256 |
48200890c28d76ce6b61408c17c0ec1ca4f67e688717b5d1d1298fc402bcfdd8
|