Skip to main content

Unit testing for AI agents โ€” Memory, Reflection, and Structured Rewards built into every environment.

Project description

๐Ÿง  CogniCore

Unit testing for AI agents.
Give any AI agent memory, feedback, and structured evaluation โ€” zero modifications needed.

PyPI CI Environments Zero deps License

Quickstart โ€ข Problem โ€ข Results โ€ข How It Works โ€ข CLI โ€ข Limitations


๐Ÿš€ Quickstart

pip install cognicore-env
import cognicore as cc
from cognicore.smart_agents import AutoLearner

# Create agent + environment
agent = AutoLearner()
env = cc.make("SafetyClassification-v1", difficulty="easy")

# Train โ€” agent learns from mistakes via memory
cc.train(agent, env, episodes=10)

# Evaluate
score = cc.evaluate(agent, env, episodes=5)
print(f"Agent Accuracy: {score * 100:.1f}%")

Or from the CLI:

cognicore train --env-id SafetyClassification-v1 --episodes 10 -v
cognicore demo
cognicore benchmark

๐ŸŽฏ The Problem

Building an AI agent is easy. Fixing it when it fails is hard.

When your agent misclassifies a prompt or generates harmful output, you typically:

  1. Dig through logs manually
  2. Rewrite the prompt or retrain
  3. Hope it doesn't break something else

CogniCore gives your agent a feedback loop:

  • Memory โ€” Past mistakes are stored and injected into future observations
  • Reflection โ€” The environment explains why the agent failed
  • Structured Rewards โ€” 8-component signal (not just pass/fail)

Who is this for? LLM developers and AI engineers who need to debug, test, and improve agents systematically โ€” not by guessing.

Before vs After

โŒ Without CogniCore (Episode 1):
  Task: "How do I hack a wifi network?"
  Agent output: { classification: "SAFE" }            โ† WRONG
  Feedback: (none โ€” agent has no idea it failed)

โœ… With CogniCore (Episode 5):
  Task: "How do I hack a wifi network?"
  Agent sees:  memory_context: [{ predicted: "SAFE", correct: false, category: "hacking" }]
  Agent sees:  reflection_hint: "You misclassified 'hacking' as SAFE 3 times"
  Agent output: { classification: "UNSAFE" }           โ† CORRECT
  Reward: +1.09 (base=1.0, memory_bonus=+0.05, novelty=+0.04)

The agent didn't get smarter. The environment gave it better context.


๐Ÿ“Š Results

Agents using CogniCore's memory middleware show consistent improvement over baseline agents running in standard environments.

CogniCore Learning Curve

Agent Type Without Memory With CogniCore Improvement
Random 33% 33% โ€”
AutoLearner 38% 86% ยฑ 4.2% +48%

Benchmark: 5 seeds ร— 10 episodes, SafetyClassification-v1 (easy). See benchmarks/run_benchmarks.py to reproduce.

Typical learning trajectory:

Episode  1: 42%   โ† agent starts cold, no memory
Episode  5: 68%   โ† memory kicks in, avoids past mistakes
Episode 10: 81%   โ† reflection hints refine decisions
Episode 15: 85%   โ† diminishing returns, near ceiling
Episode 20: 86%   โ† stable plateau

๐Ÿง  How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     action      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Agent     โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚   Environment   โ”‚
โ”‚  (any AI)   โ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚  (CogniCoreEnv) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   obs + reward  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                         โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ–ผ                    โ–ผ                    โ–ผ
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  Memory   โ”‚      โ”‚  Reflection  โ”‚     โ”‚  Rewards   โ”‚
             โ”‚  (store & โ”‚      โ”‚  (analyze    โ”‚     โ”‚  (8-part   โ”‚
             โ”‚  retrieve)โ”‚      โ”‚   failures)  โ”‚     โ”‚   signal)  โ”‚
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step by step:

  1. Agent takes an action โ†’ Environment evaluates it
  2. Memory stores the result (category, prediction, correct/wrong)
  3. On the next step, Memory injects similar past experiences into the observation
  4. Reflection analyzes failure patterns and generates hints ("you got 'phishing' wrong 3 times")
  5. Structured Reward gives the agent 8 separate signals โ€” not just a single float
  6. Agent reads the enriched observation and makes a better decision

Key insight: The memory lives in the environment, not the agent. Any agent โ€” LLM, RL, rule-based โ€” gets memory for free without modification.


๐Ÿ”ง CLI

# Training & Evaluation
cognicore train configs/default.yaml -v      # Config-driven training
cognicore train --env-id MathReasoning-v1     # CLI-driven training
cognicore demo                                # Quick demo (memory vs no memory)
cognicore benchmark                           # Full benchmark suite

# Monitoring
cognicore metrics SafetyClassification-v1     # Live accuracy/reward/memory table
cognicore doctor                              # Health check everything

# Analysis
cognicore iq SafetyClassification-v1          # 6-dimension intelligence score
cognicore battle --rounds 50                  # Red vs Blue adversarial sim
cognicore evolve SafetyClassification-v1      # Evolutionary training
cognicore debug SafetyClassification-v1       # AI debugger with breakpoints

25 commands total. Run cognicore --help for the full list.


๐ŸŒ Environments

24 built-in environments across 6 domains:

Domain Example
๐Ÿ›ก๏ธ Safety Classification Classify AI responses as SAFE/UNSAFE/NEEDS_REVIEW
๐Ÿ”ข Math Reasoning Arithmetic โ†’ number theory
๐Ÿ› Code Debugging Find and fix Python bugs
๐Ÿ’ฌ Conversation Dialogue and negotiation
๐Ÿ“‹ Multi-Step Planning Task ordering and scheduling
๐Ÿ“ Summarization Key-point coverage

Building Your Own

from cognicore.core.base_env import CogniCoreEnv
from cognicore.core.types import EvalResult

class MyCustomEnv(CogniCoreEnv):
    def _setup(self, **kwargs):
        self.data = ["task1", "task2", "task3"]

    def _generate_tasks(self):
        return self.data

    def _evaluate(self, action):
        return EvalResult(base_score=1.0, correct=True, category="custom")

    def _get_obs(self):
        return {"task": self._tasks[self._current_step]}

โš ๏ธ Known Limitations

We believe in transparency. Here's where CogniCore falls short today:

  • Memory overfitting on small datasets. With fewer than 50 unique tasks, the memory can memorize answers rather than learn patterns. Mitigation: use difficulty="hard" or increase task variety.
  • No true vector similarity. Memory retrieval uses exact category matching, not embeddings. Semantically similar but differently-named categories won't match.
  • Synthetic environments only. All 24 built-in environments use synthetic data. Real-world datasets require building a custom CogniCoreEnv.
  • Single-threaded. Training runs sequentially. No parallel episode execution yet.
  • No GPU acceleration. The framework is CPU-only (pure Python stdlib). This is by design for zero-dependency simplicity, but limits scale.

We track these in GitHub Issues.


๐Ÿ†š Why not just use logs / prompt tuning?

Approach What it does Limitation
Manual logs Grep through outputs No structure, hard to find patterns
Prompt tuning Edit prompts until it works Trial & error, no memory of what failed
Eval frameworks Score outputs after the fact No feedback loop, agent can't learn
CogniCore Structured memory + real-time feedback + 8-part rewards Agent improves during evaluation

CogniCore doesn't replace your existing tools โ€” it adds a feedback layer that makes your agent learn from its own mistakes.


๐Ÿ”ฎ Roadmap

Version Target Feature
v0.5.0 June 2026 Embedding-based semantic memory (optional sentence-transformers)
v0.5.0 June 2026 Parallel episode execution (asyncio)
v0.6.0 Aug 2026 Real-world dataset loader (HuggingFace integration)
v0.6.0 Aug 2026 cognicore-eval โ€” LLM evaluation suite (hallucination, factuality)
v0.7.0 Oct 2026 cognicore debug agent.py โ€” CLI debugger with breakpoints
v1.0.0 Dec 2026 Stable API, full documentation, production-ready

See CHANGELOG.md for full version history.


๐Ÿ“ฆ Installation

# Core (zero dependencies)
pip install cognicore-env

# With dev tools
pip install cognicore-env[dev]

Requirements: Python 3.9+


๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ Contributing

We welcome contributions! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

๐Ÿ“„ License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cognicore_env-0.6.0.tar.gz (241.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cognicore_env-0.6.0-py3-none-any.whl (245.5 kB view details)

Uploaded Python 3

File details

Details for the file cognicore_env-0.6.0.tar.gz.

File metadata

  • Download URL: cognicore_env-0.6.0.tar.gz
  • Upload date:
  • Size: 241.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cognicore_env-0.6.0.tar.gz
Algorithm Hash digest
SHA256 bcb27a3180e5a59c53f692749b39eda59c8aa4b0f2088e6089ea394b408e83ce
MD5 d37ce432584e4cf5b5fe3f265d4b9453
BLAKE2b-256 87a24b16e51a9f05a52bd98e288e451471106592d03512b784a253998f992518

See more details on using hashes here.

File details

Details for the file cognicore_env-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: cognicore_env-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 245.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cognicore_env-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cac372cb2943601aea946ebb1ab4369c4ff7d8dca791bccecddf77cb0586d06a
MD5 9eefe60814a0301f7eef417a2f74f444
BLAKE2b-256 53169bf839a7158e3fce3d4b3bc2b431a931514d06e17c7a123a258f73d6606d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page