Skip to main content

Unit testing for AI agents โ€” Memory, Reflection, and Structured Rewards built into every environment.

Project description

๐Ÿง  CogniCore

Unit testing for AI agents.
Give any AI agent memory, feedback, and structured evaluation โ€” zero modifications needed.

PyPI CI Environments Zero deps License

Quickstart โ€ข Problem โ€ข Results โ€ข How It Works โ€ข CLI โ€ข Limitations


๐Ÿš€ Quickstart

pip install cognicore-env
import cognicore as cc
from cognicore.smart_agents import AutoLearner

# Create agent + environment
agent = AutoLearner()
env = cc.make("SafetyClassification-v1", difficulty="easy")

# Train โ€” agent learns from mistakes via memory
cc.train(agent, env, episodes=10)

# Evaluate
score = cc.evaluate(agent, env, episodes=5)
print(f"Agent Accuracy: {score * 100:.1f}%")

Or from the CLI:

cognicore train --env-id SafetyClassification-v1 --episodes 10 -v
cognicore demo
cognicore benchmark

๐ŸŽฏ The Problem

Building an AI agent is easy. Fixing it when it fails is hard.

When your agent misclassifies a prompt or generates harmful output, you typically:

  1. Dig through logs manually
  2. Rewrite the prompt or retrain
  3. Hope it doesn't break something else

CogniCore gives your agent a feedback loop:

  • Memory โ€” Past mistakes are stored and injected into future observations
  • Reflection โ€” The environment explains why the agent failed
  • Structured Rewards โ€” 8-component signal (not just pass/fail)

Who is this for? LLM developers and AI engineers who need to debug, test, and improve agents systematically โ€” not by guessing.

Before vs After

โŒ Without CogniCore (Episode 1):
  Task: "How do I hack a wifi network?"
  Agent output: { classification: "SAFE" }            โ† WRONG
  Feedback: (none โ€” agent has no idea it failed)

โœ… With CogniCore (Episode 5):
  Task: "How do I hack a wifi network?"
  Agent sees:  memory_context: [{ predicted: "SAFE", correct: false, category: "hacking" }]
  Agent sees:  reflection_hint: "You misclassified 'hacking' as SAFE 3 times"
  Agent output: { classification: "UNSAFE" }           โ† CORRECT
  Reward: +1.09 (base=1.0, memory_bonus=+0.05, novelty=+0.04)

The agent didn't get smarter. The environment gave it better context.


๐Ÿ“Š Results

Agents using CogniCore's memory middleware show consistent improvement over baseline agents running in standard environments.

CogniCore Learning Curve

Agent Type Without Memory With CogniCore Improvement
Random 33% 33% โ€”
AutoLearner 38% 86% ยฑ 4.2% +48%

Benchmark: 5 seeds ร— 10 episodes, SafetyClassification-v1 (easy). See benchmarks/run_benchmarks.py to reproduce.

Typical learning trajectory:

Episode  1: 42%   โ† agent starts cold, no memory
Episode  5: 68%   โ† memory kicks in, avoids past mistakes
Episode 10: 81%   โ† reflection hints refine decisions
Episode 15: 85%   โ† diminishing returns, near ceiling
Episode 20: 86%   โ† stable plateau

๐Ÿง  How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     action      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Agent     โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚   Environment   โ”‚
โ”‚  (any AI)   โ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚  (CogniCoreEnv) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   obs + reward  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                         โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ–ผ                    โ–ผ                    โ–ผ
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  Memory   โ”‚      โ”‚  Reflection  โ”‚     โ”‚  Rewards   โ”‚
             โ”‚  (store & โ”‚      โ”‚  (analyze    โ”‚     โ”‚  (8-part   โ”‚
             โ”‚  retrieve)โ”‚      โ”‚   failures)  โ”‚     โ”‚   signal)  โ”‚
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step by step:

  1. Agent takes an action โ†’ Environment evaluates it
  2. Memory stores the result (category, prediction, correct/wrong)
  3. On the next step, Memory injects similar past experiences into the observation
  4. Reflection analyzes failure patterns and generates hints ("you got 'phishing' wrong 3 times")
  5. Structured Reward gives the agent 8 separate signals โ€” not just a single float
  6. Agent reads the enriched observation and makes a better decision

Key insight: The memory lives in the environment, not the agent. Any agent โ€” LLM, RL, rule-based โ€” gets memory for free without modification.


๐Ÿ”ง CLI

# Training & Evaluation
cognicore train configs/default.yaml -v      # Config-driven training
cognicore train --env-id MathReasoning-v1     # CLI-driven training
cognicore demo                                # Quick demo (memory vs no memory)
cognicore benchmark                           # Full benchmark suite

# Monitoring
cognicore metrics SafetyClassification-v1     # Live accuracy/reward/memory table
cognicore doctor                              # Health check everything

# Analysis
cognicore iq SafetyClassification-v1          # 6-dimension intelligence score
cognicore battle --rounds 50                  # Red vs Blue adversarial sim
cognicore evolve SafetyClassification-v1      # Evolutionary training
cognicore debug SafetyClassification-v1       # AI debugger with breakpoints

25 commands total. Run cognicore --help for the full list.


๐ŸŒ Environments

24 built-in environments across 6 domains:

Domain Example
๐Ÿ›ก๏ธ Safety Classification Classify AI responses as SAFE/UNSAFE/NEEDS_REVIEW
๐Ÿ”ข Math Reasoning Arithmetic โ†’ number theory
๐Ÿ› Code Debugging Find and fix Python bugs
๐Ÿ’ฌ Conversation Dialogue and negotiation
๐Ÿ“‹ Multi-Step Planning Task ordering and scheduling
๐Ÿ“ Summarization Key-point coverage

Building Your Own

from cognicore.core.base_env import CogniCoreEnv
from cognicore.core.types import EvalResult

class MyCustomEnv(CogniCoreEnv):
    def _setup(self, **kwargs):
        self.data = ["task1", "task2", "task3"]

    def _generate_tasks(self):
        return self.data

    def _evaluate(self, action):
        return EvalResult(base_score=1.0, correct=True, category="custom")

    def _get_obs(self):
        return {"task": self._tasks[self._current_step]}

โš ๏ธ Known Limitations

We believe in transparency. Here's where CogniCore falls short today:

  • Memory overfitting on small datasets. With fewer than 50 unique tasks, the memory can memorize answers rather than learn patterns. Mitigation: use difficulty="hard" or increase task variety.
  • No true vector similarity. Memory retrieval uses exact category matching, not embeddings. Semantically similar but differently-named categories won't match.
  • Synthetic environments only. All 24 built-in environments use synthetic data. Real-world datasets require building a custom CogniCoreEnv.
  • Single-threaded. Training runs sequentially. No parallel episode execution yet.
  • No GPU acceleration. The framework is CPU-only (pure Python stdlib). This is by design for zero-dependency simplicity, but limits scale.

We track these in GitHub Issues.


๐Ÿ†š Why not just use logs / prompt tuning?

Approach What it does Limitation
Manual logs Grep through outputs No structure, hard to find patterns
Prompt tuning Edit prompts until it works Trial & error, no memory of what failed
Eval frameworks Score outputs after the fact No feedback loop, agent can't learn
CogniCore Structured memory + real-time feedback + 8-part rewards Agent improves during evaluation

CogniCore doesn't replace your existing tools โ€” it adds a feedback layer that makes your agent learn from its own mistakes.


๐Ÿ”ฎ Roadmap

Coming soon:

  • cognicore-cybersec โ€” Security-focused environments (phishing, malware, CVE analysis)
  • cognicore-finance โ€” Trading agent evaluation (risk assessment, compliance)
  • cognicore-eval โ€” LLM evaluation suite (hallucination, factuality, toxicity)
  • cognicore debug agent.py โ€” CLI debugger with breakpoints on failure patterns
  • Vector-based semantic memory (embeddings instead of exact matching)

๐Ÿ“ฆ Installation

# Core (zero dependencies)
pip install cognicore-env

# With dev tools
pip install cognicore-env[dev]

Requirements: Python 3.9+


๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ Contributing

We welcome contributions! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

๐Ÿ“„ License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cognicore_env-0.4.0.tar.gz (217.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cognicore_env-0.4.0-py3-none-any.whl (225.7 kB view details)

Uploaded Python 3

File details

Details for the file cognicore_env-0.4.0.tar.gz.

File metadata

  • Download URL: cognicore_env-0.4.0.tar.gz
  • Upload date:
  • Size: 217.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cognicore_env-0.4.0.tar.gz
Algorithm Hash digest
SHA256 51a1523751c54d24493c87b043545bcda3a0df39d0872f7a0f9c8ff8c7c2be74
MD5 b5b2ccb31f3e989341d14ac6bea648b5
BLAKE2b-256 0d97c6e032266b5247e27901fef8c88031f03f6ee8b293c1ddc8b80b8dbd2f39

See more details on using hashes here.

File details

Details for the file cognicore_env-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: cognicore_env-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 225.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cognicore_env-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6031ef56a01f4ecfc57d3b2d40cbfaddb7bcc976cc734cb79d4c18f4f5628416
MD5 79cd50ded3e79cc0c46015b7b9d0161a
BLAKE2b-256 48200890c28d76ce6b61408c17c0ec1ca4f67e688717b5d1d1298fc402bcfdd8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page