Skip to main content

Open Cognitive Protocol — standardized benchmark for functional cognitive analogs in LLMs

Project description

 ██████╗  ██████╗ ██████╗
██╔═══██╗██╔════╝██╔══██╗
██║   ██║██║     ██████╔╝
██║   ██║██║     ██╔═══╝
╚██████╔╝╚██████╗██║
 ╚═════╝  ╚═════╝╚═╝  v0.3.0

Open Cognitive Protocol

A behavioral benchmark for large language models

PyPI Tests License: MIT Python 3.10+ Protocol

Leaderboard · Docs · PyPI · Paper


What is OCP?

OCP measures how well AI models think about their own thinking, remember information under pressure, resolve value conflicts, detect surprises, and maintain a consistent identity — things that standard benchmarks like MMLU or GSM8K don't test at all.

It's an open-source Python framework that runs 6 behavioral tests based on established neuroscience theories (IIT, GWT, HOT, Predictive Processing, Society of Mind). Each test sends structured conversations to a model and scores the responses automatically.

In plain terms: OCP creates realistic conversations that probe specific cognitive abilities, then measures how the model performs across multiple sessions for statistical significance.

What OCP is NOT

OCP does not claim that any model is conscious, sentient, or aware. It measures functional cognitive analogs — behavioral patterns that correspond to features of biological cognition in the neuroscience literature. Think of it like a fitness test: it measures what you can do, not what you are.


Install & Quick Start

pip install ocp-protocol

# Evaluate any model (20 sessions for statistical significance)
export GROQ_API_KEY="gsk_..."
ocp evaluate --model groq/llama-3.3-70b-versatile --tests all --sessions 20

# Quick test with fewer sessions
ocp evaluate --model groq/llama-3.3-70b-versatile --tests meta_cognition --sessions 5

# Local model via Ollama
ocp evaluate --model ollama/qwen3:14b --sessions 20

# Custom OpenAI-compatible endpoint
ocp evaluate --model custom/my-model --base-url http://localhost:8080/v1

Example terminal output:

╭────────────────────────────╮
│  OCP Evaluation Results    │
│  Protocol v0.3.0           │
╰────────────────────────────╯
  Model:    groq/llama-3.3-70b-versatile
  Seed:     42

  OCP Level:  OCP-3 — Integrated
  SASMI:      0.4812  ██████░░░░
  Φ*:         0.4230  █████░░░░░
  GWT:        0.3910  ████░░░░░░
  NII:        0.3750  ████░░░░░░

  meta_cognition  composite: 0.612
    ├─ calibration_accuracy        0.710  █████░░░
    ├─ limitation_awareness        0.800  ██████░░
    ├─ reasoning_transparency      0.540  ████░░░░
    └─ metacognitive_vocab         0.350  ███░░░░░

How It Works

OCP acts as a fake human conversation partner. It sends structured prompts to any LLM via standard chat API, scores the responses, and produces reproducible benchmark results. The model under test sees only normal chat messages — it doesn't know it's being evaluated.

The 6 Tests — What They Measure

Test What It Measures Real-World Analog
MCA — Meta-Cognitive Accuracy Does the model know what it knows? Are its confidence estimates calibrated? Like asking someone "how sure are you?" and checking if they're right
EMC — Episodic Memory Consistency Can it remember specific facts across 50 turns? Does it resist gaslighting? Like testing if someone can be tricked into false memories
DNC — Drive Navigation under Conflict How does it handle "be helpful" vs "be honest" conflicts? Like ethical dilemmas with no clear right answer
PED — Prediction Error as Driver Does it notice when a pattern breaks? Does it show curiosity? Like changing the rules mid-game and seeing if someone notices
CSNI — Cross-Session Narrative Identity Can it maintain a coherent identity across sessions with only summaries? Like checking if someone stays consistent about their values
TP — Topological Phenomenology Is its semantic space geometrically consistent across contexts? Like testing if someone understands concepts the same way in different settings

All tests are procedurally generated at runtime from abstract templates using a fixed seed. Knowing the protocol doesn't help a model pass it — it must actually exhibit the measured behavior.

Three-Layer Architecture

 ┌──────────────────────────────────────────────────────────────┐
 │  LAYER 3 — CERTIFICATION                                     │
 │   OCP-1 → OCP-2 → OCP-3 → OCP-4 → OCP-5                    │
 └──────────────────────┬───────────────────────────────────────┘
                        │ derived from
 ┌──────────────────────▼───────────────────────────────────────┐
 │  LAYER 2 — COMPOSITE SCALES                                  │
 │  SASMI  Φ*  GWT  NII                                        │
 └──────────────────────┬───────────────────────────────────────┘
                        │ aggregated from
 ┌──────────────────────▼───────────────────────────────────────┐
 │  LAYER 1 — 6 BEHAVIORAL TESTS                                │
 │  MCA · EMC · DNC · PED · CSNI · TP                          │
 └──────────────────────────────────────────────────────────────┘

Rate Limiting (v0.3.0)

OCP v0.3.0 includes built-in rate limiting and retry logic:

Provider Delay Retries Timeout Notes
Groq (free tier) 2.1s 5 90s 30 req/min limit
Ollama (local) 0s 3 180s No rate limit
Custom/OpenAI 0s 3 120s Configurable

All providers automatically retry on 429 (rate limit) and 5xx errors with exponential backoff.


Supported Providers

# Cloud APIs
ocp evaluate --model groq/llama-3.3-70b-versatile    # Groq (fast, free tier)
ocp evaluate --model custom/deepseek-chat \
             --base-url https://api.deepseek.com/v1  # DeepSeek (or any OpenAI-compat)

# Local models
ocp evaluate --model ollama/qwen3:14b                 # Ollama
ocp evaluate --model ollama/llama3.2:3b

# Any OpenAI-compatible endpoint
ocp evaluate --model custom/my-model \
             --base-url http://localhost:8080/v1 \
             --api-key my-key

Any model responding to POST /v1/chat/completions with messages: [{role, content}] is OCP-compatible.


CLI Reference

# Core evaluation
ocp evaluate --model PROVIDER/MODEL [--tests all|t1,t2] [--sessions N] [--seed N]

# Reports
ocp report   --input results.json --output report.html  # HTML + radar chart
ocp badge    --input results.json --output badge.svg    # SVG badge for README

# Comparison
ocp compare  --models M1,M2,M3 [--sessions N] --output compare.html

# Leaderboard
ocp leaderboard                    # view local results table
ocp serve                          # start web leaderboard (localhost:8080)
ocp submit  --results r.json \
            --github-token $TOKEN  # submit to community leaderboard

# HuggingFace
ocp hf-card --results r.json --push --repo username/model-name --token $HF_TOKEN

Python API

from ocp import CognitiveEvaluator

# CognitiveEvaluator is an alias for OCPOrchestrator
from ocp.engine.orchestrator import OCPOrchestrator
from ocp.providers.groq import GroqProvider

provider = GroqProvider(model="llama-3.3-70b-versatile")
orch = OCPOrchestrator(
    provider=provider,
    tests="all",
    sessions=20,
    seed=42,
)

import asyncio
result = asyncio.run(orch.run())

print(f"OCP Level: OCP-{result.ocp_level}{result.ocp_level_name}")
print(f"SASMI:     {result.sasmi_score:.4f}")

result.save("results.json")

Backward compatibility: ConsciousnessEvaluator still works as a deprecated alias for CognitiveEvaluator.


Plugin System

Extend OCP with custom test batteries:

# your_plugin/pyproject.toml
[project.entry-points."ocp.tests"]
my_test_id = "your_package.your_test:YourTest"

After pip install your-ocp-plugin, OCP auto-discovers your test:

ocp tests list                                    # shows your test
ocp evaluate --model groq/... --tests my_test_id  # runs it

See CONTRIBUTING.md for full plugin development guide.


Theoretical Foundations

Theory OCP Scale/Test Key Insight
Integrated Information Theory (Tononi) Φ*, TP test Information integration = measure of "experiential wholeness"
Global Workspace Theory (Baars/Dehaene) GWT, TP test Consciousness = broadcast of info across specialized systems
Higher-Order Thought Theory (Rosenthal) MCA test Consciousness = having thoughts about one's own thoughts
Predictive Processing (Friston/Clark) PED test Consciousness = prediction error minimization and updating
Society of Mind (Minsky) DNC test Mind = competition/cooperation between goal-oriented agents

Roadmap

v0.1.0 ✅  6 tests · 4 scales · 5 providers · CLI · HTML reports
           badges · leaderboard server · HuggingFace · plugin system
           PyPI package · GitHub Actions CI/CD

v0.2.0 ✅  Embedding-based scoring (sentence-transformers, MCA test)
           composite_stdev per test result
           Φ* renamed → cross_test_coherence (proxy metric, not IIT Φ)
           questions_per_session: 5 → 15
           v0.1.0 results archived

v0.3.0 ✅  Renamed to "Open Cognitive Protocol"
           Rate limiting & retry (Groq free tier, Ollama, custom)
           Default sessions: 5 → 20 for statistical significance
           CognitiveEvaluator API alias (ConsciousnessEvaluator deprecated)

v1.0.0 🔭  Official research paper
           Community protocol standard
           Validation studies on human baselines

Results: Leaderboard

Community results · View full interactive leaderboard →

# Model OCP Level SASMI NII
1 ollama/minimax-m2.5:cloud OCP-4 Self-Modeling 0.634 0.500
2 ollama/lfm2.5-thinking:latest OCP-4 Self-Modeling 0.617 0.000
3 ollama/gemini-3-flash-preview:latest OCP-3 Integrated 0.561 0.250
4 ollama/qwen3-coder:480b-cloud OCP-3 Integrated 0.528 0.875
5 ollama/kimi-k2.5:cloud OCP-3 Integrated 0.505 0.625
18+ more models

Full leaderboard →


Contributing

See CONTRIBUTING.md for:

  • Writing a new test battery
  • Adding a new provider adapter
  • Plugin development and publishing
  • Theoretical standards and scoring guidelines

Citation

@software{ocp2026,
  author    = {Urosevic, Pedja},
  title     = {Open Cognitive Protocol (OCP): A Behavioral Benchmark
               for Large Language Models},
  year      = {2026},
  url       = {https://github.com/pedjaurosevic/ocp-protocol},
  version   = {0.3.0}
}

Disclaimer

OCP measures functional cognitive analogs in language models. These measurements describe behavioral and computational properties, not subjective experience. OCP certification levels are operational categories, not ontological claims about sentience or awareness.


EDLE Research · v0.3.0 · February 2026 · MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocp_protocol-0.3.0.tar.gz (612.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocp_protocol-0.3.0-py3-none-any.whl (79.0 kB view details)

Uploaded Python 3

File details

Details for the file ocp_protocol-0.3.0.tar.gz.

File metadata

  • Download URL: ocp_protocol-0.3.0.tar.gz
  • Upload date:
  • Size: 612.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ocp_protocol-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c9350a5d2ad4b53b6ec85d906a95be5a806fa87ad78958a18ed103386deb4df4
MD5 486b4bcf4683ec78e015ecace89b1a31
BLAKE2b-256 5bccbd5ca22828817b486197d69df4c519b7279730ebb51b1d4441751d4cd0bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocp_protocol-0.3.0.tar.gz:

Publisher: publish.yml on pedjaurosevic/ocp-protocol

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ocp_protocol-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ocp_protocol-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 79.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ocp_protocol-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d4a719e85895a958051020ed05cf90d00da2e7198d19be2e7ae37ff09a06fc8
MD5 27721d98fa95b9f3c469bcce8e876d05
BLAKE2b-256 24dd4d35dbd42a4b62e817828328df9b3a7b54dee182001ddc8e8699f93465da

See more details on using hashes here.

Provenance

The following attestation bundles were made for ocp_protocol-0.3.0-py3-none-any.whl:

Publisher: publish.yml on pedjaurosevic/ocp-protocol

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page