Surface-Oracle-Ratchet MCP server for autonomous code optimization
Project description
sorkit — Surface-Oracle-Ratchet Toolkit
An MCP server that enables AI agents to autonomously iterate on code while a human-authored test suite acts as the objective function. The agent can only edit designated files ("surfaces"), is evaluated by frozen tests ("oracles"), and advances only when it improves ("ratchet").
pip install sorkit
Table of Contents
- The Pattern
- How It Works
- Installation
- Quick Start
- Configuration Reference
- MCP Tools Reference
- Writing Tests for the Oracle
- Stopping Conditions
- Notifications
- Running the Example
- Programmatic Usage
- Requirements
The Pattern
Human writes tests + golden data → frozen (agent can't touch)
Agent edits code → surface (agent's playground)
Tests run automatically → oracle (pass/fail + optional score)
Score improves? → git commit (ratchet forward)
Score doesn't? → git reset (try again)
Stopping condition? → notify human, stop
Layers are worked bottom-up. Each completed layer freezes before the next starts, so the agent can never regress previous work.
How It Works
sorkit implements autonomous code optimization in three interlocking parts:
Surface
The mutation surface is the set of files the agent is allowed to edit. Everything else is frozen. This constrains the agent's search space to only the code you want optimized.
Oracle
The oracle is your test suite. It comes in two flavors:
- Pass/fail: Contract tests that must all pass (e.g., "the API returns valid JSON"). The agent succeeds when all tests pass.
- Scored: Tests that print numeric metrics to stdout (e.g.,
ACCURACY: 0.8500). sorkit extracts these, computes a weighted composite score, and uses it to decide whether the agent improved.
Ratchet
The ratchet ensures monotonic progress:
- If the agent's change improves the score →
git commit(lock in the gain) - If the agent's change doesn't improve →
git reset(revert and try again) - If a stopping condition is hit → notify the human and stop
This makes the process safe: the agent can experiment freely because bad changes are automatically reverted.
Layers
Projects are divided into layers, worked bottom-up:
- Complete Layer 1 (e.g., core algorithm) → it freezes
- Complete Layer 2 (e.g., API wrapper) → it freezes
- And so on...
Each layer has its own surface, oracle, and stopping criteria. Completed layers become read-only, so the agent can never break previous work.
Installation
pip install sorkit
This installs the sorkit command-line tool and the MCP server.
Development Install
git clone https://github.com/2lines/sorkit.git
cd sorkit
pip install -e ".[dev]"
Quick Start
Step 1: Add sorkit to your MCP client
For Claude Code, add to your project's .claude/settings.json:
{
"mcpServers": {
"sorkit": {
"command": "sorkit"
}
}
}
For Claude Desktop, add to claude_desktop_config.json:
{
"mcpServers": {
"sorkit": {
"command": "sorkit"
}
}
}
Step 2: Initialize your project
Ask your agent to call sor_init with your project directory. It returns a config template. Fill in your layers, surfaces, and tests, then call sor_init again with the completed config.
Or create sor.yaml manually (see Configuration Reference).
Step 3: Write your tests
Tests are the oracle — the source of truth. The agent can never modify them.
For pass/fail layers, write standard tests:
def test_api_returns_dict():
result = my_api.call("hello")
assert isinstance(result, dict)
For scored layers, print metrics to stdout:
def test_golden_set_accuracy(golden_set):
correct = sum(1 for item in golden_set if predict(item) == item["label"])
accuracy = correct / len(golden_set)
print(f"ACCURACY: {accuracy:.4f}")
assert accuracy > 0.1 # floor assertion to catch catastrophic regression
The metric name in print() must match the extract field in your sor.yaml.
Step 4: Initialize git
sorkit uses git for the ratchet mechanism. Your project must be a git repository:
git init
git add -A
git commit -m "initial state"
Step 5: Generate artifacts
from pathlib import Path
from sorkit.config import load_config
from sorkit.init import generate_claude_md, generate_experiment_loop_skill, initialize_results_tsv
config = load_config(Path('.'))
generate_claude_md(config, Path('.'))
generate_experiment_loop_skill(config, Path('.'))
initialize_results_tsv(Path('.'))
This creates:
CLAUDE.md— tells the agent what files are frozen, what it can edit, and what thresholds apply.claude/skills/experiment-loop.md— the experiment protocol the agent followsresults.tsv— experiment history tracker
Step 6: Let the agent run
The agent follows the experiment loop:
- Read
CLAUDE.mdand test files to understand the problem - Form a hypothesis ("add negation handling should improve accuracy")
- Edit only surface files
- Call
sor_ratchetwith the layer name and hypothesis - Parse the output:
KEEP,DISCARD, orSTOP - Repeat until a stopping condition is hit
Configuration Reference
sor.yaml
# Project identity
project_name: "My Search Engine"
# Paths the agent must NEVER modify
always_frozen:
- "fixtures/"
- "tests/"
- "sor.yaml"
- "CLAUDE.md"
- ".claude/"
- "results.tsv"
# Global defaults (layers can override any of these)
defaults:
test_runner: "python -m pytest" # command to run tests
max_attempts: 20 # hard ceiling per layer
consecutive_failure_limit: 5 # stop after N consecutive crashes
plateau_limit: 5 # stop after N consecutive non-improvements
diminishing_threshold: 0.005 # min delta over window to continue
diminishing_window: 5 # how many recent keeps to check
# Layers — worked bottom-up, each freezes when complete
layers:
# Scored layer example
- name: "indexer"
surface:
- "src/indexer.py"
- "src/tokenizer.py"
oracle:
contracts: "tests/test_indexer_contract.py" # must pass before scoring
scored: true
scored_tests: "tests/test_indexer_quality.py" # prints metrics to stdout
metrics:
- name: "recall"
extract: "RECALL_SCORE" # matches "RECALL_SCORE: 0.8500" in stdout
weight: 0.6 # contribution to composite score
- name: "precision"
extract: "PRECISION_SCORE"
weight: 0.4
thresholds:
target_score: 0.85 # stop when composite >= this
max_attempts: 25 # override default for this layer
# Pass/fail layer example
- name: "api"
surface:
- "src/api/routes.py"
oracle:
contracts: "tests/test_api_*.py"
scored: false
thresholds:
max_attempts: 10
Configuration Fields
| Field | Required | Description |
|---|---|---|
project_name |
Yes | Human-readable project name |
always_frozen |
Yes | Paths the agent must never modify |
defaults.test_runner |
No | Test command (default: python -m pytest) |
defaults.max_attempts |
No | Max iterations per layer (default: 20) |
defaults.consecutive_failure_limit |
No | Stop after N crashes (default: 5) |
defaults.plateau_limit |
No | Stop after N non-improvements (default: 5) |
defaults.diminishing_threshold |
No | Min score delta (default: 0.005) |
defaults.diminishing_window |
No | Recent keeps to check (default: 5) |
Layer Fields
| Field | Required | Description |
|---|---|---|
name |
Yes | Unique layer name |
surface |
Yes | List of mutable file paths |
oracle.contracts |
Yes | Test file/glob for contract tests |
oracle.scored |
Yes | true for metric-based, false for pass/fail |
oracle.scored_tests |
If scored | Test file that prints metrics |
oracle.metrics |
If scored | List of {name, extract, weight} |
thresholds.target_score |
No | Stop when composite >= this |
thresholds.max_attempts |
No | Override default max attempts |
Metric Weights
Metric weights must sum to 1.0. The composite score is:
composite = sum(metric_value * metric_weight for each metric)
MCP Tools Reference
sor_init
Initialize SOR in a project directory.
sor_init(project_dir="/path/to/project")
→ Returns a config template (JSON)
sor_init(project_dir="/path/to/project", config={...filled template...})
→ Saves sor.yaml and generates CLAUDE.md, experiment-loop skill, results.tsv
sor_add_layer
Add a new layer to an existing config.
sor_add_layer(
project_dir="/path/to/project",
name="api",
surface=["src/api.py"],
contracts="tests/test_api_contract.py",
scored=false
)
sor_run_oracle
Run the oracle without git side effects. Useful for checking current state.
sor_run_oracle(layer="indexer", project_dir=".")
→ "COMPOSITE: 0.7200 (indexer)\n\nMetrics:\n recall: 0.85\n precision: 0.55"
sor_ratchet
The core tool. One iteration: oracle → compare → commit/reset → check stops.
sor_ratchet(layer="indexer", hypothesis="add TF-IDF weighting", project_dir=".")
Returns one of:
KEEP score=0.7800 prev=0.7200— improvement, committedDISCARD score=0.7000 best=0.7800— no improvement, revertedDISCARD FAIL— tests failed, revertedSTOP:TARGET_MET score=0.8600 attempts=12 kept=7— done!
sor_status
Progress dashboard for one or all layers.
sor_status(project_dir=".") # all layers
sor_status(layer="indexer", project_dir=".") # one layer
Shows: attempt count, best score, keeps, last outcome, proximity warnings.
sor_results
Query experiment history from results.tsv.
sor_results(layer="indexer", last_n=10, project_dir=".")
sor_audit
Comprehensive audit report: summary, score progression, convergence analysis, improvement rate, estimated iterations to target, hypothesis breakdown.
sor_audit(layer="indexer", project_dir=".")
sor_score_history
Score progression with running best for each attempt.
sor_score_history(layer="indexer", project_dir=".")
sor_hypotheses
Which hypotheses worked and which didn't, grouped with keep rates.
sor_hypotheses(layer="indexer", project_dir=".")
Writing Tests for the Oracle
Contract Tests (Required for All Layers)
Contract tests enforce basic correctness. They run first — if any fail, scored tests are skipped.
"""Contract tests — FROZEN, do not modify."""
from src.my_module import my_function
class TestContract:
def test_returns_correct_type(self):
result = my_function("input")
assert isinstance(result, dict)
def test_handles_empty_input(self):
result = my_function("")
assert result is not None
def test_handles_edge_cases(self):
result = my_function("!@#$%")
assert isinstance(result, dict)
Scored Tests (For Scored Layers)
Scored tests print metrics to stdout. The oracle extracts these using the extract pattern from your config.
"""Scored tests — FROZEN, do not modify."""
import json
from src.classifier import classify
def test_golden_set_accuracy(golden_set):
correct = 0
total = len(golden_set)
# Per-class tracking
class_correct = {"positive": 0, "negative": 0, "neutral": 0}
class_total = {"positive": 0, "negative": 0, "neutral": 0}
for item in golden_set:
predicted = classify(item["text"])
expected = item["label"]
class_total[expected] += 1
if predicted == expected:
correct += 1
class_correct[expected] += 1
accuracy = correct / total
pos_recall = class_correct["positive"] / class_total["positive"]
neg_recall = class_correct["negative"] / class_total["negative"]
neu_recall = class_correct["neutral"] / class_total["neutral"]
# These lines are extracted by the oracle
print(f"ACCURACY: {accuracy:.4f}")
print(f"POS_RECALL: {pos_recall:.4f}")
print(f"NEG_RECALL: {neg_recall:.4f}")
print(f"NEU_RECALL: {neu_recall:.4f}")
# Print misclassifications so the agent can learn
for item in golden_set:
predicted = classify(item["text"])
if predicted != item["label"]:
print(f" [{item['label']}→{predicted}] \"{item['text'][:60]}\"")
# Floor assertion — prevent catastrophic regression
assert accuracy > 0.2, f"Accuracy too low: {accuracy:.1%}"
Key rules for scored tests:
- Print metric lines as
METRIC_NAME: <float>— the name must matchextractin sor.yaml - Include a floor assertion to catch catastrophic regressions
- Print misclassifications/errors so the agent can learn from mistakes
- The test file is frozen — the agent never modifies it
Golden Sets
For scored layers, create a golden set of labeled examples:
[
{"text": "I love this product!", "label": "positive"},
{"text": "Terrible quality", "label": "negative"},
{"text": "It arrived on Tuesday", "label": "neutral"}
]
Use a conftest.py fixture to load it:
# tests/conftest.py — FROZEN
import json
from pathlib import Path
import pytest
@pytest.fixture
def golden_set():
path = Path(__file__).parent.parent / "fixtures" / "golden_set.json"
with open(path) as f:
return json.load(f)
Stopping Conditions
The ratchet checks 7 stopping conditions after each iteration:
| Condition | Trigger | Applies To |
|---|---|---|
TARGET_MET |
Composite score >= target_score | Scored layers |
ALL_PASS |
All contract tests pass | Pass/fail layers |
PLATEAU |
N consecutive non-improvements | Scored layers |
DIMINISHING |
Score delta below threshold over window | Scored layers |
MAX_ATTEMPTS |
Hit the max_attempts ceiling | All layers |
CONSECUTIVE_FAILURES |
N consecutive test crashes | All layers |
ORACLE_ERROR |
Oracle infrastructure is broken | All layers |
When a stopping condition fires, sorkit sends notifications and returns a STOP:{reason} message.
Notifications
sorkit notifies you when a layer completes. Set environment variables to enable channels:
# Slack webhook
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."
# Email (requires sendmail configured)
export NOTIFY_EMAIL="you@example.com"
Always-on channels:
- File log:
reports/notifications.log - Desktop notification:
osascripton macOS,notify-sendon Linux
Running the Example
The examples/sentiment/ directory contains a complete working example — a rule-based sentiment classifier that an agent can optimize from ~40% to 80-90% accuracy.
What the Example Demonstrates
- Layer 1 (scored): A naive sentiment classifier with only 12 words. The agent iteratively improves it, typically discovering negation handling, intensity modifiers, punctuation stripping, phrase matching, and more.
- Layer 2 (pass/fail): An API wrapper stub. After Layer 1 reaches its target, the agent implements the API to satisfy 7 contract tests.
Setup
cd examples/sentiment
# Initialize git (required for the ratchet)
git init
git add -A
git commit -m "initial"
# Install sorkit
pip install sorkit
# Generate CLAUDE.md and experiment-loop skill
python -c "
from pathlib import Path
from sorkit.config import load_config
from sorkit.init import generate_claude_md, generate_experiment_loop_skill, initialize_results_tsv
config = load_config(Path('.'))
generate_claude_md(config, Path('.'))
generate_experiment_loop_skill(config, Path('.'))
initialize_results_tsv(Path('.'))
print('Ready!')
"
Check the Baseline
# Run contract tests (should all pass)
python -m pytest tests/test_classifier_contract.py -v
# Run scored tests to see baseline accuracy
python -m pytest tests/test_classifier_accuracy.py -s
Expected output:
ACCURACY: 0.4000
POS_RECALL: 0.2381
NEG_RECALL: 0.2353
NEU_RECALL: 0.9167
Results: 20/50 correct (40.0%)
Run with the MCP Server
With sorkit added to your MCP client configuration:
Agent: sor_run_oracle(layer="classifier", project_dir="examples/sentiment")
→ COMPOSITE: 0.4000
Agent: [edits src/classifier.py — adds more positive/negative words]
Agent: sor_ratchet(layer="classifier", hypothesis="expand word lists", ...)
→ KEEP score=0.5200 prev=0.4000
Agent: [edits src/classifier.py — adds punctuation stripping]
Agent: sor_ratchet(layer="classifier", hypothesis="strip punctuation", ...)
→ KEEP score=0.6000 prev=0.5200
... (15-20 iterations later) ...
Agent: sor_ratchet(layer="classifier", hypothesis="tune neutral default", ...)
→ STOP:TARGET_MET score=0.8600 attempts=18 kept=11
Agent: [now implements src/api.py]
Agent: sor_ratchet(layer="api", hypothesis="implement analyze()", ...)
→ STOP:ALL_PASS score=PASS attempts=1 kept=1
What the Agent Typically Discovers
Through iterative optimization, agents typically find these improvements:
- More words — expanding positive/negative word lists
- Punctuation — stripping
!,?,.before matching - Negation — "not good" should flip sentiment
- Intensity — "very", "extremely" as amplifiers
- Phrases — multi-word patterns like "waste of money"
- Default bias — tuning what to return when scores are tied
- Scoring refinements — weighting certain matches higher
The Golden Set
fixtures/golden_set.json contains 50 labeled examples:
- 20 positive (including tricks: "not bad", "despite negative reviews")
- 15 negative (including subtle: "not what I'd call good")
- 10 neutral (including mixed: "some features good, others lacking")
- 5 edge cases with negation and context
Monitoring Progress
Agent: sor_status(project_dir="examples/sentiment")
→ Layer 1: classifier (scored)
Attempts: 12/30
Keeps: 7
Best score: 0.7800 (target: 0.85)
...
Agent: sor_audit(layer="classifier", project_dir="examples/sentiment")
→ Full audit report with convergence analysis
Agent: sor_hypotheses(layer="classifier", project_dir="examples/sentiment")
→ Which approaches worked and which didn't
Programmatic Usage
You can use sorkit as a Python library without the MCP server:
import asyncio
from pathlib import Path
from sorkit.config import load_config
from sorkit.oracle import run_oracle
from sorkit.ratchet import ratchet_once
async def main():
project = Path(".")
config = load_config(project)
# Check current score
result = await run_oracle(config, layer_idx=0, project_dir=project)
print(f"Composite: {result.composite}")
print(f"Metrics: {result.metrics}")
# Run one ratchet iteration
ratchet_result = await ratchet_once(
config, layer_idx=0,
hypothesis="add TF-IDF weighting",
project_dir=project,
)
print(ratchet_result.message)
asyncio.run(main())
Key Classes
from sorkit.config import load_config, SorConfig, validate_config
from sorkit.oracle import run_oracle, OracleResult
from sorkit.ratchet import ratchet_once, RatchetResult, RatchetOutcome, StopReason
from sorkit.results import ResultsStore
from sorkit.frozen import get_frozen_paths, is_path_frozen
from sorkit.audit import get_score_history, analyze_hypotheses, generate_audit_report
from sorkit.init import generate_config_template, validate_and_save_config
from sorkit.notify import send_notifications
Requirements
- Python 3.10+
- Git (for commit/reset ratchet)
- Your test runner (pytest by default, configurable)
Key Principles
- The golden set is sacred. Tests are frozen so the agent can't game the oracle.
- One idea per iteration. Atomic changes make the ratchet meaningful.
- Layers freeze bottom-up. Completed layers become read-only.
- Single composite score. One clear optimization target per scored layer.
- The agent stops itself. Plateau detection prevents infinite loops.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sorkit-0.1.0.tar.gz.
File metadata
- Download URL: sorkit-0.1.0.tar.gz
- Upload date:
- Size: 47.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
755841fbbda1b2984b60227b6a03a529d1db142effec6d5fb620216abd9c2f5d
|
|
| MD5 |
e958739367b345ca70417e804ce50e3d
|
|
| BLAKE2b-256 |
285a7c506c0a59df18db65bb24e8c84965d8fce5469086c620b9351717d69b58
|
Provenance
The following attestation bundles were made for sorkit-0.1.0.tar.gz:
Publisher:
publish.yml on johncarpenter/sorkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sorkit-0.1.0.tar.gz -
Subject digest:
755841fbbda1b2984b60227b6a03a529d1db142effec6d5fb620216abd9c2f5d - Sigstore transparency entry: 1182962280
- Sigstore integration time:
-
Permalink:
johncarpenter/sorkit@416537a2e884352f6f6b6fcfbb5a5fb092b1fbbd -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/johncarpenter
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@416537a2e884352f6f6b6fcfbb5a5fb092b1fbbd -
Trigger Event:
release
-
Statement type:
File details
Details for the file sorkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sorkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab736124430ede6dcaeb26a62b0a2f564fc63e0f5c307249ae2df000b507dd2f
|
|
| MD5 |
1156877a8717c576b63849b36092a323
|
|
| BLAKE2b-256 |
dbe9bb986e0d924939405b54c5136b4556897772e1b29e4681c0bc40b1a6a61b
|
Provenance
The following attestation bundles were made for sorkit-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on johncarpenter/sorkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sorkit-0.1.0-py3-none-any.whl -
Subject digest:
ab736124430ede6dcaeb26a62b0a2f564fc63e0f5c307249ae2df000b507dd2f - Sigstore transparency entry: 1182962315
- Sigstore integration time:
-
Permalink:
johncarpenter/sorkit@416537a2e884352f6f6b6fcfbb5a5fb092b1fbbd -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/johncarpenter
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@416537a2e884352f6f6b6fcfbb5a5fb092b1fbbd -
Trigger Event:
release
-
Statement type: