Reinforcement learning framework for prompt refinement with LLMs

These details have not been verified by PyPI

Project links

Project description

RLPrompt

Online Reinforcement Learning library for system-prompt refinement with human feedback, backed by a local LLM Critic.

How it works

Every human interaction produces a Perception Cycle (stimulus → observation → verdict). The TwoStageCritic (1.1 Backward + 1.2 Optimizer) uses full conversation context to produce actionable feedback and a refined prompt. CriticValidationLoop validates proposals by re-asking the same question before accepting. A statistical Update Gate decides whether the proposed prompt replaces the live policy.

Human interaction
      │
      ▼
PerceptionCycle (system_prompt, user_query, bot_response, verdict, comment, observations)
      │
      ▼
TwoStageCritic  →  Critic 1.1 (Backward): feedback accionable con contexto
      │           →  Critic 1.2 (Optimizer): nuevo system prompt (BLIND a query/response)
      ▼
CriticValidationLoop  →  Re-pregunta al Actor con prompt propuesto
      │                →  Judge: ¿se solucionó? Si no: ciclo virtual, otra propuesta
      ▼
HybridReward   R = λ_fb·H + λ_c·C − λ_ch·word_change_ratio
      │
      ▼
RewardHistory  (rolling window, convergence tracking)
      │
      ▼
UpdateGate     degradation OR forced correction?
      │
      ├── YES → ActivePolicy.write()  →  system_prompt.md updated
      └── NO  → "Politica estable — sin actualizacion"

Project structure

RLprompt/
│
├── src/prompt_rl/               ← Library (importable package)
│   ├── core/
│   │   ├── cycle.py             # PerceptionCycle — fundamental data unit
│   │   └── policy.py            # ActivePolicy — manages system_prompt.md
│   ├── llm/
│   │   ├── base.py              # LLMBackend ABC + LLMResponse
│   │   └── local_backend.py     # LocalLLMBackend (Ollama/Gemma)
│   ├── critic/
│   │   ├── base.py              # PerceptionCritic protocol + CriticOutput
│   │   ├── two_stage_critic.py  # TwoStageCritic (Backward + Optimizer)
│   │   └── llm_critic.py        # LLMPerceptionCritic (blind, legacy)
│   ├── feedback/
│   │   └── signals.py           # thumbs_to_score, reading_time_to_score, FeedbackAggregator
│   ├── rl/
│   │   ├── reward.py            # HybridReward + word_change_ratio
│   │   ├── history.py           # RewardHistory (rolling window + convergence)
│   │   └── gate.py              # UpdateGate (degradation / forced)
│   ├── population/
│   │   ├── genome.py            # PromptGenome (modular prompt sections)
│   │   └── leaderboard.py       # Leaderboard (fitness-ranked candidates)
│   └── loop/
│       └── online.py            # OnlineCriticLoop — ties everything together
│
├── demos/human_watch/           ← Human-Watch demo (encapsulated)
│   ├── server.py                # FastAPI chat server + RAG + feedback UI
│   ├── monitor.py               # Playwright perception monitor
│   ├── evaluator.py             # Critic subprocess (thin wrapper)
│   ├── run_backend.py           # Launcher: server + monitor
│   ├── run_server.py            # Server-only launcher
│   ├── reset_to_state_zero.py   # Reset state files
│   └── tests/
│       ├── test_flow.py         # Simulated flow test
│       └── test_monitor.py      # E2E integration test (14 checks)
│
├── examples/                    ← Library examples
│   ├── two_stage_example.py
│   └── validation_loop_example.py
│
├── data/                        ← Archivos de estado (.md, .json)
│   ├── system_prompt.md         ← Active Actor policy (hot-reloaded)
│   ├── interactions.md         ← Append-only perception cycle log
│   ├── reward_history.json     ← Rolling reward window + convergence state
│   ├── population.json        ← Fitness leaderboard (up to 20 entries)
│   ├── critic_memory.json
│   ├── critic_memory.md
│   ├── prompts/                ← Backups versionados del policy
│   └── logs/                   ← Archivos de interactions archivados
├── evaluator.log                ← Evaluator subprocess stdout/stderr
└── prompts/prompt_vN.md         ← Backups before each policy update

Library

Installation

pip install -e .                      # base (LocalLLMBackend para Ollama/Gemma)
pip install -e ".[dev]"               # + pytest, ruff

Key types

Type	Module	Description
`PerceptionCycle`	`core.cycle`	One feedback loop (system_prompt + verdict + comment + dwell)
`ActivePolicy`	`core.policy`	Reads / writes `system_prompt.md` with versioned backup
`PerceptionCritic`	`critic.base`	Protocol: `evaluate(cycle) -> CriticOutput`
`TwoStageCritic`	`critic.two_stage_critic`	Critic en dos etapas (Backward + Optimizer)
`CriticValidationLoop`	`validation.loop`	Valida propuestas antes de aceptar
`CriticOutput`	`critic.base`	`(critic_score, proposed_prompt, reasoning)`
`HybridReward`	`rl.reward`	`R = λ_fb·H + λ_c·C − λ_ch·change_ratio`
`RewardHistory`	`rl.history`	Rolling window + convergence state + persistence
`UpdateGate`	`rl.gate`	Fires on degradation or forced human correction
`AlwaysUpdateGate`	`rl.gate`	Updates every cycle
`OnlineCriticLoop`	`loop.online`	Orchestrates one full RL step per `PerceptionCycle`
`PromptGenome`	`population.genome`	Modular prompt with named sections
`Leaderboard`	`population.leaderboard`	Top-N candidates ranked by fitness

Quickstart

from prompt_rl import (
    PerceptionCycle, ActivePolicy,
    TwoStageCritic, CriticValidationLoop, Actor, LLMValidationJudge,
    OnlineCriticLoop, RewardHistory, Leaderboard,
)
from prompt_rl.llm.local_backend import LocalLLMBackend

backend = LocalLLMBackend(model="gemma3:4b")
critic = TwoStageCritic(backend=backend)
validated_critic = CriticValidationLoop(
    critic=critic,
    actor=Actor(backend=backend),
    judge=LLMValidationJudge(backend=backend),
    max_iterations=3,
)
policy  = ActivePolicy(path="system_prompt.md")
history = RewardHistory.from_file("reward_history.json")
lb      = Leaderboard.from_file("population.json")

loop = OnlineCriticLoop(
    critic=validated_critic,
    policy=policy,
    history=history,
    leaderboard=lb,
)

cycle = PerceptionCycle(
    system_prompt="Eres un asistente de negocio...",
    user_query="¿El plan Pro incluye IVA?",
    bot_response="No lo sé.",
    verdict="INCORRECTO",
    comment="Siempre incluye IVA del 21 %.",
    dwell_seconds=4.2,
)

result = loop.process_cycle(cycle)
loop.save_state("reward_history.json", "population.json")

print(result.gate.reason)     # "forced"
print(result.converged)       # False

TwoStageCritic design

Critic 1.1 (Backward): Full context (conversation + human feedback + cursor trace) → actionable feedback.
Critic 1.2 (Optimizer): BLIND to user_query/bot_response — only system prompt + feedback → new prompt. Avoids overfitting.

Optimizer options (Critic 1.2):

gradient_memory: N últimos feedbacks a incluir → evita repeticiones si el feedback se repite.
constraints: restricciones en lenguaje natural (ej. ["Responde en español.", "Mantén el prompt conciso."]).
in_context_examples: ejemplos antes→después para guiar el optimizer.

critic = TwoStageCritic(
    backend=backend,
    gradient_memory=3,
    constraints=["Responde siempre en español."],
    in_context_examples=["Antes: 'X' → Después: 'X mejorado porque...'"],
)

CursorTrace.from_observations(cycle.observations) parses [DWELL], [SELECT], [CLICK], [REVIEW_RAG].

Convergence criterion

After N consecutive stable cycles (default N=5) where:

verdict == CORRECTO, and
word_change_ratio(current, proposed) < ε (default ε=0.05)

RewardHistory.converged becomes True. The monitor then skips the evaluator subprocess, stopping further refinement automatically. Convergence resets whenever the gate fires and the policy is updated.

Human-Watch implementation

Human-Watch is the production runtime built on the library. It closes the real feedback loop with a human operator in the loop.

How to run

Prerequisites: Ollama with gemma3:1b + gemma3:4b pulled; pip install -e ".[human-watch]"; playwright install chromium once.

Option A — Single command (recommended):

# From project root: starts server + monitor
python -m demos.human_watch.run_backend
# or after pip install:
rlprompt-backend

Option B — Separate terminals:

# Terminal 1 — chat server
python -m demos.human_watch.run_server
# or: rlprompt-serve

# Terminal 2 — perception monitor (opens Chromium)
python -m demos.human_watch.monitor

Perception Cycle (4 phases)

① Predictive Model   — active system_prompt.md
② System Action      — user query + Gemma 3:1b response
③ Observation Phase  — [DWELL] [SELECT] [REVIEW_RAG]
④ ACC Signal         — CORRECTO / INCORRECTO + optional comment
[RAW]                — click/cursor telemetry

Dashboard

http://localhost:8000/dashboard — prompt version, reward history, accuracy, top-3 fitness, last 5 human corrections.

Tests de flujo

# From project root
# Test sin Playwright: simula eventos y verifica ciclo + evaluator
python -m demos.human_watch.tests.test_flow   # requiere servidor en :8000

# Test e2e con navegador (requiere: playwright install)
python -m demos.human_watch.tests.test_monitor   # requiere servidor en :8000

test_flow.py: Verifica que texto → Incorrecto → evaluator se ejecute. No requiere Playwright.
test_monitor.py: Test e2e completo con Playwright; valida 14 aserciones del ciclo en interactions.md.

Reward formula

R_total = λ_feedback · H + λ_critic · C − λ_change · word_change_ratio

H  = human_feedback  [0, 1]   thumbs + dwell (FeedbackAggregator)
C  = critic_score    [0, 1]   TwoStageCritic output
λ_change             penalises large rewrites (keeps changes minimal)

Default weights: λ_feedback=0.9, λ_critic=0.1, λ_change=0.3.

Update triggers

Trigger	Condition
Degradation	`R_curr < R_avg × 0.8` (rolling window of 10)
Forced	`verdict == INCORRECTO` AND non-empty comment
Stable	Neither — policy unchanged

CriticValidationLoop (default)

When verdict is INCORRECTO, validate before accepting:

Critic proposes new system prompt
Re-ask the same question to the actor with the new prompt
ValidationJudge evaluates (with original feedback) whether the problem was fixed
If not: virtual cycle → Critic proposes again → repeat (up to max_iterations)

AlwaysUpdateGate

To update on every cycle (no degradation/forced checks):

from prompt_rl import OnlineCriticLoop, AlwaysUpdateGate, ...

loop = OnlineCriticLoop(
    critic=critic,
    policy=policy,
    gate=AlwaysUpdateGate(),
    ...
)

Tests

# Library unit tests
pytest tests/ -v

# Human-Watch integration tests (requires server on :8000)
python -m demos.human_watch.tests.test_flow
python -m demos.human_watch.tests.test_monitor

References

License

MIT. See LICENSE.

Contributing

See CONTRIBUTING.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_rl-1.0.0.tar.gz (78.6 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prompt_rl-1.0.0-py3-none-any.whl (88.2 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file prompt_rl-1.0.0.tar.gz.

File metadata

Download URL: prompt_rl-1.0.0.tar.gz
Upload date: Mar 25, 2026
Size: 78.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for prompt_rl-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e181962dafb7ddfc1d1562f1a182f70bbc721ef72a78e270215ed0afbd99dc91`
MD5	`e17ddd9d1a214c4bf2ab057763d28487`
BLAKE2b-256	`d00d82a30a5a5090cbc7502f4fafa9e323d8ced893b2cf98dfb4f567589bedd7`

See more details on using hashes here.

File details

Details for the file prompt_rl-1.0.0-py3-none-any.whl.

File metadata

Download URL: prompt_rl-1.0.0-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 88.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for prompt_rl-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`882fb39e05b6d25b387ec6ddbbcc3aa29b699adcd781af58ab4bb997374e9eee`
MD5	`7e66be46f057ed362af4951f77245696`
BLAKE2b-256	`afec071bf1b7b41b5320d7b102433b412490f64c5aad259528fab99345cd9ebf`

See more details on using hashes here.

prompt-rl 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RLPrompt

How it works

Project structure

Library

Installation

Key types

Quickstart

TwoStageCritic design

Convergence criterion

Human-Watch implementation

How to run

Perception Cycle (4 phases)

Dashboard

Tests de flujo

Reward formula

Update triggers

CriticValidationLoop (default)

AlwaysUpdateGate

Tests

References

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes