Research prototype and commercial runtime for safe continual test-time adaptation under distribution shift.
Project description
Adaptive Reliability Layer
An experimental research codebase for safe continual test-time adaptation under distribution shift.
Project Goal
The project explores whether a deployed ML model can be paired with an adaptive reliability layer that:
- detects distribution shift online
- estimates whether the model remains in its competence zone
- applies bounded, reversible adaptation when safe
- preserves source knowledge over long streams
- recalibrates uncertainty after adaptation
This is a separate project from Intelligent_NPCs.
Initial Scope
The first prototype focuses on a simulated streaming setting rather than production deployment:
- streaming batches from a synthetic nonstationary environment
- latent or feature-space shift monitoring
- simple adaptation policy with safety gating
- uncertainty-aware output surface
- evaluation against a frozen baseline
Repo Layout
docs/project documents and research notessrc/adaptive_reliability_layer/core packagescripts/runnable entrypoints
Useful starting docs:
docs/status_paper_commercial_outreach.mddocs/current_findings.mddocs/next_step_decision_memo.md
Quick Start
Create a virtual environment and install in editable mode:
python3 -m venv .venv
source .venv/bin/activate
# Commercial runtime (torch fraud pilots + sidecar):
pip install -e ".[torch,serving]"
# Full research bench (adds WILDS):
pip install -e ".[research,serving,dev]"
PyPI-style install (when published):
pip install "adaptive-reliability-layer[torch,serving]"
arl-customer-replay --input customer.csv --config configs/customer_shadow.yaml
See docs/customer_replay.md for the design-partner replay path.
Run the simple simulation:
python3 scripts/run_simulation.py
Run the baseline benchmark:
python3 scripts/run_benchmark.py
Run the real tabular benchmark with the PyTorch adapter model:
python3 scripts/run_tabular_benchmark.py
Run the harder digits-shift benchmark:
python3 scripts/run_digits_shift_benchmark.py
Run the real-image scale-up benchmark on Fashion-MNIST:
python3 scripts/run_fashion_mnist_shift_benchmark.py
Run the delayed-label temporal Fashion-MNIST benchmark:
python3 scripts/run_temporal_fashion_mnist_benchmark.py
Run the temporal delay/severity suite and save aggregated results:
python3 scripts/run_temporal_benchmark_suite.py
Run the broader temporal paper-style suite:
python3 scripts/run_temporal_paper_suite.py
Run the recurrence-first temporal benchmark for specialist reuse:
python3 scripts/run_recurrence_temporal_benchmark.py
Run the first public WILDS benchmark path on CivilComments:
python3 scripts/run_wilds_civilcomments_benchmark.py
Run the multi-seed WILDS CivilComments suite:
python3 scripts/run_wilds_civilcomments_suite.py
Run the graph-native shift benchmark:
python3 scripts/run_graph_shift_benchmark.py
Run the multi-seed image scale-up suite across backbones and shift severities:
python3 scripts/run_image_scaleup_suite.py
Run the multi-seed suite and save aggregated results:
python3 scripts/run_benchmark_suite.py
Run the ablation suite and save aggregated results:
python3 scripts/run_ablation_suite.py
Current Status
This repository is in the research scaffolding phase. The code currently provides:
- a stream simulator for synthetic regime shift
- a source-reference profile and shift monitor over features plus outputs
- a martingale-style sequential risk monitor
- baseline policies for frozen, naive, and safety-gated adaptation
- a multi-action controller over
bn_refresh,label_shift,adapter_update, andreset - experimental
bandit,specialist_memory, andhybridcontrollers for the next research phase - a delayed-feedback bandit controller for temporal streams with label latency
- a delayed-feedback hybrid controller that combines specialist memory with delayed controller learning
- a regime-aware delayed bandit controller with short-horizon temporal state
- confidence-filtered pseudo-label adaptation with bounded parameter drift
- a small PyTorch encoder-plus-adapter model for adapter-only test-time updates
- a real tabular streaming benchmark with regime-based shifts
- a harder digits-shift benchmark that degrades the frozen source model much more sharply
- a real-image scale-up benchmark on a fine-grained Fashion-MNIST subset with a BN-heavy convolutional model
- a configurable image scale-up path with
convnetandresnet_smallbackbones standardandharshimage shift profiles for stress-testing adaptation policies- an
extremeimage shift profile for harder temporal stress tests - a delayed-label temporal image benchmark for studying feedback lag
- a temporal benchmark suite over multiple delay and severity settings
- smoothed, trust-weighted retrospective reward updates for delayed-feedback controller learning
- a recurrence-first temporal benchmark for testing specialist reuse under returning regimes
- a broader paper-style temporal suite with win-count and delta summaries
- saved temporal paper-suite artifacts under
results/temporal_paper_suite.{md,json} - a graph-native benchmark with topology-aware shift monitoring
- an initial WILDS CivilComments benchmark path for testing the controller stack on a stronger public benchmark family
- a multi-seed WILDS CivilComments suite with compact and medium public-benchmark settings
- a multi-seed benchmark suite with JSON and Markdown outputs under
results/ - a dedicated image scale-up suite that defaults to a fast
convnetconfirmation loop, with slowerresnet_smallconfirmations available when needed - an ablation suite for controller actions, reward shaping, and specialist-memory routing
- a lightweight uncertainty wrapper
- an evaluation loop for comparing outcomes over time
- per-step traces for inspecting failure modes
Current Read
The current results point to a clear architectural conclusion:
- naive continual adaptation is consistently brittle
- reset logic is one of the highest-leverage safety mechanisms
- on tabular data,
banditandhybridcurrently have the best utility/risk tradeoff - on the harder digits-shift benchmark, all serious controllers beat or match frozen while dramatically reducing risk capital
- on the new Fashion-MNIST benchmark, the controller family roughly matches frozen accuracy while cutting sequential risk by an order of magnitude
- the harsher image profile creates a clearer separation between frozen and controller-guided behavior
- the temporal and graph tracks are now in place, so the controller abstractions are being exercised beyond flat iid-style batch streams
- on the temporal suite, regime-aware delayed control helps most in some longer-delay settings, but it is still unstable across the full grid
- on the full temporal paper suite,
controllercurrently has the strongest aggregate utility story, whilefrozenstill wins the most accuracy settings anddelayed_banditis the strongest delayed learner - after the specialist-quality upgrade,
delayed_hybridbecame more competitive on the full temporal paper suite, butcontrollerandhybridstill have the strongest aggregate utility story - on the recurrence-first temporal benchmark,
delayed_hybridnow opens multiple specialists, but the delayed branch still trails on utility and needs better routing/credit assignment - richer specialist signatures and support-state warm starts now produce much stronger focused long-delay results for the delayed-memory branch
- the newest result is that delayed specialist memory should likely be regime-selective: it helps on some recurring long-delay slices, but
regime_aware_delayed_banditis still the strongest general delayed learner on the mixed long-delay grid - the new explicit regime encoder makes delayed control more selective and improves some long-delay slices, especially
standard / 12andextreme / 12, but it has not yet made delayed memory the strongest overall temporal branch - the temporal image benchmark now distinguishes between immediate-learning and true delayed-feedback bandit control
- the temporal track now uses retrospective rewards at label reveal time, not just delayed replay of immediate utility
delayed_hybridis now clearly a real branch rather than collapsing to plain delayed bandit, but it is still not the strongest temporal policy overall- the graph benchmark now degrades frozen performance meaningfully on structural rewiring, which makes it a better structural-shift testbed even though it still mostly highlights safety over raw accuracy recovery
- the WILDS CivilComments path now has a real
easy / hard / recurringsplit instead of recurring collapsing back onto the majority group - on the multi-seed WILDS suite,
banditandhybridcurrently have the best public-benchmark utility while raw accuracy remains roughly tied with frozen - delayed specialist memory now forms more controlled specialist pools with richer reuse diagnostics, but it still trails non-delayed
hybridon the recurrence-first benchmark - the most promising recent direction is specialist quality: better routing signatures plus specialist support-state warm starts improved delayed-memory performance far more than controller decoupling did
- those specialist-quality gains are real but mixed at full-suite scale: they improved the delayed branch’s competitiveness without yet making it the strongest overall temporal policy
- making memory more selective helped clarify the story more than it improved the top-line numbers: routing can stay fairly loose, but warm-start reuse needs to be selective under harsher shift
- the project looks strongest as a controller over bounded interventions, not as a single always-on adaptation rule
- the public temporal runtime story is now sharper:
PaySimremains the strongest fraud-style bounded-auto success storyUCI Gas Sensor Driftis a neutral-but-honest maintenance benchmarkOpenML Electricitynow uses a conservativesensor_safeprofile and stands down instead of harming accuracy
Commercial Deployment Runtime
The repo now includes a production-oriented runtime layer on top of the research benchmarks:
ReliabilityLayer— stable deployment surface for every batch (predictions, shift/risk scores, recommended/taken actions, trust state, rollback metadata)- Decision record schema — stable operator-facing payload with
regime_id,regime_confidence,risk_score,why_this_action,rollback_eligible, andretrain_recommended - Operating modes —
shadow,recommend,bounded_auto - Safety budgets — per-window caps on auto-actions / resets with downgrade-to-
recommendwhen budgets are exhausted - Model adapters —
torch_tabular,sklearn,black_box - Governance — SQLite audit log, versioned snapshots, one-click rollback
- Offline replay — canonical CSV/Parquet historical streams with label-delay simulation and
reveal_labels(step, labels)for delayed supervision - Runtime policies —
delayed_bandit,regime_aware_delayed_bandit(ported from research for fraud pilots) - Operator + buyer reports —
technical_report.md,operator_report.md,buyer_report.md, and replay schema artifacts for every pilot / public-story run - Dual-metric reports — shadow vs
bounded_autoon the same stream (dual_metric_report.md) - HTTP serving — FastAPI sidecar (
/v1/batch,/v1/batch/{step}/labels,/v1/approve,/v1/health,/v1/metrics) - Profile-aware runtime control — drift signatures plus bounded action profiles for
fraud,sensor, and conservativesensor_safestreams - Ingest contract — canonical replay schema (
timestamp,label,feature_*, optionalregime, optionalmeta_*) → replay stream - Policy persistence — save/load bandit + regime encoder state across restarts
- Configurable KPIs —
kpiblock in runtime YAML for buyer-facing scores - Pilot framework — fraud/risk-style case study with saved KPI report
- Public ops stories — one-command replay artifacts for public datasets (
scripts/run_public_ops_story.py) - Observability — optional Prometheus metrics endpoint
Product milestone checklist (dual-metric pilots, verification, sidecar): docs/product_milestones.md. Run all five with:
python3 scripts/run_product_milestones.py
Quick start (commercial path)
pip install -e ".[dev,prometheus,serving]"
# Shadow-mode offline replay on synthetic fraud-like stream
python3 scripts/run_offline_replay.py --synthetic --config configs/default.yaml
# Pilot case study artifact (report + JSON KPIs; dual-metric when layer_builder is set)
python3 scripts/run_pilot_case_study.py --config configs/pilot_fraud_tabular.yaml
# PaySim torch pilot with regime-aware delayed bandit + dual-metric report
python3 scripts/run_pilot_torch.py
# Ingest CSV/JSONL and replay (optional --dual-mode)
python3 scripts/run_ingest_replay.py --input data/openml/credit_german.csv --dual-mode
# Canonical offline replay on a CSV or Parquet stream
python3 scripts/run_offline_replay.py --input data/openml/credit_german.csv
# Public ops story on a public dataset
python3 scripts/run_public_ops_story.py --source-id paysim_fraud --controller-name multi_action --stream-cycles 4
# Correction-centric parallel-path evaluation on the fraud SOTA suite
python3 scripts/run_correction_path_evaluation.py
# Focused decomposition of the flagship fraud SOTA lane
python3 scripts/run_production_failure_analysis.py --source ieee_cis_fraud_torch
# HTTP sidecar (production pilot)
```bash
pip install -e ".[serving,prometheus]"
python3 scripts/export_bundled_fraud_data.py
python3 scripts/run_serve.py --config configs/serving_pilot_fraud_torch.yaml --force-shadow
python3 scripts/run_serving_parity.py
Docs: docs/sidecar_production.md | Docker: docker compose up arl-sidecar
Prometheus metrics (optional)
python3 scripts/run_metrics_server.py --config configs/default.yaml
### Configuration
Default runtime config: `configs/default.yaml`
Pilot config: `configs/pilot_fraud_tabular.yaml`
Key fields:
- `operating_mode`: `shadow` | `recommend` | `bounded_auto`
- `bounded_auto_actions`: low-risk actions allowed in bounded auto mode
- `safety_budget.window_steps`: control horizon for bounded-auto budgets
- `safety_budget.max_auto_actions_per_window`: automatic intervention cap per horizon
- `safety_budget.max_resets_per_window`: reset cap per horizon
- `safety_budget.downgrade_to_recommend`: force human approval when budgets are exhausted
- `governance.audit_db_path` / `governance.snapshot_dir`: audit + rollback storage
- `replay.label_delay_steps`: delayed-label simulation for offline replay
- `replay_schema.md`: generated canonical input contract for customer logs
### Public ops story artifacts
Every public or pilot replay now writes:
- `technical_report.md`: replay summary and per-strategy metrics
- `operator_report.md`: intervention timeline and top drift episodes
- `buyer_report.md`: risk / accuracy / retrain-deferral summary
- `summary.json`: machine-readable KPI payload
- `replay_schema.md`: canonical log schema for ingestion
Current strongest public fraud/risk ops story:
- `results/ops_story_paysim_fraud_multi_action/`
- bounded auto improved accuracy from `87.0%` to `96.9%`
- intervention rate stayed at `4.4%` of batches
- retrain trigger was deferred by `1` step
- harmful drift events avoided vs frozen: `2`
### Real-data verification (before design partners)
Verify the commercial runtime across multiple public datasets:
```bash
python3 scripts/run_real_data_verification.py --config configs/real_data_verification.yaml
Sources included by default:
| Source | Type | Wedge |
|---|---|---|
breast_cancer |
sklearn UCI | general tabular |
digits |
sklearn | general tabular |
tabular_breast_cancer_shift |
in-repo shift stream | general tabular |
openml_credit_g |
OpenML German Credit | fraud/risk adjacent |
paysim_fraud |
PaySim-style synthetic mobile money | fraud ops proxy (time-ordered) |
ieee_cis_fraud |
IEEE-CIS sample or synthetic fallback | imbalanced fraud tabular |
openml_electricity |
OpenML Electricity | predictive maintenance proxy |
uci_gas_sensor_drift |
UCI Gas Sensor Array Drift | natural batch-chronological drift benchmark |
wilds_civilcomments_csv |
local WILDS CSV | public NLP / moderation |
Each source runs through all 8 commercial priorities: deployment surface, operating modes, offline replay, model adapters, engineering maturity, observability hooks, governance/audit, and real-data KPI evidence.
Results are saved under results/real_data_verification/.
Bundled offline fallbacks for OpenML-style datasets live in data/openml/ (UCI German Credit + Spambase). Regenerate with:
python3 scripts/export_bundled_real_data.py
Grafana dashboard template: observability/grafana/arl_dashboard.json
Docker
docker compose run --rm replay
docker compose up metrics
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file adaptive_reliability_layer-0.3.1.tar.gz.
File metadata
- Download URL: adaptive_reliability_layer-0.3.1.tar.gz
- Upload date:
- Size: 201.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c36a43e65e839586e0a6b3360759bb48a516323805c46a53e0af642444646e0e
|
|
| MD5 |
b446d5b8afdcdf8c74ff1a4861d9a4ff
|
|
| BLAKE2b-256 |
6ecb57f7df9f12138265f084a40e2adfe564d1f690703b6d43f131b31fc10404
|
File details
Details for the file adaptive_reliability_layer-0.3.1-py3-none-any.whl.
File metadata
- Download URL: adaptive_reliability_layer-0.3.1-py3-none-any.whl
- Upload date:
- Size: 216.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc18c0f88791071fb96ae518100ebf697339a20f49bd512519a62425be9d9828
|
|
| MD5 |
17f620f99cba6ffcedc5d56517ae653f
|
|
| BLAKE2b-256 |
5ac1b0ab2c6ebeb6ca67eabbf4e762d56ed8820ed05dc177b617a593a1e82288
|