Skip to main content

Research prototype and commercial runtime for safe continual test-time adaptation under distribution shift.

Project description

Adaptive Reliability Layer

An experimental research codebase for safe continual test-time adaptation under distribution shift.

Project Goal

The project explores whether a deployed ML model can be paired with an adaptive reliability layer that:

  • detects distribution shift online
  • estimates whether the model remains in its competence zone
  • applies bounded, reversible adaptation when safe
  • preserves source knowledge over long streams
  • recalibrates uncertainty after adaptation

This is a separate project from Intelligent_NPCs.

Initial Scope

The first prototype focuses on a simulated streaming setting rather than production deployment:

  • streaming batches from a synthetic nonstationary environment
  • latent or feature-space shift monitoring
  • simple adaptation policy with safety gating
  • uncertainty-aware output surface
  • evaluation against a frozen baseline

Repo Layout

  • docs/ project documents and research notes
  • src/adaptive_reliability_layer/ core package
  • scripts/ runnable entrypoints

Useful starting docs:

  • docs/status_paper_commercial_outreach.md
  • docs/current_findings.md
  • docs/next_step_decision_memo.md

Quick Start

Create a virtual environment and install in editable mode:

python3 -m venv .venv
source .venv/bin/activate
# Commercial runtime (torch fraud pilots + sidecar):
pip install -e ".[torch,serving]"
# Full research bench (adds WILDS):
pip install -e ".[research,serving,dev]"

PyPI-style install (when published):

pip install "adaptive-reliability-layer[torch,serving]"
arl-customer-replay --input customer.csv --config configs/customer_shadow.yaml

See docs/customer_replay.md for the design-partner replay path.

Run the simple simulation:

python3 scripts/run_simulation.py

Run the baseline benchmark:

python3 scripts/run_benchmark.py

Run the real tabular benchmark with the PyTorch adapter model:

python3 scripts/run_tabular_benchmark.py

Run the harder digits-shift benchmark:

python3 scripts/run_digits_shift_benchmark.py

Run the real-image scale-up benchmark on Fashion-MNIST:

python3 scripts/run_fashion_mnist_shift_benchmark.py

Run the delayed-label temporal Fashion-MNIST benchmark:

python3 scripts/run_temporal_fashion_mnist_benchmark.py

Run the temporal delay/severity suite and save aggregated results:

python3 scripts/run_temporal_benchmark_suite.py

Run the broader temporal paper-style suite:

python3 scripts/run_temporal_paper_suite.py

Run the recurrence-first temporal benchmark for specialist reuse:

python3 scripts/run_recurrence_temporal_benchmark.py

Run the first public WILDS benchmark path on CivilComments:

python3 scripts/run_wilds_civilcomments_benchmark.py

Run the multi-seed WILDS CivilComments suite:

python3 scripts/run_wilds_civilcomments_suite.py

Run the graph-native shift benchmark:

python3 scripts/run_graph_shift_benchmark.py

Run the multi-seed image scale-up suite across backbones and shift severities:

python3 scripts/run_image_scaleup_suite.py

Run the multi-seed suite and save aggregated results:

python3 scripts/run_benchmark_suite.py

Run the ablation suite and save aggregated results:

python3 scripts/run_ablation_suite.py

Current Status

This repository is in the research scaffolding phase. The code currently provides:

  • a stream simulator for synthetic regime shift
  • a source-reference profile and shift monitor over features plus outputs
  • a martingale-style sequential risk monitor
  • baseline policies for frozen, naive, and safety-gated adaptation
  • a multi-action controller over bn_refresh, label_shift, adapter_update, and reset
  • experimental bandit, specialist_memory, and hybrid controllers for the next research phase
  • a delayed-feedback bandit controller for temporal streams with label latency
  • a delayed-feedback hybrid controller that combines specialist memory with delayed controller learning
  • a regime-aware delayed bandit controller with short-horizon temporal state
  • confidence-filtered pseudo-label adaptation with bounded parameter drift
  • a small PyTorch encoder-plus-adapter model for adapter-only test-time updates
  • a real tabular streaming benchmark with regime-based shifts
  • a harder digits-shift benchmark that degrades the frozen source model much more sharply
  • a real-image scale-up benchmark on a fine-grained Fashion-MNIST subset with a BN-heavy convolutional model
  • a configurable image scale-up path with convnet and resnet_small backbones
  • standard and harsh image shift profiles for stress-testing adaptation policies
  • an extreme image shift profile for harder temporal stress tests
  • a delayed-label temporal image benchmark for studying feedback lag
  • a temporal benchmark suite over multiple delay and severity settings
  • smoothed, trust-weighted retrospective reward updates for delayed-feedback controller learning
  • a recurrence-first temporal benchmark for testing specialist reuse under returning regimes
  • a broader paper-style temporal suite with win-count and delta summaries
  • saved temporal paper-suite artifacts under results/temporal_paper_suite.{md,json}
  • a graph-native benchmark with topology-aware shift monitoring
  • an initial WILDS CivilComments benchmark path for testing the controller stack on a stronger public benchmark family
  • a multi-seed WILDS CivilComments suite with compact and medium public-benchmark settings
  • a multi-seed benchmark suite with JSON and Markdown outputs under results/
  • a dedicated image scale-up suite that defaults to a fast convnet confirmation loop, with slower resnet_small confirmations available when needed
  • an ablation suite for controller actions, reward shaping, and specialist-memory routing
  • a lightweight uncertainty wrapper
  • an evaluation loop for comparing outcomes over time
  • per-step traces for inspecting failure modes

Current Read

The current results point to a clear architectural conclusion:

  • naive continual adaptation is consistently brittle
  • reset logic is one of the highest-leverage safety mechanisms
  • on tabular data, bandit and hybrid currently have the best utility/risk tradeoff
  • on the harder digits-shift benchmark, all serious controllers beat or match frozen while dramatically reducing risk capital
  • on the new Fashion-MNIST benchmark, the controller family roughly matches frozen accuracy while cutting sequential risk by an order of magnitude
  • the harsher image profile creates a clearer separation between frozen and controller-guided behavior
  • the temporal and graph tracks are now in place, so the controller abstractions are being exercised beyond flat iid-style batch streams
  • on the temporal suite, regime-aware delayed control helps most in some longer-delay settings, but it is still unstable across the full grid
  • on the full temporal paper suite, controller currently has the strongest aggregate utility story, while frozen still wins the most accuracy settings and delayed_bandit is the strongest delayed learner
  • after the specialist-quality upgrade, delayed_hybrid became more competitive on the full temporal paper suite, but controller and hybrid still have the strongest aggregate utility story
  • on the recurrence-first temporal benchmark, delayed_hybrid now opens multiple specialists, but the delayed branch still trails on utility and needs better routing/credit assignment
  • richer specialist signatures and support-state warm starts now produce much stronger focused long-delay results for the delayed-memory branch
  • the newest result is that delayed specialist memory should likely be regime-selective: it helps on some recurring long-delay slices, but regime_aware_delayed_bandit is still the strongest general delayed learner on the mixed long-delay grid
  • the new explicit regime encoder makes delayed control more selective and improves some long-delay slices, especially standard / 12 and extreme / 12, but it has not yet made delayed memory the strongest overall temporal branch
  • the temporal image benchmark now distinguishes between immediate-learning and true delayed-feedback bandit control
  • the temporal track now uses retrospective rewards at label reveal time, not just delayed replay of immediate utility
  • delayed_hybrid is now clearly a real branch rather than collapsing to plain delayed bandit, but it is still not the strongest temporal policy overall
  • the graph benchmark now degrades frozen performance meaningfully on structural rewiring, which makes it a better structural-shift testbed even though it still mostly highlights safety over raw accuracy recovery
  • the WILDS CivilComments path now has a real easy / hard / recurring split instead of recurring collapsing back onto the majority group
  • on the multi-seed WILDS suite, bandit and hybrid currently have the best public-benchmark utility while raw accuracy remains roughly tied with frozen
  • delayed specialist memory now forms more controlled specialist pools with richer reuse diagnostics, but it still trails non-delayed hybrid on the recurrence-first benchmark
  • the most promising recent direction is specialist quality: better routing signatures plus specialist support-state warm starts improved delayed-memory performance far more than controller decoupling did
  • those specialist-quality gains are real but mixed at full-suite scale: they improved the delayed branch’s competitiveness without yet making it the strongest overall temporal policy
  • making memory more selective helped clarify the story more than it improved the top-line numbers: routing can stay fairly loose, but warm-start reuse needs to be selective under harsher shift
  • the project looks strongest as a controller over bounded interventions, not as a single always-on adaptation rule
  • the public temporal runtime story is now sharper:
    • PaySim remains the strongest fraud-style bounded-auto success story
    • UCI Gas Sensor Drift is a neutral-but-honest maintenance benchmark
    • OpenML Electricity now uses a conservative sensor_safe profile and stands down instead of harming accuracy

Commercial Deployment Runtime

The repo now includes a production-oriented runtime layer on top of the research benchmarks:

  • ReliabilityLayer — stable deployment surface for every batch (predictions, shift/risk scores, recommended/taken actions, trust state, rollback metadata)
  • Decision record schema — stable operator-facing payload with regime_id, regime_confidence, risk_score, why_this_action, rollback_eligible, and retrain_recommended
  • Operating modesshadow, recommend, bounded_auto
  • Safety budgets — per-window caps on auto-actions / resets with downgrade-to-recommend when budgets are exhausted
  • Model adapterstorch_tabular, sklearn, black_box
  • Governance — SQLite audit log, versioned snapshots, one-click rollback
  • Offline replay — canonical CSV/Parquet historical streams with label-delay simulation and reveal_labels(step, labels) for delayed supervision
  • Runtime policiesdelayed_bandit, regime_aware_delayed_bandit (ported from research for fraud pilots)
  • Operator + buyer reportstechnical_report.md, operator_report.md, buyer_report.md, and replay schema artifacts for every pilot / public-story run
  • Dual-metric reports — shadow vs bounded_auto on the same stream (dual_metric_report.md)
  • HTTP serving — FastAPI sidecar (/v1/batch, /v1/batch/{step}/labels, /v1/approve, /v1/health, /v1/metrics)
  • Profile-aware runtime control — drift signatures plus bounded action profiles for fraud, sensor, and conservative sensor_safe streams
  • Ingest contract — canonical replay schema (timestamp, label, feature_*, optional regime, optional meta_*) → replay stream
  • Policy persistence — save/load bandit + regime encoder state across restarts
  • Configurable KPIskpi block in runtime YAML for buyer-facing scores
  • Pilot framework — fraud/risk-style case study with saved KPI report
  • Public ops stories — one-command replay artifacts for public datasets (scripts/run_public_ops_story.py)
  • Observability — optional Prometheus metrics endpoint

Product milestone checklist (dual-metric pilots, verification, sidecar): docs/product_milestones.md. Run all five with:

python3 scripts/run_product_milestones.py

Quick start (commercial path)

pip install -e ".[dev,prometheus,serving]"

# Shadow-mode offline replay on synthetic fraud-like stream
python3 scripts/run_offline_replay.py --synthetic --config configs/default.yaml

# Pilot case study artifact (report + JSON KPIs; dual-metric when layer_builder is set)
python3 scripts/run_pilot_case_study.py --config configs/pilot_fraud_tabular.yaml

# PaySim torch pilot with regime-aware delayed bandit + dual-metric report
python3 scripts/run_pilot_torch.py

# Ingest CSV/JSONL and replay (optional --dual-mode)
python3 scripts/run_ingest_replay.py --input data/openml/credit_german.csv --dual-mode

# Canonical offline replay on a CSV or Parquet stream
python3 scripts/run_offline_replay.py --input data/openml/credit_german.csv

# Public ops story on a public dataset
python3 scripts/run_public_ops_story.py --source-id paysim_fraud --controller-name multi_action --stream-cycles 4

# Correction-centric parallel-path evaluation on the fraud SOTA suite
python3 scripts/run_correction_path_evaluation.py

# Focused decomposition of the flagship fraud SOTA lane
python3 scripts/run_production_failure_analysis.py --source ieee_cis_fraud_torch

# HTTP sidecar (production pilot)

```bash
pip install -e ".[serving,prometheus]"
python3 scripts/export_bundled_fraud_data.py
python3 scripts/run_serve.py --config configs/serving_pilot_fraud_torch.yaml --force-shadow
python3 scripts/run_serving_parity.py

Docs: docs/sidecar_production.md | Docker: docker compose up arl-sidecar

Prometheus metrics (optional)

python3 scripts/run_metrics_server.py --config configs/default.yaml


### Configuration

Default runtime config: `configs/default.yaml`

Pilot config: `configs/pilot_fraud_tabular.yaml`

Key fields:

- `operating_mode`: `shadow` | `recommend` | `bounded_auto`
- `bounded_auto_actions`: low-risk actions allowed in bounded auto mode
- `safety_budget.window_steps`: control horizon for bounded-auto budgets
- `safety_budget.max_auto_actions_per_window`: automatic intervention cap per horizon
- `safety_budget.max_resets_per_window`: reset cap per horizon
- `safety_budget.downgrade_to_recommend`: force human approval when budgets are exhausted
- `governance.audit_db_path` / `governance.snapshot_dir`: audit + rollback storage
- `replay.label_delay_steps`: delayed-label simulation for offline replay
- `replay_schema.md`: generated canonical input contract for customer logs

### Public ops story artifacts

Every public or pilot replay now writes:

- `technical_report.md`: replay summary and per-strategy metrics
- `operator_report.md`: intervention timeline and top drift episodes
- `buyer_report.md`: risk / accuracy / retrain-deferral summary
- `summary.json`: machine-readable KPI payload
- `replay_schema.md`: canonical log schema for ingestion

Current strongest public fraud/risk ops story:

- `results/ops_story_paysim_fraud_multi_action/`
- bounded auto improved accuracy from `87.0%` to `96.9%`
- intervention rate stayed at `4.4%` of batches
- retrain trigger was deferred by `1` step
- harmful drift events avoided vs frozen: `2`

### Real-data verification (before design partners)

Verify the commercial runtime across multiple public datasets:

```bash
python3 scripts/run_real_data_verification.py --config configs/real_data_verification.yaml

Sources included by default:

Source Type Wedge
breast_cancer sklearn UCI general tabular
digits sklearn general tabular
tabular_breast_cancer_shift in-repo shift stream general tabular
openml_credit_g OpenML German Credit fraud/risk adjacent
paysim_fraud PaySim-style synthetic mobile money fraud ops proxy (time-ordered)
ieee_cis_fraud IEEE-CIS sample or synthetic fallback imbalanced fraud tabular
openml_electricity OpenML Electricity predictive maintenance proxy
uci_gas_sensor_drift UCI Gas Sensor Array Drift natural batch-chronological drift benchmark
wilds_civilcomments_csv local WILDS CSV public NLP / moderation

Each source runs through all 8 commercial priorities: deployment surface, operating modes, offline replay, model adapters, engineering maturity, observability hooks, governance/audit, and real-data KPI evidence.

Results are saved under results/real_data_verification/.

Bundled offline fallbacks for OpenML-style datasets live in data/openml/ (UCI German Credit + Spambase). Regenerate with:

python3 scripts/export_bundled_real_data.py

Grafana dashboard template: observability/grafana/arl_dashboard.json

Docker

docker compose run --rm replay
docker compose up metrics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptive_reliability_layer-0.3.1.tar.gz (201.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adaptive_reliability_layer-0.3.1-py3-none-any.whl (216.1 kB view details)

Uploaded Python 3

File details

Details for the file adaptive_reliability_layer-0.3.1.tar.gz.

File metadata

File hashes

Hashes for adaptive_reliability_layer-0.3.1.tar.gz
Algorithm Hash digest
SHA256 c36a43e65e839586e0a6b3360759bb48a516323805c46a53e0af642444646e0e
MD5 b446d5b8afdcdf8c74ff1a4861d9a4ff
BLAKE2b-256 6ecb57f7df9f12138265f084a40e2adfe564d1f690703b6d43f131b31fc10404

See more details on using hashes here.

File details

Details for the file adaptive_reliability_layer-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for adaptive_reliability_layer-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc18c0f88791071fb96ae518100ebf697339a20f49bd512519a62425be9d9828
MD5 17f620f99cba6ffcedc5d56517ae653f
BLAKE2b-256 5ac1b0ab2c6ebeb6ca67eabbf4e762d56ed8820ed05dc177b617a593a1e82288

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page