Research prototype and commercial runtime for safe continual test-time adaptation under distribution shift.

These details have not been verified by PyPI

Project links

Project description

Adaptive Reliability Layer

An experimental research codebase for safe continual test-time adaptation under distribution shift.

Project Goal

The project explores whether a deployed ML model can be paired with an adaptive reliability layer that:

detects distribution shift online
estimates whether the model remains in its competence zone
applies bounded, reversible adaptation when safe
preserves source knowledge over long streams
recalibrates uncertainty after adaptation

This is a separate project from Intelligent_NPCs.

Initial Scope

The first prototype focuses on a simulated streaming setting rather than production deployment:

streaming batches from a synthetic nonstationary environment
latent or feature-space shift monitoring
simple adaptation policy with safety gating
uncertainty-aware output surface
evaluation against a frozen baseline

Repo Layout

docs/ project documents and research notes
src/adaptive_reliability_layer/ core package
scripts/ runnable entrypoints

Useful starting docs:

docs/status_paper_commercial_outreach.md
docs/current_findings.md
docs/next_step_decision_memo.md

Quick Start

Create a virtual environment and install in editable mode:

python3 -m venv .venv
source .venv/bin/activate
# Commercial runtime (torch fraud pilots + sidecar):
pip install -e ".[torch,serving]"
# Full research bench (adds WILDS):
pip install -e ".[research,serving,dev]"

PyPI-style install (when published):

pip install "adaptive-reliability-layer[torch,serving]"
arl-customer-replay --input customer.csv --config configs/customer_shadow.yaml

See docs/customer_replay.md for the design-partner replay path.

Run the simple simulation:

python3 scripts/run_simulation.py

Run the baseline benchmark:

python3 scripts/run_benchmark.py

Run the real tabular benchmark with the PyTorch adapter model:

python3 scripts/run_tabular_benchmark.py

Run the harder digits-shift benchmark:

python3 scripts/run_digits_shift_benchmark.py

Run the real-image scale-up benchmark on Fashion-MNIST:

python3 scripts/run_fashion_mnist_shift_benchmark.py

Run the delayed-label temporal Fashion-MNIST benchmark:

python3 scripts/run_temporal_fashion_mnist_benchmark.py

Run the temporal delay/severity suite and save aggregated results:

python3 scripts/run_temporal_benchmark_suite.py

Run the broader temporal paper-style suite:

python3 scripts/run_temporal_paper_suite.py

Run the recurrence-first temporal benchmark for specialist reuse:

python3 scripts/run_recurrence_temporal_benchmark.py

Run the first public WILDS benchmark path on CivilComments:

python3 scripts/run_wilds_civilcomments_benchmark.py

Run the multi-seed WILDS CivilComments suite:

python3 scripts/run_wilds_civilcomments_suite.py

Run the graph-native shift benchmark:

python3 scripts/run_graph_shift_benchmark.py

Run the multi-seed image scale-up suite across backbones and shift severities:

python3 scripts/run_image_scaleup_suite.py

Run the multi-seed suite and save aggregated results:

python3 scripts/run_benchmark_suite.py

Run the ablation suite and save aggregated results:

python3 scripts/run_ablation_suite.py

Current Status

This repository is in the research scaffolding phase. The code currently provides:

a stream simulator for synthetic regime shift
a source-reference profile and shift monitor over features plus outputs
a martingale-style sequential risk monitor
baseline policies for frozen, naive, and safety-gated adaptation
a multi-action controller over bn_refresh, label_shift, adapter_update, and reset
experimental bandit, specialist_memory, and hybrid controllers for the next research phase
a delayed-feedback bandit controller for temporal streams with label latency
a delayed-feedback hybrid controller that combines specialist memory with delayed controller learning
a regime-aware delayed bandit controller with short-horizon temporal state
confidence-filtered pseudo-label adaptation with bounded parameter drift
a small PyTorch encoder-plus-adapter model for adapter-only test-time updates
a real tabular streaming benchmark with regime-based shifts
a harder digits-shift benchmark that degrades the frozen source model much more sharply
a real-image scale-up benchmark on a fine-grained Fashion-MNIST subset with a BN-heavy convolutional model
a configurable image scale-up path with convnet and resnet_small backbones
standard and harsh image shift profiles for stress-testing adaptation policies
an extreme image shift profile for harder temporal stress tests
a delayed-label temporal image benchmark for studying feedback lag
a temporal benchmark suite over multiple delay and severity settings
smoothed, trust-weighted retrospective reward updates for delayed-feedback controller learning
a recurrence-first temporal benchmark for testing specialist reuse under returning regimes
a broader paper-style temporal suite with win-count and delta summaries
saved temporal paper-suite artifacts under results/temporal_paper_suite.{md,json}
a graph-native benchmark with topology-aware shift monitoring
an initial WILDS CivilComments benchmark path for testing the controller stack on a stronger public benchmark family
a multi-seed WILDS CivilComments suite with compact and medium public-benchmark settings
a multi-seed benchmark suite with JSON and Markdown outputs under results/
a dedicated image scale-up suite that defaults to a fast convnet confirmation loop, with slower resnet_small confirmations available when needed
an ablation suite for controller actions, reward shaping, and specialist-memory routing
a lightweight uncertainty wrapper
an evaluation loop for comparing outcomes over time
per-step traces for inspecting failure modes

Current Read

The current results point to a clear architectural conclusion:

naive continual adaptation is consistently brittle
reset logic is one of the highest-leverage safety mechanisms
on tabular data, bandit and hybrid currently have the best utility/risk tradeoff
on the harder digits-shift benchmark, all serious controllers beat or match frozen while dramatically reducing risk capital
on the new Fashion-MNIST benchmark, the controller family roughly matches frozen accuracy while cutting sequential risk by an order of magnitude
the harsher image profile creates a clearer separation between frozen and controller-guided behavior
the temporal and graph tracks are now in place, so the controller abstractions are being exercised beyond flat iid-style batch streams
on the temporal suite, regime-aware delayed control helps most in some longer-delay settings, but it is still unstable across the full grid
on the full temporal paper suite, controller currently has the strongest aggregate utility story, while frozen still wins the most accuracy settings and delayed_bandit is the strongest delayed learner
after the specialist-quality upgrade, delayed_hybrid became more competitive on the full temporal paper suite, but controller and hybrid still have the strongest aggregate utility story
on the recurrence-first temporal benchmark, delayed_hybrid now opens multiple specialists, but the delayed branch still trails on utility and needs better routing/credit assignment
richer specialist signatures and support-state warm starts now produce much stronger focused long-delay results for the delayed-memory branch
the newest result is that delayed specialist memory should likely be regime-selective: it helps on some recurring long-delay slices, but regime_aware_delayed_bandit is still the strongest general delayed learner on the mixed long-delay grid
the new explicit regime encoder makes delayed control more selective and improves some long-delay slices, especially standard / 12 and extreme / 12, but it has not yet made delayed memory the strongest overall temporal branch
the temporal image benchmark now distinguishes between immediate-learning and true delayed-feedback bandit control
the temporal track now uses retrospective rewards at label reveal time, not just delayed replay of immediate utility
delayed_hybrid is now clearly a real branch rather than collapsing to plain delayed bandit, but it is still not the strongest temporal policy overall
the graph benchmark now degrades frozen performance meaningfully on structural rewiring, which makes it a better structural-shift testbed even though it still mostly highlights safety over raw accuracy recovery
the WILDS CivilComments path now has a real easy / hard / recurring split instead of recurring collapsing back onto the majority group
on the multi-seed WILDS suite, bandit and hybrid currently have the best public-benchmark utility while raw accuracy remains roughly tied with frozen
delayed specialist memory now forms more controlled specialist pools with richer reuse diagnostics, but it still trails non-delayed hybrid on the recurrence-first benchmark
the most promising recent direction is specialist quality: better routing signatures plus specialist support-state warm starts improved delayed-memory performance far more than controller decoupling did
those specialist-quality gains are real but mixed at full-suite scale: they improved the delayed branch’s competitiveness without yet making it the strongest overall temporal policy
making memory more selective helped clarify the story more than it improved the top-line numbers: routing can stay fairly loose, but warm-start reuse needs to be selective under harsher shift
the project looks strongest as a controller over bounded interventions, not as a single always-on adaptation rule
the public temporal runtime story is now sharper:
- PaySim remains the strongest fraud-style bounded-auto success story
- UCI Gas Sensor Drift is a neutral-but-honest maintenance benchmark
- OpenML Electricity now uses a conservative sensor_safe profile and stands down instead of harming accuracy

Commercial Deployment Runtime

The repo now includes a production-oriented runtime layer on top of the research benchmarks:

ReliabilityLayer — stable deployment surface for every batch (predictions, shift/risk scores, recommended/taken actions, trust state, rollback metadata)
Decision record schema — stable operator-facing payload with regime_id, regime_confidence, risk_score, why_this_action, rollback_eligible, and retrain_recommended
Operating modes — shadow, recommend, bounded_auto
Safety budgets — per-window caps on auto-actions / resets with downgrade-to-recommend when budgets are exhausted
Model adapters — torch_tabular, sklearn, black_box
Governance — SQLite audit log, versioned snapshots, one-click rollback
Offline replay — canonical CSV/Parquet historical streams with label-delay simulation and reveal_labels(step, labels) for delayed supervision
Runtime policies — delayed_bandit, regime_aware_delayed_bandit (ported from research for fraud pilots)
Operator + buyer reports — technical_report.md, operator_report.md, buyer_report.md, and replay schema artifacts for every pilot / public-story run
Dual-metric reports — shadow vs bounded_auto on the same stream (dual_metric_report.md)
HTTP serving — FastAPI sidecar (/v1/batch, /v1/batch/{step}/labels, /v1/approve, /v1/health, /v1/metrics)
Profile-aware runtime control — drift signatures plus bounded action profiles for fraud, sensor, and conservative sensor_safe streams
Ingest contract — canonical replay schema (timestamp, label, feature_*, optional regime, optional meta_*) → replay stream
Policy persistence — save/load bandit + regime encoder state across restarts
Configurable KPIs — kpi block in runtime YAML for buyer-facing scores
Pilot framework — fraud/risk-style case study with saved KPI report
Public ops stories — one-command replay artifacts for public datasets (scripts/run_public_ops_story.py)
Observability — optional Prometheus metrics endpoint

Product milestone checklist (dual-metric pilots, verification, sidecar): docs/product_milestones.md. Run all five with:

python3 scripts/run_product_milestones.py

Quick start (commercial path)

pip install -e ".[dev,prometheus,serving]"

# Shadow-mode offline replay on synthetic fraud-like stream
python3 scripts/run_offline_replay.py --synthetic --config configs/default.yaml

# Pilot case study artifact (report + JSON KPIs; dual-metric when layer_builder is set)
python3 scripts/run_pilot_case_study.py --config configs/pilot_fraud_tabular.yaml

# PaySim torch pilot with regime-aware delayed bandit + dual-metric report
python3 scripts/run_pilot_torch.py

# Ingest CSV/JSONL and replay (optional --dual-mode)
python3 scripts/run_ingest_replay.py --input data/openml/credit_german.csv --dual-mode

# Canonical offline replay on a CSV or Parquet stream
python3 scripts/run_offline_replay.py --input data/openml/credit_german.csv

# Public ops story on a public dataset
python3 scripts/run_public_ops_story.py --source-id paysim_fraud --controller-name multi_action --stream-cycles 4

# Correction-centric parallel-path evaluation on the fraud SOTA suite
python3 scripts/run_correction_path_evaluation.py

# Focused decomposition of the flagship fraud SOTA lane
python3 scripts/run_production_failure_analysis.py --source ieee_cis_fraud_torch

# HTTP sidecar (production pilot)

```bash
pip install -e ".[serving,prometheus]"
python3 scripts/export_bundled_fraud_data.py
python3 scripts/run_serve.py --config configs/serving_pilot_fraud_torch.yaml --force-shadow
python3 scripts/run_serving_parity.py

Docs: docs/sidecar_production.md | Docker: docker compose up arl-sidecar

Prometheus metrics (optional)

python3 scripts/run_metrics_server.py --config configs/default.yaml


### Configuration

Default runtime config: `configs/default.yaml`

Pilot config: `configs/pilot_fraud_tabular.yaml`

Key fields:

- `operating_mode`: `shadow` | `recommend` | `bounded_auto`
- `bounded_auto_actions`: low-risk actions allowed in bounded auto mode
- `safety_budget.window_steps`: control horizon for bounded-auto budgets
- `safety_budget.max_auto_actions_per_window`: automatic intervention cap per horizon
- `safety_budget.max_resets_per_window`: reset cap per horizon
- `safety_budget.downgrade_to_recommend`: force human approval when budgets are exhausted
- `governance.audit_db_path` / `governance.snapshot_dir`: audit + rollback storage
- `replay.label_delay_steps`: delayed-label simulation for offline replay
- `replay_schema.md`: generated canonical input contract for customer logs

### Public ops story artifacts

Every public or pilot replay now writes:

- `technical_report.md`: replay summary and per-strategy metrics
- `operator_report.md`: intervention timeline and top drift episodes
- `buyer_report.md`: risk / accuracy / retrain-deferral summary
- `summary.json`: machine-readable KPI payload
- `replay_schema.md`: canonical log schema for ingestion

Current strongest public fraud/risk ops story:

- `results/ops_story_paysim_fraud_multi_action/`
- bounded auto improved accuracy from `87.0%` to `96.9%`
- intervention rate stayed at `4.4%` of batches
- retrain trigger was deferred by `1` step
- harmful drift events avoided vs frozen: `2`

### Real-data verification (before design partners)

Verify the commercial runtime across multiple public datasets:

```bash
python3 scripts/run_real_data_verification.py --config configs/real_data_verification.yaml

Sources included by default:

Source	Type	Wedge
`breast_cancer`	sklearn UCI	general tabular
`digits`	sklearn	general tabular
`tabular_breast_cancer_shift`	in-repo shift stream	general tabular
`openml_credit_g`	OpenML German Credit	fraud/risk adjacent
`paysim_fraud`	PaySim-style synthetic mobile money	fraud ops proxy (time-ordered)
`ieee_cis_fraud`	IEEE-CIS sample or synthetic fallback	imbalanced fraud tabular
`openml_electricity`	OpenML Electricity	predictive maintenance proxy
`uci_gas_sensor_drift`	UCI Gas Sensor Array Drift	natural batch-chronological drift benchmark
`wilds_civilcomments_csv`	local WILDS CSV	public NLP / moderation

Each source runs through all 8 commercial priorities: deployment surface, operating modes, offline replay, model adapters, engineering maturity, observability hooks, governance/audit, and real-data KPI evidence.

Results are saved under results/real_data_verification/.

Bundled offline fallbacks for OpenML-style datasets live in data/openml/ (UCI German Credit + Spambase). Regenerate with:

python3 scripts/export_bundled_real_data.py

Grafana dashboard template: observability/grafana/arl_dashboard.json

Docker

docker compose run --rm replay
docker compose up metrics

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptive_reliability_layer-0.3.1.tar.gz (201.6 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adaptive_reliability_layer-0.3.1-py3-none-any.whl (216.1 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file adaptive_reliability_layer-0.3.1.tar.gz.

File metadata

Download URL: adaptive_reliability_layer-0.3.1.tar.gz
Upload date: Jun 4, 2026
Size: 201.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for adaptive_reliability_layer-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`c36a43e65e839586e0a6b3360759bb48a516323805c46a53e0af642444646e0e`
MD5	`b446d5b8afdcdf8c74ff1a4861d9a4ff`
BLAKE2b-256	`6ecb57f7df9f12138265f084a40e2adfe564d1f690703b6d43f131b31fc10404`

See more details on using hashes here.

File details

Details for the file adaptive_reliability_layer-0.3.1-py3-none-any.whl.

File metadata

Download URL: adaptive_reliability_layer-0.3.1-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 216.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for adaptive_reliability_layer-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc18c0f88791071fb96ae518100ebf697339a20f49bd512519a62425be9d9828`
MD5	`17f620f99cba6ffcedc5d56517ae653f`
BLAKE2b-256	`5ac1b0ab2c6ebeb6ca67eabbf4e762d56ed8820ed05dc177b617a593a1e82288`

See more details on using hashes here.

adaptive-reliability-layer 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Adaptive Reliability Layer

Project Goal

Initial Scope

Repo Layout

Quick Start

Current Status

Current Read

Commercial Deployment Runtime

Quick start (commercial path)

Prometheus metrics (optional)

Docker

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes