Autonomous continuous learning loop for TensorFlow RL agents

Project description

tensor-optix

Self-evolving autonomous learning loop for TensorFlow RL agents.

About

tensor-optix replaces the conventional reinforcement learning training loop with an autonomous, continuously-learning optimization system. You bring your TensorFlow model and Gymnasium environment — the library owns everything else: stepping, evaluation, hyperparameter tuning, checkpointing, and policy evolution.

The system runs as a continuous stream of steps with no fixed episode count. It detects performance plateaus through exponential backoff, tunes hyperparameters using finite difference estimation, and evolves policies by comparing live performance against its checkpoint history. When a plateau is detected, it can clone the best-known policy into a new variant, perturb its hyperparameters, and add the variant to the ensemble — autonomously generating new candidates instead of just reverting. Multiple agents run simultaneously as a weighted ensemble with self-adjusting weights based on rolling performance history, making the system particularly suited to non-stationary environments like financial markets where no single policy dominates all regimes.

Core philosophy: We own the loop. You own the model.

Install

pip install tensor-optix

Requirements: Python >= 3.11, TensorFlow >= 2.18, Gymnasium >= 1.0

Quick Start

import tensorflow as tf
import gymnasium as gym
from tensor_optix import RLOptimizer, TFAgent, BatchPipeline, HyperparamSet

# Build your model normally
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(2),
])
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-4)

agent = TFAgent(
    model=model,
    optimizer=optimizer,
    hyperparams=HyperparamSet(
        params={"learning_rate": 3e-4, "gamma": 0.99},
        episode_id=0,
    ),
)

# Continuous stepping — windows of 200 steps, no forced resets
env = gym.make("CartPole-v1")
pipeline = BatchPipeline(env=env, agent=agent, window_size=200)

opt = RLOptimizer(agent=agent, pipeline=pipeline)
opt.run()  # runs until DORMANT (plateau) or max_episodes

How It Works

tensor-optix runs an autonomous improvement loop with four states:

ACTIVE   → aggressive tuning, evaluates every window
COOLING  → recent improvement, exponential backoff on eval frequency
DORMANT  → plateau reached — model is trained, minimal intervention
WATCHDOG → monitoring for degradation

DORMANT = trained. The backoff determines when the model can no longer improve, not a fixed episode count.

The loop:

Steps continuously through the environment in fixed-size windows
Evaluates each window via primary_score
If improved: saves checkpoint, resets backoff
If plateau: backs off evaluation, eventually reaches DORMANT
If DORMANT: PolicyManager compares current score vs registry best and rolls back if needed
If degraded: optionally rolls back to best checkpoint, re-activates
Tunes hyperparameters using two-phase finite difference

Optimizer — Two-Phase Finite Difference

BackoffOptimizer uses staggered two-phase finite difference per param:

Phase 1 (probe):  apply θᵢ + δᵢ, run one window
Phase 2 (commit): gradient = (score_after - score_before) / δᵢ
                  if gradient > 0: keep θᵢ + δᵢ
                  if gradient < 0: apply θᵢ - δᵢ  (reverse)

Params are cycled round-robin. Each param is probed and committed independently. Step size adapts on improvement and plateau.

from tensor_optix import BackoffOptimizer

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,
    optimizer=BackoffOptimizer(
        param_bounds={
            "learning_rate": (1e-5, 1e-2),
            "gamma": (0.9, 0.999),
        },
        perturbation_scale=0.05,
    ),
)

PBTOptimizer

Pseudo population-based training. Maintains a history of (hyperparams, score) pairs. Exploits top performers when in the bottom 20%, explores otherwise.

from tensor_optix import PBTOptimizer

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,
    optimizer=PBTOptimizer(
        param_bounds={"learning_rate": (1e-5, 1e-2)},
        history_size=50,
    ),
)

Custom Evaluator

from tensor_optix import BaseEvaluator, EpisodeData, EvalMetrics

class TotalRewardEvaluator(BaseEvaluator):
    def score(self, episode_data: EpisodeData, train_diagnostics: dict) -> EvalMetrics:
        total = sum(episode_data.rewards)
        return EvalMetrics(
            primary_score=total,
            metrics={"total_reward": total},
            episode_id=episode_data.episode_id,
        )

opt = RLOptimizer(agent=agent, pipeline=pipeline, evaluator=TotalRewardEvaluator())

Custom Agent (Algorithm-Specific Learning)

TFAgent provides a REINFORCE baseline. Subclass and override learn() for PPO, SAC, DQN, etc.:

from tensor_optix import TFAgent
from tensor_optix.core.types import EpisodeData
import tensorflow as tf

class PPOAgent(TFAgent):
    def learn(self, episode_data: EpisodeData) -> dict:
        clip_ratio = self._hyperparams.params.get("clip_ratio", 0.2)
        # ... PPO update logic ...
        return {"loss": loss_value, "entropy": entropy_value}

Live Pipeline

For real-time data sources (trading, robotics, online environments):

from tensor_optix import LivePipeline

class MyFeed:
    def stream(self):
        while True:
            yield obs, reward, terminated, truncated, info

pipeline = LivePipeline(
    data_source=MyFeed(),
    agent=agent,
    episode_boundary_fn=LivePipeline.every_n_seconds(300),
)

Callbacks

from tensor_optix import LoopCallback

class MyLogger(LoopCallback):
    def on_improvement(self, snapshot):
        print(f"New best: {snapshot.eval_metrics.primary_score:.4f}")

    def on_dormant(self, window_id):
        print(f"Training complete at window {window_id}")

opt = RLOptimizer(agent=agent, pipeline=pipeline, callbacks=[MyLogger()])

Available hooks: on_loop_start, on_loop_stop, on_episode_end, on_improvement, on_plateau, on_dormant, on_degradation, on_hyperparam_update.

Policy Evolution

PolicyManager handles model evolution — separate from the hyperparameter optimizer.

Separation of concerns:

BackoffOptimizer / PBTOptimizer → tune hyperparameters
PolicyManager → evolve models (rollback, spawn, ensemble weights)

Automatic rollback on DORMANT

When the loop reaches DORMANT, PolicyManager compares the current score against the best checkpoint. If current < best, it loads the best known weights back into the agent.

from tensor_optix import PolicyManager, RLOptimizer
from tensor_optix.core.checkpoint_registry import CheckpointRegistry

registry = CheckpointRegistry("./checkpoints")
pm = PolicyManager(registry)

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,
    checkpoint_dir="./checkpoints",
    callbacks=[pm.as_callback(agent)],
)
opt.run()

Spawning policy variants

Clone the best checkpoint into a new agent shell and mutate its hyperparameters. The variant starts from the best-known weights with a small perturbation — a new candidate without starting from scratch.

from tensor_optix import PolicyManager, EnsembleAgent

pm = PolicyManager(registry)
pm.add_agent(primary_agent, weight=1.0)

# Clone best checkpoint into a new agent shell, perturb hyperparams by 5%
variant = pm.spawn_variant(MyAgent(...), noise_scale=0.05)
pm.add_agent(variant, weight=0.5)

ensemble = EnsembleAgent(pm, primary_agent=primary_agent)

For weight-space perturbation (e.g. adding noise to network parameters), supply a mutation_fn:

def perturb_weights(agent):
    for layer in agent.model.layers:
        for w in layer.trainable_weights:
            w.assign_add(tf.random.normal(w.shape, stddev=0.01))

variant = pm.spawn_variant(MyAgent(...), mutation_fn=perturb_weights)

Ensemble — multiple policies

Run multiple agents simultaneously. Actions are combined as a weighted average: action = Σ(wᵢ × aᵢ) / Σ(wᵢ).

pm = PolicyManager(registry)
pm.add_agent(agent_trending,  weight=1.0)
pm.add_agent(agent_ranging,   weight=1.0)
pm.add_agent(agent_volatile,  weight=1.0)

ensemble = EnsembleAgent(pm, primary_agent=agent_trending)

opt = RLOptimizer(
    agent=ensemble,
    pipeline=BatchPipeline(env=env, agent=ensemble, window_size=200),
    callbacks=[pm.as_callback(agent_trending)],
)
opt.run()

Autonomous weight adjustment

Record per-agent scores and let the system rebalance weights automatically. auto_update_weights() fires on every DORMANT event via the callback — no manual wiring needed.

# Record scores after each evaluation window (e.g. rolling Sharpe per regime)
pm.record_agent_score(0, sharpe_trending)
pm.record_agent_score(1, sharpe_ranging)
pm.record_agent_score(2, sharpe_volatile)

# auto_update_weights() is called automatically on DORMANT
# or call it manually at any time:
pm.auto_update_weights()

Scores are tracked in a rolling window (score_window=10 by default). Agents with higher mean scores get proportionally higher weight.

Regime detection

RegimeDetector classifies the current performance regime from EvalMetrics history using coefficient of variation (volatility) and normalized linear slope (trend).

from tensor_optix import RegimeDetector

detector = RegimeDetector(
    volatility_threshold=0.2,   # CV above this → "volatile"
    trend_threshold=0.05,       # normalized slope above this → "trending"
    window=10,
)

regime = detector.detect(metrics_history)  # "trending" | "ranging" | "volatile"

# Use regime to boost the relevant agent's weight
regime_to_idx = {"trending": 0, "ranging": 1, "volatile": 2}
pm.record_agent_score(regime_to_idx[regime], latest_score)

For domain-specific signals (Sharpe ratio, VIX, ATR), subclass and override detect():

class MarketRegimeDetector(RegimeDetector):
    def detect(self, metrics_history):
        if current_vix > 30:
            return "volatile"
        if atr_percentile > 70:
            return "trending"
        return "ranging"

Full Configuration

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,
    evaluator=None,                     # default: TFEvaluator
    optimizer=None,                     # default: BackoffOptimizer
    checkpoint_dir="./checkpoints",
    max_snapshots=10,
    rollback_on_degradation=False,
    improvement_margin=0.0,
    max_episodes=None,                  # None = run until DORMANT
    base_interval=1,
    backoff_factor=2.0,
    max_interval_episodes=100,
    plateau_threshold=5,
    dormant_threshold=20,
    degradation_threshold=0.95,
    callbacks=[],
)

Architecture

tensor_optix/
├── core/
│   ├── types.py                # EpisodeData, EvalMetrics, HyperparamSet, LoopState
│   ├── base_agent.py           # BaseAgent — 6-method contract
│   ├── base_evaluator.py
│   ├── base_optimizer.py
│   ├── base_pipeline.py
│   ├── loop_controller.py      # State machine + main loop
│   ├── checkpoint_registry.py
│   ├── backoff_scheduler.py
│   ├── policy_manager.py       # PolicyManager + PolicyManagerCallback
│   ├── ensemble_agent.py       # EnsembleAgent — multi-policy BaseAgent wrapper
│   └── regime_detector.py      # RegimeDetector — score-based regime classification
├── adapters/tensorflow/
│   ├── tf_agent.py             # TFAgent — Keras model wrapper
│   └── tf_evaluator.py         # TFEvaluator — default scorer
├── pipeline/
│   ├── batch_pipeline.py       # Continuous stepping, fixed windows
│   └── live_pipeline.py        # Real-time streaming
└── optimizers/
    ├── backoff_optimizer.py    # Two-phase finite difference
    └── pbt_optimizer.py        # Pseudo population-based training

Component responsibilities

Component	Responsibility
`LoopController`	State machine, episode orchestration
`BackoffScheduler`	Adaptation interval + state transitions
`CheckpointRegistry`	Snapshot storage and manifest
`BaseOptimizer`	Hyperparameter tuning
`PolicyManager`	Model evolution (rollback, spawn variants, ensemble weights)
`EnsembleAgent`	Multi-policy action combining
`RegimeDetector`	Score-based regime classification (trending / ranging / volatile)

License

Project details

Release history Release notifications | RSS feed

1.17.0

May 14, 2026

1.16.7

May 14, 2026

1.16.6

May 11, 2026

1.16.5

May 11, 2026

1.16.4

May 11, 2026

1.16.3

May 11, 2026

1.16.2

May 8, 2026

1.16.1

May 8, 2026

1.16.0

May 7, 2026

1.15.2

May 6, 2026

1.15.0

May 6, 2026

1.14.7

May 6, 2026

1.14.6

May 5, 2026

1.14.5

May 5, 2026

1.14.4

May 5, 2026

1.14.3

May 4, 2026

1.14.2

May 4, 2026

1.14.1

May 4, 2026

1.14.0

May 4, 2026

1.13.1

May 4, 2026

1.13.0

May 4, 2026

1.12.2

May 4, 2026

1.12.0

May 4, 2026

1.11.0

Apr 23, 2026

1.10.2

Apr 22, 2026

1.10.1

Apr 20, 2026

1.10.0

Apr 20, 2026

1.9.0

Apr 20, 2026

1.2.6

Apr 13, 2026

1.2.5

Apr 13, 2026

1.2.4

Apr 13, 2026

1.2.3

Apr 11, 2026

1.2.2

Apr 11, 2026

1.2.1

Apr 10, 2026

1.2.0

Apr 10, 2026

1.1.0

Apr 10, 2026

1.0.0

Apr 10, 2026

0.8.2

Apr 7, 2026

0.8.1

Apr 6, 2026

0.8.0

Apr 5, 2026

0.7.0

Mar 30, 2026

0.6.1

Mar 30, 2026

0.6.0

Mar 27, 2026

0.4.0

Mar 27, 2026

This version

0.3.0

Mar 27, 2026

0.2.0

Mar 27, 2026

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensor_optix-0.3.0.tar.gz (32.0 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tensor_optix-0.3.0-py3-none-any.whl (35.8 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file tensor_optix-0.3.0.tar.gz.

File metadata

Download URL: tensor_optix-0.3.0.tar.gz
Upload date: Mar 27, 2026
Size: 32.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for tensor_optix-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`cf2e14811bfc85e2221331aad37c244ab11d51cb27d19aedb9334a899695e754`
MD5	`bd52ea7e6ed1de2ffca43ab4c64649bc`
BLAKE2b-256	`b3ab36d8b2c1645b1504b18bccb9a69cb3a20703e4c6ec5cb1c296f4e4732dc7`

See more details on using hashes here.

File details

Details for the file tensor_optix-0.3.0-py3-none-any.whl.

File metadata

Download URL: tensor_optix-0.3.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 35.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for tensor_optix-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e966bbabda964addb6be7c6d306ca7eb070bb5cb3240b450e527da044be0063`
MD5	`3577faaa9258b6927c0eeb1e390bdaa1`
BLAKE2b-256	`580351ebc33dd81908d3c352000528918f7d17de8b6bd9a67a6b94d9897dfff7`

See more details on using hashes here.

tensor-optix 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

tensor-optix

About

Install

Quick Start

How It Works

Optimizer — Two-Phase Finite Difference

PBTOptimizer

Custom Evaluator

Custom Agent (Algorithm-Specific Learning)

Live Pipeline

Callbacks

Policy Evolution

Automatic rollback on DORMANT

Spawning policy variants

Ensemble — multiple policies

Autonomous weight adjustment

Regime detection

Full Configuration

Architecture

Component responsibilities

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes