Autonomous training loop for any sequential learning model — built-in PPO, DQN, and SAC for TensorFlow and PyTorch

These details have not been verified by PyPI

Project description

tensor-optix

Autonomous training loop for any sequential learning model — built-in PPO, DQN, and SAC for TensorFlow and PyTorch.

About

tensor-optix is a framework-agnostic autonomous training loop. It owns evaluation, checkpointing, hyperparameter tuning, policy evolution, and ensemble management for any model that can act and learn from sequential data — reinforcement learning agents, online forecasters, trading systems, robotics controllers, or any custom architecture that fits the six-method BaseAgent interface.

The framework has zero assumptions about your model, algorithm, or framework. No RL-specific logic exists in the core loop — it works equally well with PPO, a custom evolutionary strategy, a supervised sequence model, or anything else. For RL specifically, it ships with production-ready implementations of PPO, DQN, and SAC for both TensorFlow and PyTorch so you can start training without writing a single algorithm line.

The system never stops at a fixed episode count. It detects convergence through exponential backoff, spawns policy variants when it plateaus, weights an ensemble by rolling performance, and uses both training and validation signals to drive every decision — not training alone.

Core philosophy: We own the loop. You own the model.

Install

# TensorFlow (default — includes PPO, DQN, SAC for TF)
pip install tensor-optix

# PyTorch support
pip install tensor-optix[torch]

# Both + Atari + MuJoCo
pip install tensor-optix[all]

Requirements: Python >= 3.11, Gymnasium >= 1.0, NumPy >= 1.24. TensorFlow >= 2.18 is required for TF algorithms. PyTorch >= 2.0 is required for Torch algorithms. The core loop, PolicyManager, and all ensemble/evolution logic are framework-free.

Quick Start

TensorFlow PPO

import tensorflow as tf
import gymnasium as gym
from tensor_optix import RLOptimizer, TFPPOAgent, BatchPipeline, HyperparamSet

actor = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(64, activation="tanh"),
    tf.keras.layers.Dense(64, activation="tanh"),
    tf.keras.layers.Dense(2),                    # logits, one per action
])
critic = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(64, activation="tanh"),
    tf.keras.layers.Dense(64, activation="tanh"),
    tf.keras.layers.Dense(1),                    # scalar value estimate
])

agent = TFPPOAgent(
    actor=actor,
    critic=critic,
    optimizer=tf.keras.optimizers.Adam(3e-4),
    hyperparams=HyperparamSet(params={
        "learning_rate": 3e-4,
        "clip_ratio":    0.2,
        "entropy_coef":  0.01,
        "vf_coef":       0.5,
        "gamma":         0.99,
        "gae_lambda":    0.95,
        "n_epochs":      10,
        "minibatch_size": 64,
    }, episode_id=0),
)

env      = gym.make("CartPole-v1")
pipeline = BatchPipeline(env=env, agent=agent, window_size=2048)

opt = RLOptimizer(agent=agent, pipeline=pipeline)
opt.run()  # runs until convergence, auto-tunes hyperparams, saves best checkpoint

PyTorch PPO

import torch
import torch.nn as nn
import gymnasium as gym
from tensor_optix import RLOptimizer, BatchPipeline, HyperparamSet
from tensor_optix.algorithms.torch_ppo import TorchPPOAgent

obs_dim, n_actions = 4, 2

actor  = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(), nn.Linear(64, n_actions))
critic = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(), nn.Linear(64, 1))

agent = TorchPPOAgent(
    actor=actor,
    critic=critic,
    optimizer=torch.optim.Adam(
        list(actor.parameters()) + list(critic.parameters()), lr=3e-4
    ),
    hyperparams=HyperparamSet(params={
        "learning_rate": 3e-4, "clip_ratio": 0.2, "entropy_coef": 0.01,
        "gamma": 0.99, "gae_lambda": 0.95, "n_epochs": 10, "minibatch_size": 64,
    }, episode_id=0),
)

pipeline = BatchPipeline(env=gym.make("CartPole-v1"), agent=agent, window_size=2048)
opt = RLOptimizer(agent=agent, pipeline=pipeline)
opt.run()

Built-in Algorithms

PPO — Proximal Policy Optimization

Discrete action spaces. Actor + critic are separate models.

from tensor_optix.algorithms.tf_ppo import TFPPOAgent   # TensorFlow
from tensor_optix.algorithms.torch_ppo import TorchPPOAgent  # PyTorch

What it implements:

GAE-λ advantage estimation with episode-boundary handling
Clipped surrogate objective: L = min(r·A, clip(r, 1−ε, 1+ε)·A)
Value function loss (MSE) with configurable coefficient
Entropy bonus for exploration regularization
n epochs of shuffled minibatch gradient descent per rollout
Global gradient norm clipping

Hyperparams exposed to the tuner:

Key	Default	Description
`learning_rate`	`3e-4`	Adam learning rate
`clip_ratio`	`0.2`	PPO clipping epsilon (ε)
`entropy_coef`	`0.01`	Entropy bonus weight
`vf_coef`	`0.5`	Value loss weight
`gamma`	`0.99`	Discount factor
`gae_lambda`	`0.95`	GAE smoothing (0 = TD, 1 = MC)
`n_epochs`	`10`	Update epochs per rollout
`minibatch_size`	`64`	Minibatch size
`max_grad_norm`	`0.5`	Gradient clipping norm

DQN — Deep Q-Network

Discrete action spaces. Single Q-network; target network updated periodically.

from tensor_optix.algorithms.tf_dqn import TFDQNAgent     # TensorFlow
from tensor_optix.algorithms.torch_dqn import TorchDQNAgent  # PyTorch

What it implements:

Experience replay buffer (circular)
Target network with hard periodic updates
Epsilon-greedy exploration with multiplicative decay
TD loss: MSE(Q(s,a), r + γ max Q_target(s',·))

q_net = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(obs_dim,)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(n_actions),    # Q-values for all actions
])

agent = TFDQNAgent(
    q_network=q_net,
    n_actions=n_actions,
    optimizer=tf.keras.optimizers.Adam(1e-3),
    hyperparams=HyperparamSet(params={
        "learning_rate": 1e-3, "gamma": 0.99,
        "epsilon": 1.0, "epsilon_min": 0.05, "epsilon_decay": 0.995,
        "batch_size": 64, "target_update_freq": 100,
        "replay_capacity": 100_000,
    }, episode_id=0),
)

SAC — Soft Actor-Critic

Continuous action spaces. Squashed Gaussian actor, twin critics, auto-entropy tuning.

from tensor_optix.algorithms.tf_sac import TFSACAgent     # TensorFlow
from tensor_optix.algorithms.torch_sac import TorchSACAgent  # PyTorch

What it implements:

Reparameterized squashed Gaussian: a = tanh(μ + σε), actions ∈ (−1, 1)
Clipped double-Q (twin critics) to reduce overestimation bias
Soft Polyak target network updates: θ_tgt ← τθ + (1−τ)θ_tgt
Auto-entropy temperature: learnable log_α, target entropy = −dim(A)
Off-policy experience replay

action_dim = 6   # e.g. MuJoCo Ant

actor   = tf.keras.Sequential([        # obs → [mean || log_std], shape [batch, 2*action_dim]
    tf.keras.layers.Input(shape=(obs_dim,)),
    tf.keras.layers.Dense(256, activation="relu"),
    tf.keras.layers.Dense(action_dim * 2),
])
critic1 = tf.keras.Sequential([        # [obs || action] → Q-value
    tf.keras.layers.Input(shape=(obs_dim + action_dim,)),
    tf.keras.layers.Dense(256, activation="relu"),
    tf.keras.layers.Dense(1),
])
critic2 = tf.keras.Sequential([...])   # independent weights (twin-Q)

agent = TFSACAgent(
    actor=actor, critic1=critic1, critic2=critic2,
    action_dim=action_dim,
    actor_optimizer=tf.keras.optimizers.Adam(3e-4),
    critic_optimizer=tf.keras.optimizers.Adam(3e-4),
    alpha_optimizer=tf.keras.optimizers.Adam(3e-4),
    hyperparams=HyperparamSet(params={
        "learning_rate": 3e-4, "gamma": 0.99,
        "tau": 0.005, "batch_size": 256,
        "replay_capacity": 1_000_000,
    }, episode_id=0),
)

Parallel Environments — VectorBatchPipeline

Run N environments simultaneously. Increases sample throughput N× over a single BatchPipeline.

import gymnasium as gym
from tensor_optix.pipeline.vector_pipeline import VectorBatchPipeline

env_fns  = [lambda: gym.make("CartPole-v1")] * 8
pipeline = VectorBatchPipeline(env_fns=env_fns, window_size=256)

# Async (subprocess-based) for CPU-bound envs:
pipeline = VectorBatchPipeline(env_fns=env_fns, window_size=256, async_envs=True)

Observation & Reward Normalization

from tensor_optix.core.normalizers import ObsNormalizer, RewardNormalizer

obs_norm    = ObsNormalizer(obs_shape=(obs_dim,), clip=10.0)
reward_norm = RewardNormalizer(gamma=0.99, clip=10.0)

# Update stats from a collected rollout, then normalize in act():
obs_norm.update(obs_batch)
normed_obs = obs_norm.normalize(raw_obs)

# Scale rewards using running return std:
reward_norm.step(raw_reward)
scaled_reward = reward_norm.normalize([raw_reward])[0]

Extend Any Algorithm

All algorithms implement BaseAgent. Subclass to add domain-specific logic without reimplementing PPO:

class TradingPPO(TFPPOAgent):
    """Adds action distribution logging to standard PPO."""

    def learn(self, episode_data):
        diag = super().learn(episode_data)           # full PPO update
        obs  = tf.cast(episode_data.observations[:256], tf.float32)
        probs = tf.nn.softmax(self._actor(obs, training=False))
        probs_mean = tf.reduce_mean(probs, axis=0).numpy()
        diag["p_flat"]  = float(probs_mean[0])
        diag["p_long"]  = float(probs_mean[1])
        diag["p_short"] = float(probs_mean[2])
        return diag

The Loop Interface

The core loop calls exactly six methods on any agent. Nothing else is assumed.

class BaseAgent(ABC):
    def act(self, observation) -> any: ...         # any action type
    def learn(self, episode_data) -> dict: ...     # any algorithm
    def get_hyperparams(self) -> HyperparamSet: ...
    def set_hyperparams(self, hp: HyperparamSet): ...
    def save_weights(self, path: str): ...
    def load_weights(self, path: str): ...

Any model — RL algorithms, evolutionary strategies, supervised sequence models, online forecasters — plugs in by implementing these six methods. No specific framework, action space, or learning paradigm is assumed.

How It Works

The Loop States

ACTIVE   → aggressive tuning, evaluates every window
COOLING  → recent improvement, exponential backoff on eval frequency
DORMANT  → plateau reached — model is trained, minimal intervention
WATCHDOG → monitoring for degradation

DORMANT = trained. Not a fixed episode count — the system backs off evaluation geometrically until improvement stops, then declares convergence.

Backoff Schedule

interval₀ = base_interval
intervalₙ = min(intervalₙ₋₁ × backoff_factor, max_interval_episodes)

Plateau detected when:  consecutive_no_improvement ≥ plateau_threshold
DORMANT declared when:  consecutive_no_improvement ≥ dormant_threshold

Every improvement resets the backoff counter. The system accelerates evaluation when learning is happening, backs off when it isn't.

Hyperparameter Optimizer — Two-Phase Finite Difference

BackoffOptimizer cycles through hyperparameters using staggered two-phase finite difference:

For each param θᵢ:
  Phase 1 (probe):   apply θᵢ + δᵢ, run one window, record score s₊
  Phase 2 (commit):  gradient ĝᵢ = (s₊ − s₀) / δᵢ
                     if ĝᵢ > 0: keep θᵢ + δᵢ
                     if ĝᵢ ≤ 0: apply θᵢ − δᵢ  (reverse direction)

Step size δᵢ adapts: shrinks on improvement, grows on plateau. Params cycle round-robin — each is probed and committed independently.

PBTOptimizer maintains a history of (hyperparams, score) pairs and exploits top performers when in the bottom 20%, otherwise explores with Gaussian perturbation.

The Science: Train + Val Together

Without validation, every decision — checkpoint saves, rollbacks, spawn triggers — is made on training data alone. That is overfitting disguised as improvement.

Validation Pipeline

opt = RLOptimizer(
    agent=agent,
    pipeline=train_pipeline,
    val_pipeline=val_pipeline,   # held-out — agent acts, never learns
)

On every eval window, the loop runs one val episode (act() only, no learn()), then calls evaluator.combine(train_metrics, val_metrics):

primary_score        = val_score          ← drives ALL checkpoint and rollback decisions
generalization_gap   = train_score − val  ← surfaced in every EvalMetrics

Every adaptation decision — rollback, spawn, noise scale, MetaController — is driven by out-of-sample performance, not training performance.

Three-Signal Adaptive Noise

When spawning a policy variant, the mutation intensity is computed from three signals:

Signal 1 — Val slope (improvement rate)

t = clip(slope(val_scores) / max_slope, 0, 1)

t → 1 when val is improving strongly. t → 0 on plateau.

Signal 2 — Generalization gap

gap_penalty = clip(mean(train − val) / |mean(val)|, 0, 1)

Large gap means the model fits training data but not held-out data — explore different solutions.

Signal 3 — Train/val correlation (Pearson)

corr_penalty = clip(1 − Pearson(train_scores, val_scores), 0, 1)

corr → 1 means train and val move together (healthy). corr → 0 means train is moving but val isn't — a signal to explore.

Combined formula:

effective_t = t × (1 − 0.5 × gap_penalty) × (1 − 0.5 × corr_penalty)
noise_scale = max_scale − effective_t × (max_scale − min_scale)

Policy Evolution

Automatic Rollback

When the loop reaches DORMANT, PolicyManager compares the current score against the best checkpoint. If current < best, it loads the best known weights back into the agent automatically.

from tensor_optix import PolicyManager
from tensor_optix.core.checkpoint_registry import CheckpointRegistry

registry = CheckpointRegistry("./checkpoints")
pm = PolicyManager(registry)

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,
    callbacks=[pm.as_callback(agent)],
)
opt.run()

Spawn Budget — When Is Training Done?

pm = PolicyManager(registry, max_spawns=3)
cb = pm.as_callback(agent)
cb.set_stop_fn(opt.stop)

opt.run()  # returns cleanly when budget is exhausted

When budget is exhausted, a training report is printed automatically:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Training Complete
  Reason           : Spawn budget exhausted
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Best score       : 0.8732
  Val score        : 0.8612
  Generalization   : 0.0120  (train − val)
  Spawns used      : 3 / 3
  Pruned agents    : 1
  Ensemble size    : 3
  Regime           : trending
  Agents           :
    [0] weight=2.4100  mean_score=0.8710
    [1] weight=1.0300  mean_score=0.7240
    [2] weight=0.5600  mean_score=0.6120
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Autonomous Spawning

def make_agent():
    return TFPPOAgent(actor=build_actor(), critic=build_critic(),
                      optimizer=tf.keras.optimizers.Adam(3e-4),
                      hyperparams=initial_hp)

pm = PolicyManager(registry, max_spawns=5, max_ensemble_size=4)
cb = pm.as_callback(agent, agent_factory=make_agent)
cb.set_stop_fn(opt.stop)

On DORMANT: rebalance ensemble weights → rollback if degraded → clone best checkpoint → perturb hyperparams → add to ensemble → prune if over limit → stop when budget exhausted.

MetaController — Autonomous Decisions

from tensor_optix import MetaController

cb = pm.as_callback(
    agent,
    agent_factory=make_agent,
    meta_controller=MetaController(
        gap_threshold=0.3,
        gap_slope_threshold=0.02,
        improvement_threshold=0.05,
    ),
)

MetaController decides SPAWN / PRUNE / STOP / NO_OP on each DORMANT event based on generalization gap level, gap slope (overfitting progression), and val improvement rate.

Ensemble

from tensor_optix import PolicyManager, EnsembleAgent

pm = PolicyManager(registry)
pm.add_agent(agent_a, weight=1.0)
pm.add_agent(agent_b, weight=1.0)

ensemble = EnsembleAgent(pm, primary_agent=agent_a)
# Actions: a = Σ(wᵢ × aᵢ) / Σ(wᵢ)

opt = RLOptimizer(
    agent=ensemble,
    pipeline=BatchPipeline(env=env, agent=ensemble, window_size=2048),
    callbacks=[pm.as_callback(agent_a)],
)

Custom Evaluator

from tensor_optix import BaseEvaluator, EpisodeData, EvalMetrics
import numpy as np

class SharpeEvaluator(BaseEvaluator):
    def score(self, episode_data: EpisodeData, train_diagnostics: dict) -> EvalMetrics:
        rewards = np.array(episode_data.rewards)
        sharpe  = rewards.mean() / (rewards.std() + 1e-8)
        return EvalMetrics(
            primary_score=float(sharpe),
            metrics={"sharpe": float(sharpe), "mean_reward": float(rewards.mean())},
            episode_id=episode_data.episode_id,
        )

opt = RLOptimizer(agent=agent, pipeline=pipeline, evaluator=SharpeEvaluator())

Live Pipeline

from tensor_optix import LivePipeline

pipeline = LivePipeline(
    data_source=MarketFeed(),
    agent=agent,
    episode_boundary_fn=LivePipeline.every_n_seconds(300),
)

Callbacks

from tensor_optix import LoopCallback

class MyLogger(LoopCallback):
    def on_improvement(self, snapshot):
        print(f"New best: {snapshot.eval_metrics.primary_score:.4f}")

    def on_dormant(self, window_id):
        print(f"Converged at window {window_id}")

opt = RLOptimizer(agent=agent, pipeline=pipeline, callbacks=[MyLogger()])

Available hooks: on_loop_start, on_loop_stop, on_episode_end, on_improvement, on_plateau, on_dormant, on_degradation, on_hyperparam_update.

Full Configuration

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,
    val_pipeline=val_pipeline,              # optional held-out pipeline
    evaluator=None,                         # default: TFEvaluator
    optimizer=None,                         # default: BackoffOptimizer
    checkpoint_dir="./checkpoints",
    max_snapshots=10,
    rollback_on_degradation=False,
    improvement_margin=0.0,
    max_episodes=None,                      # None = run until DORMANT
    base_interval=1,
    backoff_factor=2.0,
    max_interval_episodes=100,
    plateau_threshold=5,
    dormant_threshold=20,
    degradation_threshold=0.95,
    callbacks=[],
)

Architecture

tensor_optix/
├── core/
│   ├── types.py                # EpisodeData, EvalMetrics, HyperparamSet, LoopState
│   ├── base_agent.py           # BaseAgent — 6-method contract
│   ├── base_evaluator.py       # BaseEvaluator — score, combine, compare
│   ├── base_optimizer.py       # BaseOptimizer — suggest, on_improvement, on_plateau
│   ├── base_pipeline.py        # BasePipeline — episodes() generator
│   ├── loop_controller.py      # State machine + main loop
│   ├── backoff_scheduler.py    # Convergence detection + state transitions
│   ├── checkpoint_registry.py  # Snapshot storage and manifest
│   ├── normalizers.py          # RunningMeanStd, ObsNormalizer, RewardNormalizer
│   ├── trajectory_buffer.py    # compute_gae(), make_minibatches()
│   ├── policy_manager.py       # PolicyManager + PolicyManagerCallback
│   ├── ensemble_agent.py       # EnsembleAgent — multi-policy BaseAgent wrapper
│   ├── regime_detector.py      # RegimeDetector — score-based regime classification
│   └── meta_controller.py      # MetaController — SPAWN/PRUNE/STOP/NO_OP decisions
├── adapters/
│   ├── tensorflow/
│   │   ├── tf_agent.py         # TFAgent — generic REINFORCE / A2C base
│   │   └── tf_evaluator.py     # TFEvaluator — default scorer
│   └── pytorch/
│       ├── torch_agent.py      # TorchAgent — generic REINFORCE / A2C base
│       └── torch_evaluator.py  # TorchEvaluator — default scorer (no TF dep)
├── algorithms/
│   ├── tf_ppo.py               # TFPPOAgent — clipped surrogate, GAE-λ, n-epoch minibatch
│   ├── tf_dqn.py               # TFDQNAgent — replay buffer, target net, ε-greedy
│   ├── tf_sac.py               # TFSACAgent — twin critics, squashed Gaussian, auto-α
│   ├── torch_ppo.py            # TorchPPOAgent
│   ├── torch_dqn.py            # TorchDQNAgent
│   └── torch_sac.py            # TorchSACAgent
└── pipeline/
    ├── batch_pipeline.py       # Continuous stepping, fixed windows
    ├── live_pipeline.py        # Real-time streaming with reconnect
    └── vector_pipeline.py      # Parallel envs via gymnasium.vector

Component	Responsibility
`TFPPOAgent` / `TorchPPOAgent`	PPO with GAE, clipping, entropy, n-epoch minibatch
`TFDQNAgent` / `TorchDQNAgent`	DQN with replay buffer, target net, ε-greedy
`TFSACAgent` / `TorchSACAgent`	SAC with twin critics, auto-entropy, soft updates
`LoopController`	State machine, episode orchestration, eval, checkpoint
`BackoffScheduler`	Convergence detection via exponential backoff
`CheckpointRegistry`	Snapshot storage, best-checkpoint manifest
`BackoffOptimizer`	Two-phase finite difference hyperparameter tuning
`PBTOptimizer`	Population-based exploit/explore hyperparameter tuning
`PolicyManager`	Rollback, spawn, prune, boost, ensemble weights, adaptive noise
`MetaController`	Rule-based SPAWN/PRUNE/STOP/NO_OP decisions
`EnsembleAgent`	Weighted-average action combining across multiple agents
`RegimeDetector`	Score-based regime classification (trending / ranging / volatile)
`VectorBatchPipeline`	Parallel environment rollouts via gymnasium.vector
`ObsNormalizer`	Online running mean/std observation normalization
`RewardNormalizer`	Return-std reward scaling

Math & Science Reference

GAE-λ (`trajectory_buffer.compute_gae`)

δₜ = rₜ + γ·V(sₜ₊₁)·(1−dₜ) − V(sₜ)
Aₜ = δₜ + γλ·(1−dₜ)·Aₜ₊₁

The (1−dₜ) mask zeros both the bootstrap from V(sₜ₊₁) and the GAE propagation from Aₜ₊₁ simultaneously, so a single window containing multiple episode fragments is handled correctly without splitting by episode.

PPO Clipped Surrogate (`TFPPOAgent` / `TorchPPOAgent`)

rₜ(θ) = π_θ(aₜ|sₜ) / π_θ_old(aₜ|sₜ)

L_clip = E[ min(rₜ·Âₜ, clip(rₜ, 1−ε, 1+ε)·Âₜ) ]
L_vf   = E[ (V_θ(sₜ) − Rₜ)² ]
L_ent  = E[ H(π_θ(·|sₜ)) ]

L = −L_clip + c₁·L_vf − c₂·L_ent

SAC Squashed Gaussian (`TFSACAgent` / `TorchSACAgent`)

u ~ N(μ_θ(s), σ_θ(s))
a = tanh(u)

log π(a|s) = log N(u; μ, σ) − Σᵢ log(1 − tanh²(uᵢ) + ε)

Auto-entropy: L_α = −α · (log π(aₜ|sₜ) + H_target) where H_target = −dim(A).

A2C Advantage Baseline (`TFAgent` / `TorchAgent`)

Aₜ = Gₜ − V(sₜ)
∇J ≈ Σ ∇log π(aₜ|sₜ) · Âₜ

By the policy gradient theorem, subtracting V(sₜ) does not bias the gradient while reducing variance. The explained_variance diagnostic in learn() measures critic quality: 1.0 = perfect, 0.0 = useless, <0 = harmful.

PBT Perturbation Modes (`PBTOptimizer`)

Log-scale for learning_rate, epsilon, weight_decay:

δ_log = scale × log(high / low)
θ' = clip(θ × exp(Uniform(−δ_log, +δ_log)), low, high)

Equal probability mass per decade, following Jaderberg et al. 2017.

Detrended Volatility (`RegimeDetector`)

residuals    = scores − linear_regression(scores)
CV_detrended = std(residuals) / (|mean(scores)| + ε)

Avoids conflating a steadily improving score with stability.

Gap-Slope Overfitting Signal (`MetaController`)

gapₜ = (train_scoreₜ − val_scoreₜ) / |val_scoreₜ|
slope = linear_regression_slope(gap₀, ..., gapₙ)
slope > threshold → PRUNE

Detects active overfitting progression before gap level alone triggers.

License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.17.0

May 14, 2026

1.16.7

May 14, 2026

1.16.6

May 11, 2026

1.16.5

May 11, 2026

1.16.4

May 11, 2026

1.16.3

May 11, 2026

1.16.2

May 8, 2026

1.16.1

May 8, 2026

1.16.0

May 7, 2026

1.15.2

May 6, 2026

1.15.0

May 6, 2026

1.14.7

May 6, 2026

1.14.6

May 5, 2026

1.14.5

May 5, 2026

1.14.4

May 5, 2026

1.14.3

May 4, 2026

1.14.2

May 4, 2026

1.14.1

May 4, 2026

1.14.0

May 4, 2026

1.13.1

May 4, 2026

1.13.0

May 4, 2026

1.12.2

May 4, 2026

1.12.0

May 4, 2026

1.11.0

Apr 23, 2026

1.10.2

Apr 22, 2026

1.10.1

Apr 20, 2026

1.10.0

Apr 20, 2026

1.9.0

Apr 20, 2026

1.2.6

Apr 13, 2026

1.2.5

Apr 13, 2026

1.2.4

Apr 13, 2026

1.2.3

Apr 11, 2026

1.2.2

Apr 11, 2026

1.2.1

Apr 10, 2026

1.2.0

Apr 10, 2026

1.1.0

Apr 10, 2026

1.0.0

Apr 10, 2026

0.8.2

Apr 7, 2026

0.8.1

Apr 6, 2026

This version

0.8.0

Apr 5, 2026

0.7.0

Mar 30, 2026

0.6.1

Mar 30, 2026

0.6.0

Mar 27, 2026

0.4.0

Mar 27, 2026

0.3.0

Mar 27, 2026

0.2.0

Mar 27, 2026

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensor_optix-0.8.0.tar.gz (68.5 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tensor_optix-0.8.0-py3-none-any.whl (77.8 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file tensor_optix-0.8.0.tar.gz.

File metadata

Download URL: tensor_optix-0.8.0.tar.gz
Upload date: Apr 5, 2026
Size: 68.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for tensor_optix-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`b0f398e1890126dc753db91fd770264eca69753f7cbf1b7e57bb98e1d98a1df0`
MD5	`4df5e65c0becba5e8167f0d01cb1085e`
BLAKE2b-256	`3000d5a06153f63dd017dfa41e2c89283d8d7d65cccbb4f5fbd5509320aa2157`

See more details on using hashes here.

File details

Details for the file tensor_optix-0.8.0-py3-none-any.whl.

File metadata

Download URL: tensor_optix-0.8.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 77.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for tensor_optix-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d129f63ba7dd9290250f3a5d30689d60efeca5b23953d76f534329cd7761dda`
MD5	`00c866aaf30a66882c9b31d7e32d2048`
BLAKE2b-256	`d9d60518e5fcf8b25bf5cf2146178975b2d50eefedbd5bed36e5e95eb63fa714`

See more details on using hashes here.

tensor-optix 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

tensor-optix

About

Install

Quick Start

TensorFlow PPO

PyTorch PPO

Built-in Algorithms

PPO — Proximal Policy Optimization

DQN — Deep Q-Network

SAC — Soft Actor-Critic

Parallel Environments — VectorBatchPipeline

Observation & Reward Normalization

Extend Any Algorithm

The Loop Interface

How It Works

The Loop States

Backoff Schedule

Hyperparameter Optimizer — Two-Phase Finite Difference

The Science: Train + Val Together

Validation Pipeline

Three-Signal Adaptive Noise

Policy Evolution

Automatic Rollback

Spawn Budget — When Is Training Done?

Autonomous Spawning

MetaController — Autonomous Decisions

Ensemble

Custom Evaluator

Live Pipeline

Callbacks

Full Configuration

Architecture

Math & Science Reference

GAE-λ (trajectory_buffer.compute_gae)

PPO Clipped Surrogate (TFPPOAgent / TorchPPOAgent)

SAC Squashed Gaussian (TFSACAgent / TorchSACAgent)

A2C Advantage Baseline (TFAgent / TorchAgent)

PBT Perturbation Modes (PBTOptimizer)

Detrended Volatility (RegimeDetector)

Gap-Slope Overfitting Signal (MetaController)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

GAE-λ (`trajectory_buffer.compute_gae`)

PPO Clipped Surrogate (`TFPPOAgent` / `TorchPPOAgent`)

SAC Squashed Gaussian (`TFSACAgent` / `TorchSACAgent`)

A2C Advantage Baseline (`TFAgent` / `TorchAgent`)

PBT Perturbation Modes (`PBTOptimizer`)

Detrended Volatility (`RegimeDetector`)

Gap-Slope Overfitting Signal (`MetaController`)