Skip to main content

Online weight optimization via Thompson Sampling — learns optimal configurations from outcome feedback.

Project description

dial

Online weight optimization via Thompson Sampling. Learns optimal configurations from outcome feedback — no grid search, no manual tuning. Converges in ~50 observations. +41% NDCG@5 over fixed-weight baselines in controlled experiments.

CI Python 3.11+ License: MIT

pip install kusp-dial

Quick start

from thompson_bandits import ThompsonBandit, InMemoryStore

store = InMemoryStore(arm_ids=["relevance_heavy", "balanced", "recency_heavy"])
bandit = ThompsonBandit(store)

# Run the loop: select → observe → update
for query in queries:
    arm = bandit.select()
    reward = run_query(query, strategy=arm)
    bandit.update(arm, reward=reward)

print(bandit.get_summary())

After 50 iterations:

BanditSummary(
  best_arm='relevance_heavy',
  total_pulls=50,
  arms=[
    ArmSummary(arm_id='balanced',        mean=0.5765, pulls=11),
    ArmSummary(arm_id='recency_heavy',   mean=0.4210, pulls=8),
    ArmSummary(arm_id='relevance_heavy', mean=0.8903, pulls=31),
  ]
)

The bandit explores all three options early, then converges — 31 of 50 pulls on the winner, without you telling it which arm is best.

Why Dial?

vs. grid search / random search — Those require running every combination upfront. Dial learns online, one observation at a time. No batch experiments needed.

vs. manual tuning — Manual weights are a guess that stays frozen. Dial adapts when the best option shifts — user behavior drifts, data distributions change, what worked in January fails in March.

vs. contextual bandits (LinUCB, neural) — Those need feature engineering and thousands of observations. Dial works with 50 observations and zero features. Start with Dial; graduate to contextual bandits when you have the data to justify them.

vs. Bayesian optimization (Optuna, Ax) — Those optimize over continuous parameter spaces. Dial optimizes over discrete options (strategies, presets, model choices). Different problem shape.

Use cases

  • Retrieval weight tuning — learn the optimal blend of relevance, recency, and importance for RAG systems
  • Model routing — discover which LLM performs best for different query types
  • Prompt selection — A/B test prompt variants with automatic convergence
  • Feature flag rollout — promote variants based on measured outcomes
  • Any multi-option decision where you can observe a reward signal

Features

  • Beta posteriors — each arm maintains a Beta(alpha, beta) distribution updated with observed rewards
  • Discounted Thompson Sampling — optional decay factor for non-stationary environments where the best arm shifts over time
  • Cost-aware rewards — built-in cost_aware_reward() scales outcomes by resource efficiency
  • Pluggable storageInMemoryStore for testing, SQLiteStore for persistence, or implement the ArmStore protocol for anything else
  • Zero SQLite dependency in core — bandit logic talks only to the ArmStore protocol
  • Type-safe — full annotations, runtime_checkable Protocol

Storage backends

In-memory (ephemeral)

from thompson_bandits import InMemoryStore

store = InMemoryStore(arm_ids=["a", "b", "c"], prior_alpha=1.0, prior_beta=1.0)

SQLite (persistent)

from thompson_bandits import SQLiteStore

# From a file path (store owns the connection)
store = SQLiteStore.from_path("bandits.db", arm_ids=["a", "b", "c"])

# From an existing connection (you own the connection)
import sqlite3
conn = sqlite3.connect("bandits.db")
store = SQLiteStore(conn, arm_ids=["a", "b", "c"])

Custom storage

Implement the ArmStore protocol — any class with the right methods works, no inheritance required:

from thompson_bandits import ArmStore, ArmStats

class RedisStore:
    def get_stats(self, arm_id: str) -> ArmStats | None: ...
    def update_stats(self, arm_id: str, alpha_delta: float, beta_delta: float, reward: float) -> None: ...
    def get_all_arms(self) -> list[ArmStats]: ...
    def decay(self, arm_id: str, factor: float) -> None: ...

Non-stationary environments

When the best option changes over time, enable discounting:

from thompson_bandits import ThompsonBandit, InMemoryStore, BanditConfig

config = BanditConfig(discount=0.95)  # decay factor in (0, 1)
bandit = ThompsonBandit(store, config=config)

Before each update, existing evidence is decayed by the discount factor. Recent observations carry more weight than old ones.

Cost-aware optimization

When options have different costs (tokens, latency, dollars), scale rewards accordingly:

from thompson_bandits import cost_aware_reward

raw_reward = 0.9
token_cost = 1500
baseline_cost = 1000

adjusted = cost_aware_reward(raw_reward, cost=token_cost, baseline_cost=baseline_cost)
bandit.update(arm, reward=adjusted)

Inspecting state

summary = bandit.get_summary()
print(summary.best_arm)      # 'relevance_heavy'
print(summary.total_pulls)   # 50

for arm in summary.arms:
    print(f"{arm.arm_id}: mean={arm.mean:.3f}, pulls={arm.pulls}")
# balanced:        mean=0.577, pulls=11
# recency_heavy:   mean=0.421, pulls=8
# relevance_heavy: mean=0.890, pulls=31

Warm-start transfer

When you have prior knowledge (from a previous experiment, a related task, or domain expertise), encode it as informative priors instead of starting from uniform:

from thompson_bandits import ThompsonBandit, InMemoryStore, BanditConfig

# Previous experiment found relevance_heavy won ~63% of pulls.
# Encode that as Beta(6.3, 3.7) instead of the default Beta(1, 1).
config = BanditConfig(prior_alpha=1.0, prior_beta=1.0)
store = InMemoryStore(arm_ids=["relevance_heavy", "balanced", "recency_heavy"])

# Override priors for the arm with known history
arm = store.get_stats("relevance_heavy")
arm.alpha = 6.3
arm.beta = 3.7

bandit = ThompsonBandit(store, config=config)

The bandit starts biased toward the prior winner but remains open to switching if the data disagrees. With shrinkage (e.g., scaling the prior by 0.15), the prior influence fades within ~20 observations.

Research

Dial extracts the Thompson Sampling engine from a research experiment on gradient-free retrieval weight learning. The experiment ran 1,200 episodes across 4 conditions on a $50/month API budget.

Citation (BibTeX)
@article{dirocco2026gradient,
  title   = {Gradient-Free Retrieval Weight Learning via Thompson Sampling
             with LLM Self-Assessment},
  author  = {DiRocco, Alfonso},
  year    = {2026},
  url     = {https://github.com/kusp-dev/retrieval-weight-experiment},
  note    = {1,200 episodes, 4 conditions, +41\% NDCG@5 over fixed baselines}
}

Development

git clone https://github.com/fonz-ai/dial.git
cd dial
uv sync --extra dev
uv run pytest tests/ -v
uv run ruff check src/ tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fonz_dial-0.1.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fonz_dial-0.1.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file fonz_dial-0.1.1.tar.gz.

File metadata

  • Download URL: fonz_dial-0.1.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fonz_dial-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a4b1df218686e1dfe6816458bbdcd10bef6733607fbf1699d2cc2c389499925b
MD5 a1e6de661a8b50eee370b0e1e5d8521a
BLAKE2b-256 87025d84d2807a50633ea7f209e70d312a96713466ee85d4953fa2097d6ad712

See more details on using hashes here.

File details

Details for the file fonz_dial-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fonz_dial-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fonz_dial-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb4455ba00fe8b25b489cf558743aa050912c437f332e09a87dd3f92095cb7ab
MD5 5a199cb4773d788ee856797cdcd7ae13
BLAKE2b-256 bd427cd53d84c3bbca63441999a4edf872f08e3e2b5316ccfbeeff921b00ac1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page