Reward / RL¶
Added in v0.22.0 — the final research seam of the roadmap.
A RewardModel turns an outcome into a scalar in [0, 1], and a deterministic UCB bandit over
candidate strategies learns which one earns the most reward for a task — online policy improvement from a
reward signal (the substrate for reinforcement learning).
for each round: bandit.select() → strategy ──▶ run it → answer ──▶ reward_model.reward(...) ──▶ bandit.update()
(UCB: try each once, (0..1 scalar) (running mean)
then exploit the best) └─ best() = argmax mean
CLI¶
riptide rl "summarize the water cycle" --rounds 8 --offline
# learned strategy values for: summarize the water cycle
# reward 0.71 stepwise (pulls 4)
# reward 0.55 direct (pulls 2)
# ...
# BEST STRATEGY: stepwise
The arms are the diverse reasoning styles from deliberation; the bandit learns which style maximizes reward for this task.
In code¶
from riptide_watergraph.service import optimize_strategy_for_task
report = optimize_strategy_for_task("explain reciprocal rank fusion", rounds=8)
print(report.best, [(a.arm, a.mean_reward) for a in report.arms])
Or drive the primitive directly with your own arms / runner / reward:
from riptide_watergraph.rl import optimize_strategy, HeuristicRewardModel
report = optimize_strategy(
task, arms=["concise", "detailed", "stepwise"],
runner=lambda arm, t: my_model(arm, t), # run a strategy on the task
reward_model=HeuristicRewardModel(), rounds=10,
)
The seam (swappable)¶
| Interface | Default | Purpose |
|---|---|---|
RewardModel |
HeuristicRewardModel (offline) / LLMRewardModel |
score an outcome as a scalar reward |
Bandit |
UCB1 (deterministic) | balance exploring arms vs. exploiting the best |
optimize_strategy(...) |
— | run the reward-driven bandit loop → StrategyReport |
The Bandit is deterministic (UCB1, no randomness) so runs are reproducible, and the
HeuristicRewardModel is offline — the whole RL loop runs at 100% coverage without a key.
Honest scope¶
This is not policy-gradient RL or weight updates — that's a model-training problem. It's a bandit over strategies with a reward signal: the framework-level slice of RL (online selection of the highest-reward approach). It completes the roadmap's research seams alongside multimodal perception and environments.