Skip to content

Self-improvement (measured prompt optimization)

Added in v0.18.0 — track 4 of the AGI-direction roadmap.

The agent can rewrite its own instructions — but only adopts a change it can measure. Given a base prompt and a handful of labelled examples, the optimizer proposes variants, scores each on the examples, and keeps a variant only if it strictly beats the base. Bounded, verified recursive self-improvement (DSPy / TextGrad / STOP-style) — no unmeasured change ever ships.

base prompt ──▶ propose variants ──▶ score each on examples ──▶ keep the best (only if it beats base)
                (TemplateProposer /    (SubstringScorer /
                 LLMPromptProposer)      ExactMatchScorer)

CLI

echo '[{"input": "2 + 2", "expected": "4"}, {"input": "3 + 5", "expected": "8"}]' > ex.json
riptide improve "Answer the arithmetic question." --examples ex.json --candidates 4 --out best.txt
#  base score: 0.50
#  best score: 1.00 (improved)
#  wrote improved prompt to best.txt
#
#  BEST PROMPT
#  Answer the arithmetic question.
#
#  Think step by step before giving the final answer.

In code

from riptide_watergraph.service import improve_prompt
from riptide_watergraph.optimize import Example

result = improve_prompt(
    "Answer the question.",
    [Example(input="capital of France?", expected="Paris")],
    candidates=4,
)
print(result.base_score, "->", result.best_score, "improved" if result.improved else "kept base")
print(result.best_prompt)

Or drive the primitive directly with your own runner/proposer/scorer:

from riptide_watergraph.optimize import optimize_prompt, TemplateProposer, SubstringScorer

result = optimize_prompt(
    base_prompt, examples,
    runner=lambda prompt, inp: my_model(prompt, inp),  # how a prompt is executed on an input
    proposer=TemplateProposer(), scorer=SubstringScorer(), candidates=5,
)

The seam (swappable)

Interface Default Purpose
Proposer TemplateProposer (offline) / LLMPromptProposer propose rewrites of a prompt
Scorer SubstringScorer / ExactMatchScorer measure a prediction vs the expected answer
optimize_prompt(...) base vs candidates → adopt only a strictly-better variant

OptimizationResult reports base_score, best_score, the improved flag, and every candidate's score — so the gain (or lack of one) is always auditable. Offline, the deterministic TemplateProposer + SubstringScorer make the whole loop run without a key.

Roadmap context

Track 4 of the AGI-direction roadmap (after SkillForge, cognitive memory, and deliberate reasoning). The same Scorer/eval seam feeds what's next: optimizing role prompts and the composer policy against the eval suite, and an autonomy loop with a self-generated curriculum.