Evaluation¶
The research consensus is to run your own evals rather than trust vendor benchmarks.
Offline (deterministic)¶
This runs a deterministic task suite through the full graph and scores:
- pass rate
- single-vs-swarm routing
- guardrail blocking
- tool-call validity
- a self-learning recall probe (does a repeated task recall a prior lesson?)
so behavior is measurable and regressions fail CI (4/4 = 100%).
Against a real model¶
Everything above runs offline with the deterministic DemoGateway. To run against a live LLM, install a
gateway extra, set a key, and pick a model:
pip install "riptide-watergraph[litellm]"
export OPENAI_API_KEY=sk-... # or ANTHROPIC_API_KEY
export RIPTIDE_WATERGRAPH_MODEL=gpt-4o-mini # any LiteLLM model string
riptide eval # score the suite against the live model
python examples/real_model_chat.py "What is the capital of France?"
python examples/real_model_eval.py
The live path swaps in LiteLLMGateway wrapped in ResilientGateway (timeouts + retries) — no code
change, just env.
In tests¶
tests/test_eval_real.pycovers the wiring offline (against a faked LiteLLM boundary) so the 100% coverage gate holds without a key.tests/test_eval_real_smoke.pyis skip-guarded: it runs the full suite end-to-end only when an API key is present, and is skipped in CI.