Why SuperDialog is easy to test
SuperDialog is text in, text out. There is no audio to mock, no WebRTC room to spin up, no telephony to stub. Every dialog is a Python function that takes a string and returns a string. This is the killer feature vs. voice-coupled frameworks where tests need audio fixtures.Setup
anyio as specified in the project guidelines.
Offline tests with scripted LLMs
The Playbook engine separates the Talker (StreamsLLM) and Director
(CompletesLLM) seams, so you can run a conversation with no network by
constructing PlaybookAgent with stub LLMs and asserting on
agent.runtime.state - slots, checkpoint, ended/outcome.
agent.runtime.state is the folded ConversationState: slot_value(key),
confirmed(keys), checkpoint_id, ended, outcome.
Live smoke test through the entry point
For an end-to-end check against a real model, use the public entry point with a cheap, fast model:Replay - regression without re-running models
Because the event log is the source of truth, you can re-run the Director over a recorded session under a (possibly edited) playbook and diff its decisions - LLM-free regression evidence for prompt or model changes:Persona evals
Drive scripted personas through a fresh agent per run and score completion, slot accuracy, and turns-per-checkpoint:superdialog optimize --playbook kyc.yaml runs
paired evals and proposes prose-only improvements.
Testing tools
Legacy graph engine
Same pattern withengine="flow", asserting on the machine’s state property
(which returns {"node_id": ..., "slots": ...} on the graph engine):
traversal_dir on a graph-engine machine to capture each completed session
as JSON for an eval corpus.
Tips
- Use a cheap model (
claude-haiku-4-5) in live tests to keep costs and latency low - Use scripted Talker/Director LLMs for deterministic, offline assertions
- Keep playbook YAML / flow JSON in version control so tests are reproducible
- Build a corpus from recorded event logs, then use
replay/run_eval/superdialog evalfor regression