Evals — regression-test skills with deterministic LLM stubs
Tweaking a PromptSkill, adjusting agent instructions, or adding a tool can silently break previously-working cases. Evals freeze expected behaviour in a spec, run on every build, and catch regressions immediately.
Design
defineEval(spec)— declare an eval.runEvals(specs, { llm })— run a batch and return anEvalReport.MockLLM— deterministic LLM provider; give it a sequence of responses (or a function) and it returns them in order.mockToolCall(tool, args)— produce aCompleteResultthat emits a tool call.
Evals use a trace pattern: while running the skill, every LLM call, tool call, subtitle, surface, and error is recorded; assertions run against the trace.
EvalSpec
import { defineEval, runEvals, MockLLM, mockToolCall } from '@perhapxin/dddk';
import type { EvalSpec, EvalAssertion, EvalTrace } from '@perhapxin/dddk';
interface EvalSpec {
name: string;
skill: Skill;
userInput: string;
vars?: Record<string, string>;
assertions: EvalAssertion[];
skip?: string;
}
| Assertion kind | Passes when |
|---|---|
{ kind: 'includes', substring } |
trace.text or any subtitle contains the substring |
{ kind: 'matches', pattern: RegExp } |
Same, but with a regex |
{ kind: 'callsTool', tool } |
A matching tool call appears in the trace |
{ kind: 'doesNotCallTool', tool } |
No matching tool call appears |
{ kind: 'meetsCriteria', description, check(trace) => boolean } |
Custom predicate |
EvalTrace exposes:
interface EvalTrace {
text: string;
toolCalls: Array<{ tool: string; args: Record<string, unknown> }>;
subtitles: string[];
surfaces: unknown[];
errors: string[];
llmCalls: number;
}
Example: three evals guarding one skill
Suppose we have a translate PromptSkill:
import type { PromptSkill } from '@perhapxin/dddk';
const translate: PromptSkill = {
id: 'translate',
type: 'prompt',
name: 'Translate',
prompt: 'Translate the user message into {{language}}. When done, call show_subtitle to inform the user that the translation is complete. If the user message is empty, call show_subtitle to ask what they want translated.',
};
Three evals to hold the behaviour:
const evals: EvalSpec[] = [
defineEval({
name: 'translate: empty input prompts the user',
skill: translate,
userInput: '',
vars: { language: 'English' },
assertions: [
{ kind: 'callsTool', tool: 'show_subtitle' },
{ kind: 'includes', substring: 'translate' },
],
}),
defineEval({
name: 'translate: normal input emits "done" subtitle',
skill: translate,
userInput: '你好',
vars: { language: 'English' },
assertions: [
{ kind: 'callsTool', tool: 'show_subtitle' },
{ kind: 'includes', substring: 'complete' },
],
}),
defineEval({
name: 'translate: should not invoke other tools',
skill: translate,
userInput: '你好',
vars: { language: 'English' },
assertions: [
{ kind: 'doesNotCallTool', tool: 'navigate' },
{ kind: 'doesNotCallTool', tool: 'agent' },
],
}),
];
Run it:
const llm = new MockLLM({
responses: [
// eval #1: empty input → prompt
{ ...mockToolCall('show_subtitle', { text: 'Please type what to translate' }), content: '' },
// eval #2: "你好" translated to English → done subtitle
{ ...mockToolCall('show_subtitle', { text: 'Translation complete: Hello' }), content: 'Hello' },
// eval #3: same as #2
{ ...mockToolCall('show_subtitle', { text: 'Translation complete: Hello' }), content: 'Hello' },
],
});
const report = await runEvals(evals, { llm, verbose: true });
console.log(`${report.passed}/${report.total} passed`);
if (report.failures > 0) process.exit(1);
MockLLM
new MockLLM({
responses: [
'plain text', // → { content: 'plain text', finishReason: 'stop' }
{ content: 'hi', usage: { input: 10, output: 2 } }, // partial CompleteResult
mockToolCall('search_catalog', { query: 'shirt' }), // a tool call
],
});
Functions are also supported for conditional responses:
new MockLLM({
responses: (opts, callIndex) => {
const lastUserMsg = opts.messages.findLast((m) => m.role === 'user')?.content ?? '';
if (typeof lastUserMsg === 'string' && lastUserMsg.includes('refund')) {
return mockToolCall('qa_lookup', { question: lastUserMsg });
}
return { content: `OK ${callIndex}` };
},
});
In array mode the last entry repeats — calls beyond the array length all return the last item.
Runner
const report = await runEvals(evals, {
llm, // required
toolMocks: { // optional: mock ctx.llm and other ambient helpers
llm: (prompt) => `mocked llm: ${prompt}`,
},
verbose: true, // default true; prints each result
});
interface EvalReport {
total: number;
passed: number;
failed: number;
skipped: number;
failures: number; // alias of failed; convenient for CI
results: EvalResult[]; // each EvalResult carries trace + failures[]
}
The failures alias exists for CI ergonomics (if (report.failures > 0) process.exit(1) reads more naturally).
How each skill type is driven
| skill.type | What runEvals does |
|---|---|
script |
No LLM call; collects each step's subtitle into the trace. |
prompt |
Substitutes vars into prompt, then makes one llm.complete({ system, user: userInput }) call. |
action |
Calls handler(stubCtx); palette.replace / navigate / agent calls are captured as trace.toolCalls. |
surface |
Calls build(stubCtx); the result lands in trace.surfaces[]. |
PanelSkill isn't driven directly by the runner — its lifecycle (onEnter → onInput → onAction) needs a host-supplied test harness.
Skip
defineEval({ name: 'not implemented yet', skill, userInput: '...', assertions: [], skip: 'API not deployed' });
Skipped evals don't run, don't fail, and still appear in the report.
Eval vs integration test
| Use eval | Use integration test |
|---|---|
| Skill / prompt / tool behaviour | Full UI flow (palette open, key press, render) |
| Pure LLM-driven decisions | DOM interaction, CSS, routing |
| Want many cases in seconds | A few happy paths over minutes |
| Run without a real LLM key (CI) | Run end-to-end fixtures |
| Mock deterministic input → output | Real user agent |
You need both — evals catch "how the LLM uses tools," integration tests catch browser behaviour.
See also
- Tools registry — most tools you register should have an eval guarding "the agent does call this when expected".
- Skills overview — skill shape.