webagent — Actions Catalog

Every tool the agent can call. The LLM invokes them via function calling.

Design principles

  • Each action has an explicit input schema (Zod / JSON Schema).
  • Each action is either idempotent or clearly marked as a side-effect.
  • Failures return a normalised error — never throw into the agent loop.
  • DOM actions query first to confirm the target exists; if not, return { ok: false, reason: 'not_found' }.

Full list (12 built-ins)

Action Params Behaviour
navigate { path: string } SPA-friendly page change. Always gated with a Space confirmation (the runtime emits a confirm event before running).
scroll_to { selector } Smooth-scrolls to the element. Use before narrating about something below the fold.
wait { ms?, selector?, timeout? } Two modes — sleep ms milliseconds, or poll until a CSS selector appears (with timeout). Capped at 5 s.

DOM interaction

Action Params Behaviour
click { selector } Click the element. Use this for submit buttons too — there is no separate submit_form.
fill_input { selector, value } Fill an input or textarea (dispatches input + change).
select_option { selector, value } Pick a <select> option.
clear_input { selector } Clear a field.
press_key { key, selector? } Dispatch a keyboard event (keydown + keyup) on an element. key is the W3C key name ("Enter", "Escape", "ArrowDown", " ", single chars). selector is optional — omitted = document.activeElement. Use for Enter to submit, Escape to dismiss, arrow keys, etc.

Visual overlays (shown to the user)

Action Params Behaviour
border { selector, color?, label? } Draw a border around the element. Calling border or highlight auto-clears any prior overlay; there is no clear_overlays tool. In CoT mode the standalone border tool is HIDDEN from the model — framing is set via narrate.about instead (see below). The action stays registered for the legacy / classic loop and host customActions.
highlight { selector, color?, label? } Translucent fill — for inline text spans / paragraphs. Same auto-clear behaviour. Not in the default builtin set; opt in via customActions.

Talking to the user

Action Params Behaviour
pause { note? } Wait for the user to press Space before the next subject. Hidden from the CoT tool list — the runtime auto-pauses after every narrate, exposing pause would invite double-pauses. Stays available in classic mode.
ask_user { question } Ask a free-text question.
ask_user_choice { question, options[], allowFreeText? } Multi-choice picker (2–6 options, with optional free-text fallback).

ask_user vs ask_user_choice

If the answer space is 2–4 short options, prefer ask_user_choice — the host renders it as a clickable / number-keyed picker (we recommend Subtitle.showChoice) so the user doesn't have to type. Only fall back to ask_user when the answer genuinely needs free text (e.g. "describe the issue", "paste an email").

allowFreeText defaults to true: the picker's last row is a free-text input, letting the user type an answer that isn't in the list. On submit the typed string is delivered as-is — the agent receives that string with no special sentinel; the host distinguishes free-text vs a listed pick at the event layer via index === -1, then calls agent.respond(value).

Both actions behave the same way: after invocation the agent enters waiting and the loop blocks there until the host calls respond(value).

Selector format — stable [id] hashes from the DOM dump

Every selector parameter above accepts either a stable [id] hash from the indexed DOM dump ("a1b2", "[a1b2]", even "↓[a1b2]") OR a CSS selector string. The DOM reader emits each addressable element with a per-element hash — the LLM passes that hash back as the selector arg and the runtime resolves it via the per-turn index map. CSS selectors still work as a fallback, but hashes are the primary path and avoid all selector-guessing.

Never invent CSS selectors like #command-X based on guessed names. The page does not necessarily use those IDs. Always copy a hash verbatim from the current turn's DOM dump.

Termination (CoT mode)

CoT mode envelopes have explicit signals for ending the loop:

  • { task_finish: true } — put as the LAST item in actions[] when the user's original task is fully satisfied. Runtime ends the loop right after the prior actions in this turn dispatch. DO NOT use in the same turn as ask_user_choice, navigate, click, fill_input, or any tool whose result you have not yet observed — runtime drops mis-placed task_finish and logs a warning.
  • Empty / omitted actions — legacy end-of-loop path; runtime hides the subtitle when no actions are present.

Classic (non-CoT) mode ends naturally when the model emits a turn with text only and no tool call (finish_reason=stop).

Narrate envelope — narrate.about auto-borders

In CoT mode the agent_turn envelope's actions[] accepts narrate items shaped { narrate: string, about?: string }. When about is set to an element's [id] hash, the runtime AUTO-CALLS border on that element BEFORE streaming the narrate text. No separate border action needed — the framing is structurally tied to the narration. This is the reason border is hidden from the CoT tool list above.

Action result shape

type ActionResult =
  | { ok: true; data?: any }
  | { ok: false; reason: ActionFailureReason; message?: string };

type ActionFailureReason =
  | 'not_found'       // selector matched nothing
  | 'not_visible'     // element exists but isn't visible
  | 'not_interactive' // element is disabled / readonly
  | 'timeout'         // wait_for timed out
  | 'navigation'      // interrupted by an in-flight navigation
  | 'unknown';

The agent loop appends the result to session.steps, so the next LLM turn sees the failure reason and can self-correct.

Selector rules

LLM-returned selectors must be CSS selectors, with limits:

  • Only tag / id / class / [data-*] attributes / :nth-child() are allowed.
  • No :has(), :not(), or other selectors prone to NodeList blow-ups.
  • No * (universal selector).
  • When multiple elements match, the first visible one is used by default, and result.data carries { matched: N, used: 0 } as a hint.

For complex targeting use the data-webagent-id attribute — the DOM Reader auto-attaches fallback ids.

Custom action example

import { z } from 'zod';

agent.config.customActions = [
  {
    name: 'open_chat_panel',
    description: 'Open the chat panel on the right side',
    parameters: z.object({
      initialMessage: z.string().optional(),
    }),
    handler: async ({ initialMessage }, ctx) => {
      myUI.openChatPanel(initialMessage);
      return { ok: true };
    },
  },
];

Disabling built-ins you don't need

Every name in the table above is registered by default so any host works out of the box. Sites with a narrow surface (no <select> elements, no destructive operations, no questions to ask the user, etc.) can trim the list with disableBuiltinActions — the listed actions are removed from the agent's tool schema entirely, which shrinks per-turn token cost and removes "wrong tool" failure modes.

new DotDotDuck({
  // …
  webAgent: {
    disableBuiltinActions: [
      'pause',            // runtime auto-pauses; only needed for destructive moments
      'wait',             // no async UI to wait for
      'select_option',    // no <select> elements
      'clear_input',      // fill_input('') covers the rare case
      'ask_user',         // one-shot Q&A: agent shouldn't ask follow-ups
      'ask_user_choice',  // same as above
    ],
  },
});

The visual highlight action is not in the default set — re-add it via customActions if you want the translucent-fill style alongside border. present_surface is similarly opt-in via allowPresent: true. disableBuiltinActions applies to both, so you can disable them even after opting in.

Action vs direct API

Actions are what the LLM can invoke. When the host needs to trigger a behaviour directly (without the LLM), call agent.executeAction(name, params) — it runs the handler without appending to session.steps.

Actions we deliberately don't ship (so you don't ask)

  • eval_js — too dangerous; not exposed.
  • fetch — wrap with a custom action instead; don't let the LLM issue arbitrary requests.
  • localStorage_set/get — wrap with a custom action.

Note: there is no screenshot ACTION (the LLM can't request a capture), but the agent CAN see the page visually — enable WebAgentConfig.screenshot and a viewport / full-page image is attached to every turn alongside the DOM dump. See the screenshot guide.