webagent — Overview

What it is

webagent is a DOM-grounded agent toolkit.

It lets an AI agent work directly on the page the user is currently looking at — read DOM, click buttons, fill forms, navigate, highlight / border elements, persist state across pages, and pause to ask the user when needed.

Each turn the agent sees:

an indexed tabbed-tree dump of every actionable element (each gets a numeric [N] index; the LLM passes that index back instead of guessing CSS selectors),
viewport markers (↑ / ↓) telling it what the user can actually see vs. what's above / below the fold,
and optionally a screenshot (viewport or full-page with auto-split) — disabled by default; opt in for pages whose visual layout the DOM dump can't convey.

Not a replacement for LangChain — a complement. LangChain runs on the backend where there is no concept of "page". webagent runs on the frontend where DOM, accessibility tree, and routes are first-class.

30-second try

npm install @perhapxin/dddk

import { WebAgent, OpenAIProvider } from '@perhapxin/dddk';

const agent = new WebAgent({
  llm: new OpenAIProvider({ apiKey: import.meta.env.VITE_OPENAI_KEY }),
  locale: 'en',
});

agent.on('subtitle', (text) => console.log('agent:', text));
agent.on('done', () => console.log('finished'));

await agent.run('Change the headline to "Annual Report 2026" and save.');

That's the whole integration. The agent reads the page, picks tools, narrates progress, and stops when the task is done — or pauses and asks a follow-up if it needs you.

How it compares to other agent frameworks

Dimension	Backend orchestrators	Vendor-bundled agent SDKs	Headless-browser test agents	webagent
Runs where	server	server	server-driven browser	the browser itself (user's open tab)
Input	text + tool results	text + tool results	DOM via accessibility tree	DOM + user-visible visual overlays
Cross-page state	host-managed	host-managed	session-based	built-in sessionStorage + cross-tab sync
User-facing UI	none (bring your own)	none	developer trace UI	subtitle / overlay / ask_user / Surface, native
Deployment	server + frontend	server + frontend	server + headless browser	frontend-only SDK, zero server

The three core use cases

A. In-app assistant (the main one)

A user inside a SaaS presses dddk's Space and says "please rewrite this report headline to sound more professional." webagent reads the current DOM, finds the headline element, rewrites it, and tells the user what changed via the subtitle bar.

B. Automated tour / onboarding

A first-time user lands on the site; webagent runs the /introduce skill — auto-walks them through the product, highlights key areas, narrates in the subtitle bar, and waits for the user to acknowledge each step before continuing.

C. Intent-driven task execution

The user says "book me next Wednesday's meeting room." webagent navigates to the booking page, fills the date, picks a room. If it lacks information it calls ask_user; if it needs structured input the host opens a Pieces-based Surface via dddk (the agent itself never renders UI). When the form is submitted the agent receives the data and continues.

What it does NOT do

webagent intentionally does not:

Cross-site crawling (it is not a browser automation tool)
Backend agent orchestration (that's LangGraph)
Model training or fine-tuning
Visual workflow editing (that's runboard)
Command palette / trigger system (that's dddk)

webagent does one thing: operate the DOM of the current page, driven by user intent through an LLM.

Package layout

@perhapxin/dddk/src
├── orchestrator.ts   Top-level DotDotDuck class — wires modules + triggers
├── agent/
│   ├── webagent/     Main page-driving agent loop (DOM tools, ask_user, navigate)
│   ├── inline/       InlineAgent — selection edits inside editables
│   ├── llm/          OpenAI + Google providers, LLMRouter, adapter registry
│   ├── sitemap/      Sitemap tree + navigation policy
│   └── memory/       Cross-tab session persistence
├── modules/          Voice, TTS, Subtitle, Dwell, ImmersiveTranslate, Palette, …
├── triggers/         Space gesture, hotkeys, selection-change observers
├── ui/               Subtitle bar, indicator, Surface renderer host
├── skills/           Built-in skills (introduce, …) + registration API
├── toolbox/          Reusable host helpers (selection, screenshot, dom serialize)
└── utils/            Shared low-level helpers

No framework dependencies. Pure DOM API + event emitter — works with React, Vue, Svelte, or vanilla HTML.