Overview
Page Agent is an open-source JavaScript-based GUI agent developed by Alibaba that allows AI models to interact with and control web interfaces using natural language. It enables a 'bring-your-own-LLM' approach to web automation, using a text-based representation of the DOM to navigate and manipulate web pages without requiring a heavy browser extension.
Features
- ✓Extension-less operation via JS snippet or NPM
- ✓Text-based DOM perception for LLM-friendly interaction
- ✓BYO LLM support for provider flexibility
- ✓Natural language control of browser interfaces
- ✓Official MCP server for integration with other AI agents
Installation
npm install page-agentPros
- +Low friction installation (no mandatory extension)
- +Flexible LLM provider support
- +Semantic understanding of dynamic web content
- +Native MCP integration
Cons
- −Potential CSP restrictions for in-page injection
- −Significant token usage for complex page structures
- −Higher latency compared to native automation scripts
Alternatives
Documentation
Page Agent
Overview
Page Agent is an open-source JavaScript-based GUI agent designed to allow AI models to interact with and control web interfaces using natural language. Developed by Alibaba, it enables a "bring-your-own-LLM" approach to web automation, allowing agents to perceive the DOM and perform actions without requiring a heavy browser extension for basic functionality.
Unlike traditional browser automation tools that rely on rigid selectors or complex RPA scripts, Page Agent uses a text-based representation of the DOM and an intelligent action system to navigate and manipulate web pages dynamically. It aims to bridge the gap between high-level intent and low-level browser API calls, making web-based agentic workflows more accessible and robust.
Features
- Extension-less Operation: Works directly in-page via a JavaScript snippet or NPM package, reducing the friction of installing browser extensions.
- Text-Based DOM Perception: Converts complex HTML structures into a simplified, LLM-friendly text format that preserves semantic meaning while reducing token usage.
- BYO LLM (Bring Your Own LLM): Compatible with various LLM providers, allowing developers to use the model that best fits their performance and cost requirements.
- Natural Language Control: Allows users and other agents to steer the browser using plain English instructions.
- MCP Server Beta: Provides an official Model Context Protocol (MCP) server, enabling other AI agents (like Claude Desktop) to control the browser through the Page Agent interface.
- High Compatibility: Built with TypeScript, ensuring type safety and ease of integration into modern web applications.
Installation
Via NPM
npm install page-agent
Via CDN
Add the following script tag to your HTML:
<script src="https://cdn.jsdelivr.net/npm/page-agent/dist/index.js"></script>
Quick Start
import { PageAgent } from 'page-agent';
// Initialize the agent with your LLM configuration
const agent = new PageAgent({
llm: {
provider: 'openai',
apiKey: process.env.OPENAI_API_KEY,
model: 'gpt-4o'
}
});
// Start the agent and give it a command
await agent.run("Go to github.com, search for 'alibaba/page-agent', and tell me the number of stars.");
Core Concepts
DOM-to-Text Translation
Page Agent employs a sophisticated translation layer that strips unnecessary HTML noise and converts the active page into a structured text representation. This allows the LLM to "see" the page layout, identify interactive elements (buttons, inputs, links), and understand the current state without being overwhelmed by thousands of lines of raw HTML.
Action Loop
The agent operates in a perceive-plan-act loop:
- Perceive: Captures the current state of the DOM and translates it to text.
- Plan: The LLM analyzes the state and the user's goal to determine the next best action (e.g.,
click,type,scroll). - Act: The agent executes the action via the browser's JavaScript API.
- Verify: The agent checks the resulting DOM state to confirm the action succeeded before proceeding.
Advanced Features
- Hybrid Mode: While it can run extension-less, it offers an optional Chrome extension for advanced capabilities like cross-tab navigation and deeper system-level browser control.
- Custom Action Definitions: Developers can define custom JavaScript functions that the agent can call as specialized tools.
- Observation Logs: Detailed tracing of the agent's thought process and actions for debugging and auditing.
Examples
- Automated Testing: Use Page Agent to write end-to-end tests in natural language rather than fragile CSS selectors.
- Web Data Extraction: Navigate complex multi-page sites to gather specific information without writing custom scrapers for every page.
- Accessibility Auditing: Use the agent to navigate a site and report on accessibility hurdles from a user's perspective.
Pros
- ✅ Low Friction: No mandatory extension means faster deployment and easier usage in hosted environments.
- ✅ LLM Flexible: Not locked into a single model provider.
- ✅ Semantic Understanding: Better at handling dynamic websites where IDs and classes change frequently.
- ✅ MCP Ready: Easy integration into the growing ecosystem of MCP-enabled agents.
Cons
- ❌ In-Page Requirement: Without the extension, it must be injected into the page, which may be restricted by some site security policies (CSP).
- ❌ Token Consumption: High-complexity pages still require significant token usage for DOM representation.
- ❌ Latency: The perceive-plan-act loop can be slower than hard-coded automation scripts.
When to Use
Use Page Agent when:
- You need to build a web-based agent that can be easily shared or embedded without requiring users to install extensions.
- You are dealing with highly dynamic web interfaces where traditional RPA tools fail.
- You want to expose your web application's GUI to other AI agents via the MCP protocol.
- You prefer a TypeScript-native approach to browser automation.
