Raj
Home/Projects/Peruse AI

Peruse AI

alpha

Local-first autonomous web agent for exploration, testing, and UX review

3 progress reportsView Source ↗
ETLResearchEngineeringModelingLaunch
Python 3.10+PlaywrightasyncioOllamaLM StudioOpenAI-compatible APIsJinaClick
Overview

What It Is

Peruse AI is a local-first, autonomous web agent that explores web applications without sending data off your machine. It combines browser automation (Playwright), dual-channel perception (DOM extraction plus visual screenshots), and local Vision-Language Models to autonomously navigate web apps and produce structured reports: data insights, UX/UI reviews, and bug reports.

The key value proposition is 100% local execution. It works with any open-source VLM via Ollama, LM Studio, or compatible APIs.


Architecture

The agent follows a six-stage perceive-plan-act loop:

  1. Entry: CLI commands or Python API initiate a session with a URL and task description.
  2. Init: Load configuration, start VLM backend, launch Playwright browser.
  3. Perceive: Capture screenshot, extract interactive DOM elements via JavaScript, monitor console errors and network failures.
  4. VLM Decide: Send screenshot plus DOM text plus task context plus step history to the VLM. Parse structured JSON action from response.
  5. Act: Execute the Playwright browser action (click, type, scroll, navigate, select, wait).
  6. Reports: After exploration, generate markdown reports by sending collected screenshots and session data back through the VLM for analysis.
Key Features

Data Sources

  • Input: Any publicly accessible URL.
  • Web Data: DOM elements extracted via JavaScript (interactive elements only, filtered for visibility and viewport presence). Screenshots captured at each step.
  • Configuration: Environment variables with PERUSE_ prefix, .env files, or direct Python kwargs.

Current State

Published to PyPI as v0.1.0 (Alpha). Core functionality is stable: autonomous exploration, persona support, focus groups, loop recovery, multi-output reports, and four VLM backend integrations. Active development on stability improvements and report quality. Intel ARC GPU support is experimental and unstable.

Progress Reports
Report #3 of 3
Peruse AI: Architecture Documentation & Demo
Mar 3 - Mar 5, 2026

Devlog

With the package published on PyPI and the core features stable, I shifted to making the project presentable. A CLI tool that nobody can see in action does not get much traction, so the goal was demo materials and architecture documentation that let someone understand what Peruse AI does without installing it.

I ran a full demo session against the USDA QuickStats Analytics dashboard, which turned out to be a good test case: it has dropdowns, filters, data tables, and visualizations that require multi-step navigation. The agent explored the dashboard autonomously, interacting with commodity selectors, state filters, and chart views. I captured both a video recording and an animated GIF of the session. The GIF is 9.7 MB, which is large, but it shows the complete perceive-act cycle in action and embeds directly in the GitHub README.

The architecture diagram maps the six-stage pipeline visually: Entry (CLI or Python API), Init (config, VLM, browser), Perceive (screenshot + DOM + errors), VLM Decide (vision + DOM context to structured action), Act (Playwright execution), and Reports (VLM-powered analysis of collected data). I also built an interactive HTML flowchart that shows the same pipeline with expandable detail at each stage.

The sample UX review output from the USDA demo session shows what the agent actually produces: specific, actionable feedback on visual hierarchy, color contrast, WCAG accessibility, button sizes, information density, and design consistency. It is 150+ lines of structured critique, prioritized with accessibility issues first. Having a concrete example of the output quality was more convincing than any description I could write in the README.

This is not a flashy milestone, but it matters. The project went from "working package with a PyPI listing" to "something you can evaluate in 30 seconds by looking at the README." The demo GIF, architecture diagram, and sample output together tell the story faster than reading the code.

What's next: Monitoring community feedback, stability improvements based on real-world usage, and exploring whether the focus group output format could work as a structured input for design system audits.

Changelog

Added

  • Add demo video (peruse_ai_run.mp4, 7.75 MB) of USDA QuickStats exploration session
  • Add demo GIF (peruse_ai_run.gif, 9.72 MB) for README embedding
  • Add architecture overview diagram (Peruse_AI_Architecture_Overview.png)
  • Add interactive HTML flowchart (peruse_ai_flowchart_compact_mono.html)
  • Add sample UX review output from USDA QuickStats demo session

Changed

  • Update README with embedded demo GIF, architecture diagram, sample output links, and comprehensive feature documentation
documentationdemoarchitecture-diagramflowchartusda-quickstatsux-reviewsample-outputreadmeopen-source
Report #2 of 3
Peruse AI: Personas, Focus Groups & PyPI Release
Feb 23 - Feb 27, 2026

Devlog

This was the feature expansion sprint that turned Peruse AI from a working prototype into a publishable package. Six commits over five days, culminating in the PyPI release.

The two biggest additions were personas and focus groups. Personas let you assign a role to the agent (senior UX designer, data analyst, accessibility auditor) which gets prepended to the system prompt, shaping how the VLM interprets what it sees without breaking the core action-selection logic. Focus groups take this further: you provide a list of personas and the agent runs all of them concurrently against the same URL, each with its own browser instance, VLM session, and output directory. The concurrent execution uses asyncio gather, so a 5-persona focus group takes roughly the same wall-clock time as a single run. This turned out to be the feature people found most interesting when I described the project.

I also added a fourth VLM backend (Jina, cloud-based) and bumped the default context window to 32,768 tokens. The default model switched from Qwen 2.5 VL to Qwen 3 VL 6B, which handles structured JSON output more reliably. On the experimental side, I added Intel ARC GPU support via IPEX-LLM's Vulkan backend, though this turned out to be unstable: frequent model runner crashes, shader compilation delays on first run, and VRAM exhaustion on large context windows. I documented the workarounds but marked it as experimental.

The loop recovery system got a significant upgrade. The agent now detects two stuck patterns: identical consecutive actions (7+ repeats) and low-variety oscillation (2 unique actions over 12 steps). When stuck, it issues nudge messages suggesting alternative actions, and after a nudge the avoided elements are actually blocked from interaction rather than just flagged. I also added perceptual hashing for screenshot deduplication in reports, so repeated screenshots of the same page state do not inflate the report with redundant images.

A smaller but important fix: the select_option handler now uses multi-strategy matching (label, value, substring) with 3-second timeouts per strategy, which handles the messy dropdown implementations you find on real websites.

By February 27, the package was stable enough to publish. I pushed v0.1.0 to PyPI with badges and updated README.

What's next: Creating demo materials and architecture documentation to make the project presentable beyond just the README.

Changelog

Added

  • Add custom persona support prepended to VLM system prompt
  • Add focus group module for concurrent multi-persona execution
  • Add extra_instructions config field for domain-specific agent guidance
  • Add Jina VLM backend (cloud-based API)
  • Add Intel ARC GPU support via IPEX-LLM Vulkan (experimental)
  • Add nudge-based loop recovery: detects identical actions and low-variety oscillation
  • Add element blocking after nudge messages (not just advisory)
  • Add perceptual hashing for screenshot deduplication in reports
  • Add configurable max_report_screenshots setting
  • Add VLM retry logic with cooldown (vlm_retries, vlm_cooldown settings)
  • Add PyPI publication with version/Python/license badges

Changed

  • Update default model from qwen2.5-vl:7b to qwen3-vl:6b
  • Update default context window to 32,768 tokens
  • Update perception module for better filter and form element navigation

Fixed

  • Fix select_option with multi-strategy matching (label, value, substring) and 3s timeouts
personasfocus-groupsjinaintel-arcipex-llmvulkannudge-recoveryloop-detectionscreenshot-deduplicationperceptual-hashingpypiopen-sourceqwen3-vlconcurrent-executionselect-option
Report #1 of 3
Peruse AI: Initial Package & Core Agent Loop
Feb 18 - Feb 18, 2026

Devlog

I built and shipped the first working version of Peruse AI in a single day: a local-first autonomous web agent that explores websites using vision-language models without sending any data off your machine.

The core idea is a perceive-plan-act loop. At each step, the agent captures a screenshot of the current page, extracts interactive DOM elements via JavaScript injection (filtering for visibility, viewport presence, and deduplicating by tag and position), and sends both the visual and structural context to a VLM along with the task description and step history. The VLM returns a structured JSON action (click, type, scroll, navigate, select, wait), which Playwright executes. After exploration completes, the collected screenshots and session data are sent back through the VLM to generate three report types: data insights, UX/UI review, and bug report.

I started with three VLM backends: Ollama for local inference, LM Studio for its OpenAI-compatible endpoint, and a generic OpenAI-compatible adapter for anything else. The architecture uses a factory pattern so adding new backends is a single class. The default model is Qwen 2.5 VL via Ollama.

The hardest problem on day one was VLM output reliability. Local models frequently produce malformed JSON, extra markdown fences, or free-text responses when you need structured actions. I implemented a 5-strategy parsing fallback: strip markdown fences, direct JSON parse, brace-matching extraction, partial JSON extraction, and finally natural language keyword detection that recovers click/scroll/select actions from plain text. On parse failure, the agent falls back to scrolling rather than stopping, so exploration continues even when the VLM is being uncooperative.

The package structure is clean: 9 Python modules, a Click-based CLI with Rich terminal formatting, Pydantic Settings for configuration, and a test suite. Everything runs async via Playwright's async API.

What's next: Adding more VLM backends, improving navigation through complex form elements, and building custom personas so the agent can explore from different perspectives (UX designer, data analyst, accessibility auditor).

Changelog

Added

  • Add core perceive-plan-act agent loop with Playwright browser automation
  • Add dual-channel perception: DOM element extraction + screenshot capture
  • Add 3 VLM backends: Ollama, LM Studio, OpenAI-compatible
  • Add VLM factory pattern for extensible backend support
  • Add 5-strategy VLM response parsing fallback for malformed JSON
  • Add 3 report types: Data Insights, UX/UI Review, Bug Report
  • Add Click CLI with Rich terminal formatting and progress display
  • Add Pydantic Settings configuration (env vars, .env, kwargs)
  • Add error monitoring for console logs and network failures
  • Add test suite for config, outputs, and perception modules
  • Add MIT license and initial README documentation

Fixed

  • Fix premature agent termination by scrolling on parse failure instead of stopping
playwrightollamalm-studioopenai-compatiblevlmweb-automationdom-extractionscreenshotperceptionpydanticclickrichpythonopen-sourcemit-license