A verification layer for browser agents: Amazon case study

https://news.ycombinator.com/rss Hits: 2
Summary

This post is a technical report on four runs of the same Amazon shopping flow. The purpose is to isolate one claim: reliability comes from verification, not from giving the model more pixels or more parameters.Sentience is used here as a verification layer: each step is gated by explicit assertions over structured snapshots. This makes it feasible to use small local models as executors, while reserving larger models for planning (reasoning) when needed. No vision models are required for the core loop in the local runs discussed below.Key findingsFindingEvidence (from logs / report)A fully autonomous run can complete with local models when verification gates every step.Demo 3 re-run: Steps passed: 7/7 and success: TrueToken efficiency can be engineered by interface design (structure + filtering), not by model choice.Demo 0 report: estimated ~35,000 → 19,956 tokens (~43% reduction)Verification > intelligence is the practical lesson.Planner drift is surfaced as explicit FAIL/mismatch rather than silent progressKey datapoints:MetricDemo 0 (cloud baseline)Demo 3 (local autonomy)Success1/1 run7/7 steps (re-run)Duration~60,000ms405,740msTokens19,956 (after filtering)11,114Task (constant across runs): Amazon → Search “thinkpad” → Click first product → Add to cart → Proceed to checkoutFirst principles: structure > pixelsScreenshot-based agents use pixels as the control plane. That often fails in predictable ways: ambiguous click targets, undetected navigation failures, and “progress” without state change.The alternative is to treat the page as a structured snapshot (roles, text, geometry, and a small amount of salience) and then require explicit pass/fail verification after each action. This is the “Jest for agents” idea: a step does not succeed because the model says it did; it succeeds because an assertion over browser state passes.The “impossible benchmark”The target configuration is a strong planner paired with a small, local executor, still achieving reliable end-to-end...

First seen: 2026-01-22 02:42

Last seen: 2026-01-22 03:42