Compare

Compare models, interfaces, families, and benchmark versions.

This page emphasizes representation dependence: the same task can shift dramatically between Browser-Use, AX-tree, SoM, and Pixel. Full/Core comparisons are shown only where real rows exist.

Open Results Dashboard

Pass Rate Heatmap

v1 model/mode matrix from `topline_model_mode.csv`.

Sort by
Model
gemini-3-flash95.2%89.6%87.1%85.4%
gemini-3.1-flash-lite87.4%77.7%73.5%63.3%
gpt-5-mini87.0%83.1%78.5%49.0%
gpt-5.4-mini85.8%79.1%74.7%77.1%
qwen3-vl-235b-fp8pending77.0%54.4%50.5%

Family Heatmap Summary

Hardest families by average v1 pass rate.

Drag/Drop & Workspace Interactions
48.4%
Continuous & High-Precision Input
58.1%
Advanced Editors
60.1%
Date & Time
69.9%
Disclosure & Progressive
71.3%
Structured Data Display
71.9%
List-based Selection (Flat)
74.3%
Combobox & Autocomplete
79.7%
Hierarchical Selection & Navigation
80.2%
Text Entry & Structured Field Input
82.9%
Files, Clipboard, Downloads
83.4%
Discrete Choice
85.0%
Overlays & Transient UI
88.2%
Command & Navigation
90.7%

Full/Core Availability

Core rows are shown only when real v2 result rows exist.

gemini-3-flash
Browser-Use / Core
76.9%
gemini-3-flash
Pixel / Core
60.9%
gpt-5.4-mini
Browser-Use / Core
57.8%
gpt-5.4-mini
Pixel / Core
37.7%

Component Difficulty Heatmap

Components with the largest average absolute pass-rate delta across observation modes.

select native
Mode sensitivity
51.0%
kanban board drag drop
Mode sensitivity
49.3%
drag drop sortable list
Mode sensitivity
40.0%
rich text editor
Mode sensitivity
37.5%
drag drop between lists
Mode sensitivity
34.3%
context menu
Mode sensitivity
28.0%
window splitter
Mode sensitivity
27.5%
code editor
Mode sensitivity
26.7%
json editor
Mode sensitivity
26.0%
tree grid
Mode sensitivity
25.9%
rating
Mode sensitivity
24.8%
data table filterable
Mode sensitivity
24.7%
select custom multi
Mode sensitivity
24.7%
transfer list
Mode sensitivity
23.7%
toggle button group multi
Mode sensitivity
23.6%
data grid row selection
Mode sensitivity
22.8%