Compare
Compare models, interfaces, families, and benchmark versions.
This page emphasizes representation dependence: the same task can shift dramatically between Browser-Use, AX-tree, SoM, and Pixel. Full/Core comparisons are shown only where real rows exist.
Pass Rate Heatmap
v1 model/mode matrix from `topline_model_mode.csv`.
Sort by
| Model | ||||
|---|---|---|---|---|
| gemini-3-flash | 95.2% | 89.6% | 87.1% | 85.4% |
| gemini-3.1-flash-lite | 87.4% | 77.7% | 73.5% | 63.3% |
| gpt-5-mini | 87.0% | 83.1% | 78.5% | 49.0% |
| gpt-5.4-mini | 85.8% | 79.1% | 74.7% | 77.1% |
| qwen3-vl-235b-fp8 | pending | 77.0% | 54.4% | 50.5% |
Family Heatmap Summary
Hardest families by average v1 pass rate.
Drag/Drop & Workspace Interactions
48.4%
Continuous & High-Precision Input
58.1%
Advanced Editors
60.1%
Date & Time
69.9%
Disclosure & Progressive
71.3%
Structured Data Display
71.9%
List-based Selection (Flat)
74.3%
Combobox & Autocomplete
79.7%
Hierarchical Selection & Navigation
80.2%
Text Entry & Structured Field Input
82.9%
Files, Clipboard, Downloads
83.4%
Discrete Choice
85.0%
Overlays & Transient UI
88.2%
Command & Navigation
90.7%
Full/Core Availability
Core rows are shown only when real v2 result rows exist.
gemini-3-flash
Browser-Use / Core
76.9%
gemini-3-flash
Pixel / Core
60.9%
gpt-5.4-mini
Browser-Use / Core
57.8%
gpt-5.4-mini
Pixel / Core
37.7%
Component Difficulty Heatmap
Components with the largest average absolute pass-rate delta across observation modes.
select native
Mode sensitivity
51.0%
kanban board drag drop
Mode sensitivity
49.3%
drag drop sortable list
Mode sensitivity
40.0%
rich text editor
Mode sensitivity
37.5%
drag drop between lists
Mode sensitivity
34.3%
context menu
Mode sensitivity
28.0%
window splitter
Mode sensitivity
27.5%
code editor
Mode sensitivity
26.7%
json editor
Mode sensitivity
26.0%
tree grid
Mode sensitivity
25.9%
rating
Mode sensitivity
24.8%
data table filterable
Mode sensitivity
24.7%
select custom multi
Mode sensitivity
24.7%
transfer list
Mode sensitivity
23.7%
toggle button group multi
Mode sensitivity
23.6%
data grid row selection
Mode sensitivity
22.8%