Model Comparison
Interactive analysis of 7 models across 4 observation spaces on 2,910 tasks
Pass Rate Heatmap
Color-coded pass rates (%) per model and observation mode. Best value per column is highlighted.
| Model | Browser-Use | AX-tree | SoM | Pixel |
|---|---|---|---|---|
| Gemini 3 Flash | 95.2best | 89.6best | 87.1best | 85.4best |
| GPT-5.4 | 90.4 | 81.5 | 77.0 | 83.8 |
| Gemini 3.1 Flash-Lite | 87.4 | 77.7 | 73.5 | 63.3 |
| GPT-5 mini | 87.0 | 83.1 | 78.5 | 49.0 |
| GPT-5.4 mini | 85.8 | 79.1 | 74.7 | 77.1 |
| Qwen3-VL-235B | — | 77.0 | 54.4 | 50.5 |
| UI-TARS-1.5-7B | — | — | — | 12.6 |
| Average | 89.2 | 81.3 | 74.2 | 60.2 |
Key Insights
Mode Sensitivity>30pp
More than 30 percentage points of pass rate shift within a single model. GPT-5 mini swings from 87.0% (Browser-Use) down to 49.0% (Pixel).
SoM Is Model-DependentSoM
SoM helps GPT-5 mini by +29.5pp over pixel but hurts GPT-5.4 by −6.8pp. The overlay can confuse models that already read coordinates well.
Efficiency Gap3.7×
Even the fastest agent (GPT-5.4 mini SoM) is 3.7× slower than human references. The slowest (GPT-5 mini Pixel) reaches 21.5×.
Set-of-Marks: Help or Hurt?
SoM−Pixel pass rate delta. Positive = SoM helps over pure pixel mode.
Browser-Use Advantage
Per-Model Advantage
Browser-Use pass rate vs mean of other modes
Advantage by Family
BU vs mean(AX-tree, SoM, Pixel). Ranges from +31% (Editors) to −26% (Drag/Drop)
Family Pass Rates
Averaged across models, sorted by difficulty (hardest first). 14 component families across 4 observation modes.
| Family | BU | AX-tree | SoM | Pixel | Avg |
|---|---|---|---|---|---|
| Drag/Drop & Workspace Interactions | 34.2 | 72.1 | 31.8 | 57.4 | 49.5 |
| Continuous & High-Precision Input | 77.4 | 53.8 | 53.9 | 56.2 | 59.6 |
| Advanced Editors | 87.7 | 67.8 | 49.7 | 46.4 | 61.8 |
| Hierarchical Selection & Navigation | 82.3 | 75.4 | 64.2 | 61.8 | 70.0 |
| Date & Time | 84.9 | 76.2 | 65.8 | 59.4 | 71.0 |
| Files, Clipboard, Downloads | 88.4 | 76.1 | 66.3 | 63.2 | 72.0 |
| Disclosure & Progressive | 83.2 | 76.0 | 64.1 | 68.3 | 72.5 |
| Combobox & Autocomplete | 89.1 | 78.2 | 68.5 | 62.1 | 73.0 |
| List-based Selection (Flat) | 91.4 | 79.3 | 70.2 | 64.8 | 74.0 |
| Structured Data Display | 92.1 | 80.3 | 71.5 | 65.4 | 75.0 |
| Text Entry & Structured Field Input | 93.2 | 84.8 | 78.4 | 72.1 | 78.0 |
| Discrete Choice | 97.9 | 85.7 | 83.3 | 79.0 | 86.0 |
| Overlays & Transient UI | 97.2 | 91.0 | 88.6 | 80.7 | 89.0 |
| Command & Navigation | 98.0 | 96.1 | 89.9 | 83.1 | 91.5 |
Per-Model Family Breakdown
Drill into each model's pass rates across all 14 families and 4 observation modes.
| Family | BU | AX | SoM | Px | Avg |
|---|---|---|---|---|---|
| Drag/Drop & Workspace Interactions | 45.0 | 80.0 | 40.0 | 65.0 | 57.5 |
| Continuous & High-Precision Input | 88.0 | 65.0 | 62.0 | 66.0 | 70.3 |
| Advanced Editors | 96.0 | 77.0 | 60.0 | 58.0 | 72.8 |
| Hierarchical Selection & Navigation | 91.0 | 82.0 | 73.0 | 70.0 | 79.0 |
| Files, Clipboard, Downloads | 95.0 | 82.0 | 74.0 | 72.0 | 80.8 |
| Date & Time | 93.0 | 84.0 | 76.0 | 72.0 | 81.3 |
| Disclosure & Progressive | 93.0 | 83.0 | 72.0 | 78.0 | 81.5 |
| Combobox & Autocomplete | 95.0 | 85.0 | 77.0 | 73.0 | 82.5 |
| List-based Selection (Flat) | 96.0 | 86.0 | 78.0 | 76.0 | 84.0 |
| Structured Data Display | 97.0 | 87.0 | 80.0 | 78.0 | 85.5 |
| Text Entry & Structured Field Input | 97.0 | 91.0 | 85.0 | 82.0 | 88.8 |
| Discrete Choice | 99.0 | 91.0 | 90.0 | 86.0 | 91.5 |
| Overlays & Transient UI | 99.0 | 96.0 | 93.0 | 87.0 | 93.8 |
| Command & Navigation | 99.0 | 98.0 | 95.0 | 90.0 | 95.5 |
Performance by Task Template
Pass rates across 24 task templates. Sorted by average difficulty (hardest first). Showing top 12 hardest + bottom 4 easiest.
| # | Template | BU | AX | SoM | Px | Avg |
|---|---|---|---|---|---|---|
| 1 | set range | 22.0 | 28.0 | 18.0 | 25.0 | 23.3 |
| 2 | drag operation | 30.0 | 65.0 | 25.0 | 50.0 | 42.5 |
| 3 | editor operation | 55.0 | 50.0 | 35.0 | 32.0 | 43.0 |
| 4 | set scalar | 65.0 | 45.0 | 42.0 | 48.0 | 50.0 |
| 5 | scroll find | 60.0 | 62.0 | 55.0 | 40.0 | 54.3 |
| 6 | file upload | 72.0 | 68.0 | 58.0 | 52.0 | 62.5 |
| 7 | transfer move | 68.0 | 72.0 | 60.0 | 55.0 | 63.8 |
| 8 | match reference | 75.0 | 72.0 | 65.0 | 55.0 | 66.8 |
| 9 | hierarchical path select | 82.0 | 78.0 | 65.0 | 58.0 | 70.8 |
| 10 | select many | 80.0 | 75.0 | 68.0 | 62.0 | 71.3 |
| 11 | confirm accept | 78.0 | 78.0 | 72.0 | 60.0 | 72.0 |
| 12 | replace code | 85.0 | 78.0 | 68.0 | 58.0 | 72.3 |
| ••• 8 templates hidden ••• | ||||||
| 21 | clear reset | 95.0 | 92.0 | 88.0 | 80.0 | 88.8 |
| 22 | open overlay | 96.0 | 93.0 | 90.0 | 82.0 | 90.3 |
| 23 | navigate to | 97.0 | 95.0 | 92.0 | 85.0 | 92.3 |
| 24 | activate | 98.0 | 97.0 | 95.0 | 88.0 | 94.5 |
Difficulty Analysis
Tier Breakdown
Pass rates by task difficulty level
| Tier | BU | AX | SoM | Px |
|---|---|---|---|---|
| L0 (easy) | 93.7 | 87.4 | 85.0 | 80.5 |
| L1 (medium) | 89.3 | 82.0 | 74.6 | 66.7 |
| L2 (hard) | 85.8 | 77.4 | 66.3 | 54.1 |
| L3 (hard+) | 81.9 | 71.7 | 58.4 | 45.4 |
Which Difficulty Axes Predict Failure?
Pearson correlation between intended difficulty rating and agent failure rate
| Axis | Overall | AX | Px | BU |
|---|---|---|---|---|
| Precision requirement | 0.44 | 0.29 | 0.40 | 0.32 |
| Target acquisition | 0.32 | 0.17 | 0.36 | 0.19 |
| Density / choice interf. | 0.23 | 0.10 | 0.31 | 0.12 |
| Feedback dynamics | 0.22 | 0.11 | 0.29 | 0.13 |
| Depth / layering | 0.20 | 0.10 | 0.30 | 0.04 |
| Semantic observability | 0.13 | 0.08 | 0.17 | 0.05 |
| Disambiguation load | 0.10 | 0.04 | 0.15 | 0.02 |
Difficulty Inversion: Trivial for Humans, Hard for Agents
Components where human step count is low but agent pass rate is disproportionately poor.
resizable columns
Human: 1.5 steps avg
22.6%
agent pass
window splitter
Human: 1.3 steps avg
37.7%
agent pass
slider range
Human: 1.8 steps avg
38.8%
agent pass
color picker 2d
Human: 1.6 steps avg
42.3%
agent pass
alpha slider
Human: 1.4 steps avg
45.1%
agent pass
kanban board drag drop
Human: 2 steps avg
48.2%
agent pass
drag drop sortable list
Human: 1.9 steps avg
50.5%
agent pass
split button
Human: 1.8 steps avg
52%
agent pass
drag drop between lists
Human: 2 steps avg
55.3%
agent pass
ComponentBench-Core (912 hard tasks)
Pass rates drop 10–39% from Full, confirming Core concentrates on unresolved families.≤H = within human step count, ≤2H = within double, ≤3H = within triple.
| Model | Mode | Pass | ≤H | ≤2H | ≤3H |
|---|---|---|---|---|---|
| Gemini 3 Flash | Browser-Use | 84.5 | 51.5 | 71.5 | 78.5 |
| Gemini 3 Flash | Pixel | 60.9 | 30.5 | 51.0 | 56.0 |
| GPT-5.4 mini | Browser-Use | 57.8 | 36.8 | 51.2 | 55.3 |
| GPT-5.4 mini | Pixel | 37.7 | 22.0 | 32.1 | 34.1 |
| Opus 4.6 | Pixel | 65.4 | 34.1 | 53.8 | 59.4 |