Model Comparison

Interactive analysis of 7 models across 4 observation spaces on 2,910 tasks

Pass Rate Heatmap

Color-coded pass rates (%) per model and observation mode. Best value per column is highlighted.

ModelBrowser-UseAX-treeSoMPixel
Gemini 3 Flash95.2best89.6best87.1best85.4best
GPT-5.490.481.577.083.8
Gemini 3.1 Flash-Lite87.477.773.563.3
GPT-5 mini87.083.178.549.0
GPT-5.4 mini85.879.174.777.1
Qwen3-VL-235B77.054.450.5
UI-TARS-1.5-7B12.6
Average89.281.374.260.2

Key Insights

Mode Sensitivity
>30pp

More than 30 percentage points of pass rate shift within a single model. GPT-5 mini swings from 87.0% (Browser-Use) down to 49.0% (Pixel).

SoM Is Model-Dependent
SoM

SoM helps GPT-5 mini by +29.5pp over pixel but hurts GPT-5.4 by 6.8pp. The overlay can confuse models that already read coordinates well.

Efficiency Gap
3.7×

Even the fastest agent (GPT-5.4 mini SoM) is 3.7× slower than human references. The slowest (GPT-5 mini Pixel) reaches 21.5×.

Set-of-Marks: Help or Hurt?

SoMPixel pass rate delta. Positive = SoM helps over pure pixel mode.

GPT-5 mini
+29.5pp
Gemini 3.1 Flash-Lite
+10.2pp
Qwen3-VL-235B
+3.9pp
Gemini 3 Flash
+1.7pp
GPT-5.4 mini
-2.4pp
GPT-5.4
-6.8pp
SoM helps SoM hurts

Browser-Use Advantage

Per-Model Advantage

Browser-Use pass rate vs mean of other modes

GPT-5 mini+16.8pp
87.0%
BU: 87.0%Others avg: 70.2%
Gemini 3.1 Flash-Lite+15.9pp
87.4%
BU: 87.4%Others avg: 71.5%
GPT-5.4+9.6pp
90.4%
BU: 90.4%Others avg: 80.8%
GPT-5.4 mini+8.8pp
85.8%
BU: 85.8%Others avg: 77.0%
Gemini 3 Flash+7.8pp
95.2%
BU: 95.2%Others avg: 87.4%

Advantage by Family

BU vs mean(AX-tree, SoM, Pixel). Ranges from +31% (Editors) to 26% (Drag/Drop)

Advanced Editors
+31.2
List-based Selecti…
+22.5
Continuous & High-…
+18.8
Structured Data Di…
+17.5
Date & Time
+15.2
Discrete Choice
+13.8
Combobox & Autocom…
+12.5
Hierarchical Selec…
+10.8
Text Entry & Struc…
+10.2
Disclosure & Progr…
+8.5
Overlays & Transie…
+7.2
Command & Navigati…
+5.5
Files, Clipboard, …
-3.2
Drag/Drop & Worksp…
-25.6

Family Pass Rates

Averaged across models, sorted by difficulty (hardest first). 14 component families across 4 observation modes.

FamilyBUAX-treeSoMPixelAvg
Drag/Drop & Workspace Interactions34.272.131.857.449.5
Continuous & High-Precision Input77.453.853.956.259.6
Advanced Editors87.767.849.746.461.8
Hierarchical Selection & Navigation82.375.464.261.870.0
Date & Time84.976.265.859.471.0
Files, Clipboard, Downloads88.476.166.363.272.0
Disclosure & Progressive83.276.064.168.372.5
Combobox & Autocomplete89.178.268.562.173.0
List-based Selection (Flat)91.479.370.264.874.0
Structured Data Display92.180.371.565.475.0
Text Entry & Structured Field Input93.284.878.472.178.0
Discrete Choice97.985.783.379.086.0
Overlays & Transient UI97.291.088.680.789.0
Command & Navigation98.096.189.983.191.5

Per-Model Family Breakdown

Drill into each model's pass rates across all 14 families and 4 observation modes.

FamilyBUAXSoMPxAvg
Drag/Drop & Workspace Interactions45.080.040.065.057.5
Continuous & High-Precision Input88.065.062.066.070.3
Advanced Editors96.077.060.058.072.8
Hierarchical Selection & Navigation91.082.073.070.079.0
Files, Clipboard, Downloads95.082.074.072.080.8
Date & Time93.084.076.072.081.3
Disclosure & Progressive93.083.072.078.081.5
Combobox & Autocomplete95.085.077.073.082.5
List-based Selection (Flat)96.086.078.076.084.0
Structured Data Display97.087.080.078.085.5
Text Entry & Structured Field Input97.091.085.082.088.8
Discrete Choice99.091.090.086.091.5
Overlays & Transient UI99.096.093.087.093.8
Command & Navigation99.098.095.090.095.5

Performance by Task Template

Pass rates across 24 task templates. Sorted by average difficulty (hardest first). Showing top 12 hardest + bottom 4 easiest.

#TemplateBUAXSoMPxAvg
1set range22.028.018.025.023.3
2drag operation30.065.025.050.042.5
3editor operation55.050.035.032.043.0
4set scalar65.045.042.048.050.0
5scroll find60.062.055.040.054.3
6file upload72.068.058.052.062.5
7transfer move68.072.060.055.063.8
8match reference75.072.065.055.066.8
9hierarchical path select82.078.065.058.070.8
10select many80.075.068.062.071.3
11confirm accept78.078.072.060.072.0
12replace code85.078.068.058.072.3
••• 8 templates hidden •••
21clear reset95.092.088.080.088.8
22open overlay96.093.090.082.090.3
23navigate to97.095.092.085.092.3
24activate98.097.095.088.094.5

Difficulty Analysis

Tier Breakdown

Pass rates by task difficulty level

TierBUAXSoMPx
L0 (easy)93.787.485.080.5
L1 (medium)89.382.074.666.7
L2 (hard)85.877.466.354.1
L3 (hard+)81.971.758.445.4

Which Difficulty Axes Predict Failure?

Pearson correlation between intended difficulty rating and agent failure rate

AxisOverallAXPxBU
Precision requirement0.440.290.400.32
Target acquisition0.320.170.360.19
Density / choice interf.0.230.100.310.12
Feedback dynamics0.220.110.290.13
Depth / layering0.200.100.300.04
Semantic observability0.130.080.170.05
Disambiguation load0.100.040.150.02

Difficulty Inversion: Trivial for Humans, Hard for Agents

Components where human step count is low but agent pass rate is disproportionately poor.

resizable columns

Human: 1.5 steps avg

22.6%

agent pass

Human
100%
Agent
Hard

window splitter

Human: 1.3 steps avg

37.7%

agent pass

Human
100%
Agent
Hard

slider range

Human: 1.8 steps avg

38.8%

agent pass

Human
100%
Agent
Hard

color picker 2d

Human: 1.6 steps avg

42.3%

agent pass

Human
100%
Agent
Hard

alpha slider

Human: 1.4 steps avg

45.1%

agent pass

Human
100%
Agent
Hard

kanban board drag drop

Human: 2 steps avg

48.2%

agent pass

Human
100%
Agent
Hard

drag drop sortable list

Human: 1.9 steps avg

50.5%

agent pass

Human
100%
Agent
Hard

split button

Human: 1.8 steps avg

52%

agent pass

Human
100%
Agent
Hard

drag drop between lists

Human: 2 steps avg

55.3%

agent pass

Human
100%
Agent
Hard

ComponentBench-Core (912 hard tasks)

Pass rates drop 1039% from Full, confirming Core concentrates on unresolved families.H = within human step count, 2H = within double, 3H = within triple.

ModelModePassH2H3H
Gemini 3 FlashBrowser-Use84.551.571.578.5
Gemini 3 FlashPixel60.930.551.056.0
GPT-5.4 miniBrowser-Use57.836.851.255.3
GPT-5.4 miniPixel37.722.032.134.1
Opus 4.6Pixel65.434.153.859.4