Model Comparison

Interactive analysis of 7 models across 4 observation spaces on 2,910 tasks

Pass Rate Heatmap

Color-coded pass rates (%) per model and observation mode. Best value per column is highlighted.

Model	Browser-Use	AX-tree	SoM	Pixel
Gemini 3 Flash	95.2best	89.6best	87.1best	85.4best
GPT-5.4	90.4	81.5	77.0	83.8
Gemini 3.1 Flash-Lite	87.4	77.7	73.5	63.3
GPT-5 mini	87.0	83.1	78.5	49.0
GPT-5.4 mini	85.8	79.1	74.7	77.1
Qwen3-VL-235B	—	77.0	54.4	50.5
UI-TARS-1.5-7B	—	—	—	12.6
Average	89.2	81.3	74.2	60.2

Key Insights

Mode Sensitivity
>30pp

More than 30 percentage points of pass rate shift within a single model. GPT-5 mini swings from 87.0% (Browser-Use) down to 49.0% (Pixel).

SoM Is Model-Dependent
SoM

SoM helps GPT-5 mini by +29.5pp over pixel but hurts GPT-5.4 by −6.8pp. The overlay can confuse models that already read coordinates well.

Efficiency Gap
3.7×

Even the fastest agent (GPT-5.4 mini SoM) is 3.7× slower than human references. The slowest (GPT-5 mini Pixel) reaches 21.5×.

Set-of-Marks: Help or Hurt?

SoM−Pixel pass rate delta. Positive = SoM helps over pure pixel mode.

GPT-5 mini

+29.5pp

Gemini 3.1 Flash-Lite

+10.2pp

Qwen3-VL-235B

+3.9pp

Gemini 3 Flash

+1.7pp

GPT-5.4 mini

-2.4pp

GPT-5.4

-6.8pp

SoM helps SoM hurts

Browser-Use Advantage

Per-Model Advantage

Browser-Use pass rate vs mean of other modes

GPT-5 mini+16.8pp

87.0%

BU: 87.0%Others avg: 70.2%

Gemini 3.1 Flash-Lite+15.9pp

87.4%

BU: 87.4%Others avg: 71.5%

GPT-5.4+9.6pp

90.4%

BU: 90.4%Others avg: 80.8%

GPT-5.4 mini+8.8pp

85.8%

BU: 85.8%Others avg: 77.0%

Gemini 3 Flash+7.8pp

95.2%

BU: 95.2%Others avg: 87.4%

Advantage by Family

BU vs mean(AX-tree, SoM, Pixel). Ranges from +31% (Editors) to −26% (Drag/Drop)

Advanced Editors

+31.2

List-based Selecti…

+22.5

Continuous & High-…

+18.8

Structured Data Di…

+17.5

Date & Time

+15.2

Discrete Choice

+13.8

Combobox & Autocom…

+12.5

Hierarchical Selec…

+10.8

Text Entry & Struc…

+10.2

Disclosure & Progr…

+8.5

Overlays & Transie…

+7.2

Command & Navigati…

+5.5

Files, Clipboard, …

-3.2

Drag/Drop & Worksp…

-25.6

Family Pass Rates

Averaged across models, sorted by difficulty (hardest first). 14 component families across 4 observation modes.

Family	BU	AX-tree	SoM	Pixel	Avg
Drag/Drop & Workspace Interactions	34.2	72.1	31.8	57.4	49.5
Continuous & High-Precision Input	77.4	53.8	53.9	56.2	59.6
Advanced Editors	87.7	67.8	49.7	46.4	61.8
Hierarchical Selection & Navigation	82.3	75.4	64.2	61.8	70.0
Date & Time	84.9	76.2	65.8	59.4	71.0
Files, Clipboard, Downloads	88.4	76.1	66.3	63.2	72.0
Disclosure & Progressive	83.2	76.0	64.1	68.3	72.5
Combobox & Autocomplete	89.1	78.2	68.5	62.1	73.0
List-based Selection (Flat)	91.4	79.3	70.2	64.8	74.0
Structured Data Display	92.1	80.3	71.5	65.4	75.0
Text Entry & Structured Field Input	93.2	84.8	78.4	72.1	78.0
Discrete Choice	97.9	85.7	83.3	79.0	86.0
Overlays & Transient UI	97.2	91.0	88.6	80.7	89.0
Command & Navigation	98.0	96.1	89.9	83.1	91.5

Per-Model Family Breakdown

Drill into each model's pass rates across all 14 families and 4 observation modes.

Family	BU	AX	SoM	Px	Avg
Drag/Drop & Workspace Interactions	45.0	80.0	40.0	65.0	57.5
Continuous & High-Precision Input	88.0	65.0	62.0	66.0	70.3
Advanced Editors	96.0	77.0	60.0	58.0	72.8
Hierarchical Selection & Navigation	91.0	82.0	73.0	70.0	79.0
Files, Clipboard, Downloads	95.0	82.0	74.0	72.0	80.8
Date & Time	93.0	84.0	76.0	72.0	81.3
Disclosure & Progressive	93.0	83.0	72.0	78.0	81.5
Combobox & Autocomplete	95.0	85.0	77.0	73.0	82.5
List-based Selection (Flat)	96.0	86.0	78.0	76.0	84.0
Structured Data Display	97.0	87.0	80.0	78.0	85.5
Text Entry & Structured Field Input	97.0	91.0	85.0	82.0	88.8
Discrete Choice	99.0	91.0	90.0	86.0	91.5
Overlays & Transient UI	99.0	96.0	93.0	87.0	93.8
Command & Navigation	99.0	98.0	95.0	90.0	95.5

Performance by Task Template

Pass rates across 24 task templates. Sorted by average difficulty (hardest first). Showing top 12 hardest + bottom 4 easiest.

#	Template	BU	AX	SoM	Px	Avg
1	set range	22.0	28.0	18.0	25.0	23.3
2	drag operation	30.0	65.0	25.0	50.0	42.5
3	editor operation	55.0	50.0	35.0	32.0	43.0
4	set scalar	65.0	45.0	42.0	48.0	50.0
5	scroll find	60.0	62.0	55.0	40.0	54.3
6	file upload	72.0	68.0	58.0	52.0	62.5
7	transfer move	68.0	72.0	60.0	55.0	63.8
8	match reference	75.0	72.0	65.0	55.0	66.8
9	hierarchical path select	82.0	78.0	65.0	58.0	70.8
10	select many	80.0	75.0	68.0	62.0	71.3
11	confirm accept	78.0	78.0	72.0	60.0	72.0
12	replace code	85.0	78.0	68.0	58.0	72.3
••• 8 templates hidden •••
21	clear reset	95.0	92.0	88.0	80.0	88.8
22	open overlay	96.0	93.0	90.0	82.0	90.3
23	navigate to	97.0	95.0	92.0	85.0	92.3
24	activate	98.0	97.0	95.0	88.0	94.5

Difficulty Analysis

Tier Breakdown

Pass rates by task difficulty level

Tier	BU	AX	SoM	Px
L0 (easy)	93.7	87.4	85.0	80.5
L1 (medium)	89.3	82.0	74.6	66.7
L2 (hard)	85.8	77.4	66.3	54.1
L3 (hard+)	81.9	71.7	58.4	45.4

Which Difficulty Axes Predict Failure?

Pearson correlation between intended difficulty rating and agent failure rate

Axis	Overall	AX	Px	BU
Precision requirement	0.44	0.29	0.40	0.32
Target acquisition	0.32	0.17	0.36	0.19
Density / choice interf.	0.23	0.10	0.31	0.12
Feedback dynamics	0.22	0.11	0.29	0.13
Depth / layering	0.20	0.10	0.30	0.04
Semantic observability	0.13	0.08	0.17	0.05
Disambiguation load	0.10	0.04	0.15	0.02

Difficulty Inversion: Trivial for Humans, Hard for Agents

Components where human step count is low but agent pass rate is disproportionately poor.

resizable columns

Human: 1.5 steps avg

22.6%

agent pass

Human

100%

Agent

Hard

window splitter

Human: 1.3 steps avg

37.7%

agent pass

Human

100%

Agent

Hard

slider range

Human: 1.8 steps avg

38.8%

agent pass

Human

100%

Agent

Hard

color picker 2d

Human: 1.6 steps avg

42.3%

agent pass

Human

100%

Agent

Hard

alpha slider

Human: 1.4 steps avg

45.1%

agent pass

Human

100%

Agent

Hard

kanban board drag drop

Human: 2 steps avg

48.2%

agent pass

Human

100%

Agent

Hard

drag drop sortable list

Human: 1.9 steps avg

50.5%

agent pass

Human

100%

Agent

Hard

split button

Human: 1.8 steps avg

52%

agent pass

Human

100%

Agent

Hard

drag drop between lists

Human: 2 steps avg

55.3%

agent pass

Human

100%

Agent

Hard

ComponentBench-Core (912 hard tasks)

Pass rates drop 10–39% from Full, confirming Core concentrates on unresolved families.≤H = within human step count, ≤2H = within double, ≤3H = within triple.

Model	Mode	Pass	≤H	≤2H	≤3H
Gemini 3 Flash	Browser-Use	84.5	51.5	71.5	78.5
Gemini 3 Flash	Pixel	60.9	30.5	51.0	56.0
GPT-5.4 mini	Browser-Use	57.8	36.8	51.2	55.3
GPT-5.4 mini	Pixel	37.7	22.0	32.1	34.1
Opus 4.6	Pixel	65.4	34.1	53.8	59.4

Model Comparison

Pass Rate Heatmap

Key Insights

Mode Sensitivity>30pp

SoM Is Model-DependentSoM

Efficiency Gap3.7×

Set-of-Marks: Help or Hurt?

Browser-Use Advantage

Per-Model Advantage

Advantage by Family

Family Pass Rates

Per-Model Family Breakdown

Performance by Task Template

Difficulty Analysis

Tier Breakdown

Which Difficulty Axes Predict Failure?

Difficulty Inversion: Trivial for Humans, Hard for Agents

ComponentBench-Core (912 hard tasks)

Mode Sensitivity
>30pp

SoM Is Model-Dependent
SoM

Efficiency Gap
3.7×