Compositional Generalization in Web Automation: Where LLM Agents Break Down
LLM agents hit 94% success on basic web tasks — but drop to 25% on compositional tasks that combine multiple steps. The CompWoB benchmark exposes exactly where and why this happens.
Language Model Agents (LMAs) are impressive on controlled benchmarks — but how they perform when real-world tasks require chaining multiple sub-tasks together tells a very different story.
The Gap Nobody Talks About
Models like GPT-3.5-turbo and GPT-4 achieve up to 94% success on basic web automation tasks. That’s the number that gets cited.
The number that should concern you: on compositional tasks — where you combine several base tasks — success rates drop to around 25%.
That’s not a small gap. That’s a near-collapse of capability the moment complexity scales to anything resembling real-world usage.
What CompWoB Measures
To study this systematically, the paper introduces CompWoB: a benchmark of 50 compositional tasks built by combining 2–8 base tasks of varying difficulty.
Tasks are designed to mirror real web automation scenarios:
- Both single-page and multi-page environments
- Tasks linked with simple connectors (“and then”)
- Controlled difficulty gradient from base to compositional
The controlled design lets researchers isolate exactly where performance degrades — not just that it does, but why.
Transferred vs. Prompted LMAs
The study compares two LMA approaches:
Prompted LMAs — the standard GPT-4 / GPT-3.5 approach: give the model instructions, let it reason through the task. Excellent on base tasks. Falls apart on compositional ones.
Transferred LMAs — small-scale models fine-tuned only on base tasks, then zero-shot transferred to compositional settings. Average success rate: 54.8% — more than double the prompted approach on compositional tasks.
This is counterintuitive. The smaller, fine-tuned models that can’t match GPT-4 on individual tasks outperform it significantly when tasks get complex. The working hypothesis: transferred models develop more generalizable action policies, while prompted models over-fit to the explicit framing of each task.
HTML-T5++ and the Path Forward
The paper also introduces HTML-T5++, a new model trained with a novel data mixture strategy. Results:
- Human-level performance on MiniWoB
- 61.5% zero-shot transfer on CompWoB — best in the study
The architecture matters less than the training strategy here: mixing base and compositional-style data during training produces more robust agents.
What This Means for Production LMA Deployments
If you’re building or evaluating LMA systems:
- Benchmark on compositional tasks, not just base tasks — otherwise you’re measuring the easy case
- Fine-tuned smaller models may outperform prompted frontier models on complex multi-step tasks
- Real-world web automation is inherently compositional — tasks are not atomic, and agents that can’t chain steps reliably will fail in deployment
The 94% headline number is real. The 25% compositional number is also real. The question is which one reflects your actual use case.
References
Originally published on LinkedIn.