Compositional Generalization in Web Automation: Where LLM Agents Break Down

Language Model Agents (LMAs) are impressive on controlled benchmarks — but how they perform when real-world tasks require chaining multiple sub-tasks together tells a very different story.

The Gap Nobody Talks About

Models like GPT-3.5-turbo and GPT-4 achieve up to 94% success on basic web automation tasks. That’s the number that gets cited.

The number that should concern you: on compositional tasks — where you combine several base tasks — success rates drop to around 25%.

That’s not a small gap. That’s a near-collapse of capability the moment complexity scales to anything resembling real-world usage.

What CompWoB Measures

To study this systematically, the paper introduces CompWoB: a benchmark of 50 compositional tasks built by combining 2–8 base tasks of varying difficulty.

Tasks are designed to mirror real web automation scenarios:

Both single-page and multi-page environments
Tasks linked with simple connectors (“and then”)
Controlled difficulty gradient from base to compositional

The controlled design lets researchers isolate exactly where performance degrades — not just that it does, but why.

Transferred vs. Prompted LMAs

The study compares two LMA approaches:

Prompted LMAs — the standard GPT-4 / GPT-3.5 approach: give the model instructions, let it reason through the task. Excellent on base tasks. Falls apart on compositional ones.

Transferred LMAs — small-scale models fine-tuned only on base tasks, then zero-shot transferred to compositional settings. Average success rate: 54.8% — more than double the prompted approach on compositional tasks.

This is counterintuitive. The smaller, fine-tuned models that can’t match GPT-4 on individual tasks outperform it significantly when tasks get complex. The working hypothesis: transferred models develop more generalizable action policies, while prompted models over-fit to the explicit framing of each task.

HTML-T5++ and the Path Forward

The paper also introduces HTML-T5++, a new model trained with a novel data mixture strategy. Results:

Human-level performance on MiniWoB
61.5% zero-shot transfer on CompWoB — best in the study

The architecture matters less than the training strategy here: mixing base and compositional-style data during training produces more robust agents.

What This Means for Production LMA Deployments

If you’re building or evaluating LMA systems:

Benchmark on compositional tasks, not just base tasks — otherwise you’re measuring the easy case
Fine-tuned smaller models may outperform prompted frontier models on complex multi-step tasks
Real-world web automation is inherently compositional — tasks are not atomic, and agents that can’t chain steps reliably will fail in deployment

The 94% headline number is real. The 25% compositional number is also real. The question is which one reflects your actual use case.

References

arXiv:2311.18751

Originally published on LinkedIn.