Navigating Cognitive Biases in LLMs: Insights from the COBBLER Benchmark

Large Language Models have revolutionised natural language processing — but how reliable are they when used as evaluators of other models? A new benchmark called COBBLER quantifies exactly that, and the results are sobering.

Key Findings

The COBBLER benchmark evaluated 15 popular instruction-tuned LLMs trained with human feedback. The headline numbers:

40% average bias in evaluation outputs
49.6% average Rank-Biased Overlap (RBO) score — a significant misalignment between human and machine preferences

In other words, when you ask an LLM to judge which output is better, it agrees with humans less than half the time.

Implicit Biases: Built Into the Model

These are biases inherent to the model-as-evaluator, observable without any external manipulation:

Order Bias — Models favor responses based on their presentation order rather than content quality. Larger models particularly prefer the first or last response in a sequence.

Compassion Fade (Naming) — When real names replace anonymous aliases, model behavior shifts measurably. All 15 models were influenced; larger models showed increased susceptibility.

Egocentric Bias (Self-Preference) — Models consistently prefer their own outputs over those of other models. This persisted even when real names were introduced, suggesting it’s deeply embedded.

Salience Bias (Length) — Larger models favor longer responses regardless of quality. Smaller models were less influenced, suggesting this scales with model size.

Induced Biases: What External Pressure Reveals

These require modifications to the prompt or additional context:

Bandwagon Effect — Models were influenced by a fake majority preference, deferring to collective opinion rather than independent judgment. A simple “most people prefer option A” in the prompt was enough to shift evaluations.

Attentional Bias (Distraction) — Adding irrelevant information to the evaluation setup measurably degraded decision quality. Models were genuinely distracted by noise.

What This Means for LLM-as-Judge Pipelines

The LLM-as-judge pattern is widely used — having one model evaluate outputs from another is cheap, fast, and scalable. But COBBLER shows this approach carries systematic, measurable biases that compound across evaluation pipelines.

Practical implications:

Don’t rely on a single LLM judge — ensemble across multiple models to average out self-preference and order bias
Randomise presentation order — present options in multiple orderings and aggregate
Be skeptical of length as quality — longer outputs will be systematically overrated
Calibrate against human preferences for your specific task before deploying LLM-based evaluation at scale

The 49.6% RBO score is a baseline to beat, not a ceiling to accept.

Originally published on LinkedIn.