Enhancing Answer Selection in LLMs with Aggregation of Reasoning

Chain-of-Thought prompting has significantly improved LLM reasoning — but the standard approach of sampling multiple chains and picking the most frequent answer has a critical flaw: it breaks down when the correct answer is a minority.

A new framework called Aggregation of Reasoning (AoR) addresses this directly.

The Problem with Frequency-Based Ensembles

Current ensemble methods for LLM reasoning work by sampling multiple reasoning chains and selecting the answer that appears most often. This works well when the model is confident and correct answers dominate.

But when the correct answer is rare — when the model’s default “instinct” is wrong — frequency-based selection actively works against you. The most common answer wins regardless of quality.

What AoR Does Differently

AoR introduces two key innovations:

Hierarchical Evaluation of Reasoning Chains — Instead of only looking at predicted answers, AoR evaluates the reasoning process itself. A well-reasoned chain that arrives at an uncommon answer can outrank poorly-reasoned chains that agree on a wrong answer.

Dynamic Sampling — AoR adjusts how many reasoning chains to sample based on the complexity of the task. Simple tasks require fewer chains; hard tasks get more. This avoids both under-sampling (missing the correct reasoning path) and over-sampling (unnecessary compute).

Experimental Results

The study tested AoR across complex reasoning benchmarks including math word problems and commonsense reasoning, across multiple LLMs.

Key findings:

Consistent outperformance of traditional ensemble methods across all tested tasks
Adaptability: AoR maintained strong performance regardless of the underlying model
Scalability: Dynamic sampling scaled efficiently with task complexity, avoiding the fixed-sampling trap

Why This Matters

The insight here is fundamental: the quality of a reasoning chain is not reducible to its conclusion. Two chains that reach the same answer can have very different levels of coherence — and that distinction matters, especially in edge cases.

This has practical implications for any pipeline that uses LLM self-consistency or majority voting:

Don’t treat all sampled chains as equal votes
Evaluate the structure of reasoning, not just the output
Let task complexity drive sampling depth

AoR points toward a more principled approach to LLM ensembling — one that rewards good reasoning rather than just popular answers.

References

AoR Paper (ar5iv)

Originally published on LinkedIn.