xLSTM: Can Extended LSTMs Compete with Transformers at Scale?

LSTMs were the dominant architecture for sequential data before Transformers took over. The conventional wisdom since 2017 has been: Transformers won, LSTMs lost, move on.

A new paper challenges that assumption — at scale.

Why LSTMs Lost (And Whether It Had To Be That Way)

The original LSTM solved the vanishing gradient problem with the constant error carousel and gating mechanisms. But three limitations held it back:

Cannot revise storage decisions — once information is written to memory, the gate mechanism can’t rewrite it
Limited storage capacity — scalar memory cells constrain how much information can be retained
No parallelizability — sequential computation means you can’t leverage modern GPU parallelism the way Transformers can

These weren’t fundamental flaws in the LSTM concept — they were implementation constraints that, with hindsight, could be engineered around.

xLSTM: Two New Variants

The paper introduces xLSTM through two new LSTM variants built on exponential gating and modified memory structures:

sLSTM (scalar LSTM) — Features scalar memory with a new memory mixing technique. Designed for tasks where compact state representation is sufficient, with post up-projection residual blocks (similar to Transformer architecture).

mLSTM (matrix LSTM) — Incorporates a matrix memory and covariance update rule. The key innovation: mLSTM is fully parallelizable, addressing the core speed limitation of classical LSTMs. Pre up-projection residual blocks (similar to State Space Models like Mamba).

The result: exponential gating allows xLSTM to revise storage decisions — fixing limitation #1 — while matrix memory dramatically increases capacity — fixing limitation #2.

Computational Efficiency

This is where xLSTM makes a compelling case against Transformers:

Linear computation complexity with respect to sequence length (vs. quadratic for standard attention)
Constant memory complexity with respect to sequence length

For long sequences and edge deployments where memory is constrained, this is a significant practical advantage. Transformers’ quadratic scaling is a known production headache — xLSTM doesn’t have it.

Scale: 300B Tokens

The headline experiment: training xLSTM on 300B tokens for large-scale language modeling.

Results showed:

Competitive or superior validation perplexity vs. Transformers and State Space Models at equivalent compute
Strong sequence length extrapolation (performs well on sequences longer than seen during training)
Competitive downstream task performance

The study also ran ablations showing that exponential gating and memory mixing together drive the performance gains — each contributes independently.

Context: The Recurrent Architecture Renaissance

xLSTM isn’t alone. Mamba, RWKV, RetNet, and now xLSTM are all betting that recurrent architectures — updated with modern techniques — can match or exceed Transformers on certain task profiles.

The common thread: linear scaling is a hard constraint that matters more as sequences get longer and deployments move to the edge. Transformers may be the best academic benchmark performers, but the engineering tradeoffs are real.

xLSTM makes a credible case that LSTM-style architectures weren’t inherently inferior — just under-engineered relative to their potential.

References

arXiv:2405.04517

Originally published on LinkedIn.