Simplifying Continual Pre-training of Large Language Models
Re-training LLMs from scratch when new data arrives is prohibitively expensive. Three simple strategies — LR re-warming, LR re-decaying, and minimal data replay — match the performance of full re-training at a fraction of the cost.
Every time a large language model needs to incorporate new data, the default approach is to re-train from scratch. For frontier-scale models, this is wildly expensive and increasingly impractical. Continual pre-training is the obvious solution — but catastrophic forgetting and distribution shift make it hard to do well.
A new study shows that three surprisingly simple strategies can close most of this gap.
The Core Problem
Continual pre-training — updating an existing model on new data without starting over — faces two failure modes:
- Poor adaptation: The model doesn’t sufficiently learn the new distribution
- Catastrophic forgetting: The model loses performance on the original distribution to accommodate new data
Standard fine-tuning handles neither well. The study tests methods to address both simultaneously.
Three Strategies That Work
Learning Rate Re-warming — Rather than continuing from the final low LR of the previous training run, re-increase the learning rate at the start of continual training. This allows the model to make larger parameter updates early on, improving adaptation to the new distribution.
Learning Rate Re-decaying — After re-warming, decay the LR back down over the new training run. This combination — re-warm then re-decay — produces better adaptation than either flat continuation or re-warming alone.
Replay of Previous Data — Mix a small percentage of data from the original training distribution into the new training run. Even 1% replay substantially reduces forgetting. Higher percentages (up to 50%) provide further protection under strong distribution shifts.
Experimental Setup
The study used GPT-NeoX models at two scales (405M and 10B parameters) trained on SlimPajama, German Common Crawl, and Pile datasets.
Two distribution shift conditions:
- Weak shift: English → English (same language, different domain)
- Strong shift: English → German (language change)
The 10B scale experiments are particularly relevant — this is where re-training costs become genuinely prohibitive, and where these strategies most need to work.
Results
The key finding: continually pre-trained models using these strategies match the performance of models fully re-trained from scratch on both old and new distributions.
- Re-warming + re-decaying improved adaptation under both weak and strong distribution shifts
- 1% replay meaningfully reduced forgetting in all conditions
- 50% replay essentially eliminated forgetting even under the English → German shift
- Final validation losses were comparable to full re-training baselines
The compute savings are substantial — updating a 10B parameter model rather than re-training it from scratch is not a marginal optimization.
Practical Implications
For production LLM deployments that need to stay current:
- You don’t need to re-train from scratch when incorporating new data — these strategies are sufficient for most distribution shifts
- Start with 5–10% replay as a practical default; increase if you’re seeing forgetting
- Re-warming + re-decaying is cheap to implement and should be standard practice for any continual training run
- The infinite learning rate schedule proposed in the paper (an alternative to fixed schedules) is worth exploring if you’re doing frequent updates
This is the kind of applied ML work that has immediate practical value — not a new architecture, just better training hygiene that happens to dramatically reduce costs.
References
Originally published on LinkedIn.