Simplifying Continual Pre-training of Large Language Models

Every time a large language model needs to incorporate new data, the default approach is to re-train from scratch. For frontier-scale models, this is wildly expensive and increasingly impractical. Continual pre-training is the obvious solution — but catastrophic forgetting and distribution shift make it hard to do well.

A new study shows that three surprisingly simple strategies can close most of this gap.

The Core Problem

Continual pre-training — updating an existing model on new data without starting over — faces two failure modes:

Poor adaptation: The model doesn’t sufficiently learn the new distribution
Catastrophic forgetting: The model loses performance on the original distribution to accommodate new data

Standard fine-tuning handles neither well. The study tests methods to address both simultaneously.

Three Strategies That Work

Learning Rate Re-warming — Rather than continuing from the final low LR of the previous training run, re-increase the learning rate at the start of continual training. This allows the model to make larger parameter updates early on, improving adaptation to the new distribution.

Learning Rate Re-decaying — After re-warming, decay the LR back down over the new training run. This combination — re-warm then re-decay — produces better adaptation than either flat continuation or re-warming alone.

Replay of Previous Data — Mix a small percentage of data from the original training distribution into the new training run. Even 1% replay substantially reduces forgetting. Higher percentages (up to 50%) provide further protection under strong distribution shifts.

Experimental Setup

The study used GPT-NeoX models at two scales (405M and 10B parameters) trained on SlimPajama, German Common Crawl, and Pile datasets.

Two distribution shift conditions:

Weak shift: English → English (same language, different domain)
Strong shift: English → German (language change)

The 10B scale experiments are particularly relevant — this is where re-training costs become genuinely prohibitive, and where these strategies most need to work.

Results

The key finding: continually pre-trained models using these strategies match the performance of models fully re-trained from scratch on both old and new distributions.

Re-warming + re-decaying improved adaptation under both weak and strong distribution shifts
1% replay meaningfully reduced forgetting in all conditions
50% replay essentially eliminated forgetting even under the English → German shift
Final validation losses were comparable to full re-training baselines

The compute savings are substantial — updating a 10B parameter model rather than re-training it from scratch is not a marginal optimization.

Practical Implications

For production LLM deployments that need to stay current:

You don’t need to re-train from scratch when incorporating new data — these strategies are sufficient for most distribution shifts
Start with 5–10% replay as a practical default; increase if you’re seeing forgetting
Re-warming + re-decaying is cheap to implement and should be standard practice for any continual training run
The infinite learning rate schedule proposed in the paper (an alternative to fixed schedules) is worth exploring if you’re doing frequent updates

This is the kind of applied ML work that has immediate practical value — not a new architecture, just better training hygiene that happens to dramatically reduce costs.

References

arXiv:2403.08763

Originally published on LinkedIn.