Learning, Fast and Slow: LLMs That Adapt Continually¶
Adapting an LLM through parameter updates forces every improvement into a single persistent set of weights: task-specific tricks and general reasoning alike. This shrinks the model's distribution toward the trained task eroding its capacity to learn new ones. Prompt optimization enables fast task-specific adaptations and hence sidesteps this, but cannot, on its own, match the performance ceiling of parameter updates.
We introduce Fast-Slow Training (FST), a paradigm for LLM training that optimizes the agent/context layer including prompts as "fast weights" and the network parameters as "slow weights", with the two updates interleaved during training. Fast weights encode task-dependent nuances; enabling slow weights to focus on general capabilities. Across math, code, and general reasoning benchmarks, FST beats weights-only training on every axis we measured. With one recipe, FST:
- Matches RL's performance with up to 3x fewer training steps and lifts the asymptotic ceiling under ScaleRL-style scaling-law fits.
- Reaches matched accuracy at ~70% lower KL divergence from the base, preserving the model's ability to keep learning (plasticity).
- Does a better job at continual learning where weights-only training stalls when the task switches.
Motivation: The Quest for General-Purpose AI¶
A north star in AI research is building performant, scalable systems that can instantly adapt to a diverse range of general tasks. In the past five years, Large Language Models (LLMs) and in-context learning have revolutionized this pursuit, allowing models to solve problems they were never explicitly trained for.
It is easy to forget that until recently, even simple tasks like sentiment analysis required training bespoke classifiers from scratch.
While in-context learning as a paradigm has massively paid its dividends in terms of generality [22], training the model parameters for a given task typically yields higher ceiling performance.
Compute costs aside, domain-specific finetuning imposes a set of restrictions on the model. For one, training a model on a narrow domain is known to degrade out of distribution performance. It can also decrease the ability to later finetune the model on new tasks.
Though current models are quite general, there seems to be a tradeoff between how adaptable and how performant they are. What can we do to close this gap?
Is Reinforcement Learning Enough?¶
The emerging paradigm of reinforcement learning for LLMs has shown great promise in making models more performant across a diverse set of tasks. To what degree doing RL on a model causes specialization and degrades future task and out-of-domain performance remains an open question. While recent work argues the nature of on-policy updates make it so the model distribution changes minimally and does not forget on old tasks [1] [2], doing heavy RL on a given domain does drastically change the model distribution in practice, eg. OpenAI goblin's incident [3].
Despite this line of work on LLMs suggesting on-policy learning is all we need, simply being on-policy has clearly not been sufficient in continual and episodic deep RL. For example, prior work has observed primacy bias [4], where data encountered early in agent training plays a significant role in the final policy, loss of plasticity [5], where models trained on a given domain are less plastic to learning new skills, and catastrophic forgetting [6], where model performance tanks on old domains when learning a new one.
These obstacles have produced a rich literature of methods enabling learning across changing tasks [7] [8] [9] [10]. A common theme among work in this area is the adoption of fast and slow components into the model. The idea dates back to classic machine learning work by Schmidhuber [11] and Hinton [12]. The intuition is that fast components of the model can quickly learn task-specific information while the slow components build a general core of skills applicable across tasks.
Fast and slow learning has an even richer history in neuroscience, specifically related to complementary learning systems theory [13] [14]. The theory proposes that the neocortex learns slowly to discover structure across experiences while the hippocampus allows for quick adaptation to new situations without disrupting the existing structure. These new memories are then slowly ingrained into neocortical memory systems over time.
Inspired by this literature, we propose…
Fast-Slow Training for LLMs¶
Due to the strong in-context learning ability of LLMs, we represent the model context as fast weights [15] and the model parameters as slow weights. Fast-Slow Training (FST) in LLMs presents a general blueprint where any context optimization approach can be taken to update the context, adapting quickly to new settings, and any gradient-based learning approach can be taken to update model parameters.

To instantiate this idea, we take a state-of-the-art RL algorithm in CISPO [16] and interleave its updates with a state-of-the-art prompt optimizer GEPA [17], which is able to leverage rich text feedback. Every \(T\) RL steps, we do a light round of prompt optimization with GEPA. The prompt optimizer generates a set of prompts covering the Pareto front. For each problem in RL, we pull several of these into the rollout prompts and calculate the advantage once per problem.
The intuition is as follows. Reinforcement learning does whatever it takes to maximize reward on the task being trained. This includes forcing the model to internalize task-specific information into its parameters in order to climb rewards. But our goal when doing RL for LLMs is a general purpose reasoner: a model whose weights capture broadly useful reasoning strategies rather than memorizing domain-specific details for every possible setting. As such, by introducing context optimization, declarative task-specific information can quickly be absorbed into the prompt, leading the model weights to learn more general reasoning behavior.
FST has several benefits. We find FST:
- improves data efficiency and the performance ceiling,
- remains close to the base model and maintains plasticity,
- and improves continual learning.
We detail each experiment below.
Fast-Slow Training Improves Data Efficiency and Performance Ceiling¶


We find data efficiency in FST is much improved over RL, taking far fewer steps to achieve the same performance across tasks. This does not come at a cost of OOD performance or diversity: FST has equal or better performance across out-of-domain and Easy-to-Hard tasks. Next, following ScaleRL [18], we fit sigmoidal scaling curves to RL on these tasks and find improvements in the performance ceiling.
Fast-Slow Training Remains Close to the Base Model and Maintains Plasticity¶


Since the prompt absorbs the brunt of the task-specific information, the model parameters are able to remain much closer in distribution to the base model. This is significant when linking to past work [19] that shows base model KL on a task is a strong proxy for catastrophic forgetting. We additionally probe the plasticity of models trained with RL vs. FST and find FST checkpoints are much more amenable to future RL training.
Fast-Slow Training Improves Continual Learning¶

We compare FST with RL in a task-stage continual learning setting, where a single training run continues across 3 tasks, HoVer (blue) → CodeIO (green) → Physics (red). Here, the FST seed prompts for GEPA are 100% task-agnostic and the prompt optimizer autonomously decides how and when to change the system prompt in response to changing data. We find FST is able to quickly pick up tasks where RL stalls.
Why does FST work?¶
We wanted to better understand what exactly about FST enabled faster learning on new tasks and a higher performance ceiling. We conducted a few controlled experiments:
Fast Weights Acquire Task Signal Faster Than Slow Weights¶

We train models with RL and FST on a synthetic star graph search task [20], where the goal is to find a path between two nodes in a large star shaped graph. The base model obtains 0 rewards on this task. We find the addition of context optimization is able to drastically speed up the rate at which FST obtains learning signal. This is in part due to the capability of GEPA to learn from text feedback and incorporate general lessons into the context. While GEPA alone only aids in solving few problems, it provides enough gradient signal for FST to climb rewards.
Fast and Slow Weights Both Optimizing for Reward Raise Performance Ceiling¶

Finally, to study the impact on the performance ceiling, we train models with RL, FST and FST-distill, a variant using reverse KL to distill information from the FST prompt into model weights, following recent work on self-distillation [21]. We find FST has the highest ceiling while other approaches rely on only the fast or the slow weights to climb on rewards. Additionally, we note an added diversity and exploration benefit of FST: since the prompt optimization process generates several candidate prompts covering the Pareto front, using one per rollout allows the model to explore more during RL and usually maintains higher entropy over training.
The Future¶
More broadly, FST represents a paradigm for continual learning in LLMs where model context can be optimized as "fast weights" (through any method), quickly picking up task-specific information, and network parameters can be updated as "slow weights" (eg. via RL, SFT, OPD…), building a robust general reasoning core.