Context Management for LLM Agents: A Memory Hierarchy View

23 minute read

Published:

Context is what the model sees at each forward pass: system prompt, tool definitions, conversation history, tool call results, and any retrieved or injected information. As LLM agents take on longer, more complex tasks, context management becomes the central bottleneck — not because models lack context length, but because they lack strategies for using context well.

This post surveys how agents learn to manage their own context, from harness-driven compaction through memory tools to sub-agent delegation.

Why Context Management Matters

Four distinct pressures push agents toward context limits, each with different characteristics:

1. Deep reasoning exceeds context length. On truly difficult problems, test-time scaling means the model’s chain of thought can grow to tens or hundreds of thousands of tokens — potentially exceeding the context window during a single reasoning pass. The bottleneck is the model’s own generation, not external input.

2. Tool outputs are long and off-policy. Search agents ingest web pages, code execution outputs, and API responses that can be tens of thousands of tokens each. A single screenshot in a GUI agent can consume thousands of tokens. These tokens consume context budget and push the model into out-of-distribution conditioning states.

3. Long-horizon tasks accumulate history. Multi-session tasks like SWE agents or deep research systems run for hundreds of turns. Even with short individual turns, accumulated history grows linearly and eventually saturates the window. Information relevant to the current step may have been generated many turns ago — naive FIFO truncation loses it.

4. User personalization requires cross-session state. Agents that serve the same user repeatedly need to remember preferences and prior decisions across separate conversations. The context window resets each session; personalization requires persistent storage.

These pressures often co-occur: a research agent (pressure 3) ingests long web pages (pressure 2) while reasoning deeply about their content (pressure 1) for a user whose preferences shape the research direction (pressure 4).

Long Context vs. Long Horizon

Two capabilities, often conflated, are largely distinct in practice:

  1. Long-context capability: Can the model attend to and retrieve from information at large context lengths? This is an architecture + pretraining problem (RoPE scaling, context extension, attention patterns), and is typically measured by input-heavy retrieval benchmarks (RULER, needle-in-a-haystack).

  2. Long-horizon reasoning: Can the model make productive use of additional thinking tokens or interaction turns? This is a training distribution + reasoning strategy problem, and the bottleneck is on the output side — sustaining coherent generation over tens or hundreds of thousands of tokens.

These look independent partly because we measure them differently. Long-context benchmarks test retrieval from long inputs but rarely test long generation. A model that aces needle-in-a-haystack at 1M context may still degrade past ~128k of its own reasoning, filling the space with repetition and circular logic — but we wouldn’t know from the retrieval benchmark alone. The capabilities may share more infrastructure than the benchmarks suggest (attention quality at long range matters for both), but current evaluation treats them as separate.

What’s clearer is the asymmetry in the other direction: context management can provide long-horizon reasoning without long context — a model with a 32k window can reason effectively over 500k tokens of reasoning depth if it periodically compresses its own reasoning into summaries. Context management can substitute for context length.

The LLM Memory Hierarchy

Just as computer architecture organizes storage into a hierarchy (registers → RAM → disk → network) trading speed for capacity, LLM agents operate across an analogous hierarchy:

LevelComputer analogySpeedCapacityPersistenceManagement
KV cacheCPU cache/registersHardware-speedBounded by GPU memoryPer-generationHardware/framework
Context windowRAMSingle forward pass32k-1M tokensPer-sessionHarness or model
FilesDisk/SSDTool call round-tripLargeCross-sessionModel (via tools)
Databases / vector storesNetwork storageSearch + retrievalUnlimitedPermanentModel (via tools)

The core design challenge at each boundary is the caching policy — what to keep at the fast level, when to evict to a slower level, and when to promote back. This analogy is not merely illustrative — it has already driven real system designs. PagedAttention (vLLM) directly implements virtual memory for KV cache: fixed-size KV blocks, block tables for non-contiguous allocation, copy-on-write for shared sequences. KV cache compression methods (H2O, SnapKV) are literally page eviction policies.

A notable disconnect: KV cache management is typically treated as a systems problem (optimize throughput, minimize memory) while context management is treated as an algorithm problem (optimize reasoning quality). But the two levels are tightly coupled — heuristic KV eviction can destroy reasoning by evicting semantically critical attention states, while algorithmic context editing operates without awareness of KV cache costs. The co-design opportunity: context editing decisions informed by KV cache pressure, and KV eviction informed by the model’s semantic understanding of what matters.

Who Acts as the “Operating System”?

Each pressure maps naturally to a boundary in the hierarchy:

  • Pressure 1 (deep reasoning) → KV cache ↔ context: the model’s own reasoning outgrows the window
  • Pressure 2 (tool outputs) → external data entering context: long, off-policy tokens injected from the environment
  • Pressure 3 (long-horizon accumulation) → context ↔ files: history that must survive beyond the session window
  • Pressure 4 (personalization) → files ↔ databases: cross-session state that must persist permanently

The question at each boundary is the same one an operating system faces: who decides the caching policy? The “LLM as OS” analogy is not new — Karpathy’s LLM OS vision, MemGPT [1], and AIOS [2] all explored it. MemGPT in particular proposed a two-tier virtual memory system (context window as RAM, external storage as disk) with interrupt-driven swapping. We extend this to a four-level hierarchy with explicit caching policies at each boundary, and connect each context pressure to a specific level transition.

In computer architecture, the OS has privileged access to cheap hardware signals (dirty bits, TLB misses, access counters). In LLM systems, no component has this privilege — the “hardware signals” (attention weights, token probabilities) are expensive to extract and noisy to interpret. This creates an evolving design spectrum for who makes memory management decisions — and this spectrum directly corresponds to the context management strategies we’ll survey:

PhaseWho manages memoryStrategy
Phase 1 — Harness-as-systemExternal harness, using heuristics. Cheap but semantically blind.Harness-driven compaction (Sec. 1)
Phase 2 — Model-as-systemModel manages memory via tool calls. More accurate but costly in tokens.Memory as a tool (Sec. 2)
Phase 3 — Model-informed systemSystem uses signals from the model (e.g. Free()LM’s [3] LoRA-switched cleaning mode), but the model doesn’t bear the full cost.Trained compaction (Sec. 1)
Phase 4 — Agent-as-OSMain agent schedules sub-agents and allocates fresh contexts. Sub-agents return condensed results.Sub-agents (Sec. 3)

Context Management Strategies

The strategies below are ordered by increasing model autonomy — from the harness making all decisions, to the model managing its own memory, to the model delegating entire subtasks. They are not mutually exclusive; production systems typically combine all three.

1. Harness-Driven Compaction

The base case: an external harness monitors context length and triggers compression when it exceeds a threshold. The model has no control over when or what gets compressed.

History compaction: When context reaches N tokens, the harness passes the conversation history to a summarizer (often the same model) which compresses it. The agent continues with the summary plus a small window of recent state.

Tool output preprocessing: Rather than passing raw tool outputs into context, the harness intercepts them and runs a summarizer first — condensing a full web page into the relevant paragraphs before the agent ever sees it.

Limitations: The model cannot protect information it knows will be needed later — the harness compresses uniformly. Critical details may be lost if they don’t appear “important” to the summarizer at compression time.

Training can improve harness-driven compaction along two dimensions: training the model to compress better (learned compaction), or training the model to reason better from compressed state (compaction-adapted training). These are complementary.

Learned Compaction

AgentFold [4] (SFT): Trains the model to maintain multi-scale state summaries — granular condensation of recent turns and deep consolidation of completed sub-tasks. Achieves 92% context reduction (7K vs 91K tokens after 100 turns). Key insight: delay consolidation until outcomes are clear.

MEM1 [5] (RL with sparse rewards): Trains the model to maintain a compact internal state (<IS>...</IS> tags) via PPO, learning purely from task success. At each turn, the model rewrites its internal state and discards all previous context — achieving constant memory usage regardless of task length. Results: 3.5x better performance than 2x larger models, 3.7x memory reduction. Exhibits emergent behaviors: concurrent question management, adaptive focus shifting, self-verification.

 AgentFoldMEM1
TrainingSFT on synthetic dataRL (PPO) with sparse task rewards
Memory structureMulti-scale summaries (granular + deep)Single internal state block
Memory growthSub-linear (~7K at 100 turns)Constant (~500-1K tokens)
Compression92% vs ReAct73% (27% of baseline)

That MEM1 succeeds with sparse terminal rewards while more complex context behaviors (see ContextFold [6] below) require dense process rewards reflects a recurring theme in agent RL: the right reward density depends on the complexity of the behavior being shaped.

Compaction-Adapted Training

Rather than training the compression itself, these methods keep the compression mechanism fixed and train the model to reason effectively from compressed state. This directly addresses pressure 1 — deep reasoning exceeding context length.

Reasoning Cache (RC) [7]: An iterative decoding algorithm that alternates between reasoning and summarization. At each iteration: (1) generate a reasoning trace (budget $H_R$, e.g. 16k tokens) conditioned on the previous summary, (2) summarize into a compact summary (~2k tokens), (3) discard the raw trace. The effective reasoning horizon scales as $T \times H_R$ — at $T=12$ iterations with $H_R=16\text{k}$, the model reasons over 192k tokens while each individual generation stays within 16k.

The critical design: only the reasoning step is trained with RL (GRPO), not the summarization. The RL objective trains summary-conditioned generation: given a problem and a summary of prior reasoning, produce reasoning that leads to a correct answer. This exploits a summarization-generation asymmetry — models are already better at reasoning from summaries than generating full solutions from scratch.

Results (RCT-4B, trained at 16k, evaluated at 192k):

 HMMT 2025AIME 2025IMO-AnswerBenchFrontierScience
Base (16k, standard)39.846.040.923.3
Base + RC (192k, no training)56.746.329.5
RCT-4B + RC (192k, trained)66.374.958.034.1

A 4B model extrapolates from a 16k training budget to 192k at test time — a 12x extrapolation — outperforming specialized thinking models and competing with much larger models. The FrontierScience gains transfer despite training exclusively on math.

Free()LM [3]: Where RC summarizes reasoning into a new compact representation, Free()LM takes the complementary approach — surgically pruning redundant spans while preserving the remaining tokens exactly. The core metaphor: standard LLMs are “malloc-only” engines that continuously append tokens without any mechanism to discard obsolete information. Free()LM adds the missing free() operation.

A plug-and-play Free-Module (LoRA adapter) switches the model between two modes:

  • Reasoning mode (LoRA unmerged): Normal reasoning
  • Cleaning mode (LoRA merged): Scans context, identifies redundant chunks, outputs pruning commands as [{"prefix": "...", "suffix": "..."}] anchors defining spans to delete

The cycle triggers every ~5000 tokens. Training uses SFT on filtered synthetic data — rejection sampling retains only pruning operations where accuracy on the pruned context $\geq$ accuracy on the original.

The most striking finding: on HLE tasks requiring >80k thinking tokens, Qwen3-235B drops to 0% accuracy (total reasoning collapse), but Free()LM restores performance to ~50% by compressing 100k+ trajectories back to the 40-70k range where the model reasons comfortably. The Free-Module also exhibits cross-model generalization: the 8B module transfers to Qwen3-235B and even DeepSeek-V3.2.

CoMem [8] (ICLR 2026 Workshop): Decouples memory management from reasoning into two separate models — a small memory model (e.g. Qwen3-4B) that compresses long-term history into summaries, and a larger agent model (e.g. DeepSWE 32B) that reasons over the compressed state + recent raw turns. The memory model runs asynchronously, overlapping compression with the agent’s inference.

The primary motivation is latency: context length directly determines decoding latency (KV cache loading from HBM). The memory model is trained with GRPO on a functional equivalence reward — the summary should make the frozen agent produce the same actions it would have with full context. This means the memory model may learn to preserve information a generic summarizer would discard (exact file paths, specific error messages) because the agent’s policy depends on those details.

Results on SWE-Bench: 1.4-2.1x latency reduction while maintaining competitive performance. On DeepSWE, it slightly exceeds full-context performance (41.0 vs 40.4) — suggesting compression filters noise that hurts reasoning.

RC vs CoMem: Training Opposite Sides

RC and CoMem both use summarization for context management, but they train different sides of the same interface:

compress: history -> summary    CoMem: trained (GRPO, functional equivalence)
                                RC:    not trained (base model ICL)

reason:   summary -> answer     RC:    trained (GRPO, summary-conditioned generation)
                                CoMem: not trained (frozen agent model)

Both demonstrate that training one side suffices for significant gains. The natural next step — training both sides jointly — remains unexplored.

 RCFree()LMCoMem
Primary goalReasoning extrapolationReasoning qualityInference latency
MechanismSummarize, reason from summaryPrune redundant spansSummarize (separate model), reason
What’s trainedReasoning from summariesRedundancy identificationSummary production
Training methodRL (GRPO)SFT on filtered operationsRL (GRPO, functional equivalence)
ArchitectureSingle model, two promptsSingle model + LoRATwo separate models
Extrapolation12x beyond training horizonIndirectNot targeted

These are complementary: Free()LM prunes noise first, then RC or CoMem summarizes the cleaned result. CoMem provides the latency reduction that RC and Free()LM lack.

Context Management and Test-Time Scaling

Without explicit optimization for long-horizon reasoning, models exhibit an inverse-U scaling curve: accuracy improves with more thinking tokens up to a peak, then degrades as the model falls into repetitive, unproductive reasoning. This is the fundamental ceiling on test-time scaling for standard autoregressive decoding.

Training for context management — whether RC-style [7] compaction adaptation, MEM1-style [5] internal state, or ContextFold-style [6] branching — should push this ceiling higher or eliminate it entirely. The bottleneck is not compute itself but the model’s inability to maintain productive reasoning as context grows. RC provides direct evidence: the inverse-U curve becomes monotonically increasing under RC decoding, with no sign of degradation out to 192k tokens.

This effect also operates across sessions. Databricks’ MemAlign demonstrates that agent performance scales monotonically with accumulated memory — accuracy rises from near-zero to ~70% as episodic and semantic memories grow, and average reasoning steps drop from ~20 to ~5. Memory converts past test-time compute into stored knowledge, making future sessions more efficient.

2. Memory as a Tool

The model gains autonomy by having explicit read/write tools for storing and retrieving information. Instead of the harness deciding when to compress, the model decides what to externalize and when to retrieve it — analogous to a person taking notes.

Memory storage can take several forms: in-context state (structured state in the model’s own output), persistent files (surviving across sessions, as in Claude Code’s memory system), and databases/vector stores (enabling semantic retrieval over large corpora).

Two production systems illustrate the design space:

Mem0 [9] implements a two-phase pipeline for fact-oriented persistent memory. An extraction phase processes message pairs using conversation summary + recent messages. An update phase evaluates extracted facts against existing memories using semantic similarity search, determining operations: ADD, UPDATE, DELETE, or NOOP. The graph-enhanced variant (Mem0g) adds directed labeled graphs for temporal reasoning. Results on LOCOMO: 26% improvement over OpenAI’s memory system, 91% lower p95 latency.

Agent Workflow Memory (AWM) [10] stores reusable task workflows rather than raw facts — abstracted action sequences with variables. After each successful task, it induces a workflow and adds it to memory, creating a snowball effect where simple workflows become building blocks for complex ones. Results on WebArena/Mind2Web: 51.1% relative improvement over state-of-the-art, 40% fewer steps per task.

 Mem0AWM
What’s storedFacts about the world/userAbstracted action sequences
When to storeEvery conversation turn (pipeline)After successful task completion
RetrievalSemantic similarity searchGoal-matching against descriptions
Best forPersonalization, factual recallProcedural tasks, skill transfer

Training the Model to Use Memory

Mem0 and AWM are inference-only pipelines — memory operations happen around the model, not as model-initiated actions. The harder problem: training the model to proactively call memory read/write tools during its own reasoning.

The most effective lever is selecting tasks where memory use is required by construction: multi-session tasks where information from session 1 is needed in session 3, post-compaction retrieval where facts have been evicted from context, or cross-task transfer where stored workflows accelerate new tasks. On these tasks, outcome reward alone naturally reinforces memory use because the model can’t succeed without it.

The reward design challenge is that memory actions have delayed and indirect payoff — writing costs tokens now but only pays off later. Possible approaches include counterfactual rewards for reads (did retrieval actually help?), future-use rewards for writes (was this memory ever retrieved?), and format/process rewards to bootstrap usage (must be annealed to avoid reward hacking).

No published work has systematically studied reward design for training memory tool use. The combination of memory-dependent data with outcome-only reward is the most practical starting point — it sidesteps the credit assignment problem by making the task structure do the work.

3. Sub-Agents

The most autonomous strategy: the model delegates entire subtasks to sub-agents that run in their own fresh context windows. Each sub-agent explores extensively but returns only a condensed summary. The main agent’s context stays clean and focused on coordination.

ContextFold [6] demonstrates that sub-agent delegation can be trained end-to-end with RL. It introduces two special actions:

  • branch(description, prompt): Creates a sub-agent with a scoped task and its own context
  • return(message): Summarizes outcomes and folds intermediate steps out of the main context

ContextFold trains with FoldGRPO — GRPO adapted to train on folded (compressed) contexts rather than full histories. This train-test consistency proved critical: FoldGRPO outperformed standard GRPO by +7.7% on BrowseComp-Plus. Training on unfolded contexts while deploying with folded ones is itself a form of distribution mismatch.

Dense process rewards shape the branching behavior: an Unfolded Token Penalty discourages bloating the main context, an Out-of-Scope Penalty (judged by an LLM) keeps branches focused (improving scope adherence from 47.3% to 75.4%), and a Failure Penalty punishes broken tool calls.

Results: 62% pass@1 on BrowseComp-Plus (+14.2% over baseline) and 58% on SWE-Bench using only 32K active context with max 10 branches — matching models using 10x larger contexts.

Recursive Language Models (RLM) [11]: The most expressive form of sub-agent context management. Instead of fixed branch/return operations, the model writes arbitrary code in a Python REPL to decompose, examine, and recursively process its input — including calling itself on sub-prompts.

The key design: the input prompt is stored as a variable in the REPL, never loaded into the model’s context window. The model sees only constant-size metadata and writes code to manipulate the prompt programmatically — print(prompt[:100]), prompt.split("Chapter"), or sub_RLM("In Chapter 1, find all items..."). Each recursive sub-call gets its own fresh context.

state <- InitREPL(prompt=P)              # P stored as variable, not in context
state <- AddFunction(state, sub_RLM)     # model can call itself
hist  <- [Metadata(state)]               # only metadata, not P itself
while True:
    code <- LLM(hist)                    # model writes code
    state, stdout <- REPL(state, code)   # execute in REPL
    hist <- hist || code || Metadata(stdout)
    if state[Final] is set: return state[Final]

Three design choices distinguish RLMs:

  1. Symbolic handle: The prompt is manipulated by reference (code), not by value (reading it into context). Context stays constant-size regardless of input length.
  2. Code output: The model generates programs that produce answers, not the answers themselves.
  3. Symbolic recursion: Code can invoke sub_RLM() inside loops — enabling $\Omega(|P|)$ or $\Omega(|P|^2)$ processing over the input.

Results: RLMs scale to 10M+ tokens and outperform base models by up to 2x. On OOLONG-Pairs (quadratic-complexity), GPT-5 and Qwen3-Coder both score ~0% while RLM achieves 58% — the task is structurally impossible without recursive decomposition.

In our memory hierarchy, RLM operates at the file level: the model reads from and writes to REPL variables (analogous to disk), never loading the full data into context (RAM). It effectively writes its own page fault handler.

 ContextFoldRLM
OperationsFixed: branch/returnUnbounded: arbitrary code
ScalingTree-structured (max 10 branches)Arbitrary recursion depth
TrainingRL with process rewards (FoldGRPO)SFT on 1K trajectories
Best forMulti-part tasks within context windowInputs far exceeding context window

These are complementary: ContextFold manages the model’s own reasoning context during multi-part tasks. RLM manages external input that doesn’t fit in context at all. A system could use RLM to process a 10M-token corpus and ContextFold to manage the reasoning context while doing so.

Comparing Across the Spectrum

AspectHarness CompactionMemory as ToolSub-Agents
Model autonomyNoneDecides what/when to storeDelegates entire subtasks
Context growthBounded by compression ratioBounded by retrieval budgetBounded by summary size (ContextFold) or constant (RLM)
Information loss riskHarness may discard critical infoModel may forget to saveSummary may omit key details
Training approachSFT (AgentFold), RL (MEM1), or compaction-adapted RL (RC)Mostly inference-only pipelinesRL with process rewards (ContextFold) or SFT (RLM)
Best forDeep reasoning (RC, Free()LM), simple multi-turnPersonalization, skill accumulationComplex multi-part tasks (ContextFold), inputs exceeding context (RLM)

References

[1] Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.

[2] Mei, K., Li, Z., Xu, S., Ye, R., Ge, Y., & Zhang, Y. (2024). AIOS: LLM Agent Operating System. arXiv:2403.16971.

[3] Zheng, Y., Ma, D., Liang, T., Xu, J., Huang, X., Chen, L., Mi, H., & Wang, Y. (2026). Free(): Learning to Forget in Malloc-Only Reasoning Models. arXiv:2602.08030.

[4] Ye, R., et al. (2025). AgentFold: Long-Horizon Web Agents with Proactive Context Management. arXiv:2510.24699.

[5] Zhou, Z., et al. (2025). MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents. arXiv:2506.15841.

[6] Sun, W., Lu, M., Ling, Z., Liu, K., Yao, X., Yang, Y., & Chen, J. (2025). Scaling Long-Horizon LLM Agent via Context-Folding. arXiv:2510.11967.

[7] Wu, I., Qu, Y., Setlur, A., & Kumar, A. (2026). Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL. arXiv:2602.03773.

[8] Zhang, Y., et al. (2026). CoMem: Context Management with A Decoupled Long-Context Model. ICLR 2026 Workshop. OpenReview.

[9] Chhikara, P., et al. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.

[10] Wang, Z. Z., Mao, J., Fried, D., & Neubig, G. (2025). Agent Workflow Memory. ICML 2025. GitHub.

[11] Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. arXiv:2512.24601.