Memory System v2: Solving the Context Bloat Problem
Two problems plagued our AI agent: context windows exploding from tool output (82.5% of context was tool results), and /new wiping all working state with no short-term memory handoff. We built aggressive context pruning, compressed MEMORY.md by 97%, and wrote a session handoff hook that automatically updates memory files before each session reset.
In our last post on building a persistent memory system, we described the MEMORY.md bloat problem: after six weeks, the file had grown to over 700 lines, and we fixed it by switching from inline content to pointer-based entries. The fix worked. MEMORY.md got compact, session startup improved, everything was fine.
Then it bloated again.
Four weeks later, MEMORY.md was back to 92,000 characters and 790 lines. The organizer pipeline kept writing new facts inline rather than deferring to per-topic files. Our byte-size limit wasnβt being enforced consistently. The original fix had patched the symptom, not the cause.
More troubling, we had started noticing that sessions were hitting context limits mid-task even when MEMORY.md was under control. The agent would read a few files, run a search, and then stall β not because it had run out of memory, but because its context window was full of tool output from earlier in the same session.
And there was a third problem weβd been tolerating: every time we ran /new to start a fresh session, the agent lost all awareness of what it had just been doing. Our long-term memory system (v1) handled facts, preferences, and project knowledge well. But the short-term working state β what task was in progress, what decisions were just made, what the next step was β vanished completely. The user had to manually remind the agent to update its memory files before resetting, or accept losing the context.
Three problems, one theme: no systematic lifecycle for context at any timescale.
Measuring before fixing
Before changing anything, we wrote session-stats.py to analyze the last 15 sessions and understand where context was actually going. The output was clarifying.
Session context breakdown (15 sessions, chars):
ββββββββββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββ
β Category β Total β % of ctx β Avg/sessionβ
ββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββ€
β Tool results β 1,842,300 β 82.5% β 122,820 β
β System prompt β 268,100 β 12.0% β 17,873 β
β Assistant text β 64,700 β 2.9% β 4,313 β
β User input β 55,900 β 2.5% β 3,727 β
ββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββ
The most extreme session: 159,000 characters of tool results, 1,500 characters of user input and assistant text combined. The actual conversation was almost invisible in its own context window.
System prompt was 17K chars per session on average. We knew MEMORY.md was loaded at startup, but seeing it account for 12% of total context across all sessions, including sessions where nothing memory-related happened, made the number concrete. The agent was paying 17K chars of context tax on every session, regardless of what it was doing.
The two problems were now measurable: tool results bloating within a session, and MEMORY.md bloating across sessions. Both were solvable, and we had numbers to evaluate solutions against.
Solution 1: Context pruning
The within-session problem is that tool outputs accumulate. The agent reads a file β thatβs 8K chars of context. Runs a search β another 4K. Edits a file, sees the diff β 2K. Reads the test output β 6K. After a moderately complex task, the context is mostly tool output from earlier steps that the agent no longer needs to reference.
OpenClawβs contextPruning feature handles this with a TTL-based approach: after a configurable time window, tool outputs beyond the most recent turn are replaced with a placeholder. The content is gone from the active context, but the agent can see that something happened.
Our configuration:
contextPruning:
mode: cache-ttl
ttl: 30
minPrunableToolChars: 100
hardClearRatio: 0
With ttl: 30, any tool result older than 30 seconds is eligible for pruning on the next turn. minPrunableToolChars: 100 prevents replacing tiny tool outputs that cost almost nothing. hardClearRatio: 0 means we never do a full wipe β we keep the most recent turn intact.
The effect is that the agent operates with a sliding window of recent tool context rather than the full accumulated history. For tasks involving repeated file reads or search-iterate loops, this is the difference between hitting context limits at step 8 and finishing the task.
One concern we had: would pruning break the agentβs ability to reference earlier work? In practice, no. For most tasks, the agent either needs the output of the most recent tool call, or it needs a general fact that should be in memory rather than in an ephemeral tool result. If the agent needs to re-read a file it already processed, thatβs usually a sign the fact should have been written to memory, not cached in context.
Solution 2: MEMORY.md structural compression
The 92K β compact migration required confronting a design question weβd avoided the first time: what exactly should MEMORY.md contain?
Our v1 answer had been βrecent activity, active projects, key contacts, and infrastructure notes,β with a byte-size cap to keep it manageable. This was wrong. A byte-size cap is an incentive to compress content, but it doesnβt prevent accumulation β it just makes each entry shorter before you run out of room and start bending the rules.
The right answer is that MEMORY.md should contain pointers, not content. If you can answer the question βwhat is this file for?β with βit contains X,β then MEMORY.md should not contain X β it should contain βsee memory/X.md for X.β MEMORY.md is an index that tells the agent where to look, not a document that contains what the agent knows.
With that definition, the target structure became obvious:
## Users
| handle | role | notes |
| --- | --- | --- |
| @orange | owner | ... |
## Projects
| name | status | detail file |
| --- | --- | --- |
| claw-stack | active | memory/entities/project-claw-stack.md |
| info-pipeline | active | memory/entities/project-info-pipeline.md |
## Infrastructure
| service | notes | detail file |
| --- | --- | --- |
| CF Workers | edge compute | memory/infra/cloudflare.md |
## Behavior rules
See AGENTS.md for current rules.
## Recent (last 5)
- 2026-03-09: ...
Tables for structured facts (users, projects, infra). Pointers for everything else. Recent activity capped at five entries, rolling. Total target: under 5,000 characters.
After the migration, MEMORY.md went from 92,000 characters to 2,900 characters β a 97% reduction. Session startup went from ~23K tokens of MEMORY.md context to ~700 tokens. Everything that was in MEMORY.md before is still searchable through QMD vector search; itβs just in per-topic files now rather than inline.
The migration script itself was about 150 lines of Python: read the current MEMORY.md, extract facts by category using Claude Haiku, write facts to appropriate per-topic files, generate the new pointer-based MEMORY.md. Running it took 20 seconds.
Solution 3: Session handoff hooks
The context pruning and MEMORY.md compression addressed the technical bloat problems. There was a third problem weβd been tolerating: when you run /new to start a fresh session, you lose all the working context from the current session. What file were you editing? What was the next step? What did you just figure out about the bug you were debugging?
The conventional response is βwrite better notes.β We wanted to automate it.
OpenClaw supports hooks that fire on specific commands. We wrote a command:new hook that runs a session summarization pipeline before the new session starts:
# Triggered on /new
def session_handoff(transcript):
summary = claude_haiku(
system=open("MANIFEST.md").read(), # file map for the memory system
prompt=f"Summarize this session. Extract: current work state, "
f"decisions made, lessons learned, entities updated. "
f"Format as structured updates for memory files.\n\n{transcript}"
)
apply_memory_updates(summary) # updates MEMORY.md, TODO.md, entities, etc.
The hook runs synchronously with a 20-second timeout, then falls back to async if the transcript is too long to process quickly. In practice, most sessions process in 8β12 seconds.
The key piece is MANIFEST.md, a file that describes the memory systemβs structure: which files exist, what each one contains, and what kinds of updates go where. Without it, Haiku doesnβt know that a project update should go to memory/entities/project-X.md rather than into MEMORY.md directly. The MANIFEST is the schema documentation for the agent that maintains memory.
After the handoff hook, /new still starts a fresh context, but MEMORY.md now reflects the current sessionβs outcomes. The next session starts knowing where you left off.
Decay prevention rules
After rebuilding the system twice, we wrote explicit rules into AGENTS.md to prevent the same problems from recurring:
Hard limits:
- MEMORY.md must stay under 5,000 characters. If an update would push it over, write to a per-topic file and add a pointer instead.
- Never write commit hashes, code snippets, or raw error messages to MEMORY.md. These are either ephemeral (commit hashes, errors) or belong in per-topic files (code).
Prohibited content:
- Lists of more than 5 items (use a per-topic file)
- Facts already present in another memory file (no duplication)
- βTemporaryβ notes (write to a TODO file, not to MEMORY.md)
Regular maintenance:
- After any session that touched more than 3 files, check whether per-topic files need updating
- When a project status changes, update the entity file, not the MEMORY.md table
Rules written into AGENTS.md become part of the system prompt, which means the organizer pipeline and the handoff hook both see them. Theyβre not enforced by code, but explicit rules in the context are meaningfully better than informal conventions.
Measured outcomes
The immediate results after deploying the v2 changes:
| Metric | Before | After |
|---|---|---|
| MEMORY.md size | ~92K chars (~23K tokens) | ~2.9K chars (~700 tokens) |
| Session startup context tax | ~23K tokens | ~700 tokens |
| Tool result share of context | 82.5% | Pruned after 30s |
| Working state preserved across /new | No | Yes (automated) |
The MEMORY.md reduction is a 97% cut. Every new session now starts with 22K fewer tokens of overhead, which means more room for the actual task. The context pruning configuration means tool results older than 30 seconds are replaced with placeholders, preventing the within-session accumulation that was causing stalls on multi-step tasks.
Whether the handoff hook produces the right memory updates consistently is something weβll know after a few weeks of use. The architecture is right β the question is whether Haikuβs judgment about what to update holds up at scale. Weβll report back.
What we learned about memory
The v1 blog post framed the bloat problem as a technical issue with a technical fix: enforce a byte-size limit, use pointers instead of inline content. That framing was correct but incomplete.
The real problem is that memory management is an information architecture problem, not a storage problem. Every time we said βthis fact might be relevant later, so put it in MEMORY.md,β we were making a bad indexing decision. MEMORY.md was being used as a catch-all rather than as a specific layer in the architecture.
The v2 system works not because we have better enforcement mechanisms (though the TTL pruning and size limits help) but because weβre clearer about what each layer is for:
- Active context: the current sessionβs working state. Ephemeral. Pruned aggressively.
- MEMORY.md: session orientation. The minimum context needed to start a session. Pointers only.
- Per-topic files: depth on specific subjects. Loaded on demand. Where content lives.
- Vector search: fallback retrieval across all memory. For queries that donβt know where to look.
When a new fact arrives, the question isnβt βshould I remember this?β Itβs βwhich layer does this belong in?β Most facts donβt belong in MEMORY.md. Getting that architecture right is what prevents bloat.
Practical takeaways for agent developers
If youβre building something similar, the mistakes we made twice are worth knowing:
Enforce the index/content separation at write time, not retroactively. A byte-size limit on MEMORY.md doesnβt prevent bloat β it just makes bloat smaller before you exceed it. The real constraint is: no content in the index, only pointers. Check this on every write.
Measure context distribution before you optimize. We assumed MEMORY.md was the main problem. It was a problem. Tool results were a bigger problem. Running session-stats took a day to write and immediately surfaced the bigger issue. Measurement first.
TTL-based context pruning is low-risk and high-reward. We were worried it would break agent behavior. It didnβt. For most tasks, old tool results are noise, not signal. Prune them.
A handoff hook is worth more than perfect note-taking. Asking humans (or agents) to write end-of-session notes reliably is a losing strategy. Automate it. Even a rough extraction that takes 10 seconds is better than manual notes that donβt get written.
Document the memory systemβs schema for the agents that use it. The MANIFEST.md pattern β a file that explains where things go β is what makes automated memory updates actually put things in the right place. Without it, every update becomes an ad-hoc decision about file placement.
Memory systems for AI agents are still young enough that thereβs no established practice. These are the patterns that worked for us at our scale. Your scale, your access patterns, and your agentβs task distribution will produce different constraints. But the underlying principle holds: agent memory is information architecture. Get the architecture right before you build the infrastructure.