March 9, 2026 · Orange & Qiushi Wu memory architecture context-window AI agents

Memory System v2: Solving the Context Bloat Problem

Two problems plagued our AI agent: context windows exploding from tool output (82.5% of context was tool results), and /new wiping all working state with no short-term memory handoff. We built aggressive context pruning, compressed MEMORY.md by 97%, and wrote a session handoff hook that automatically updates memory files before each session reset.

In our last post on building a persistent memory system, we described the MEMORY.md bloat problem: after six weeks, the file had grown to over 700 lines, and we fixed it by switching from inline content to pointer-based entries. The fix worked. MEMORY.md got compact, session startup improved, everything was fine.

Then it bloated again.

Four weeks later, MEMORY.md was back to 92,000 characters and 790 lines. The organizer pipeline kept writing new facts inline rather than deferring to per-topic files. Our byte-size limit wasn’t being enforced consistently. The original fix had patched the symptom, not the cause.

More troubling, we had started noticing that sessions were hitting context limits mid-task even when MEMORY.md was under control. The agent would read a few files, run a search, and then stall — not because it had run out of memory, but because its context window was full of tool output from earlier in the same session.

And there was a third problem we’d been tolerating: every time we ran /new to start a fresh session, the agent lost all awareness of what it had just been doing. Our long-term memory system (v1) handled facts, preferences, and project knowledge well. But the short-term working state — what task was in progress, what decisions were just made, what the next step was — vanished completely. The user had to manually remind the agent to update its memory files before resetting, or accept losing the context.

Three problems, one theme: no systematic lifecycle for context at any timescale.

Measuring before fixing

Before changing anything, we wrote session-stats.py to analyze the last 15 sessions and understand where context was actually going. The output was clarifying.

Session context breakdown (15 sessions, chars):
┌──────────────────┬───────────┬───────────┬────────────┐
│ Category         │ Total     │ % of ctx  │ Avg/session│
├──────────────────┼───────────┼───────────┼────────────┤
│ Tool results     │ 1,842,300 │   82.5%   │   122,820  │
│ System prompt    │   268,100 │   12.0%   │    17,873  │
│ Assistant text   │    64,700 │    2.9%   │     4,313  │
│ User input       │    55,900 │    2.5%   │     3,727  │
└──────────────────┴───────────┴───────────┴────────────┘

The most extreme session: 159,000 characters of tool results, 1,500 characters of user input and assistant text combined. The actual conversation was almost invisible in its own context window.

System prompt was 17K chars per session on average. We knew MEMORY.md was loaded at startup, but seeing it account for 12% of total context across all sessions, including sessions where nothing memory-related happened, made the number concrete. The agent was paying 17K chars of context tax on every session, regardless of what it was doing.

The two problems were now measurable: tool results bloating within a session, and MEMORY.md bloating across sessions. Both were solvable, and we had numbers to evaluate solutions against.

Solution 1: Context pruning

The within-session problem is that tool outputs accumulate. The agent reads a file — that’s 8K chars of context. Runs a search — another 4K. Edits a file, sees the diff — 2K. Reads the test output — 6K. After a moderately complex task, the context is mostly tool output from earlier steps that the agent no longer needs to reference.

OpenClaw’s contextPruning feature handles this with a TTL-based approach: after a configurable time window, tool outputs beyond the most recent turn are replaced with a placeholder. The content is gone from the active context, but the agent can see that something happened.

Our configuration:

contextPruning:
  mode: cache-ttl
  ttl: 30
  minPrunableToolChars: 100
  hardClearRatio: 0

With ttl: 30, any tool result older than 30 seconds is eligible for pruning on the next turn. minPrunableToolChars: 100 prevents replacing tiny tool outputs that cost almost nothing. hardClearRatio: 0 means we never do a full wipe — we keep the most recent turn intact.

The effect is that the agent operates with a sliding window of recent tool context rather than the full accumulated history. For tasks involving repeated file reads or search-iterate loops, this is the difference between hitting context limits at step 8 and finishing the task.

One concern we had: would pruning break the agent’s ability to reference earlier work? In practice, no. For most tasks, the agent either needs the output of the most recent tool call, or it needs a general fact that should be in memory rather than in an ephemeral tool result. If the agent needs to re-read a file it already processed, that’s usually a sign the fact should have been written to memory, not cached in context.

Solution 2: MEMORY.md structural compression

The 92K → compact migration required confronting a design question we’d avoided the first time: what exactly should MEMORY.md contain?

Our v1 answer had been “recent activity, active projects, key contacts, and infrastructure notes,” with a byte-size cap to keep it manageable. This was wrong. A byte-size cap is an incentive to compress content, but it doesn’t prevent accumulation — it just makes each entry shorter before you run out of room and start bending the rules.

The right answer is that MEMORY.md should contain pointers, not content. If you can answer the question “what is this file for?” with “it contains X,” then MEMORY.md should not contain X — it should contain “see memory/X.md for X.” MEMORY.md is an index that tells the agent where to look, not a document that contains what the agent knows.

With that definition, the target structure became obvious:

## Users
| handle | role | notes |
| --- | --- | --- |
| @orange | owner | ... |

## Projects
| name | status | detail file |
| --- | --- | --- |
| claw-stack | active | memory/entities/project-claw-stack.md |
| info-pipeline | active | memory/entities/project-info-pipeline.md |

## Infrastructure
| service | notes | detail file |
| --- | --- | --- |
| CF Workers | edge compute | memory/infra/cloudflare.md |

## Behavior rules
See AGENTS.md for current rules.

## Recent (last 5)
- 2026-03-09: ...

Tables for structured facts (users, projects, infra). Pointers for everything else. Recent activity capped at five entries, rolling. Total target: under 5,000 characters.

After the migration, MEMORY.md went from 92,000 characters to 2,900 characters — a 97% reduction. Session startup went from ~23K tokens of MEMORY.md context to ~700 tokens. Everything that was in MEMORY.md before is still searchable through QMD vector search; it’s just in per-topic files now rather than inline.

The migration script itself was about 150 lines of Python: read the current MEMORY.md, extract facts by category using Claude Haiku, write facts to appropriate per-topic files, generate the new pointer-based MEMORY.md. Running it took 20 seconds.

Solution 3: Session handoff hooks

The context pruning and MEMORY.md compression addressed the technical bloat problems. There was a third problem we’d been tolerating: when you run /new to start a fresh session, you lose all the working context from the current session. What file were you editing? What was the next step? What did you just figure out about the bug you were debugging?

The conventional response is “write better notes.” We wanted to automate it.

OpenClaw supports hooks that fire on specific commands. We wrote a command:new hook that runs a session summarization pipeline before the new session starts:

# Triggered on /new
def session_handoff(transcript):
    summary = claude_haiku(
        system=open("MANIFEST.md").read(),  # file map for the memory system
        prompt=f"Summarize this session. Extract: current work state, "
               f"decisions made, lessons learned, entities updated. "
               f"Format as structured updates for memory files.\n\n{transcript}"
    )
    apply_memory_updates(summary)  # updates MEMORY.md, TODO.md, entities, etc.

The hook runs synchronously with a 20-second timeout, then falls back to async if the transcript is too long to process quickly. In practice, most sessions process in 8–12 seconds.

The key piece is MANIFEST.md, a file that describes the memory system’s structure: which files exist, what each one contains, and what kinds of updates go where. Without it, Haiku doesn’t know that a project update should go to memory/entities/project-X.md rather than into MEMORY.md directly. The MANIFEST is the schema documentation for the agent that maintains memory.

After the handoff hook, /new still starts a fresh context, but MEMORY.md now reflects the current session’s outcomes. The next session starts knowing where you left off.

Decay prevention rules

After rebuilding the system twice, we wrote explicit rules into AGENTS.md to prevent the same problems from recurring:

Hard limits:

MEMORY.md must stay under 5,000 characters. If an update would push it over, write to a per-topic file and add a pointer instead.
Never write commit hashes, code snippets, or raw error messages to MEMORY.md. These are either ephemeral (commit hashes, errors) or belong in per-topic files (code).

Prohibited content:

Lists of more than 5 items (use a per-topic file)
Facts already present in another memory file (no duplication)
“Temporary” notes (write to a TODO file, not to MEMORY.md)

Regular maintenance:

After any session that touched more than 3 files, check whether per-topic files need updating
When a project status changes, update the entity file, not the MEMORY.md table

Rules written into AGENTS.md become part of the system prompt, which means the organizer pipeline and the handoff hook both see them. They’re not enforced by code, but explicit rules in the context are meaningfully better than informal conventions.

Measured outcomes

The immediate results after deploying the v2 changes:

Metric	Before	After
MEMORY.md size	~92K chars (~23K tokens)	~2.9K chars (~700 tokens)
Session startup context tax	~23K tokens	~700 tokens
Tool result share of context	82.5%	Pruned after 30s
Working state preserved across /new	No	Yes (automated)

The MEMORY.md reduction is a 97% cut. Every new session now starts with 22K fewer tokens of overhead, which means more room for the actual task. The context pruning configuration means tool results older than 30 seconds are replaced with placeholders, preventing the within-session accumulation that was causing stalls on multi-step tasks.

Whether the handoff hook produces the right memory updates consistently is something we’ll know after a few weeks of use. The architecture is right — the question is whether Haiku’s judgment about what to update holds up at scale. We’ll report back.

What we learned about memory

The v1 blog post framed the bloat problem as a technical issue with a technical fix: enforce a byte-size limit, use pointers instead of inline content. That framing was correct but incomplete.

The real problem is that memory management is an information architecture problem, not a storage problem. Every time we said “this fact might be relevant later, so put it in MEMORY.md,” we were making a bad indexing decision. MEMORY.md was being used as a catch-all rather than as a specific layer in the architecture.

The v2 system works not because we have better enforcement mechanisms (though the TTL pruning and size limits help) but because we’re clearer about what each layer is for:

Active context: the current session’s working state. Ephemeral. Pruned aggressively.
MEMORY.md: session orientation. The minimum context needed to start a session. Pointers only.
Per-topic files: depth on specific subjects. Loaded on demand. Where content lives.
Vector search: fallback retrieval across all memory. For queries that don’t know where to look.

When a new fact arrives, the question isn’t “should I remember this?” It’s “which layer does this belong in?” Most facts don’t belong in MEMORY.md. Getting that architecture right is what prevents bloat.

Practical takeaways for agent developers

If you’re building something similar, the mistakes we made twice are worth knowing:

Enforce the index/content separation at write time, not retroactively. A byte-size limit on MEMORY.md doesn’t prevent bloat — it just makes bloat smaller before you exceed it. The real constraint is: no content in the index, only pointers. Check this on every write.

Measure context distribution before you optimize. We assumed MEMORY.md was the main problem. It was a problem. Tool results were a bigger problem. Running session-stats took a day to write and immediately surfaced the bigger issue. Measurement first.

TTL-based context pruning is low-risk and high-reward. We were worried it would break agent behavior. It didn’t. For most tasks, old tool results are noise, not signal. Prune them.

A handoff hook is worth more than perfect note-taking. Asking humans (or agents) to write end-of-session notes reliably is a losing strategy. Automate it. Even a rough extraction that takes 10 seconds is better than manual notes that don’t get written.

Document the memory system’s schema for the agents that use it. The MANIFEST.md pattern — a file that explains where things go — is what makes automated memory updates actually put things in the right place. Without it, every update becomes an ad-hoc decision about file placement.

Memory systems for AI agents are still young enough that there’s no established practice. These are the patterns that worked for us at our scale. Your scale, your access patterns, and your agent’s task distribution will produce different constraints. But the underlying principle holds: agent memory is information architecture. Get the architecture right before you build the infrastructure.

← Back to Blog Orange & Qiushi Wu