A workflow agent that runs for 120 iterations accumulates a conversation history that grows with every turn. Each MCP tool result can be 500–2,000 tokens. After 40 tool calls, the history alone can exceed 80,000 tokens — before the system prompt, tool definitions, or any new content. Without a strategy for managing this growth, the most complex workflows would fail mid-execution with context window errors, or cost far more than necessary.
The History Loss Bug
Before describing the solution, it's worth describing the bug it replaced. In an early version of the message builder function, the conversation history was never actually attached to the messages sent to the model:
# Buggy version of _build_messages()
def _build_messages(state: GoalAgentState) -> list:
messages = [SystemMessage(content=system_prompt)]
# ... add file context, filled slots ...
conversation = state.get("messages", [])
# BUG: conversation was fetched but never appended
# messages += conversation ← this line was missing
return messages
The effect was subtle: the agent would work correctly on turn 1, because there was no prior history to miss. On turn 2, it would receive the system prompt and file context but no memory of what it had already done. It would sometimes restart from the beginning, repeat tool calls it had already made, or produce contradictory results.
The fix was one line: messages += conversation. But finding it required systematically tracing why agents on long workflows were producing results that ignored their own earlier output. The bug only manifested on turn 2 and beyond — single-turn workflows passed all tests.
The Trimming Strategy
With conversation history correctly included, the growth problem became real. The solution is selective trimming: keep recent turns verbatim, compress earlier turns to summaries.
_KEEP_RECENT_TURNS = 2 # AI→tool pairs kept intact
_TRIMMED_TOOL_MSG_MAX_CHARS = 150 # max chars for trimmed tool results
def _trim_old_tool_results(messages: list) -> list:
"""Trim ToolMessage content from older turns.
Keeps the most recent _KEEP_RECENT_TURNS AI→Tool pairs intact.
For older turns, replaces ToolMessage content with a short summary
preserving ref_id and success/error status.
"""
if len(messages) <= _KEEP_RECENT_TURNS * 3:
return messages # too short to be worth trimming
# Find cut point: walk backwards, count AI messages
ai_count = 0
cut_index = len(messages)
for i in range(len(messages) - 1, -1, -1):
if isinstance(messages[i], AIMessage):
ai_count += 1
if ai_count >= _KEEP_RECENT_TURNS:
cut_index = i
break
# Everything before cut_index: trim ToolMessages to short summary
trimmed = []
for i, msg in enumerate(messages):
if i < cut_index and isinstance(msg, ToolMessage):
short = _summarize_tool_result(msg.content, tool_name=...)
trimmed.append(ToolMessage(content=short,
tool_call_id=msg.tool_call_id))
else:
trimmed.append(msg) # keep intact
return trimmed
What the Summary Preserves
The trimmed summary is not arbitrary truncation. It extracts the highest-value information from a tool result:
def _summarize_tool_result(content: str, tool_name: str = "") -> str:
if len(content) <= _TRIMMED_TOOL_MSG_MAX_CHARS:
return content # short enough, keep as-is
# 1. Preserve ref_id — used by downstream tools (store_artifact, etc.)
ref_match = re.search(r"ref_id[:\s]+([a-zA-Z0-9_-]+)", content)
ref_note = f" [ref_id: {ref_match.group(1)}]" if ref_match else ""
# 2. Preserve success/error status from the first 100 chars
if "error" in content.lower()[:100]:
status = "[ERROR] "
elif "success" in content.lower()[:100]:
status = "[OK] "
else:
status = ""
# 3. Truncate the beginning + append ref note + trimmed marker
tool_prefix = f"[{tool_name}] " if tool_name else ""
truncated = content[:_TRIMMED_TOOL_MSG_MAX_CHARS].rstrip()
return f"{tool_prefix}{status}{truncated}...{ref_note} [trimmed — already processed]"
Three things are preserved: the tool name (for context on what was called), the success/error status (the agent needs to know if the call failed), and the ref_id (a reference ID for binary outputs that downstream tools need to chain calls). A 2,000-token MCP result becomes a 30-word summary — sufficient for the agent to know what happened and what reference IDs are available, without the cost of keeping the full output.
Why Keep Recent Turns Intact
The _KEEP_RECENT_TURNS = 2 value is a deliberate tradeoff. The agent needs full, verbatim results from its most recent actions because:
- The current reasoning turn is directly consuming the results of the previous tool call
- Partial or compressed results would cause the agent to reason incorrectly about what it just received
- Reference IDs alone are sufficient for older turns — the binary data they point to is in object storage, not in the context window
Two turns is a conservative choice. It covers the "call tool → reason about result → call another tool based on that result" pattern that dominates most workflows. Extending to 3 or 4 would increase safety margin but also increase costs on workflows where turns are verbose.
Interaction with Prompt Caching
Message trimming and prompt caching serve complementary roles. Prompt caching reduces the cost of the system prompt portion — which is identical across all turns — by paying approximately 10% of normal input cost after the first turn:
messages = [
SystemMessage(
content=[
{"text": full_system_prompt},
{"cachePoint": {"type": "default"}},
]
)
]
# system prompt + tool definitions are identical every turn
# → cachePoint marks the boundary; everything before it is cached
# → subsequent turns pay ~10% of normal cost for the system prefix
Message trimming reduces the cost of the conversation history portion — which grows with every turn. Together, they address the two largest contributors to input token cost in long-running workflows. For a 40-turn workflow with a 4,000-token system prompt and an average of 800 tokens per tool result:
# Without optimizations:
# Turn 40 input = 4,000 (system) + 40 × 800 (history) = 36,000 tokens
# Total across 40 turns ≈ 40 × 20,000 average = 800,000 input tokens
# With caching + trimming:
# Turn 40: 400 (cached system) + 2 × 800 (recent full) + 38 × 30 (summaries) ≈ 3,140 tokens
# Savings: 89% reduction on long turns
The savings compound on high-iteration workflows. A 120-iteration workflow for resume generation sees the greatest benefit — by turn 80, the untrimmed history would exceed the context window entirely, making trimming not just an optimization but a functional requirement.
Message Deduplication in the APPEND Reducer
The conversation history uses an APPEND reducer (the add_messages function) that deduplicates by message ID. This matters when a workflow is restored from a checkpoint:
GoalAgentState:
messages: Annotated[list[AnyMessage], add_messages]
# add_messages: appends new messages, deduplicates by message.id
# → safe to replay checkpoints without doubling history
When a workflow resumes from a checkpoint after a SLOT_NEEDED pause, the messages from before the pause are already in state. The user's slot response is injected as a new HumanMessage with a fresh ID. The deduplicating reducer ensures the history before the pause isn't replicated — the agent sees a clean, continuous conversation thread.
The history loss bug taught us that conversation history is the agent's memory. Lose it and the agent loses coherence. Keep all of it verbatim and the agent loses its ability to reason within a finite context. The trimming strategy is a deliberate compression: discard verbosity, preserve meaning.
