Alpha v0.8.0: Token Budget Scaling

Workflow agents can get stuck. A loop where the agent calls the same tool repeatedly, never advancing, would run indefinitely without a hard stop. Iteration limits provide one layer of protection — but they're counted in turns, not cost. A workflow that calls a tool once per turn and a workflow that calls five tools per turn have the same iteration count at completion, but wildly different input token consumption. Token budgets address what iteration limits cannot.

The Budget Formula

The token budget for each workflow execution is computed once at initialization from the workflow's max_iterations value:

# Computed at job start, checked every turn
_TOKEN_BUDGET_INPUT = max(100_000, max_iterations * 10_000)

# Examples:
# Simple workflow:  max_iterations=15  → budget = max(100K, 150K) = 150K
# Standard:         max_iterations=25  → budget = max(100K, 250K) = 250K
# Complex:          max_iterations=60  → budget = max(100K, 600K) = 600K
# Resume engine:    max_iterations=120 → budget = max(100K, 1.2M) = 1.2M

The formula has two components: a multiplier (10,000 tokens per iteration) and a floor (100,000 tokens). Each addresses a distinct failure mode.

Why 10,000 Tokens Per Iteration

Measured token usage across production workflows shows an average of 3,000–5,000 input tokens per agent turn, including system prompt amortization, conversation history, and tool result injection. The 10,000-token-per-iteration multiplier gives a 2–3× headroom above this average:

# Observed usage breakdown per turn (approximate):
# System prompt:       ~4,000 tokens (cached after turn 1: ~400 effective)
# Conversation history: ~2,000 tokens (grows, but trimmed by strategy)
# Tool definitions:     ~1,500 tokens (constant)
# New tool result:        ~500 tokens (varies widely by tool)
# ─────────────────────────────────────
# Typical turn total:   ~4,400 tokens (after caching + trimming)

# 10,000 tokens/iter × 25 iterations = 250K budget
# Expected actual usage: ~110K for same workflow
# Headroom ratio: ~2.3×

The headroom accounts for outlier turns. A turn where the agent receives a large document extraction result might consume 15,000 tokens. A turn where the agent requests user input after sending a short message might consume 2,000. The average behavior budgets conservatively for the worst-case distribution.

Why a 100,000-Token Floor

Without the floor, low-iteration workflows would have extremely tight budgets:

# Without floor:
# max_iterations=5  → budget = 5 × 10,000 = 50,000 tokens
# A single large tool result (e.g., extracted text from a 50-page PDF)
# could consume 30,000 tokens — leaving only 20,000 for all other turns.
# This would falsely abort a perfectly normal short workflow.

# With floor:
# max_iterations=5  → budget = max(100,000, 50,000) = 100,000 tokens
# The short workflow has 100K tokens regardless of the multiplier result.
# This is generous for 5 iterations — but prevents false positives.

The floor is set at 100,000 to match the typical single-turn worst-case for workflows that process large inputs (scanned PDFs, lengthy documents, or image OCR outputs). A workflow that legitimately processes one large document in one tool call should never hit the budget abort — the floor ensures it won't.

Budget Enforcement

The budget is checked at the start of the reason node, before the language model is invoked for that turn:

async def reason(state: GoalAgentState) -> dict:
    max_iter = state["max_iterations"]
    _TOKEN_BUDGET_INPUT = max(100_000, max_iter * 10_000)

    # Check cumulative input tokens from all prior turns
    prev_usage = state.get("token_usage", {})
    cumulative_input = prev_usage.get("total_input", 0)

    if cumulative_input > _TOKEN_BUDGET_INPUT:
        logger.warning(
            "Token budget exceeded (%d > %d) for job %s",
            cumulative_input, _TOKEN_BUDGET_INPUT, job_id,
        )
        return {
            "is_complete": True,
            "error": f"Token budget exceeded ({cumulative_input:,} input tokens)",
            "current_step": "token_budget_exceeded",
        }

    # Budget check passed — proceed with LLM invocation
    ...

Token usage is accumulated in the token_usage state field using a merge reducer that sums values across turns:

def merge_token_usage(left: dict, right: dict) -> dict:
    """Accumulate token usage across turns (immutable merge)."""
    return {
        "total_input":  left.get("total_input",  0) + right.get("total_input",  0),
        "total_output": left.get("total_output", 0) + right.get("total_output", 0),
        "per_turn":     left.get("per_turn", []) + right.get("per_turn", []),
    }

# After each LLM call, node returns:
# {"token_usage": {"total_input": turn_input, "total_output": turn_output,
#                  "per_turn": [{"turn": N, "input": X, "output": Y}]}}

The per_turn list is kept for observability — each workflow job record includes full per-turn token breakdowns. This data informs budget calibration: if a workflow consistently uses 40% of its budget, the max_iterations spec value could potentially be reduced.

Two Safety Rails, One Job

The iteration limit and token budget are independent safety rails that can fire at different points:

# Scenario A: Fast but expensive workflow
# Agent calls 8 tools per turn via parallel tool calls
# Turn 8: iteration_count = 8 (well under limit)
# Turn 8: cumulative_input = 450K (exceeds 250K budget for max_iter=25)
# → Token budget fires first

# Scenario B: Slow but efficient workflow
# Agent makes minimal tool calls, lots of text generation
# Turn 26: iteration_count = 26 (exceeds max_iterations=25)
# Turn 26: cumulative_input = 85K (well under budget)
# → Iteration limit fires first

# Scenario C: Normal workflow
# Turn 18: iteration_count = 18, cumulative_input = 70K
# complete_workflow called → finish node fires
# → Neither rail fires

In practice, the token budget fires rarely. The iteration limit is the primary protective rail for most workflows — the token budget exists for the unusual case where a workflow uses far more tokens per turn than its iteration count would suggest. Seeing a token budget abort in logs is a diagnostic signal: something unusual happened in that workflow's execution, worth investigating.

A fixed token budget set high enough for complex workflows is meaningless for simple ones, and a budget set for simple workflows aborts complex ones prematurely. The formula is the only way to scale protection to the actual risk each workflow carries.