Token Optimization in Code Agents: Beyond CLAUDE.md

Introduction

The obsession with context quantity is a recurring problem in AI agent adoption for development. Developers often feel the need to push as much information as possible into models, believing more data will yield better responses. This intuition is intuitively appealing but empirically false. What researchers and production professionals have discovered is that context quality vastly outperforms context quantity.

The reality is more nuanced: LLMs function as stateless functions with finite context windows. Every token is precious, and how you structure and deliver information determines whether the agent stays focused on the task or gets "lost" in irrelevant context. This article goes beyond simple .cursorrules and CLAUDE.md configurations, exploring context engineering strategies that professionals use daily with Cursor, Windsurf, Zed, and Claude Code.

1. The Real Problem: Attention Degradation, Not Data Scarcity

The Empirical Reality

Recent research in LLM-based code generation reveals a counterintuitive pattern: the more tokens you provide, the worse the model performs — until a certain threshold where quality stabilizes or degrades further.

A Refine AI case study illustrates this clearly. When attempting to improve dashboard and data analysis component generation:

Iteration 1: Include entire library documentation, style guides, and examples. Result: 10% success rate, ~100k input tokens.
Iteration 2: Attempt to structure this in a three-stage prompt. Result: 20% success rate, but with paradoxical token growth to ~500k.
Iteration 3: Isolated multi-agent architecture. Result: 80% success rate, with 90% token reduction compared to the previous iteration.

The critical insight: the model wasn't failing due to lack of information. It was failing because model attention was diluted between relevant context and noise.

Why This Happens

LLMs allocate "attention power" non-uniformly. When you provide 500k tokens of context, the model:

Cannot effectively process all context — attention degrades uniformly as token quantity increases.
Prioritizes peripheries — the model tends to pay more attention to the beginning (system prompt, CLAUDE.md) and end (your current query) than the middle.
Suffers from "path-of-least-resistance" — it prefers generating generic responses using common patterns instead of exploring the provided context.

2. Multi-Agent Architecture: Divide and Conquer

The Context Window Paradox

Waiting for larger context windows is a flawed strategy. Even with 200k tokens (Claude 3.5 Opus), 128k-200k (Cursor Max Mode), or 2M (experimental models), the problem persists: information is not the issue, focus is.

The solution established by production communities is separation of concerns through multiple specialized agents.

Pattern: Research-Plan-Implement-Review

Instead of a single agent doing everything (the problem CLAUDE.md attempts to solve), distribute the task:

┌──────────────────────┐
│ Research Agent       │ (Identifies patterns and relevant references)
└──────────┬───────────┘
           │
           ↓
┌──────────────────────┐
│ Planning Agent       │ (Structures approach and roadmap)
└──────────┬───────────┘
           │
           ↓
┌──────────────────────┐
│ Implementation Agent  │ (Executes with focused context)
└──────────┬───────────┘
           │
           ↓
┌──────────────────────┐
│ Review Agent        │ (Validates and refines iteratively)
└──────────────────────┘

Why it works:

Each agent inherits only context relevant to its phase
The Research Agent doesn't carry entire conversation history
The Implementation Agent receives clear specifications, not a braindump
Overall result is better (80% vs 10%), yet using 90% fewer tokens

Implementation in Cursor 2.0 / Windsurf / Claude Code

Cursor 2.0 (released October 2025) offers up to 8 parallel agents via git worktrees or remote machines. Each agent works independently without conflicts:

# Cursor 2.0 automatically creates isolated worktrees
cursor-agent-1: feature-backend (Composer model)
cursor-agent-2: feature-frontend (Composer model)
cursor-agent-3: tests-e2e (GPT-4 Codex)
# You compare results and choose the best

Windsurf implements Cascade 2.0 (in development) with "one agent writes, another reviews and marks" pattern. Each has separate context and can execute up to 20 tool calls sequentially without manual intervention.

Claude Code offers Sub-agents (native since August 2025) — specializations with their own system prompts, tool permissions, and isolated context window. Example structure:

# .claude/agents/code-reviewer.md
---
name: code-reviewer
description: Analyzes code for patterns, security, and performance
tools:
  - grep
  - file-read
model: haiku  # Use faster/cheaper model for focused tasks
---

You are a senior code reviewer specializing in...

3. Intentional Context Compression

The Problem with Auto-Compaction

Claude Code automatically summarizes at 95% of the window. Cursor truncates internally to "maintain performance". No tool offers explicit control over what is preserved and why.

Strategy: Structured Notes with Explicit Decisions

Instead of letting the agent auto-summarize, you define what's important upfront:

# Project Status - Sprint 24

## Architectural Decisions (✗ change only with explicit justification)
- **Use FastAPI + SQLAlchemy** — reason: native async support, active community
- **Database: PostgreSQL 14+** — reason: JSONB, full-text search, advanced constraints
- **Authentication: JWT with refresh tokens** — reason: stateless, ideal for horizontal scaling

## Resolved Issues
- ✓ N+1 queries in /users endpoint (eager loading added)
- ✓ WebSocket cleanup memory leak (implemented context manager)

## In-Progress Issues
- 🔴 CRITICAL: Timeout on reports with >100k rows (optimize query, add pagination)
- 🟡 MEDIUM: Insufficient validation on file uploads (add sanitization)

## Recent Technical Context
- Modified `profiles` schema to include `last_sync_at` (migration #47)
- Implemented pagination in GET /reports endpoint (20-100 items/page)

This approach is superior to auto-summaries because:

Preserves intent — the agent knows why each decision was made
Enables search — finding "why we use FastAPI" is trivial
Reduces re-explanation — new agents understand context without ambiguity
Supports versioning — commit this to git, track evolution of decisions

Practical Implementation in Your Projects

Create a .agent-state/progress.md file in your repository:

# .agent-state/progress.md

## Last Agent Session
- **Previous Agent**: ImplementationAgent (FastAPI endpoints)
- **Time Spent**: 2.5 hours
- **Completed**: 3 endpoints GET/POST/PATCH implemented + unit tests
- **Status**: 🟢 Ready for review

## Next Steps
1. ReviewAgent: Validate data types, edge cases, security
2. TestAgent: Add integration tests (database)
3. DocsAgent: Update Swagger/OpenAPI spec

## Known Issues for Next Session
- In-memory cache causing inconsistency with database data (pending debugging)
- POST /items endpoint not yet paginated

Instruct your agent: "Read .agent-state/progress.md before starting any new task and update it before finishing."

4. Intelligent Search and Contextual Retrieval (Enhanced RAG)

The Anti-Pattern: "Attach Everything"

Many developers attach 50 files hoping the agent will "find" the relevant ones. Result: wasted tokens, diluted attention, increased latency.

Solution: Iterative On-Demand Retrieval

Windsurf uses semantic indexing on Abstract Syntax Tree (AST) — not entire files, but semantic blocks (functions, classes, etc). Result: 3x better retrieval accuracy compared to simple methods like naive grep.

Cursor 2.0 with Composer model uses three-layer search:

Semantic embeddings — "which file probably contains email validation?"
Exact match grep — search for email_validator
AST parsing — understand structure and dependencies

How to Implement With Your Agents

Instruct the agent to explore proactively:

"Explore the repository proactively:
1. List structure with depth=2 (tree -L 2)
2. Identify critical files by size/frequency
3. Search for specific patterns: /^\s*class\s+\w+.*Exception/
4. Read only what's necessary for the task
5. When in doubt about patterns, search for examples (grep -r 'pattern' --include='*.py')"

This keeps the context window clean and focused. The agent doesn't load 50 files — it loads 3-4 directly relevant ones.

5. Sub-Agents vs. Slash Commands vs. Skills

Confusion about "which to use when" is common. Here's the decision matrix (based on Claude Code stack):

Resource	When to Use	Isolation	Automatic?	Example
Slash Command	Manual, repeatable workflow	❌ No	❌ Manual	`/run-tests`
Skill	Auto-activate by context	❌ No	✅ Automatic	Detect `package.json` → activate Node.js skill
Sub-agent	Isolated task, own context	✅ Yes (isolated)	✅ Automatic	Code reviewer, test generator
Hook	React to lifecycle events	❌ No	✅ Automatic	On save file, run linter

"Divide & Conquer" paradigm: Use sub-agents for true parallelization, slash commands for frequent manual workflows, skills for automatic context detection.

6. State Management with MCPs (Model Context Protocol)

The Problem with Ephemeral Context

Each new session/conversation is a reset. You lose:

Previous architectural decisions
Patterns that worked
Known bugs
Accumulated project context

Solution: MCPs as Persistent Memory

MCPs (standardized protocol for tool integration) allow agents to access structured data without loading it into context:

{
  "name": "project-state-mcp",
  "description": "Access to persistent project state",
  "tools": [
    {
      "name": "get_architecture_decisions",
      "description": "Returns architectural decisions with justifications"
    },
    {
      "name": "get_known_issues",
      "description": "Known bugs and previous resolutions"
    },
    {
      "name": "get_approved_patterns",
      "description": "Approved code patterns in the project"
    }
  ]
}

Production MCP Examples (December 2025):

Serena MCP — comprehensive toolkit for code agents (semantic search, intelligent editing). Integrates with Claude Code and works with sub-agents
Google Cloud Security MCP — connects SOC personas with security runbooks
Shortcut MCP — integrates project management with agent context

Benefit: The agent doesn't need to re-read everything — it accesses via tool call, saving 60-80% of tokens.

7. New: Parallel Agents in Cursor 2.0

What Changed (October 2025)

Cursor 2.0 introduced up to 8 simultaneous agents using git worktrees or remote machines:

Agent 1 (Composer model, worktree-1): Backend refactoring
Agent 2 (Composer model, worktree-2): Frontend components
Agent 3 (GPT-4 Codex, worktree-3): E2E tests

Each worktree is an isolated clone of the same repo — changes don't conflict. You compare the 3 results and choose the best.

Practical implication:

Complex tasks solved 2-3x faster
Result comparison for hard problems (multiple models tackling the same issue)
Git worktrees automatically managed by Cursor

8. Windsurf Cascade: Semantic Indexing

How It Works (December 2025)

Windsurf pre-scans your entire repository and creates a semantic index on AST:

Your code:
├── auth/
│   ├── login.py     → [login function, AuthService class, imports]
│   ├── jwt.py       → [encode_token function, decode_token function]
├── models/
│   ├── user.py      → [User class, validators]
│   ├── schema.py    → [Pydantic BaseModel subclasses]

Windsurf's semantic index:
- Entity: "login" (function) → file: auth/login.py:15
- Entity: "AuthService" (class) → file: auth/login.py:42
- Dependency: AuthService → jwt.encode_token
- Dependency: User model → schema.py

When you ask "refactor authentication to use JWT", Cascade understands structure without reading entire files.

Advantage over grep: Structural context, not just strings.

9. Intentional Compression vs. Auto-Compaction

Auto-Compression Failure Patterns

Stanford research (2024) identifies 3 patterns where automatic compression fails:

"Lost by the boundary" — critical information at chunk boundaries is lost
"Lost if surprise" — unexpected/outlier information is discarded
"Lost along the way" — data in the middle of context suffers degradation

Strategy: Controlled Compression with Explicit Ranks

# Manual Compression - Session #47

## Preserve (Critical - never discard)
- Decision: Use Row-Level Security in PostgreSQL
- Reason: Multi-tenant data security requirement
- Reference file: `docs/security-design.md`
- Priority: 🔴 CRITICAL

## Preserve (Important - if space available)
- Exponential backoff with jitter retry pattern
- Reason: Prevent thundering herd in rate limiting

## Discard (Already solved, no future relevance)
- Discussions on ORM selection (resolved with SQLAlchemy, no alternatives)
- Python syntax debugging (isolated problem)

## New Context Window
- Start new session with CLAUDE.md + progress.md + decisions file
- "Implementation" sub-agent will receive only focused spec

10. Model Selection: The Cheapest Token is the One You Don't Use

Selection Matrix (December 2025)

Task	Model	Tokens	Example
Log analysis	Haiku / Sonnet 3.5	10k	Find error pattern
Module refactoring	Sonnet 4.0	50-100k	Reorganize auth service
Architecture design	Opus / Claude Pro	150k+	Redesign scalability
Performance optimization	Opus + thinking mode	300k+	Profile and optimize bottleneck

Anti-Pattern

❌ Use Opus for EVERYTHING because "it's the best"
❌ Use Haiku for complex refactoring
❌ Enable "thinking mode" by default

Cursor 2.0 allows automatic model selection per task — configure "Auto" for simple tasks, Sonnet for moderate, Opus rarely.

11. The Role of Thinking Mode

Models with explicit reasoning ("thinking") can solve complex problems, but consume tokens exponentially.

When to Use Thinking

✅ Critical architectural decisions (e.g., database migration) ✅ Complex performance optimizations (e.g., distributed sorting algorithm) ✅ Deep security analysis (e.g., cryptography audit)

When to Avoid

❌ Simple refactoring ❌ Typo fixes ❌ Boilerplate generation

Golden Rule: Thinking mode = 3-5x more tokens. Use only when quality justifies it.

12. Technical Terminology and Best Practices

To avoid direct translation confusion:

Technical Term	Incorrect Translation	Correct Terminology
Token dilution	Token dilution (too literal)	Attention degradation or context pollution
Context window	Context window (OK)	Context window / Context limit
System prompt	System prompt (OK)	System prompt / System instruction
Tool calling	Tool calling (OK)	Tool invocation / Function calling
Stateless function	Stateless function (OK)	Function without memory
Auto-compact	Auto-compact (OK)	Automatic compaction / Auto-compression
Context pollution	Context pollution (OK)	Contextual pollution / Context contamination
Embedding	Embedding (keep)	Semantic vector / Vector representation
Retrieval-Augmented Generation (RAG)	RAG (keep acronym)	Search + Generation
Gist token compression	Gist token compression (too technical)	Essential token compression / Compressed synthesis

13. Continuous Monitoring and Optimization

Metrics That Matter

Success Rate — % of tasks completed on first attempt
Tokens per Success — cost of each successful task
Response Time — total latency (not linear with tokens)
Context Utilization — % of window actually used

Observability Tools (December 2025)

Cursor: View tokens/stats in UI per operation
Windsurf: Detailed retrieval + token logs in ~/.windsurf/logs
Claude Code: Retrospective trace analysis via Claude Dashboard
Qodo: Metrics per executed agent

Realistic Optimization Pipeline

Week 1: Measure baseline
  Metric: 150k tokens/task, 60% success

Week 2: Implement intelligent RAG
  Metric: 80k tokens, 75% success

Week 3: Add MCPs
  Metric: 50k tokens, 85% success

Week 4: Multi-agent (research-plan-implement)
  Metric: 40k tokens, 90% success

Week 5+: Incremental refinements
  Metric: 30k tokens, 92% success

14. Common Anti-Patterns

❌ Sending Everything At Once

# BAD
agent.run("""
    TODO: Refactor the authentication module.
    
    Here is the ENTIRE codebase:
    [500 files attached]
    
    Here is the documentation:
    [50 PDFs]
""")

✅ Iterative Retrieval

# GOOD
agent.run("""
    TODO: Refactor the authentication module.
    
    Explore:
    1. Search for classes with 'Auth' in the name
    2. Identify the entry point function
    3. Map dependencies
    4. Then review structure
""")

❌ One Agent Doing Everything

# BAD - polluted context
agent = create_agent(
    instructions="""
        You are a full-stack developer who:
        - Writes FastAPI backends
        - Writes React frontends
        - Deploys with Docker
        - Writes tests
        - Documents everything
        - Optimizes performance
    """
)

✅ Specialized Agents

# GOOD - focused context
research_agent = SubAgent(instructions="Analyze codebase and propose architecture")
backend_agent = SubAgent(instructions="Implement FastAPI endpoints")
test_agent = SubAgent(instructions="Write pytest tests")
review_agent = SubAgent(instructions="Review code for patterns and security")

# Orchestrate manually or let Claude Code coordinate

Conclusion: Tokens as a Finite Resource

Token optimization is fundamentally about treating tokens as a finite and valuable resource, similar to RAM or CPU in traditional systems.

Principles applied by production professionals in December 2025:

Quality > Quantity — focused context outperforms abundant context
Multi-Agent Architecture — split complex problems into specialized tasks
Intentional Compression — control what's memorized, don't rely on auto-summaries
Iterative Search — load context on-demand, not all at once
Appropriate Model Selection — use the right model for each task
Semantic Indexing — leverage AST and embeddings, not naive grep
MCPs for Memory — access persistent state via tools, not context
Observability — measure and optimize continuously
Sub-agents for Isolation — each agent specialized, own context
Parallel Agents — Cursor 2.0 allows 8 simultaneous agents via git worktrees

Applying these techniques isn't just about saving tokens — it's about building AI agents that reason better, deliver superior code quality, and cost less in the process.

Technical References

Anthropic Engineering: "Effective context engineering for AI agents" (2025)
Refine AI Case Study: "Quality Code Generation: Multi-Agent Systems and Token Dilution" (90% token reduction with multi-agent)
Cursor 2.0 Release: "Introducing Cursor 2.0 and Composer" (October 2025) — git worktrees, 8 parallel agents
Windsurf: AST-based semantic indexing, Cascade 2.0 roadmap (code review + benchmarking)
Claude Code Documentation: Sub-agents, MCP protocol extensions, context isolation patterns
Serena MCP: Semantic code understanding toolkit for code agents (GitHub: oraios/serena)
HumanLayer Blog: "Writing a good CLAUDE.md" (instruction-following research)
Research Papers:
- "Self-Organized Agents" (SoA framework)
- "Chain of Agents" (CoA framework)
- "Gist Token-based Context Compression" (Stanford, failure patterns)
- "LLMLingua: Compressing Prompts for Accelerated Inference"