Token Optimization in Code Agents: Beyond CLAUDE.md
Introduction
The obsession with context quantity is a recurring problem in AI agent adoption for development. Developers often feel the need to push as much information as possible into models, believing more data will yield better responses. This intuition is intuitively appealing but empirically false. What researchers and production professionals have discovered is that context quality vastly outperforms context quantity.
The reality is more nuanced: LLMs function as stateless functions with finite context windows. Every token is precious, and how you structure and deliver information determines whether the agent stays focused on the task or gets "lost" in irrelevant context. This article goes beyond simple .cursorrules and CLAUDE.md configurations, exploring context engineering strategies that professionals use daily with Cursor, Windsurf, Zed, and Claude Code.
1. The Real Problem: Attention Degradation, Not Data Scarcity
The Empirical Reality
Recent research in LLM-based code generation reveals a counterintuitive pattern: the more tokens you provide, the worse the model performs — until a certain threshold where quality stabilizes or degrades further.
A Refine AI case study illustrates this clearly. When attempting to improve dashboard and data analysis component generation:
- Iteration 1: Include entire library documentation, style guides, and examples. Result: 10% success rate, ~100k input tokens.
- Iteration 2: Attempt to structure this in a three-stage prompt. Result: 20% success rate, but with paradoxical token growth to ~500k.
- Iteration 3: Isolated multi-agent architecture. Result: 80% success rate, with 90% token reduction compared to the previous iteration.
The critical insight: the model wasn't failing due to lack of information. It was failing because model attention was diluted between relevant context and noise.
Why This Happens
LLMs allocate "attention power" non-uniformly. When you provide 500k tokens of context, the model:
- Cannot effectively process all context — attention degrades uniformly as token quantity increases.
- Prioritizes peripheries — the model tends to pay more attention to the beginning (system prompt, CLAUDE.md) and end (your current query) than the middle.
- Suffers from "path-of-least-resistance" — it prefers generating generic responses using common patterns instead of exploring the provided context.
2. Multi-Agent Architecture: Divide and Conquer
The Context Window Paradox
Waiting for larger context windows is a flawed strategy. Even with 200k tokens (Claude 3.5 Opus), 128k-200k (Cursor Max Mode), or 2M (experimental models), the problem persists: information is not the issue, focus is.
The solution established by production communities is separation of concerns through multiple specialized agents.
Pattern: Research-Plan-Implement-Review
Instead of a single agent doing everything (the problem CLAUDE.md attempts to solve), distribute the task:
┌──────────────────────┐
│ Research Agent │ (Identifies patterns and relevant references)
└──────────┬───────────┘
│
↓
┌──────────────────────┐
│ Planning Agent │ (Structures approach and roadmap)
└──────────┬───────────┘
│
↓
┌──────────────────────┐
│ Implementation Agent │ (Executes with focused context)
└──────────┬───────────┘
│
↓
┌──────────────────────┐
│ Review Agent │ (Validates and refines iteratively)
└──────────────────────┘
Why it works:
- Each agent inherits only context relevant to its phase
- The Research Agent doesn't carry entire conversation history
- The Implementation Agent receives clear specifications, not a braindump
- Overall result is better (80% vs 10%), yet using 90% fewer tokens
Implementation in Cursor 2.0 / Windsurf / Claude Code
Cursor 2.0 (released October 2025) offers up to 8 parallel agents via git worktrees or remote machines. Each agent works independently without conflicts:
# Cursor 2.0 automatically creates isolated worktrees
cursor-agent-1: feature-backend (Composer model)
cursor-agent-2: feature-frontend (Composer model)
cursor-agent-3: tests-e2e (GPT-4 Codex)
# You compare results and choose the best
Windsurf implements Cascade 2.0 (in development) with "one agent writes, another reviews and marks" pattern. Each has separate context and can execute up to 20 tool calls sequentially without manual intervention.
Claude Code offers Sub-agents (native since August 2025) — specializations with their own system prompts, tool permissions, and isolated context window. Example structure:
# .claude/agents/code-reviewer.md
---
name: code-reviewer
description: Analyzes code for patterns, security, and performance
tools:
- grep
- file-read
model: haiku # Use faster/cheaper model for focused tasks
---
You are a senior code reviewer specializing in...
3. Intentional Context Compression
The Problem with Auto-Compaction
Claude Code automatically summarizes at 95% of the window. Cursor truncates internally to "maintain performance". No tool offers explicit control over what is preserved and why.
Strategy: Structured Notes with Explicit Decisions
Instead of letting the agent auto-summarize, you define what's important upfront:
# Project Status - Sprint 24
## Architectural Decisions (✗ change only with explicit justification)
- **Use FastAPI + SQLAlchemy** — reason: native async support, active community
- **Database: PostgreSQL 14+** — reason: JSONB, full-text search, advanced constraints
- **Authentication: JWT with refresh tokens** — reason: stateless, ideal for horizontal scaling
## Resolved Issues
- ✓ N+1 queries in /users endpoint (eager loading added)
- ✓ WebSocket cleanup memory leak (implemented context manager)
## In-Progress Issues
- 🔴 CRITICAL: Timeout on reports with >100k rows (optimize query, add pagination)
- 🟡 MEDIUM: Insufficient validation on file uploads (add sanitization)
## Recent Technical Context
- Modified `profiles` schema to include `last_sync_at` (migration #47)
- Implemented pagination in GET /reports endpoint (20-100 items/page)
This approach is superior to auto-summaries because:
- Preserves intent — the agent knows why each decision was made
- Enables search — finding "why we use FastAPI" is trivial
- Reduces re-explanation — new agents understand context without ambiguity
- Supports versioning — commit this to git, track evolution of decisions
Practical Implementation in Your Projects
Create a .agent-state/progress.md file in your repository:
# .agent-state/progress.md
## Last Agent Session
- **Previous Agent**: ImplementationAgent (FastAPI endpoints)
- **Time Spent**: 2.5 hours
- **Completed**: 3 endpoints GET/POST/PATCH implemented + unit tests
- **Status**: 🟢 Ready for review
## Next Steps
1. ReviewAgent: Validate data types, edge cases, security
2. TestAgent: Add integration tests (database)
3. DocsAgent: Update Swagger/OpenAPI spec
## Known Issues for Next Session
- In-memory cache causing inconsistency with database data (pending debugging)
- POST /items endpoint not yet paginated
Instruct your agent: "Read .agent-state/progress.md before starting any new task and update it before finishing."
4. Intelligent Search and Contextual Retrieval (Enhanced RAG)
The Anti-Pattern: "Attach Everything"
Many developers attach 50 files hoping the agent will "find" the relevant ones. Result: wasted tokens, diluted attention, increased latency.
Solution: Iterative On-Demand Retrieval
Windsurf uses semantic indexing on Abstract Syntax Tree (AST) — not entire files, but semantic blocks (functions, classes, etc). Result: 3x better retrieval accuracy compared to simple methods like naive grep.
Cursor 2.0 with Composer model uses three-layer search:
- Semantic embeddings — "which file probably contains email validation?"
- Exact match grep — search for
email_validator - AST parsing — understand structure and dependencies
How to Implement With Your Agents
Instruct the agent to explore proactively:
"Explore the repository proactively:
1. List structure with depth=2 (tree -L 2)
2. Identify critical files by size/frequency
3. Search for specific patterns: /^\s*class\s+\w+.*Exception/
4. Read only what's necessary for the task
5. When in doubt about patterns, search for examples (grep -r 'pattern' --include='*.py')"
This keeps the context window clean and focused. The agent doesn't load 50 files — it loads 3-4 directly relevant ones.
5. Sub-Agents vs. Slash Commands vs. Skills
Confusion about "which to use when" is common. Here's the decision matrix (based on Claude Code stack):
| Resource | When to Use | Isolation | Automatic? | Example |
|---|---|---|---|---|
| Slash Command | Manual, repeatable workflow | ❌ No | ❌ Manual | /run-tests |
| Skill | Auto-activate by context | ❌ No | ✅ Automatic | Detect package.json → activate Node.js skill |
| Sub-agent | Isolated task, own context | ✅ Yes (isolated) | ✅ Automatic | Code reviewer, test generator |
| Hook | React to lifecycle events | ❌ No | ✅ Automatic | On save file, run linter |
"Divide & Conquer" paradigm: Use sub-agents for true parallelization, slash commands for frequent manual workflows, skills for automatic context detection.
6. State Management with MCPs (Model Context Protocol)
The Problem with Ephemeral Context
Each new session/conversation is a reset. You lose:
- Previous architectural decisions
- Patterns that worked
- Known bugs
- Accumulated project context
Solution: MCPs as Persistent Memory
MCPs (standardized protocol for tool integration) allow agents to access structured data without loading it into context:
{
"name": "project-state-mcp",
"description": "Access to persistent project state",
"tools": [
{
"name": "get_architecture_decisions",
"description": "Returns architectural decisions with justifications"
},
{
"name": "get_known_issues",
"description": "Known bugs and previous resolutions"
},
{
"name": "get_approved_patterns",
"description": "Approved code patterns in the project"
}
]
}
Production MCP Examples (December 2025):
- Serena MCP — comprehensive toolkit for code agents (semantic search, intelligent editing). Integrates with Claude Code and works with sub-agents
- Google Cloud Security MCP — connects SOC personas with security runbooks
- Shortcut MCP — integrates project management with agent context
Benefit: The agent doesn't need to re-read everything — it accesses via tool call, saving 60-80% of tokens.
7. New: Parallel Agents in Cursor 2.0
What Changed (October 2025)
Cursor 2.0 introduced up to 8 simultaneous agents using git worktrees or remote machines:
Agent 1 (Composer model, worktree-1): Backend refactoring
Agent 2 (Composer model, worktree-2): Frontend components
Agent 3 (GPT-4 Codex, worktree-3): E2E tests
Each worktree is an isolated clone of the same repo — changes don't conflict. You compare the 3 results and choose the best.
Practical implication:
- Complex tasks solved 2-3x faster
- Result comparison for hard problems (multiple models tackling the same issue)
- Git worktrees automatically managed by Cursor
8. Windsurf Cascade: Semantic Indexing
How It Works (December 2025)
Windsurf pre-scans your entire repository and creates a semantic index on AST:
Your code:
├── auth/
│ ├── login.py → [login function, AuthService class, imports]
│ ├── jwt.py → [encode_token function, decode_token function]
├── models/
│ ├── user.py → [User class, validators]
│ ├── schema.py → [Pydantic BaseModel subclasses]
Windsurf's semantic index:
- Entity: "login" (function) → file: auth/login.py:15
- Entity: "AuthService" (class) → file: auth/login.py:42
- Dependency: AuthService → jwt.encode_token
- Dependency: User model → schema.py
When you ask "refactor authentication to use JWT", Cascade understands structure without reading entire files.
Advantage over grep: Structural context, not just strings.
9. Intentional Compression vs. Auto-Compaction
Auto-Compression Failure Patterns
Stanford research (2024) identifies 3 patterns where automatic compression fails:
- "Lost by the boundary" — critical information at chunk boundaries is lost
- "Lost if surprise" — unexpected/outlier information is discarded
- "Lost along the way" — data in the middle of context suffers degradation
Strategy: Controlled Compression with Explicit Ranks
# Manual Compression - Session #47
## Preserve (Critical - never discard)
- Decision: Use Row-Level Security in PostgreSQL
- Reason: Multi-tenant data security requirement
- Reference file: `docs/security-design.md`
- Priority: 🔴 CRITICAL
## Preserve (Important - if space available)
- Exponential backoff with jitter retry pattern
- Reason: Prevent thundering herd in rate limiting
## Discard (Already solved, no future relevance)
- Discussions on ORM selection (resolved with SQLAlchemy, no alternatives)
- Python syntax debugging (isolated problem)
## New Context Window
- Start new session with CLAUDE.md + progress.md + decisions file
- "Implementation" sub-agent will receive only focused spec
10. Model Selection: The Cheapest Token is the One You Don't Use
Selection Matrix (December 2025)
| Task | Model | Tokens | Example |
|---|---|---|---|
| Log analysis | Haiku / Sonnet 3.5 | 10k | Find error pattern |
| Module refactoring | Sonnet 4.0 | 50-100k | Reorganize auth service |
| Architecture design | Opus / Claude Pro | 150k+ | Redesign scalability |
| Performance optimization | Opus + thinking mode | 300k+ | Profile and optimize bottleneck |
Anti-Pattern
❌ Use Opus for EVERYTHING because "it's the best"
❌ Use Haiku for complex refactoring
❌ Enable "thinking mode" by default
Cursor 2.0 allows automatic model selection per task — configure "Auto" for simple tasks, Sonnet for moderate, Opus rarely.
11. The Role of Thinking Mode
Models with explicit reasoning ("thinking") can solve complex problems, but consume tokens exponentially.
When to Use Thinking
✅ Critical architectural decisions (e.g., database migration) ✅ Complex performance optimizations (e.g., distributed sorting algorithm) ✅ Deep security analysis (e.g., cryptography audit)
When to Avoid
❌ Simple refactoring ❌ Typo fixes ❌ Boilerplate generation
Golden Rule: Thinking mode = 3-5x more tokens. Use only when quality justifies it.
12. Technical Terminology and Best Practices
To avoid direct translation confusion:
| Technical Term | Incorrect Translation | Correct Terminology |
|---|---|---|
| Token dilution | Token dilution (too literal) | Attention degradation or context pollution |
| Context window | Context window (OK) | Context window / Context limit |
| System prompt | System prompt (OK) | System prompt / System instruction |
| Tool calling | Tool calling (OK) | Tool invocation / Function calling |
| Stateless function | Stateless function (OK) | Function without memory |
| Auto-compact | Auto-compact (OK) | Automatic compaction / Auto-compression |
| Context pollution | Context pollution (OK) | Contextual pollution / Context contamination |
| Embedding | Embedding (keep) | Semantic vector / Vector representation |
| Retrieval-Augmented Generation (RAG) | RAG (keep acronym) | Search + Generation |
| Gist token compression | Gist token compression (too technical) | Essential token compression / Compressed synthesis |
13. Continuous Monitoring and Optimization
Metrics That Matter
- Success Rate — % of tasks completed on first attempt
- Tokens per Success — cost of each successful task
- Response Time — total latency (not linear with tokens)
- Context Utilization — % of window actually used
Observability Tools (December 2025)
- Cursor: View tokens/stats in UI per operation
- Windsurf: Detailed retrieval + token logs in
~/.windsurf/logs - Claude Code: Retrospective trace analysis via Claude Dashboard
- Qodo: Metrics per executed agent
Realistic Optimization Pipeline
Week 1: Measure baseline
Metric: 150k tokens/task, 60% success
Week 2: Implement intelligent RAG
Metric: 80k tokens, 75% success
Week 3: Add MCPs
Metric: 50k tokens, 85% success
Week 4: Multi-agent (research-plan-implement)
Metric: 40k tokens, 90% success
Week 5+: Incremental refinements
Metric: 30k tokens, 92% success
14. Common Anti-Patterns
❌ Sending Everything At Once
# BAD
agent.run("""
TODO: Refactor the authentication module.
Here is the ENTIRE codebase:
[500 files attached]
Here is the documentation:
[50 PDFs]
""")
✅ Iterative Retrieval
# GOOD
agent.run("""
TODO: Refactor the authentication module.
Explore:
1. Search for classes with 'Auth' in the name
2. Identify the entry point function
3. Map dependencies
4. Then review structure
""")
❌ One Agent Doing Everything
# BAD - polluted context
agent = create_agent(
instructions="""
You are a full-stack developer who:
- Writes FastAPI backends
- Writes React frontends
- Deploys with Docker
- Writes tests
- Documents everything
- Optimizes performance
"""
)
✅ Specialized Agents
# GOOD - focused context
research_agent = SubAgent(instructions="Analyze codebase and propose architecture")
backend_agent = SubAgent(instructions="Implement FastAPI endpoints")
test_agent = SubAgent(instructions="Write pytest tests")
review_agent = SubAgent(instructions="Review code for patterns and security")
# Orchestrate manually or let Claude Code coordinate
Conclusion: Tokens as a Finite Resource
Token optimization is fundamentally about treating tokens as a finite and valuable resource, similar to RAM or CPU in traditional systems.
Principles applied by production professionals in December 2025:
- Quality > Quantity — focused context outperforms abundant context
- Multi-Agent Architecture — split complex problems into specialized tasks
- Intentional Compression — control what's memorized, don't rely on auto-summaries
- Iterative Search — load context on-demand, not all at once
- Appropriate Model Selection — use the right model for each task
- Semantic Indexing — leverage AST and embeddings, not naive grep
- MCPs for Memory — access persistent state via tools, not context
- Observability — measure and optimize continuously
- Sub-agents for Isolation — each agent specialized, own context
- Parallel Agents — Cursor 2.0 allows 8 simultaneous agents via git worktrees
Applying these techniques isn't just about saving tokens — it's about building AI agents that reason better, deliver superior code quality, and cost less in the process.
Technical References
- Anthropic Engineering: "Effective context engineering for AI agents" (2025)
- Refine AI Case Study: "Quality Code Generation: Multi-Agent Systems and Token Dilution" (90% token reduction with multi-agent)
- Cursor 2.0 Release: "Introducing Cursor 2.0 and Composer" (October 2025) — git worktrees, 8 parallel agents
- Windsurf: AST-based semantic indexing, Cascade 2.0 roadmap (code review + benchmarking)
- Claude Code Documentation: Sub-agents, MCP protocol extensions, context isolation patterns
- Serena MCP: Semantic code understanding toolkit for code agents (GitHub: oraios/serena)
- HumanLayer Blog: "Writing a good CLAUDE.md" (instruction-following research)
- Research Papers:
- "Self-Organized Agents" (SoA framework)
- "Chain of Agents" (CoA framework)
- "Gist Token-based Context Compression" (Stanford, failure patterns)
- "LLMLingua: Compressing Prompts for Accelerated Inference"