Token Savings Guide — mkurri lab

Most developers using AI tools focus on model pricing — $3/M input tokens vs $15/M output tokens. But the real cost story isn't about price per token. It's about how many tokens you waste without realizing it. After analyzing real usage data from Claude Code, API workflows, and agentic loops, a clear picture emerges: more than half of the tokens you pay for are completely unnecessary.

The Five Ways You're Wasting Tokens

1. Tool Schema Bloat

Every AI tool sends its full schema — descriptions, parameter types, enums — with every single API call. For a typical agentic setup with file operations, bash access, web search, and code editing, that's roughly 25-30% of your context window consumed before you've even asked a question. A single Claude Code turn loads ~45,000 tokens of tool definitions alone. The fix? Enable tool search so only relevant tools are loaded per turn, or trim your tool definitions aggressively if you're building custom agents.

~/.claude/settings.json

# Claude Code: cut startup context from 45k → 20k tokens
ENABLE_TOOL_SEARCH=true

This single setting saves ~14,000 tokens per turn. At scale, that's the difference between a $50/month bill and a $200/month bill.

2. Cache Expiry — The Silent Killer

Prompt caching saves 90% on repeated tokens — but only if the cache hasn't expired. Claude's cache TTL is 5 minutes (Pro) to 60 minutes (Max tier). Here's the problem: if you step away for 6 minutes and come back, the entire conversation history, all tool schemas, and all system prompts get re-processed at full price. In real-world data, 54% of turns hit an expired cache, causing a 10x cost spike on those turns.

The solution isn't to type faster — it's to structure your workflow around cache-friendly patterns. Batch related questions together. Use session-based tools instead of restarting. If you're building an agent, implement checkpointing so it can resume without re-sending the full history.

3. Redundant File Reads

AI agents read files. A lot. In a typical coding session, the same file gets read 3-7 times across different turns — once for context, again for a diff, again to verify changes, again after a failed edit. Each read sends the full file contents as input tokens. For a 500-line file, that's 2,000-3,000 tokens per read, repeated unnecessarily.

Better approaches: keep a working memory of already-read files, use diff-only operations instead of full reads, and implement file caching in your agent loop. If you're an API user, track what's in context and avoid re-sending static content.

4. Stateless Conversation Rebuild

LLM conversations are stateless — every turn rebuilds the full history from scratch. Turn 10 of a conversation doesn't just send message 10; it sends messages 1-9 plus the new one. For long sessions, this means your token usage grows quadratically. A 20-turn session might consume 500k+ tokens just in history replay.

Mitigation: use conversation summarization at checkpoints (compress turns 1-8 into a summary before sending turn 9). Structure your prompts so earlier context can be safely dropped. For agentic loops, implement a sliding window that keeps only the most relevant N turns in full.

5. Over-Engineering the System Prompt

System prompts are sent with every API call. A verbose 4,000-token system prompt costs the same as 4,000 input tokens every single turn. Over a 50-turn session, that's 200,000 tokens just on instructions. Audit your system prompts ruthlessly — remove examples that the model already understands, compress multi-paragraph instructions into concise rules, and move rarely-needed instructions into a separate lookup mechanism.

This Isn't Just a Claude Code Problem

These patterns apply everywhere: OpenAI Agents, LangChain workflows, custom API integrations, Cursor, Copilot, any tool that maintains conversation state. The underlying economics are the same — you pay for every token sent, and wasted tokens compound over session length.

If you're building with the Anthropic API directly, the same principles apply: cache your system prompts, minimize tool definitions per call, implement conversation compression, and batch operations to maximize cache hits. The API gives you more control than any wrapper tool — use it.

Quick Wins Checklist

Enable tool search — cut 14k tokens per turn in Claude Code

Batch questions in one turn instead of spreading across many

Use shorter system prompts — compress instructions, drop examples

Implement conversation summarization at checkpoints

Track file reads — cache what's already in context

Match your cache TTL to your workflow cadence

Use diff operations instead of full file re-reads

For API users: implement sliding window context management

The Math That Matters

Let's say you do 100 turns per day across your AI tools. Without optimization, each turn averages ~50k input tokens. With the fixes above — tool search, cache-friendly batching, compressed prompts, file read caching — you can bring that down to ~20k per turn. That's 3M tokens saved per day. At standard pricing, that's real money back in your pocket every month.

The biggest insight isn't about using a cheaper model — it's about not paying for tokens you don't need. Optimize the waste first, then evaluate whether you even need to switch models.

Resources

Anthropic Prompt Caching — how cache TTL works and how to maximize hits

Claude Code Settings — ENABLE_TOOL_SEARCH and other token-saving options

Tiktoken and Anthropic Tokenizer — count tokens before you send them

LangChain Conversation Summary — conversation compression patterns

Reddit discussion — the original thread that inspired this post