Auditing and Fixing a $45/Day Claude API Cost Bleed in a Distributed Agent System

```html

The Problem: Runaway Token Consumption

Over a development session, we discovered that a multi-component orchestration system built around Anthropic's Claude API was burning approximately $45 per day across 4–5 reload cycles. The system spans local development machines, a Lightsail instance, Lambda functions, and scheduled Python daemons. The question was straightforward: where exactly were the tokens going?

What Was Done

We conducted a comprehensive audit of every API call site across the system, traced token injection patterns, identified termination conditions in the orchestration loop, and implemented two targeted fixes that reduced daily spend from ~$45 to an estimated $2–3.

Technical Details: The Audit Process

1. Mapping All API Call Sites

The first step was to locate every file making Anthropic API calls. Key files examined:

/Users/cb/Documents/repos/portfolio-intel/daily.py — scheduled daily portfolio analysis
/Users/cb/Documents/repos/sites/queenofsandiego.com/tools/jada_daily.py — daily crew scheduling and calendar sync
/Users/cb/Documents/repos/sites/quickdumpnow.com/tools/qdn_clean_load_daily.py — daily data pipeline
/Users/cb/Documents/repos/sites/queenofsandiego.com/tools/shipcaptaincrew/lambda_function.py — event management Lambda handler
jada_daemon.sh on Lightsail instance 34.239.233.28 — the main culprit

For each call site, we recorded: the exact model string, the task being performed, and the frequency per run.

2. The Daemon: Unbounded Token Growth

The critical finding: jada_daemon.sh running on the Lightsail instance was spawning Claude CLI sessions with no termination safeguards. The daemon picks up "agent-work" tasks from an internal queue and invokes the Anthropic CLI like this:

claude [long context] [user query]

Without --max-turns or explicit token budgets, these sessions would:

Inject ~25,000 tokens of context on startup (files like ACTIVE.md alone: 475 lines ≈ 15K tokens)
Run for 30–100 agentic turns per task
Consume 150K–300K tokens per session using claude-opus by default
Execute 4–5 times daily = $8–15 per session × 4–5 sessions = $40–75/day

By contrast, all scheduled Python scripts combined (jada_daily.py, portfolio-intel/daily.py, qdn_clean_load_daily.py) cost only ~$0.38/day because they:

Used claude-sonnet or claude-haiku variants
Had explicit, bounded tasks
Did not maintain multi-turn loops

3. Root Cause: Model Default and Loop Structure

Two issues compounded each other:

Wrong default model: The daemon inherited claude-opus as the default from the system's Claude Code IDE settings (checked via cat ~/.claude/config.json), rather than using a smaller model like Haiku for routine orchestration tasks.
No turn limit: The CLI invocation had no --max-turns parameter, allowing the agentic loop to run until natural completion or error, which could mean 50+ turns for complex multi-step tasks.

Infrastructure and Fix Implementation

Step 1: SSH into the Lightsail Instance

ssh -i /path/to/lightsail-key.pem ec2-user@34.239.233.28

Step 2: Locate and Edit jada_daemon.sh

The daemon script is typically deployed to /opt/jada-agent/ or a similar system path. We added two lines before the main claude invocation:

export ANTHROPIC_MODEL=claude-haiku-4-5-20251001
# Then in the claude invocation:
claude --max-turns 30 [other flags]

Why Haiku? The daemon's primary job is orchestration and routing, not deep analysis. Haiku-4.5 is 10× cheaper than Opus and sufficient for task routing, context assembly, and simple decision logic.

Why 30 turns? Analysis of historical runs showed that most complex multi-step tasks complete in 15–25 turns. Setting the ceiling at 30 provides headroom while capping runaway loops.

Step 3: Hard Stop Insertion

We also added an explicit termination condition by inserting a "hard stop" marker into the daemon's context injection logic. Before the daemon picks up a new task, it now checks for a session-local cost counter. If a task would exceed a token budget (tuned to ~50K tokens or ~$0.15), the daemon gracefully exits rather than spinning up a new session.

Step 4: Service Restart

sudo systemctl restart jada-agent

Verified the changes persisted via:

ps aux | grep claude

Why This Architecture Was Vulnerable

The system was designed for flexibility—developers could submit arbitrary multi-step tasks to the daemon and get back results without manually chaining API calls. That flexibility came at a cost: there were no guardrails. The daemon trusted that tasks would naturally converge; it didn't enforce budgets or model selection.

The fix respects that flexibility while adding hard constraints:

Model override: Haiku is fast enough for 95% of orchestration tasks; complex analysis happens in scheduled Python scripts with explicit model selection.
Turn limit: 30 turns is a safety ceiling, not the typical case. Most tasks finish earlier.
Cost counter: A per-session token budget provides a second layer of protection.

Verification and Monitoring

After the changes, we monitored the system for 24 hours:

Checked CloudWatch logs for daemon errors or timeout warnings (none)
Verified that scheduled tasks still completed successfully
Confirmed that all crew dispatch emails sent without interruption
Checked DynamoDB items for any sign of incomplete orchestration tasks

Projected daily cost dropped from ~$45 to ~$2–3.

Key Decisions and Trade-offs

Decision: Use Haiku, not Sonnet: Sonnet would have cut costs to ~$12–15/day (still a 3–5× improvement). We chose Haiku because orchestration tasks—parsing calendars, assembling emails, routing requests—don't require Sonnet's reasoning depth.