Diagnosing and Resolving OAuth Token Failures in Distributed Agent Systems

```html

During a routine health check of our jada-agent orchestrator daemon running on AWS Lightsail (34.239.233.28), we discovered a critical OAuth token failure in the port sheet synchronization pipeline that had been silently failing for approximately 12 hours. This post details the diagnosis methodology, root cause analysis, and the architectural lessons learned from token lifecycle management in distributed agent systems.

What Was Done

We performed a comprehensive health audit of the jada-agent.service daemon, including service status verification, log analysis, CPU/memory profiling, and task execution tracking. The investigation revealed:

The jada-agent.service has maintained 3 days of continuous uptime with healthy resource utilization
Three agent sessions executed in a 5-minute window, with two hitting the 30-turn Claude API limit
The port_sheet_sync.py script has been failing every 30-minute sync cycle with "HTTP Error 400: Bad Request"
Google OAuth credentials for port sheet synchronization have expired or been revoked

Technical Details: Daemon Architecture and Session Management

The jada-agent daemon operates as a task queue consumer that polls the progress dashboard at regular intervals. The architecture follows a polling pattern rather than event-driven, which creates natural rate-limiting but introduces blind spots for credential expiration.

Session tracking: Each Claude API session is bounded by a 30-turn limit and tracked in the daemon logs with timestamps and exit codes. Today's session pattern reveals the agent's complexity:

Session 1 (00:00 UTC): max_turns_reached → exit code 1
Session 2 (00:02 UTC): completed_successfully → exit code 0
Session 3 (00:05 UTC): max_turns_reached → exit code 1

Session 2 successfully created a needs-you task for blocking issues in the e-signature and crew page generator code, demonstrating that even constrained sessions produce actionable output. However, Sessions 1 and 3 hitting the turn limit suggests task complexity may need decomposition or the turn budget may require adjustment for certain workload classes.

Resource utilization: The Lightsail instance (t2.small equivalent, 914MB RAM, 2vCPU) showed excellent health metrics:

CPU: 0.65% average with zero spikes over the 2-hour observation window
Memory: 144MB / 914MB (15.8% utilization)
Disk: 6.2GB / 39GB (17% used)
Status checks: 0 failures
Load average: 0.00 (essentially idle between task cycles)

This indicates the daemon's polling loop and idle state are CPU-efficient, but also that there are significant idle periods where credential refresh operations could be more aggressively scheduled.

The OAuth Token Failure: Root Cause and Impact

The port_sheet_sync.py script executes on a 30-minute schedule via cron. Every execution since approximately 2026-05-13 13:00 UTC has failed with identical error signatures:

[port-sheet] token error: HTTP Error 400: Bad Request

This error pattern indicates the Google OAuth token stored for the service account has either:

Expired without automatic refresh (token TTL typically 1 hour for user-facing OAuth flows)
Been revoked at the Google API level (user changed password, revoked permissions, or admin policy change)
Lost validity due to scope mismatch or credential rotation

The impact is that Google Sheets synced via port_sheet_sync.py have not been updated for ~12 hours, creating data staleness for any downstream consumers of that sheet data.

Infrastructure and Authentication Architecture

Current token storage pattern: OAuth credentials are stored in a secrets directory referenced by repos.env. The auth flow uses google-auth-oauthlib for user-facing authentication, with tokens persisted locally for service use.

Why this failed: Google OAuth 2.0 user-facing flows issue access tokens with 1-hour TTL and refresh tokens with 7-day sliding window expiration. When a refresh token expires without use, the entire credential chain becomes invalid. Our port_sheet_sync.py script likely uses a cached access token without implementing refresh token rotation, causing it to fail silently once the access token expired.

The detection gap: The daemon's polling interval and idle load average (0.00) meant no resource constraints triggered alerts. The error was only visible by connecting directly to the instance and inspecting the port_sheet_sync cron logs—a manual process that would not scale across multiple daemons.

Key Decisions and Why They Matter

Decision 1: Manual SSH access via Lightsail API temporary credentials

We initially searched for a persistent jada-key SSH private key that should have been available. When the key wasn't found in standard locations (~/.ssh/jada-key or repos.env references), we pivoted to AWS Lightsail's temporary SSH credential API rather than requesting a new key pair. This decision:

Avoided introducing new long-lived SSH credentials that would require secure storage and rotation
Leveraged existing AWS IAM permissions to generate ephemeral, time-limited credentials
Reduced attack surface by not requiring key material to be transmitted or stored locally

Decision 2: Simultaneous metrics collection via Lightsail API

Rather than relying solely on SSH session output, we queried CloudWatch metrics for CPU, memory, and network data in parallel. This provided independent verification of instance health and prevented a single SSH session or log inspection from giving a false picture of daemon health.

Decision 3: Task decomposition for max-turns exits

Sessions 1 and 3 hit the 30-turn limit, which superficially looks like failures but are actually designed escape valves. Rather than implementing immediate fixes, we noted this as a pattern to monitor. If tasks consistently exceed 30 turns, the solution is to break tasks into smaller units in the queue rather than increase the turn limit (which increases latency and cost).

What's Next

Immediate actions required:

Re-authenticate port_sheet_sync OAuth token: Run the auth_ga.py script (located at /Users/cb/Documents/repos/tools/auth_ga.py) with the dangerouscentaur@gmail.com account to generate fresh Google OAuth credentials. This will refresh both access and refresh tokens.
Implement token refresh monitoring: Add logging to port_sheet_sync.py to detect when refresh tokens are used and to alert when approaching the 7-day expiration window.
Add credential health checks to daemon polling loop: Implement a periodic credential validation check in the jada-agent daemon that tests OAuth tokens without making actual API calls (using the credentials.valid flag in google-auth libraries).

Long-term architectural improvements:

Migrate OAuth credentials to AWS Secrets Manager with automatic rotation policies rather than file-based storage
Implement a credential health dashboard that surfaces expiration dates and usage patterns across all service accounts
Add structured logging for all third-party API calls so token errors surface in CloudWatch Logs rather than requiring SSH inspection

The jada-agent daemon itself is performing well with healthy resource utilization and successful task completion rates. The OAuth token issue is a credential lifecycle management problem, not a