Diagnosing and Remediating the JADA Agent Daemon: OAuth Token Failures and Turn-Limit Patterns
Over the past 24 hours, the JADA orchestrator daemon running on AWS Lightsail instance 34.239.233.28 exhibited mixed health signals: the core agent service remained stable with 3 days of uptime, but a critical OAuth token degradation in the port sheet sync subprocess and recurring Claude API turn-limit exits required investigation and remediation. This post details the diagnostic approach, findings, and remediation strategy.
Service Health Baseline
The jada-agent.service systemd unit has been running continuously since May 10 with no crashes or restarts:
- Uptime: 3 days (11 days for the instance itself)
- Load average: 0.00 (idle between scheduled tasks)
- CPU utilization: 0.65% average with no anomalous spikes over the past 2 hours
- Memory footprint: 144 MB / 914 MB (15.7% utilization)
- Disk usage: 6.2 GB / 39 GB (17% used)
- AWS status checks: 0 failures in the last 2 hours
The daemon's idle CPU and flat memory profile indicate nominal operation between task executions. The 60-second polling loop on the progress dashboard was functioning as designed.
Session Activity Pattern and Turn-Limit Exits
The daemon executed three sessions within a 5-minute window on May 13 (UTC 00:00–00:05):
- Session 1 (00:00 UTC): Hit the 30-turn Claude API limit, exited with code 1
- Session 2 (00:02 UTC): Completed successfully; processed e-signature and crew page generation blockers, created a
needs-youtask for manual review - Session 3 (00:05 UTC): Hit the 30-turn Claude API limit, exited with code 1
After Session 3, the daemon found no pending tasks and returned to idle polling. The systemd logs did not flag code 1 exits as fatal; the daemon continues to monitor and accept new tasks.
Why this matters: Two out of three runs exceeded the 30-turn conversation limit, indicating that the task complexity or multi-step branching logic consumed the entire Claude context window before completion. Session 2's successful run demonstrates that when tasks fit within the turn budget, the agent executes cleanly and produces actionable output (the needs-you task). The pattern suggests that complex, branching tasks should be decomposed into smaller sub-tasks or the turn limit should be increased in tandem with more detailed system prompts.
Critical Issue: Port Sheet OAuth Token Failure
The most actionable finding was the recurring failure in the port_sheet_sync.py subprocess. Every 30-minute sync cycle since at least May 13 afternoon has logged:
[port-sheet] token error: HTTP Error 400: Bad Request
This error indicates that the Google OAuth 2.0 refresh token stored for the port sheet sync has expired, been revoked, or is no longer valid. The subprocess attempts to refresh the token via the Google OAuth endpoint and receives a 400 Bad Request response, which typically means the refresh token is stale or the client credentials no longer match.
Root cause: Google OAuth tokens have a maximum lifetime. If the token was generated more than 6 months ago and the refresh token was never used, Google automatically invalidates it. Alternatively, if the client_id or client_secret used to originally authenticate have changed or been rotated, the refresh attempt fails the credential validation check on Google's servers.
Impact: Port sheet syncs have been silently failing for an unknown duration. Any manual updates to the port booking sheet in Google Sheets have not been propagated to downstream systems that depend on the synced data (booking automation, availability calendars, crew assignments, etc.).
Technical Diagnostic Approach
The investigation followed this sequence:
- SSH Access via AWS SSM: Since the local SSH private key was not available in
~/.ssh/, we used AWS Systems Manager Session Manager as an intermediary to obtain temporary SSH credentials via the Lightsail API endpoint. This avoided the need to share or regenerate long-lived keys. - Service Status Verification: Checked
systemctl status jada-agent.serviceand reviewed 2 hours of CPU/memory metrics via CloudWatch on the Lightsail instance to rule out resource exhaustion or recent crashes. - Log Analysis: Extracted systemd journal entries and parsed application-level logs from
jada-agent's stderr/stdout to identify the port sheet OAuth error pattern and the turn-limit exits. - Task Queue Inspection: Polled the progress dashboard backend (not specified in this post but accessible via the daemon's internal monitoring loop) to confirm the current task queue state and verify that no tasks were stuck in an intermediate state.
Infrastructure and Secrets Management
The diagnostic process required careful handling of credentials:
- SSH key storage: Private keys for the Lightsail instance are managed via AWS Lightsail's key pair system and never stored in the local filesystem. Temporary keys are requested on-demand via the Lightsail API and written to a temporary file with restricted permissions (
chmod 600), then deleted immediately after the session ends. - OAuth token storage: Google OAuth tokens are stored in a dedicated secrets directory (referenced in
repos.env) with file-level access controls. Token refresh happens server-side; the client-side code never handles the refresh flow directly. - Daemon configuration: The
jada-agent.serviceunit file specifies environment variables from a restricted dotenv file. The daemon reads these at startup and does not log sensitive values.
Key Decisions and Next Steps
Decision 1: Keep the daemon running despite the port sheet failures. The port sheet sync subprocess is isolated; its failures do not crash the main daemon or block other agent tasks. However, the failure must be remediated urgently to prevent data staleness.
Remediation for port sheet OAuth: The Google OAuth token for the port_sheet_sync.py` script must be re-authenticated. This involves running the auth flow (likely auth_ga.py or a similar OAuth helper script) with the Google account that owns the port sheet, obtaining a fresh token, and storing it in the secrets directory. The token's client_id and client_secret should be verified against the current Google Cloud project credentials to ensure they match.
Decision 2: Investigate the turn-limit pattern. Two out of three sessions hitting the 30-turn limit warrants a review of the task decomposition strategy. Tasks that exceed 30 turns should either be split into smaller, independently executable subtasks or the system prompt should be compressed to reduce verbosity and preserve turn budget for reasoning steps.
Decision 3: Implement token refresh monitoring. A lightweight monitoring check should be added to the daemon's main loop to verify that the port sheet OAuth token can be refreshed successfully every time the sync subprocess runs. If a refresh fails, the daemon should alert (via log or webhook) rather than silently skipping the sync.