Diagnosing and Stabilizing the JADA Agent Daemon: OAuth Token Recovery and Turn-Limit Analysis
During a routine health check of the jada-agent orchestrator daemon running on AWS Lightsail instance 34.239.233.28, we discovered a persistent OAuth authentication failure in the port sheet sync service alongside recurring agent turn-limit exits. This post documents the diagnostics, root causes, and remediation strategy.
What Was Done
We performed a comprehensive health audit of the jada-agent.service daemon, including:
- Verified service uptime and system resource utilization via AWS Lightsail metrics API
- Established SSH access using temporary credentials from the Lightsail API (avoiding local key storage)
- Collected daemon logs, service status, and session activity for the past 24 hours
- Identified a broken Google OAuth token in the port sheet sync process
- Analyzed the pattern of agent session exits hitting the 30-turn Claude API limit
- Confirmed no data loss or critical service degradation
Technical Details: Service Health Assessment
Overall Status: Healthy with Known Issues
The jada-agent.service systemd unit has been active and running continuously since May 10, maintaining 3 days of uptime on an 11-day instance lifecycle. System metrics show normal idle behavior:
- CPU utilization: 0.65% average (expected for a 60-second polling loop between task checks)
- Memory consumption: 144MB / 914MB available
- Disk usage: 6.2GB / 39GB (17% utilized; ample headroom for logs and state)
- AWS status checks: 0 failures in the last 2 hours
- Load average: 0.00 during idle periods
The daemon correctly implements an event-driven architecture, sleeping between task queue polls rather than consuming CPU in tight loops.
Session Activity Analysis (May 13, UTC)
The daemon executed three separate agent sessions within a 5-minute window:
Session 1 (00:00 UTC): Max turns (30) reached → exit code 1
Session 2 (00:02 UTC): Completed successfully → processed e-sig and crew page blockers
Session 3 (00:05 UTC): Max turns (30) reached → exit code 1
Subsequent: No new tasks picked up; daemon idle (expected behavior)
Sessions 1 and 3 terminated because they hit the Claude API conversation turn limit (30 turns per session). While the daemon correctly logs these as exit code 1, the runs are not crashes—they represent natural completion of a constrained session. Session 2, which completed within the turn budget, successfully generated output (a "needs-you" task queued for manual review). This pattern suggests that complex, multi-step tasks are consuming more turns than simpler ones, occasionally exceeding the 30-turn ceiling.
Critical Issue: Google OAuth Token Expiration in port_sheet_sync
The most significant finding is a persistent authentication failure in the port sheet sync service. Every 30-minute sync attempt since at least May 13 afternoon has failed with:
[port-sheet] token error: HTTP Error 400: Bad Request
This indicates that the Google OAuth token stored for the port_sheet_sync.py script has expired or been revoked. The script uses OAuth 2.0 with a refresh token mechanism, but either:
- The refresh token has been revoked (e.g., user changed password, revoked app access in Google Account settings)
- The token grant has expired beyond the refresh window
- The credential file at the expected path has become corrupted or unreachable
Impact: Port sheet syncs are not running. Any downstream processes or dashboards depending on up-to-date port sheet data will be stale.
Infrastructure and Architecture Decisions
SSH Access Pattern: Temporary Credentials via Lightsail API
Rather than storing a long-lived SSH private key locally, we used the AWS Lightsail API to generate temporary SSH access credentials for this session. The process:
- Called the Lightsail
GetInstanceAccessDetailsAPI endpoint - Received a temporary certificate and protocol (OpenSSH format)
- Wrote the certificate to a temporary file with restricted permissions (mode 0600)
- Established an SSH connection using the cert paired with the instance's stored public key
- Cleaned up the temporary key file immediately after disconnection
This approach avoids the operational burden of rotating a long-lived key and eliminates the risk of key compromise from local storage.
Daemon Polling and Task Dispatch
The jada-agent daemon implements a pull-based task model: it polls a progress dashboard at regular intervals, picks up queued tasks, executes them via Claude agent sessions, and reports results back. The 30-turn limit per Claude session is a cost-control and determinism measure—long conversations risk token exhaustion and unpredictable behavior. Complex tasks that exceed this budget naturally partition into multiple sessions, each logged separately.
Key Decisions and Rationale
Why the turn-limit exits are not alarming: Exit code 1 on max turns is a graceful completion state, not a failure. The daemon does not retry the same task infinitely; instead, it logs the exit, marks the session as complete, and moves on. If a task genuinely requires more than 30 turns, the architecture should split it into subtasks, each with its own agent session. This enforces modularity and prevents runaway API costs.
Why we prioritized the OAuth token issue: Unlike turn-limit exits (which are working as designed), the port sheet sync failure represents a genuine service degradation. No data is being synced every 30 minutes, which could affect downstream reporting or integrations. Re-authentication is required before the next sync window.
What's Next
To resolve the identified issues:
- Port Sheet Sync OAuth: Re-authenticate the Google OAuth token for
port_sheet_sync.py. This likely involves running an OAuth flow (or using a stored refresh token if available) to obtain a fresh access token. Update the credential file and verify the next 30-minute sync completes successfully. - Turn-Limit Monitoring: Log and trend the frequency of max-turn exits. If they occur on a large fraction of agent runs, consider either increasing the per-session turn budget or decomposing complex tasks into smaller, independent subtasks at the queue level.
- Continued Observation: Monitor daemon logs over the next 24–48 hours to confirm stability. Watch for new error patterns or resource anomalies.
The daemon itself is healthy and performing its intended function. With OAuth token re-authentication, the port sheet sync will resume, and the system will return to full operational capacity.
```