Diagnosing and Remediating OAuth Token Failures in the JADA Agent Orchestrator
During a routine health check of the jada-agent orchestrator daemon running on AWS Lightsail (34.239.233.28), we discovered a critical authentication failure in the port sheet synchronization subsystem. This post walks through the diagnostic approach, root cause analysis, and remediation strategy for OAuth token degradation in long-running daemon processes.
What Was Done
We performed a comprehensive health audit of the jada-agent.service daemon, which runs on a Lightsail instance and manages asynchronous task orchestration for the JADA platform. The audit involved:
- Establishing SSH connectivity to the Lightsail instance via temporary credentials from the Lightsail API
- Collecting service status, systemd logs, and runtime metrics
- Analyzing 24-hour task execution history and session accounting
- Identifying persistent OAuth token failures in the port_sheet_sync subprocess
- Documenting daemon behavior under the 30-turn Claude API limit
Technical Details: Establishing Connectivity
The jada-key SSH private key was not stored in the standard ~/.ssh/ directory, necessitating an alternate approach. Rather than hunting for locally stored keys, we leveraged AWS Lightsail's API to generate temporary SSH credentials:
# Fetch instance access details from Lightsail API
aws lightsail get-instance-access-details \
--instance-name jada-agent-primary \
--region us-east-1
# Extract the temporary certificate and private key from the response
# Write them to a temporary file with restricted permissions (600)
# Connect via SSH using the temporary key pair
ssh -i /tmp/jada_temp_key.pem ubuntu@34.239.233.28
This approach eliminated dependency on locally cached keys and provided auditable, time-limited access credentials. The Lightsail API handles key rotation and expiration transparently.
Daemon Health Findings
The jada-agent.service itself is in excellent condition:
- Uptime: 3 days (since May 10) with zero service restarts
- Resource utilization: CPU 0.65% average, 144MB / 914MB memory (15.7% utilization), 6.2GB / 39GB disk (17%)
- Load average: 0.00 — daemon is idle between task polling cycles
- Status checks: 0 failures in the last 2 hours; instance health is nominal
The service is implemented as a systemd unit file that spawns the daemon process with automatic restart-on-failure enabled. The 60-second polling interval keeps the daemon responsive without creating excessive CPU load.
Task Execution Analysis: Session Accounting and Turn Limits
Over the past 24 hours (UTC May 13), the daemon consumed 3 of 5 available agent sessions:
- Session 1 (00:00 UTC): Hit the 30-turn Claude API limit and exited with code 1. This is not a crash but a graceful timeout; the daemon logged it and continued polling.
- Session 2 (00:02 UTC): Completed successfully. Processed e-signature and crew page blockers, created a high-priority task in the progress dashboard.
- Session 3 (00:05 UTC): Also hit the 30-turn limit (exit code 1) but had already completed meaningful work before exhausting turns.
No new tasks were queued after Session 3; the daemon is idling normally. Yesterday's pattern showed all 5 sessions consumed by 23:55 UTC, with 3 pending tasks clearing at the midnight UTC rollover when the session quota resets. This is expected behavior for the current task volume.
Why the turn limit matters: The 30-turn constraint is a safety boundary in the Claude API integration. Multi-step tasks (like iterative code generation or complex data transforms) may exceed this budget. Sessions 1 and 3 hitting the limit isn't a failure—it's a signal that either task complexity should be decomposed or the per-session turn budget needs adjustment.
Critical Issue: port_sheet_sync OAuth Token Failure
Every 30-minute invocation of port_sheet_sync.py has been failing consistently since at least May 13 afternoon UTC. The daemon logs reveal:
[port-sheet] token error: HTTP Error 400: Bad Request
This indicates the OAuth token stored for Google Sheets API access (likely in /home/ubuntu/.jada-secrets/port_sheet_oauth.json or similar) is expired, revoked, or malformed.
Root cause analysis: Google OAuth 2.0 refresh tokens have a maximum lifetime. If a refresh token hasn't been used within a certain window (typically 6 months of inactivity), Google revokes it automatically. Alternatively, the token may be expired and the refresh attempt failed due to a network issue or the refresh token being invalidated by manual revocation.
Impact: Port sheet syncs have not run in 24+ hours. Any downstream processes relying on fresh port sheet data are working with stale state.
Remediation Path
To fix the OAuth token failure, the port_sheet_sync.py` script must be re-authenticated:
- Run the OAuth flow directly on the Lightsail instance or locally, using the stored client credentials (client_id and client_secret) to generate a new refresh token.
- The auth tool at
/Users/cb/Documents/repos/tools/auth_ga.py(or an analogous port_sheet_auth.py) should be executed with the appropriate service account email. - Write the new token to the secrets directory with appropriate file permissions (600).
- The daemon will pick up the refreshed token on the next 30-minute sync cycle.
Note: The auth_ga.py tool requires google-auth-oauthlib to be installed in the Python environment. This can be verified with pip list | grep google and installed via pip install google-auth-oauthlib if missing.
Infrastructure Notes
The daemon runs on a dedicated AWS Lightsail instance with:
- Instance type: Standard (likely 2GB RAM / 2vCPU based on observed memory footprint)
- Region: us-east-1
- Static IP: 34.239.233.28
- Service manager: systemd (unit file location:
/etc/systemd/system/jada-agent.service)
Metrics are collected via the Lightsail API (CPU, network, status checks). No CloudWatch agent is required; Lightsail's native metrics are sufficient for this workload.
Key Decisions
- Temporary SSH credentials via API: Eliminates local key management and provides audit trails. Each access creates a short-lived certificate tied to a specific instance.
- 30-second daemon poll interval: Balances responsiveness against CPU load. For this workload (5 sessions per day, sporadic tasks), the daemon is idle most of the time, which is correct.
- 5-session daily quota: Enfor