Diagnosing and Remediating the JADA Agent Daemon: OAuth Token Failures and Session Management
During a routine health check of the orchestrator daemon running on our primary Lightsail instance (34.239.233.28), we discovered a critical OAuth token failure in the port sheet synchronization pipeline and identified patterns in agent session management that require architectural attention. This post documents the diagnostic process, findings, and remediation strategy.
What Was Done
We performed a comprehensive health audit of the jada-agent.service systemd unit by:
- Establishing SSH access via AWS Lightsail temporary credentials (since the local private key was unavailable)
- Collecting service status, recent logs, and resource utilization metrics
- Pulling CloudWatch metrics from the Lightsail API for CPU, memory, network, and status checks
- Analyzing daemon activity logs and session exit codes from the past 24 hours
- Identifying the root cause of recurring
port_sheet_syncfailures
Technical Details: Daemon Health Findings
Service Status and Resource Utilization
The daemon has been running continuously since May 10, maintaining 11 days of uptime with healthy resource metrics:
- Service state: Active (running)
- CPU utilization: 0.65% average across a 60-second polling cycle, with no observable spikes
- Memory footprint: 144MB / 914MB allocated — well within safe thresholds
- Disk usage: 6.2GB / 39GB (17% used) — adequate headroom for logs and data
- Load average: 0.00 — essentially idle between task pickups
- AWS status checks: Zero failures in the last two hours
The daemon's low CPU footprint reflects its event-driven architecture: it polls the progress dashboard for pending tasks, picks them up in sequence, and returns to idle sleep between completions. This is the intended behavior.
Session Activity and Exit Codes
Over the past 24 hours (UTC), the daemon consumed 3 of its 5 available daily sessions:
- Session 1 (00:00 UTC): Exited with code 1 after hitting the 30-turn limit — agent reached max Claude turns mid-task
- Session 2 (00:02 UTC): Completed successfully — processed e-signature page blockers and crew page generator code, created a
needs-youtask for manual review - Session 3 (00:05 UTC): Exited with code 1 after hitting the 30-turn limit — same max-turn constraint triggered
- After 00:05 UTC: No additional pending tasks detected; daemon idling normally
The two "max turns" exits are not crashes — the daemon correctly logs them as non-fatal errors (exit code 1) and continues operation. However, this pattern suggests that complex tasks are consuming the full 30-turn budget before completion. This is a recurring architectural constraint worth monitoring.
Critical Issue: Broken OAuth Token in Port Sheet Sync
The most significant finding is a persistent failure in the port_sheet_sync.py script, which runs every 30 minutes as a scheduled task. All syncs since at least this afternoon have been failing with:
[port-sheet] token error: HTTP Error 400: Bad Request
Root cause: The Google OAuth token stored for the port sheet service account (used to authenticate against the Google Sheets API) has expired or been revoked. Without a valid token, the sync cannot write updated port assignments back to the canonical sheet.
Impact: Port sheet synchronization has been non-functional for an unknown duration. Any changes to port assignments made through the agent or manual updates are not being reflected in the source sheet.
Infrastructure and Architecture
Lightsail Instance Configuration
The orchestrator runs on an AWS Lightsail instance at 34.239.233.28. Key characteristics:
- Instance name: jada-key (Lightsail key pair)
- Operating system: Linux-based (Amazon Linux or Ubuntu)
- Primary service:
jada-agent.service(systemd unit) - Storage: 39GB root volume, currently 17% utilized
- Access method: Temporary SSH credentials via Lightsail API (preferred over stored keys for security)
Agent Session Management
The daemon implements a daily session quota (5 sessions per day) and per-session turn limits (30 turns per Claude API interaction). This design prevents runaway API costs and enforces task completion deadlines:
- Sessions roll over at UTC midnight
- Each session is a fresh Claude conversation context
- The 30-turn limit forces task prioritization and scope reduction
- Exit code 1 on max turns is logged but non-fatal; the daemon continues polling
OAuth Token Management
The port sheet sync uses a stored Google OAuth token (likely in a credentials file, possibly repos.env or a dedicated secrets directory). The token must be periodically refreshed or re-authenticated. Current state: the token is stale and invalidates all write operations to the port assignment sheet.
Key Decisions and Reasoning
Why We Used Lightsail Temporary Credentials Instead of Stored Keys
The local private key for the jada-key pair was not present in the expected location (~/.ssh/jada-key). Rather than attempt key recovery, we used the AWS Lightsail API endpoint to generate temporary SSH credentials. This approach:
- Eliminates dependency on locally stored private keys
- Provides time-limited access (credentials expire automatically)
- Requires no key material in the development environment
- Integrates with AWS IAM permissions and audit trails
Why Port Sheet Sync Failure Is Critical
The port sheet is the canonical source of truth for port assignments across the fleet. If the sync is broken, manual edits made through the agent dashboard or crew-facing interfaces will not propagate back to the source sheet, creating data inconsistency and confusion downstream. This requires immediate remediation.
Why Max-Turn Exits Are Not Fatal But Worth Monitoring
Exit code 1 on max turns is a design feature, not a bug — it forces task creators to scope work appropriately. However, two exits in one day suggests either:
- Task complexity is increasing and requiring more agent reasoning steps
- Prompt engineering could be improved to reduce turn count
- The 30-turn limit may need to be raised if tasks are legitimately complex
What's Next
- Re-authenticate the Google OAuth token for
port_sheet_sync.pyby running the auth script (likelyauth_ga.pyor a port-sheet-specific equivalent) with the