```html

Diagnosing and Remediating OAuth Token Failures in the JADA Agent Orchestrator

During a routine health check of the jada-agent orchestrator daemon running on AWS Lightsail (34.239.233.28), we discovered a critical authentication failure in the port sheet synchronization subsystem. This post walks through the diagnostic approach, root cause analysis, and remediation strategy for OAuth token degradation in long-running daemon processes.

What Was Done

We performed a comprehensive health audit of the jada-agent.service daemon, which runs on a Lightsail instance and manages asynchronous task orchestration for the JADA platform. The audit involved:

  • Establishing SSH connectivity to the Lightsail instance via temporary credentials from the Lightsail API
  • Collecting service status, systemd logs, and runtime metrics
  • Analyzing 24-hour task execution history and session accounting
  • Identifying persistent OAuth token failures in the port_sheet_sync subprocess
  • Documenting daemon behavior under the 30-turn Claude API limit

Technical Details: Establishing Connectivity

The jada-key SSH private key was not stored in the standard ~/.ssh/ directory, necessitating an alternate approach. Rather than hunting for locally stored keys, we leveraged AWS Lightsail's API to generate temporary SSH credentials:

# Fetch instance access details from Lightsail API
aws lightsail get-instance-access-details \
  --instance-name jada-agent-primary \
  --region us-east-1

# Extract the temporary certificate and private key from the response
# Write them to a temporary file with restricted permissions (600)
# Connect via SSH using the temporary key pair
ssh -i /tmp/jada_temp_key.pem ubuntu@34.239.233.28

This approach eliminated dependency on locally cached keys and provided auditable, time-limited access credentials. The Lightsail API handles key rotation and expiration transparently.

Daemon Health Findings

The jada-agent.service itself is in excellent condition:

  • Uptime: 3 days (since May 10) with zero service restarts
  • Resource utilization: CPU 0.65% average, 144MB / 914MB memory (15.7% utilization), 6.2GB / 39GB disk (17%)
  • Load average: 0.00 — daemon is idle between task polling cycles
  • Status checks: 0 failures in the last 2 hours; instance health is nominal

The service is implemented as a systemd unit file that spawns the daemon process with automatic restart-on-failure enabled. The 60-second polling interval keeps the daemon responsive without creating excessive CPU load.

Task Execution Analysis: Session Accounting and Turn Limits

Over the past 24 hours (UTC May 13), the daemon consumed 3 of 5 available agent sessions:

  • Session 1 (00:00 UTC): Hit the 30-turn Claude API limit and exited with code 1. This is not a crash but a graceful timeout; the daemon logged it and continued polling.
  • Session 2 (00:02 UTC): Completed successfully. Processed e-signature and crew page blockers, created a high-priority task in the progress dashboard.
  • Session 3 (00:05 UTC): Also hit the 30-turn limit (exit code 1) but had already completed meaningful work before exhausting turns.

No new tasks were queued after Session 3; the daemon is idling normally. Yesterday's pattern showed all 5 sessions consumed by 23:55 UTC, with 3 pending tasks clearing at the midnight UTC rollover when the session quota resets. This is expected behavior for the current task volume.

Why the turn limit matters: The 30-turn constraint is a safety boundary in the Claude API integration. Multi-step tasks (like iterative code generation or complex data transforms) may exceed this budget. Sessions 1 and 3 hitting the limit isn't a failure—it's a signal that either task complexity should be decomposed or the per-session turn budget needs adjustment.

Critical Issue: port_sheet_sync OAuth Token Failure

Every 30-minute invocation of port_sheet_sync.py has been failing consistently since at least May 13 afternoon UTC. The daemon logs reveal:

[port-sheet] token error: HTTP Error 400: Bad Request

This indicates the OAuth token stored for Google Sheets API access (likely in /home/ubuntu/.jada-secrets/port_sheet_oauth.json or similar) is expired, revoked, or malformed.

Root cause analysis: Google OAuth 2.0 refresh tokens have a maximum lifetime. If a refresh token hasn't been used within a certain window (typically 6 months of inactivity), Google revokes it automatically. Alternatively, the token may be expired and the refresh attempt failed due to a network issue or the refresh token being invalidated by manual revocation.

Impact: Port sheet syncs have not run in 24+ hours. Any downstream processes relying on fresh port sheet data are working with stale state.

Remediation Path

To fix the OAuth token failure, the port_sheet_sync.py` script must be re-authenticated:

  • Run the OAuth flow directly on the Lightsail instance or locally, using the stored client credentials (client_id and client_secret) to generate a new refresh token.
  • The auth tool at /Users/cb/Documents/repos/tools/auth_ga.py (or an analogous port_sheet_auth.py) should be executed with the appropriate service account email.
  • Write the new token to the secrets directory with appropriate file permissions (600).
  • The daemon will pick up the refreshed token on the next 30-minute sync cycle.

Note: The auth_ga.py tool requires google-auth-oauthlib to be installed in the Python environment. This can be verified with pip list | grep google and installed via pip install google-auth-oauthlib if missing.

Infrastructure Notes

The daemon runs on a dedicated AWS Lightsail instance with:

  • Instance type: Standard (likely 2GB RAM / 2vCPU based on observed memory footprint)
  • Region: us-east-1
  • Static IP: 34.239.233.28
  • Service manager: systemd (unit file location: /etc/systemd/system/jada-agent.service)

Metrics are collected via the Lightsail API (CPU, network, status checks). No CloudWatch agent is required; Lightsail's native metrics are sufficient for this workload.

Key Decisions

  • Temporary SSH credentials via API: Eliminates local key management and provides audit trails. Each access creates a short-lived certificate tied to a specific instance.
  • 30-second daemon poll interval: Balances responsiveness against CPU load. For this workload (5 sessions per day, sporadic tasks), the daemon is idle most of the time, which is correct.
  • 5-session daily quota: Enfor