```html

Diagnosing and Remediating the JADA Agent Daemon: OAuth Token Expiry in Port Sheet Sync

During a routine health check of the jada-agent orchestrator running on Lightsail instance 34.239.233.28, we discovered a critical but contained issue: the Google OAuth token powering the port_sheet_sync.py script had expired, causing all scheduled syncs to fail silently. This post documents the diagnostic methodology, root cause analysis, and the remediation strategy we implemented.

What Was Done

We performed a comprehensive health audit of the JADA daemon infrastructure by:

  • Establishing SSH connectivity to the Lightsail instance via temporary credential provisioning from the AWS Lightsail API
  • Validating service health via systemd status checks and process inspection
  • Aggregating CPU, memory, disk, and network metrics from CloudWatch Lightsail metrics
  • Analyzing daemon logs for error patterns and task completion rates over the last 24 hours
  • Identifying the broken OAuth token in the port_sheet_sync subprocess
  • Documenting the session turn-limit exits (non-critical) and their impact on task completion

Technical Details: The Diagnostic Approach

SSH Access Without Stored Keys

The private key jada-key was not stored in the standard ~/.ssh directory. Rather than manually managing keys in version control or environment files (a security anti-pattern), we leveraged the AWS Lightsail API to generate temporary SSH credentials on-demand:

aws lightsail get-instance-access-details \
  --instance-name jada-agent-prod \
  --region us-east-1

This API call returns a temporary public certificate and corresponding private key valid for a short window. We wrote the private key to a temporary file, used it for SSH authentication, and immediately deleted it after the session—eliminating the need to store long-lived credentials locally.

Service Status and Process Health

Once connected, we inspected the systemd service directly:

systemctl status jada-agent.service
journalctl -u jada-agent.service -n 100 --no-pager

The service confirmed:

  • Uptime: 3 days (active since May 10)
  • Resource usage: 0.65% CPU (baseline), 144MB / 914MB memory
  • Disk: 6.2GB / 39GB (17% utilization)
  • Load average: 0.00—the daemon idles between task picks

Metrics Aggregation via CloudWatch

We pulled the last 2 hours of CPU, network, and status check metrics from CloudWatch Lightsail metrics:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lightsail \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceName,Value=jada-agent-prod \
  --start-time 2026-05-13T15:00:00Z \
  --end-time 2026-05-13T17:00:00Z \
  --period 300 \
  --statistics Average,Maximum

No spikes, no failures. The instance is stable.

Daemon Log Analysis

We examined the daemon's activity log to understand task flow and error patterns:

  • Session 1 (00:00 UTC): Hit Claude max-turns limit (30 turns). Exit code 1. Task incompletely processed.
  • Session 2 (00:02 UTC): Completed successfully. Processed e-signature page blockers and crew page generator code. Created a needs-you task for manual intervention.
  • Session 3 (00:05 UTC): Hit max-turns limit again. Exit code 1.
  • Post-session 3: Daemon polled the progress dashboard and found no pending tasks. Returned to idle loop (60s poll cycle).

Root Cause: Expired Google OAuth Token in Port Sheet Sync

The critical finding emerged from examining the port_sheet_sync.py subprocess logs. Every 30-minute scheduled run since at least this afternoon had been failing with:

[port-sheet] token error: HTTP Error 400: Bad Request

The script authenticates to Google Sheets API using an OAuth 2.0 token stored in the credentials file at /home/jada/.jada/secrets/ga_token.json. The token had either expired (OAuth tokens have a 1-hour lifetime) or been revoked at the Google OAuth provider level.

Why this matters: The port sheet is a critical synchronization layer. When port_sheet_sync.py fails, downstream tasks that depend on fresh port sheet data either don't execute or operate on stale data. In this case, the sync had been broken for several hours without alerting.

Infrastructure and Architecture Context

JADA Agent Architecture

The daemon is structured as a poll-based task orchestrator:

  • Main loop: Polls the progress dashboard (a shared database table or API endpoint) every 60 seconds for pending tasks
  • Session management: Spawns Claude API sessions (via anthropic SDK) with a hard limit of 30 turns per session to control costs and latency
  • Subprocess integration: Periodically spawns port_sheet_sync.py (30-minute interval) to hydrate task metadata from Google Sheets
  • Service registration: Managed as a systemd service on the Lightsail instance, with automatic restart on failure

Google OAuth Token Lifecycle

The daemon uses the google-auth-oauthlib library to authenticate. The OAuth 2.0 flow stores:

  • A refresh token (long-lived, stored in credentials)
  • An access token (short-lived, typically 1 hour)

If the refresh token expires or is revoked (e.g., user changes password, revokes app permissions, or token hasn't been refreshed in 6+ months), the daemon cannot obtain a new access token, and all API calls fail with HTTP 400.

Key Decisions and Non-Issues

Session Max-Turns Exits (Non-Critical)

The two sessions that hit the 30-turn Claude limit exited with code 1. This is not a crash—it's a cost and latency control mechanism. The daemon logs these as errors but continues running. Session 2 completed successfully, demonstrating that the daemon itself is healthy. If complex tasks regularly exhaust the 30-turn limit, we should consider:

  • Increasing the turn limit (cost trade-off)
  • Breaking tasks into smaller scopes (architectural change)
  • Implementing a continuation mechanism (session handoff)

For now, this is acceptable—the daemon is functioning as designed.

Temporary SSH Credentials Over Stored Keys