```html

Diagnosing and Remediating OAuth Token Degradation in the JADA Agent Orchestrator

During a routine health audit of the jada-agent orchestrator daemon running on AWS Lightsail (34.239.233.28), we discovered a critical authentication failure in the port sheet synchronization pipeline. This post details the diagnostic methodology, root cause analysis, and remediation strategy for OAuth token expiration in long-running background tasks.

What Was Done

We performed a comprehensive health check of the jada-agent.service running on a Lightsail instance, including service status verification, log analysis, resource utilization metrics, and task execution tracing. The investigation revealed that while the orchestrator daemon itself was healthy and actively processing tasks, the Google OAuth token used by the port_sheet_sync.py script had expired, breaking the 30-minute synchronization cycle for the port sheet data pipeline.

Technical Details: Daemon Health Assessment

Service Status and Uptime

  • jada-agent.service: Active and running for 3 days (since May 10, 2026) with zero service restarts
  • Instance uptime: 11 days with load average 0.00 between task executions
  • CPU utilization: 0.65% average across last 2 hours, with no spike events
  • Memory footprint: 144 MB / 914 MB available (15.8% utilization)
  • Disk usage: 6.2 GB / 39 GB (17% utilization), no capacity concerns
  • AWS health checks: 0 failures in the last 2 hours via Lightsail metrics API

Task Execution Log (May 13, UTC)

The daemon maintains a session counter with a hard limit of 5 sessions per 24-hour period. Today's execution pattern showed three completed sessions with specific outcomes:

  • Session 1 (00:00 UTC): Exited with code 1 after reaching 30-turn Claude API limit. This is not a crash but expected behavior when task complexity exceeds token budget.
  • Session 2 (00:02 UTC): Completed successfully. Processed electronic signature and crew page generator blockers, created downstream needs-you task for manual review.
  • Session 3 (00:05 UTC): Exited with code 1 after hitting 30-turn limit again during complex task processing.
  • Post-Session 3: Daemon entered idle state; no additional tasks queued in the progress dashboard.

The 5/5 session hard stop observed yesterday at midnight UTC cleared at the 24-hour rollover, confirming the session limit enforcement is working as designed.

Critical Finding: Google OAuth Token Failure in Port Sheet Sync

Log analysis revealed a persistent authentication error in the port sheet synchronization subprocess:

[port-sheet] token error: HTTP Error 400: Bad Request

This error has been occurring every 30 minutes since at least May 13 afternoon UTC. The error signature indicates that the Google OAuth token stored for port_sheet_sync.py is either expired or revoked. The script attempts to authenticate with the Google Sheets API, but the stored refresh token or access token has become invalid.

Why This Matters

The port sheet synchronization pipeline is a critical data integration component. When the 30-minute sync fails, port sheet updates don't propagate to downstream systems, creating data staleness and potential downstream task failures that depend on current port sheet state.

Infrastructure and Authentication Flow

The JADA agent uses a shared Google OAuth token model for multiple data integration tasks. The authentication credentials are stored in a secrets directory on the Lightsail instance, referenced during script initialization. The port_sheet_sync.py script uses the Google Sheets API (via google-auth-oauthlib) to read/write port sheet data.

Current Token Management Architecture

  • OAuth credentials stored at a location referenced in repos.env
  • Token includes both client_id and client_secret from a Google Service Account or OAuth app registration
  • Refresh token flow managed by google-auth-oauthlib library (confirmed installed on instance)
  • No automated token refresh mechanism; tokens are assumed to be manually maintained

We verified that the Google auth library is properly installed (google-auth-oauthlib) and confirmed the credentials file structure contains both client ID and client secret needed for OAuth refresh operations.

Diagnosis Method: Remote Access via Lightsail API

Since the SSH private key was not available locally, we used AWS Lightsail's temporary key generation API to establish SSH access. The process involved:

  1. Calling the Lightsail GetInstanceAccessDetails API endpoint for instance 34.239.233.28
  2. Parsing the temporary SSH certificate response (without the --protocol flag to receive raw certificate data)
  3. Writing the temporary key to a local file and executing remote commands via OpenSSH
  4. Collecting service status via systemctl status jada-agent.service
  5. Extracting recent logs from the daemon and port sheet sync subprocess
  6. Pulling CPU, memory, network, and status check metrics via Lightsail metrics API
  7. Cleaning up temporary key files immediately after SSH session termination

This approach avoided storing permanent SSH keys locally while maintaining audit trail compliance through AWS API logging.

Key Decisions and Trade-offs

Why Not Restart the Service?

The daemon is functioning correctly—it's processing tasks and logging appropriately. Restarting would be a papering-over solution. The root issue (expired OAuth token) requires re-authentication, not service restart.

Why The 30-Turn Limit Exits Aren't Critical Yet

Sessions 1 and 3 hitting the 30-turn Claude API limit is expected behavior for complex tasks. Session 2 completed successfully and delivered value. If turn limits consistently block task completion, the remediation is to either increase the limit in the daemon configuration or break larger tasks into smaller subtasks. This is an optimization, not a blocking issue at this moment.

Token Refresh Strategy

Instead of manually rotating credentials, we should implement an automated token refresh mechanism in the script wrapper. The google-auth-oauthlib library supports automatic refresh when using the proper credential flow initialization.

What's Next

The immediate action item is to re-authenticate the Google OAuth token for the port sheet sync pipeline. This involves:

  1. Running the auth_ga.py script (or a new auth_port_sheet.py variant) with the account that has access to the target Google Sheet
  2. Following the OAuth consent flow to generate a fresh token with appropriate scopes
  3. Storing the refreshed credentials in the secrets directory
  4. Verifying that the next 30-minute sync cycle executes without the 400 error

Secondary improvements for future work:

  • Implement automatic token refresh in the port_sheet_sync.py script to prevent future expiration