Diagnosing and Remediating OAuth Token Failures in the JADA Agent Orchestrator

During a routine health check of the jada-agent daemon running on our Lightsail instance (34.239.233.28), we discovered a critical OAuth token failure affecting the port sheet synchronization service. This post documents the diagnostic methodology, root cause analysis, and remediation strategy for production Google OAuth token management in distributed agent systems.

What Was Done

We conducted a comprehensive health assessment of the jada-agent.service orchestrator daemon, which discovered that while the core agent was functioning normally with healthy resource utilization and uptime metrics, the port_sheet_sync.py service was experiencing repeated authentication failures. The investigation revealed an expired or revoked Google OAuth token, causing synchronization failures every 30 minutes for at least 12+ hours.

  • Established SSH connectivity to the Lightsail instance via AWS Lightsail API temporary credentials (replacing missing local SSH key material)
  • Collected service status, resource metrics, and daemon logs from the running jada-agent.service
  • Analyzed session activity logs and identified OAuth token degradation in the port_sheet_sync subprocess
  • Created authentication remediation workflow in auth_ga.py to handle multi-service token refresh

Technical Details: Diagnostic Approach

The health check began with a challenge: the jada SSH private key was not available in the expected local directories (~/.ssh/jada-key). Rather than blocking on key retrieval, we used the AWS Lightsail API to generate temporary SSH credentials on-demand.

# Generate temporary SSH key material via Lightsail API
# GET /api/GetInstanceAccessDetails
# - Instance name: jada-agent-prod
# - Region: us-east-1

Once connected, we pulled comprehensive daemon telemetry:

  • Service Status: jada-agent.service active since May 10 (3 days uptime)
  • Resource Usage: 0.65% CPU average, 144MB/914MB memory, 6.2GB/39GB disk
  • System Health: Load average 0.00, zero status check failures in prior 2 hours
  • Session Activity: 3 of 5 daily sessions consumed; 2 sessions hit 30-turn Claude limit (expected behavior for complex tasks); 1 session completed successfully

Daemon logs revealed the root cause: every 30-minute invocation of port_sheet_sync.py was failing with:

[port-sheet] token error: HTTP Error 400: Bad Request

This indicates the Google OAuth 2.0 refresh token stored for the port sheet service is either expired, revoked, or otherwise invalid. The daemon continues running (it's not a crash), but port sheet syncs have been silently failing, accumulating data drift.

Infrastructure and Service Architecture

The JADA agent system consists of several interconnected components:

  • jada-agent.service: Systemd service running on Lightsail, polling the progress dashboard every 60 seconds for pending tasks
  • port_sheet_sync.py: Subprocess invoked on a 30-minute schedule, syncs Google Sheets data using stored OAuth credentials
  • auth_ga.py: Authentication utility for Google APIs, stores and refreshes OAuth tokens in /Users/cb/Documents/repos/tools/
  • Google OAuth Token Storage: Stored in environment files or credential managers (exact paths withheld for security)
  • Progress Dashboard: Central task queue that the daemon monitors and consumes from

The port_sheet_sync service relies on a long-lived refresh token issued by Google OAuth 2.0. When a refresh token becomes invalid (user revoked access, token expired after 6+ months of non-use, or credential rotation occurred), subsequent token refresh attempts fail with HTTP 400 Bad Request.

Root Cause Analysis

Google OAuth 2.0 refresh tokens have several failure modes:

  • Token Expiration: Refresh tokens can expire after extended periods of inactivity (typically 6 months). If port_sheet_sync was disabled or the Lightsail instance was rebuilt, the token may have aged out.
  • Credential Revocation: User revoked access through Google Account Settings → Security → Third-party apps & services
  • Scope Changes: If the OAuth consent screen was re-configured without the required scopes (e.g., Google Sheets read/write), the stored token becomes invalid
  • Client ID Rotation: If the Google Cloud project's OAuth 2.0 credentials (client_id, client_secret) were rotated, previously issued tokens are invalidated

For the dangerouscentaur@gmail.com account, we confirmed that client_id and client_secret exist in the stored token material, suggesting either token age or user-initiated revocation as the cause.

Remediation Strategy

We created auth_ga.py to handle re-authentication and token refresh for Google APIs:

python3 ~/Documents/repos/tools/auth_ga.py --account dangerouscentaur@gmail.com

This utility performs the following steps:

  • Checks for existing valid tokens in the credential store
  • If expired or missing, initiates Google OAuth 2.0 authorization flow
  • Stores new refresh token securely
  • Confirms access to required Google APIs (Sheets, Analytics, etc.)
  • Updates environment configuration for port_sheet_sync.py to use fresh credentials

Why this approach: Rather than manually rotating credentials in multiple places (repos.env, systemd environment, credential managers), centralizing token management in auth_ga.py ensures consistency and reduces the risk of stale credentials in partial updates.

Deployment Considerations

Once the new OAuth token is obtained, the port_sheet_sync service will need to be notified. Two approaches:

  • Soft restart: The next scheduled 30-minute sync will automatically use the updated token from the credential store
  • Hard restart: Run `systemctl restart jada-agent.service` to force immediate re-initialization (may interrupt in-flight tasks)

We recommend the soft restart approach to avoid disrupting active agent sessions.

What's Next

  • Execute auth_ga.py with the dangerouscentaur@gmail.com account to obtain fresh Google OAuth tokens
  • Verify port_sheet_sync logs after the next 30-minute cycle to confirm HTTP 400 errors have ceased
  • Implement token expiry alerting in the daemon to proactively surface credential issues before they cause data drift
  • Document OAuth credential rotation procedures for future maintenance cycles
  • Consider using Google Cloud Service Accounts instead of user OAuth tokens for non-interactive services; service account keys have different lifecycle management and don't expire