Diagnosing and Remediating a Production Daemon's OAuth Token Expiration in AWS Lightsail
During a routine health check of our jada-agent orchestrator daemon running on AWS Lightsail (34.239.233.28), we discovered a critical service degradation: the port_sheet_sync.py script was failing every 30 minutes with HTTP 400 OAuth token errors, while the main agent service remained healthy. This post details how we diagnosed the issue, why the problem emerged, and the remediation path forward.
What Was Done
We performed a comprehensive health audit of the jada-agent daemon by:
- Establishing SSH access to the Lightsail instance via temporary credentials from the AWS Lightsail API (since the persistent jada-key was not stored locally)
- Collecting service status from
jada-agent.servicesystemd unit - Parsing 7+ days of daemon logs to identify recurring failure patterns
- Cross-referencing CloudWatch metrics (CPU, memory, network, status checks) with log timestamps
- Isolating the
port_sheet_syncOAuth credential failure as a distinct, persistent issue - Confirming the main agent service was otherwise healthy with 3-day uptime and normal resource utilization
Technical Details: The OAuth Token Failure
The port_sheet_sync.py script runs every 30 minutes as a scheduled task on the Lightsail instance. Its job is to synchronize sheet metadata and session counts to a Google Sheet via the Google Sheets API. The daemon logs showed:
[port-sheet] token error: HTTP Error 400: Bad Request
Sync failed at 2026-05-13 14:30:00 UTC
Sync failed at 2026-05-13 15:00:00 UTC
Sync failed at 2026-05-13 15:30:00 UTC
... (pattern continues every 30 minutes)
This error indicates that the OAuth 2.0 refresh token stored on disk (likely in /var/lib/jada-agent/ or a secrets directory) has been revoked or has expired beyond the refresh window. The Google OAuth library being used (google-auth-oauthlib) handles token refresh automatically, but when the refresh token itself is invalid, it returns a 400 Bad Request.
Root cause analysis suggests one of three scenarios:
- Token revocation: The Google account owner (dangerouscentaur@gmail.com) revoked OAuth access from the browser settings, invalidating all issued tokens.
- Token age: The original OAuth flow was completed more than 6 months ago, and Google's token retention policy may have culled the refresh token.
- Scope mismatch: The token was re-issued with insufficient scopes (missing Sheets or Drive API scopes), and the daemon is now trying to use a scope that wasn't originally granted.
Infrastructure and Service Architecture
The jada-agent daemon runs as a systemd service on a single AWS Lightsail instance (34.239.233.28) with the following observed configuration:
- Instance specs: 1GB RAM, ~40GB storage, uptime 11 days at time of audit
- Service:
jada-agent.service— active and running since May 10, 2026 - Resource utilization: CPU ~0.65% (normal polling loop), Memory 144MB / 914MB, Disk 6.2GB / 39GB
- Session management: 3 of 5 daily sessions used as of May 13; daemon respects Claude API turn limits and rolls over at midnight UTC
- Dependent scripts:
port_sheet_sync.py(30-min interval), main agent loop (event-driven from task queue)
The daemon uses a progress dashboard (likely a DynamoDB table or similar queue) to pick up tasks. As of the audit, the daemon was idling normally between task executions, confirming that the core orchestration logic was healthy.
Key Decisions and Why
Decision: Use Lightsail API for temporary SSH credentials instead of searching for persistent keys.
The jada-key private key was not found in standard locations (~/.ssh/) or referenced in repos.env. Rather than delay the audit further, we leveraged the AWS Lightsail API's get_instance_access_details endpoint to retrieve a temporary OpenSSH certificate paired with the instance's public key. This approach:
- Avoided storing long-lived SSH keys in source control or on developer machines
- Provided audit trail via AWS CloudTrail
- Enabled rapid access without waiting for key distribution
- Allowed cleanup by discarding temporary credentials immediately after use
Decision: Separate core daemon health from subsidiary script failures.
The main jada-agent service is functioning correctly. The OAuth token failure is isolated to a dependent synchronization script. This distinction matters: the daemon can continue processing tasks while we remediate the port_sheet_sync issue independently, minimizing blast radius.
Observed Agent Activity
On May 13, 2026, the daemon executed three sessions within its daily allotment:
- Session 1 (00:00 UTC): Hit the 30-turn Claude API limit — exited with code 1. This is not a crash; the daemon logs it and continues.
- Session 2 (00:02 UTC): Completed successfully. Processed e-signature page blockers and crew page generator tasks, created a needs-you task in the progress dashboard.
- Session 3 (00:05 UTC): Hit the 30-turn limit again — exited with code 1.
After Session 3, no additional tasks were queued, and the daemon idled normally. The pattern suggests that complex tasks are consuming the full turn budget; this is worth monitoring but not an immediate failure.
What's Next
To remediate the OAuth token failure, the following steps are required:
- Re-authenticate Google OAuth: Execute the Google OAuth flow for the dangerouscentaur@gmail.com account with the required scopes (Google Sheets API, Google Drive API). This will issue new access and refresh tokens.
- Update stored credentials: Replace the expired token in the daemon's secrets directory (path TBD — likely
/var/lib/jada-agent/secrets/or an AWS Secrets Manager secret) with the new token. - Verify port_sheet_sync recovery: Monitor the next 30-minute sync execution to confirm HTTP 200 response from the Google Sheets API.
- Audit token storage: Ensure the token file has restrictive permissions (0600) and is not world-readable.
- Consider token rotation policy: Implement automatic token refresh or scheduled re-authentication every 3–6 months to prevent future expiration.
The core agent daemon requires no immediate action and should continue processing tasks normally while this remediation proceeds in parallel.
```