```html

Diagnosing jada-agent Orchestrator Health: SSH Access, Daemon Metrics, and OAuth Token Failures

On May 13, 2026, we conducted a comprehensive health check of the jada-agent orchestrator daemon running on Lightsail instance 34.239.233.28. The goal was to verify service status, review recent task execution logs, identify performance bottlenecks, and confirm active task pickup from the progress dashboard. This post details the diagnostic approach, infrastructure patterns used, and a critical OAuth token issue discovered during the audit.

Challenge: Private Key Not Locally Available

The initial challenge was that the jada-key private SSH key was not stored in the standard ~/.ssh directory. Rather than delay the diagnostic, we employed two parallel strategies:

  • AWS Lightsail Temporary Credentials API: We called the Lightsail API to generate temporary SSH credentials for the instance, avoiding dependency on locally cached keys.
  • AWS Systems Manager Session Manager: We verified SSM connectivity as a fallback, though the Lightsail temporary credentials proved faster.

This approach reflects a best practice: infrastructure access keys should never be committed to local disk or version control. Temporary, time-bound credentials (TTL typically 1 hour) reduce blast radius if credentials are leaked.

Service Health: Overall Status and Uptime

The jada-agent daemon is healthy and stable:

  • Service State: jada-agent.service active and running since May 10 (3 days continuous uptime)
  • System Uptime: 11 days; load average 0.00 between task cycles
  • CPU Utilization: 0.65% average over 60-second poll intervals—normal for an idle-waiting agent
  • Memory: 144 MB / 914 MB allocated—24% utilization, no pressure
  • Disk: 6.2 GB / 39 GB (17% used)—ample headroom
  • AWS Health Checks: Zero failures in the past 2 hours

The daemon's architecture—polling for tasks on a fixed interval and sleeping between cycles—is reflected in the idle load. This is intentional; the orchestrator is designed to be lightweight and event-driven via the progress dashboard task queue.

Session Execution: Daily Quota and Max-Turn Limits

The daemon tracks session usage against a daily quota (5 sessions per day, UTC-based). On May 13, three sessions were executed:

  • Session 1 (00:00 UTC): Hit 30-turn Claude API limit; exit code 1
  • Session 2 (00:02 UTC): Completed successfully; processed e-signature and crew page blockers; created a needs-you task
  • Session 3 (00:05 UTC): Hit 30-turn limit again; exit code 1
  • Post-Session 3: No new tasks in queue; daemon idling normally

The max-turn exits are logged as errors but do not crash the daemon. This is expected behavior when complex tasks approach the 30-turn boundary. Session 2's success demonstrates the daemon is picking up tasks and executing them; the turn-limit exits suggest either task complexity is increasing or the turn budget should be revisited for certain task types.

Critical Issue: Google OAuth Token Expiration in port_sheet_sync

During log analysis, we identified a persistent failure in the port_sheet_sync.py script:

[port-sheet] token error: HTTP Error 400: Bad Request

This error has been occurring every 30 minutes since at least May 13 afternoon. Root cause: the Google OAuth 2.0 token used for port sheet synchronization has expired or been revoked. The token file is likely stored in a service account credentials path (typically /opt/jada-agent/secrets/google_oauth.json or similar), and the refresh token flow is either not configured or has failed.

Impact: Port sheet syncs have halted. Any downstream processes depending on up-to-date port sheet data are now stale.

Resolution: The Google OAuth token must be re-authenticated. This typically involves:

  • Re-running the Google OAuth 2.0 consent flow for the service account or user account
  • Updating the token in the service's credential store
  • Verifying the refresh token flow is working by running a manual sync test
  • Monitoring the next 30-minute sync cycle for success

Infrastructure Pattern: Lightsail + CloudWatch Metrics + SSH Certificates

The diagnostic workflow leveraged several AWS infrastructure components:

  • Lightsail Instance Metrics API: We retrieved CPU, network, and status check metrics via the Lightsail API without SSH, providing a fast, read-only health snapshot.
  • OpenSSH Certificates: Temporary SSH access credentials from the Lightsail API were used as an OpenSSH certificate paired with the instance's public key, allowing passwordless, time-bound access.
  • System Logs via SSH: Once connected, we pulled daemon logs from /var/log/jada-agent/ (typical path) and systemd journal via journalctl -u jada-agent.service.

This multi-layered approach—metrics API for passive health checks, SSH for deep diagnostics—is preferred over always maintaining local copies of private keys.

Task Queue Behavior and Yesterday's Session Hard Stop

Logs from May 12 show the daemon hit a hard stop at the 5/5 daily session limit shortly before midnight UTC. Three tasks remained in the pending queue and were automatically cleared at the midnight rollover (when the daily session counter resets). This is documented expected behavior and indicates the rate limiter is working as designed.

Key Decisions and Trade-offs

  • Temporary Credentials Over Static Keys: We opted for Lightsail API temporary credentials rather than searching for cached private keys, reducing security risk and improving auditability.
  • Parallel Diagnostics: Pulling metrics via the API simultaneously with SSH connection setup reduced total diagnostic time.
  • Error Tolerance: The daemon correctly logs max-turn exits as errors but continues operation; this design choice prevents a single task from breaking the entire orchestrator.

What's Next

  • Immediate: Re-authenticate the Google OAuth token for port_sheet_sync.py and verify the next 30-minute sync succeeds.
  • Short-term: Investigate whether the recent max-turn exits are a symptom of increasing task complexity. Consider whether the 30-turn budget should be increased or task decomposition improved.
  • Medium-term: Implement CloudWatch alarms for OAuth token expiration errors in port_sheet_sync to alert before syncs silently fail for hours.
  • Ongoing: Continue daily task queue monitoring; the daemon's health is strong and requires minimal intervention.

Overall, the jada-agent orchestrator is functioning as designed. One OAuth token issue requires immediate attention; everything else is nominal.

```