Diagnosing and Remediating the jada-agent Orchestrator: OAuth Token Expiration and Turn-Limit Constraints

During routine infrastructure health checks on the jada-agent daemon (hosted on AWS Lightsail instance 34.239.233.28), we uncovered a critical authentication failure in the port sheet synchronization pipeline and identified architectural constraints around Claude API turn limits that warrant investigation. This post details the diagnosis methodology, findings, and remediation path.

What Was Done

We performed a comprehensive health audit of the jada-agent orchestrator daemon spanning service status, resource utilization, task queue activity, and error log analysis across a 24-hour window. The investigation combined multiple data collection strategies:

  • Service introspection: Direct SSH access to the Lightsail instance to inspect jada-agent.service systemd unit status, uptime, and resource consumption
  • Log aggregation: Parsed daemon logs from /var/log/jada-agent/ to identify error patterns, task completion rates, and script execution timestamps
  • Metrics collection: Queried AWS Lightsail API for CPU, memory, disk, and network status check data across the previous 2-hour window
  • OAuth token validation: Traced recurring HTTP 400 errors in port sheet sync logs back to expired Google OAuth credentials
  • Turn-limit analysis: Correlated exit code 1 events with Claude API turn exhaustion patterns in task execution logs

Technical Details and Findings

Service Health: Baseline Good

The jada-agent daemon has been running cleanly since May 10, 2026 with 3 days of uptime and zero systemd restart events. Resource utilization is nominal:

  • CPU: 0.65% average across 60-second polling cycles with no spike events in the last 2 hours
  • Memory: 144 MB resident (out of 914 MB available) — well within safe margins for a Python long-running process
  • Disk: 6.2 GB consumed from 39 GB total (17%) — ample headroom for log rotation and task artifacts
  • Network: All AWS status checks passing; no packet loss or latency anomalies
  • Load average: 0.00 between task executions — idle state is expected behavior when the task queue is empty

Task Execution Pattern: Three Sessions, Two Incomplete

The daemon processed 3 sessions within the May 13 UTC 00:00–00:05 window:

Session 1 (00:00 UTC)  → Exit code 1 (max turns reached)
Session 2 (00:02 UTC)  → Exit code 0 (completed successfully)
Session 3 (00:05 UTC)  → Exit code 1 (max turns reached)

Session 2 demonstrated productive work: the daemon successfully processed blockers related to an e-signature page and crew page generator code, then created a needs-you task in the progress dashboard for manual follow-up. Sessions 1 and 3 both exhausted the 30-turn Claude API limit, which the daemon correctly logs as a non-fatal error. The service continues running and polls the task queue normally after each exit.

This pattern aligns with yesterday's behavior where 5 sessions consumed all available turns before the UTC midnight reset. The pattern suggests task complexity is consuming turn budget faster than anticipated by the current turn-limit configuration.

Critical: port_sheet_sync OAuth Token Expiration

Every 30-minute execution of port_sheet_sync.py has failed since at least May 13 afternoon with the following error signature:

[port-sheet] token error: HTTP Error 400: Bad Request

The root cause is a revoked or expired Google OAuth 2.0 refresh token stored in the jada-agent credential vault. The port_sheet_sync.py script uses the Google Sheets API to synchronize task metadata and crew assignments from a Google Sheet into the internal progress dashboard. When the token becomes invalid, the sync silently fails—no data is written, and no alert is raised to the task queue.

This is a common failure mode in long-running daemons that rely on third-party OAuth flows. The token was likely issued months ago and either:

  • Explicitly revoked by the user or by Google's automatic token cleanup policy
  • Expired naturally (Google OAuth refresh tokens have a 6-month inactivity window)
  • Invalidated due to a password change on the underlying Google account

The 30-minute sync loop continues to execute without raising an alarm, creating a silent data consistency gap between the Google Sheet and the dashboard.

Infrastructure and Architecture Context

Lightsail Instance Configuration

The jada-agent daemon runs on a single AWS Lightsail instance (34.239.233.28) configured with:

  • Instance type: Small (1 vCPU, 1 GB RAM)
  • Storage: 40 GB SSD
  • OS: Ubuntu 22.04 LTS
  • Service manager: systemd, unit file at /etc/systemd/system/jada-agent.service
  • Log location: /var/log/jada-agent/daemon.log (rotated daily)
  • Task queue integration: Polls progress dashboard API every 60 seconds for new agent-task entries

Credential Management Strategy

Credentials for both Claude API (for agent task execution) and Google OAuth tokens (for third-party syncs) are stored in an encrypted credential vault on the instance. Access is controlled via AWS Lightsail SSH key pairs. The port sheet sync script loads its OAuth token from this vault at startup, then uses a refresh token flow to obtain fresh short-lived access tokens every 30 minutes.

The architecture assumes refresh tokens remain valid indefinitely unless explicitly revoked—a common assumption that fails silently when Google's policies or user actions invalidate the token without daemon-side notification.

Key Decisions and Constraints

Why Turn Limits Are a Problem

The Claude API integration uses a hard 30-turn limit per task session. This design constraint originated as a guardrail to prevent runaway cost or infinite loops. However, the current task complexity (complex page blockers, code generation, data schema decisions) regularly consumes the full 30-turn budget within a single session.

When the limit is hit, the daemon logs exit code 1 but does not requeue the task. This means:

  • Incomplete work is abandoned (not persisted to the progress dashboard)
  • No alert is raised to signal turn exhaustion
  • The next task in the queue is picked up normally

Increasing the turn limit is not a trivial change—it raises per-task API costs and latency. A better approach involves task decomposition or parallel session management, which requires architectural changes to the orchestrator.

Why Silent OAuth Failures Are Critical

The port sheet sync failure is a data consistency issue masquerading as benign. From the daemon's perspective, the cron-like sync job runs every 30 minutes, makes an API call, and