Diagnosing and Resolving jada-agent Daemon Health Issues on AWS Lightsail

Last Tuesday morning, we conducted a comprehensive health audit of the jada-agent orchestrator daemon running on our AWS Lightsail instance (34.239.233.28). The goal was to verify service stability, confirm active task processing, and identify any operational bottlenecks. This post walks through our diagnostic approach, the issues we discovered, and the remediation strategy we're implementing.

Overview of Findings

The daemon itself is healthy and stable—running continuously for 11 days with normal resource utilization and zero infrastructure failures. However, we uncovered two distinct issues requiring attention:

  • A broken Google OAuth token in the port_sheet_sync.py script causing sync failures every 30 minutes
  • Recurring max-turn limits (30 turns) being hit during agent runs, causing incomplete task processing in complex workflows

Diagnostic Methodology

Challenge: The jada-key private key was not stored in the standard ~/.ssh/jada-key location, and repos.env did not contain an explicit path reference. Rather than delay diagnosis, we pursued two parallel paths:

  1. AWS Lightsail API temporary credentials: We called the Lightsail GetInstanceAccessDetails endpoint to generate a temporary SSH keypair valid for 60 minutes. This eliminated the need to hunt for persistent keys and provided a clean audit trail.
  2. AWS Systems Manager Session Manager: As a fallback, we could have used SSM for interactive shell access without SSH keys, but the temporary Lightsail credentials proved faster.

Command sequence:

aws lightsail get-instance-access-details \
  --instance-name jada-agent-prod \
  --region us-east-1 \
  --query 'accessDetails.{cert:certKey,pk:privateKey}' \
  --output text > /tmp/jada_access.json

# Extract private key and certificate, set permissions
chmod 600 /tmp/jada_key
ssh -i /tmp/jada_key ubuntu@34.239.233.28

Service Status and Infrastructure Health

Once connected, we verified the systemd unit:

sudo systemctl status jada-agent.service
sudo journalctl -u jada-agent.service -n 100 --no-paging

Results: The service has been active and running since May 10 with zero restarts. The daemon implements a 60-second polling loop that queries the progress dashboard for pending tasks. During idle periods (between task assignments), the process consumes approximately 0.65% CPU, which is expected behavior for a polling agent.

Resource metrics (via Lightsail API):

  • CPU: 0.65% average across last 2 hours; no spikes detected
  • Memory: 144MB of 914MB allocated—15.8% utilization
  • Disk: 6.2GB of 39GB used (17%)—ample headroom for logs and task artifacts
  • Network: ~2KB inbound/outbound per minute (minimal, as expected for a daemon in idle state)
  • Status checks: 0 failures in the last 2 hours; instance is fully healthy from AWS perspective

The instance uptime of 11 days indicates stable infrastructure with no unexpected reboots or hardware issues.

Agent Session Activity Analysis

We reviewed the daemon's task log by parsing the session counter and checking recent Claude API invocations:

cat /var/log/jada-agent/sessions.log | tail -20
grep "exit_code" /var/log/jada-agent/task_summary.json | tail -10
ps aux | grep "[j]ada"

Today's activity (May 13, UTC):

  • Session 1 (00:00 UTC): Ran 30 turns, hit max limit, exited with code 1
  • Session 2 (00:02 UTC): Completed successfully, processed blockers on e-signature and crew page generator, created a needs-you task
  • Session 3 (00:05 UTC): Ran 30 turns, hit max limit, exited with code 1
  • After 00:05: Daemon queried progress dashboard, found no pending tasks, resumed normal idle state

Yesterday's pattern: The daemon reached its hard stop of 5 sessions before midnight UTC and had 3 pending tasks queued. These were automatically cleared at the daily rollover (midnight), which is expected behavior by design.

Critical Issue: Broken Google OAuth Token in port_sheet_sync

The most significant finding was in the port_sheet_sync.py script, which runs as a scheduled task every 30 minutes via cron:

*/30 * * * * /opt/jada-agent/scripts/port_sheet_sync.py >> /var/log/port_sheet_sync.log 2>&1

Every sync since at least this afternoon has been failing with:

[port-sheet] token error: HTTP Error 400: Bad Request

Root cause: The Google OAuth token stored in the daemon's credential store (likely in /opt/jada-agent/config/credentials/google_oauth.json) has expired or been revoked. The script attempts to authenticate with Google Sheets API but receives a 400 error, which typically indicates an invalid or expired bearer token.

Impact: Port sheet syncs have not completed in approximately 8+ hours. Any downstream processes relying on fresh port sheet data are working with stale information.

Remediation required: Re-authenticate the Google OAuth token. This likely involves:

  1. Triggering a new Google OAuth flow via your configured consent screen
  2. Updating the stored refresh token in the credentials directory
  3. Validating the sync completes successfully on the next 30-minute interval

Secondary Issue: Max-Turn Limits During Complex Tasks

Two of today's three agent sessions exited with code 1 after hitting the 30-turn Claude API limit. This is not a crash or service failure—the daemon logs it as an error but continues processing. However, it indicates that certain task types are exceeding the configured turn budget.

Why this matters: Complex, multi-step tasks (e.g., analyzing blockers across multiple pages, generating code for dependent systems) may require more than 30 API calls to complete successfully. When the limit is hit, the agent gracefully terminates but the task remains incomplete and may be re-queued or dropped depending on error handling logic.

Possible solutions:

  • Increase the max-turn limit in the daemon configuration (trade-off: higher API costs per task)
  • Restructure complex tasks into smaller, independent subtasks that fit within the turn budget
  • Implement multi-session task continuation logic where a task can resume in a subsequent session

Key Decisions and Architecture Notes