Debugging a Cascading Deployment Failure: When AI Agents Break Production
This post documents a critical incident where an AI agent (Claude 4.5) was tasked with fixing a booking calendar race condition on sailjada.com but introduced multiple breaking changes across 23 HTML files, requiring emergency rollback and manual remediation.
The Initial Task: A Legitimate Bug Fix
The original issue was straightforward—a race condition in the booking modal on sailjada.com:
- The
jadaOpenBook()function opened the booking calendar modal immediately - Availability data was fetched asynchronously in the background
- Users could interact with an empty/incomplete calendar for 200-500ms before data populated
- This created UX jank and potential booking errors
The fix required adding an isLoading state flag and deferring modal open until availabilityData was available. This was a reasonable architectural improvement.
What Went Wrong: The Cascade
The agent was instructed to "apply the same fix to all other pages." Instead of verifying the fix was correct first, the agent:
- Applied the fix to all 23 HTML files in the sailjada.com repository
- Introduced invalid JavaScript by leaving Python format-string escapes in the code
- Deployed to staging without syntax validation
- Marked the staging deployment as "ready for CB review" without catching the errors
The primary issue: the agent's fix included lines like:
{{ isLoading: false }} // Invalid JS — looks like Jinja/Mustache templating
This syntax is valid in Python f-strings and Jinja2 templates, but not in JavaScript. The double-braces should have been single curly braces in the object literal.
Root Cause Analysis
Upon investigation, the repository revealed two distinct issues:
- Legitimate CSS: Many files contain
{{ ... }}in CSS media queries and animations—this is valid - Invalid JavaScript: The new
jadaOpenBook()function used double-braces in object initialization - Template Artifacts: Some production files still contained Python placeholders like
{STRIPE_LINK}, suggesting a hybrid templating workflow
The agent didn't distinguish between:
- CSS contexts where
{{ }}is legitimate (e.g.,@supports (display: {{ var }})) - JavaScript contexts where only
{ }is valid - Template-rendered contexts where
{{ }}is valid
Emergency Response: Rollback Procedure
Once the issue was identified, we executed a full rollback:
- Identified all affected files: Searched for
jadaBookingState(the broken variable name) across the repositoryfind . -name "*.html" -type f | xargs grep -l "jadaBookingState" - Downloaded production versions from S3: Retrieved the known-good versions from the production S3 bucket
aws s3 cp s3://sailjada-prod/releases/rc1/index.html ./releases/rc1/index.html - Overwrote all 23 local files: Restored production versions across:
- /Users/cb/Documents/repos/sites/sailjada.com/index.html
- /Users/cb/Documents/repos/sites/sailjada.com/releases/rc1/index.html
- All 21 other affected pages
- Verified restoration: Confirmed
jadaOpenBookwas restored to its original (working) implementation andjadaBookingStatewas completely removed - Deleted broken staging deployment: Removed https://queenofsandiego.com/_staging/sailjada/ to prevent accidental promotion
- No pre-deployment linting: The HTML/JS files should be validated with
eslintor similar before staging - No syntax checking in CI/CD: A simple
node -ccheck would have caught the invalid JS - Template vs. static HTML confusion: The repo appears to have hybrid Python templating + static HTML, which confused the agent
- No agent guardrails: The agent should have been restricted from deploying without manual approval
- Add pre-deployment validation: Integrate ESLint and HTMLHint into the build pipeline
npm run lint -- "releases/**/*.html" - Require code review before staging: Agent staging deployments should be marked as "awaiting review" and not auto-promoted
- Separate template and static files: Store Jinja2/Python templates separately from pure HTML to reduce confusion
- Use version control properly: The 12+ edits to index.html without meaningful commit messages made rollback harder to reason about
- Establish agent task boundaries: "Apply fix to all files" should trigger a safety prompt requiring explicit per-file approval
Infrastructure Implications
The incident revealed process gaps:
Production files are stored in S3 buckets (exact bucket names withheld for security), served through CloudFront distribution with Route53 DNS. The staging deployment was to a separate S3 path (_staging/sailjada/) for review before production push.
What's Next: Process Improvements
To prevent similar cascades:
Status: Production is restored and clean. Staging has been purged. The original race condition fix is pending proper implementation with correct syntax and pre-deployment validation.
```