Preventing S3 Deployment Regressions: A Case Study in State Management and Runbook Design

During a recent development session on the queenofsandiego.com property, a deployment regression wiped three previously-working features from production: a hero image crossfade animation, a Stripe embedded checkout flow, and inadvertently resurrected deleted marketing copy. The root cause was a stale local index.html file deployed over a newer S3 production version without diffing or staging validation first. This post covers the technical safeguards we implemented to prevent similar regressions, the architectural decisions behind them, and patterns applicable to any S3-backed static site.

What Went Wrong: The Regression Chain

The incident unfolded in three steps:

  • Stale local state: A developer's local copy of /Users/cb/Documents/repos/sites/queenofsandiego.com/index.html was several commits behind the current S3 production version in the queenofsandiego-prod bucket.
  • Silent overwrite: A single aws s3 cp command deployed the local file directly to s3://queenofsandiego-prod/index.html without prior inspection or staging validation.
  • CloudFront cache miss: The CloudFront distribution (E1ABC2DEFG3HI, fronting the prod bucket) invalidated within seconds, pushing the stale version to all edge locations before the error was detected.

The three features lost were:

  • A CSS keyframe fade between two hero images (JADA splash → BOOK NOW overlay)
  • A Stripe embedded checkout session mounted in the booking modal
  • Removal of deprecated "For Ranch & Coast readers..." marketing text that had been intentionally deleted two sessions prior

A prior session summary had explicitly warned about stale local files. This warning was present in the claude memory system but not enforced at execution time.

Infrastructure Context

Understanding the deployment topology is essential to the fix:

  • S3 buckets: queenofsandiego-prod (live site) and queenofsandiego-staging (pre-prod validation)
  • CloudFront distributions:
    • Production: E1ABC2DEFG3HI (apex + www subdomains, 300-sec default TTL)
    • Staging: E2XYZ9UVWXYZ1 (staging.queenofsandiego.com, 60-sec TTL)
  • Route53 zones: Both domains managed in the primary hosted zone; MX, SPF, DMARC records for transactional email via SES
  • No S3 versioning enabled: This was the critical gap — once a file is overwritten, the prior version is lost unless manually snapshotted

The Eight Hard Rules: Enforcement via Runbook

To prevent future regressions, we encoded eight mandatory checks into a project-specific Claude instruction file that auto-loads on every session. These are stored in /Users/cb/Documents/repos/sites/queenofsandiego.com/CLAUDE.md and referenced from the workspace-wide memory system:

D1 — Pull S3 and diff before any edit: Before modifying index.html locally, always run:

aws s3 cp s3://queenofsandiego-prod/index.html ./index.html.s3-prod
diff -u index.html.s3-prod index.html | head -50

If the local file is older, stop and ask CB before proceeding. This catches stale local state upfront.

D2 — Staging-only, single-target deploys: Never deploy to prod and staging in the same command. Always deploy to staging first, validate, then promote:

# Correct
aws s3 cp index.html s3://queenofsandiego-staging/index.html
# (CB reviews on staging.queenofsandiego.com)
aws s3 cp index.html s3://queenofsandiego-prod/index.html

# Incorrect — never do this
aws s3 sync . s3://queenofsandiego-prod/ && aws s3 sync . s3://queenofsandiego-staging/

D3 — One logical change per deployment: Group related CSS, JS, and HTML changes in a single commit and deployment. Do not batch unrelated feature work (e.g., hero fade + booking modal + email copy) into one deploy.

D4 — Obey prior session warnings: If a prior Claude session logged a concern in CLAUDE.md or MEMORY.md (e.g., "local index.html is stale"), treat it as blocking until resolved.

D5 — Snapshot prod before overwriting: Since S3 versioning is not enabled, manually create a timestamped backup:

aws s3 cp s3://queenofsandiego-prod/index.html ./backups/index.html.prod.$(date +%s)

D6 — Six-line proof block: Before any aws s3 cp command, print a validation block to chat showing:

  • Source file path and last-modified time
  • Target S3 URI
  • CloudFront distribution ID and invalidation scope
  • Line count or hash of the file being deployed
  • Explicit confirmation from CB (or escalation if uncertain)

D7 — Feature token registry: Maintain a grep-able list of key HTML/CSS tokens in the file for each major feature. Before and after deploy, verify tokens are present in S3-current:

grep -c "jada-to-book-fade" <(aws s3 cp s3://queenofsandiego-prod/index.html - | head -3650)
grep -c "stripe-embedded-checkout" <(aws s3 cp s3://queenofsandiego-prod/index.html -)
grep -v "For Ranch & Coast" <(aws s3 cp s3://queenofsandiego-prod/index.html -)

D8 — Escalate if S3 ahead of local: If the S3 version has commits or features not present in local, do not deploy. Escalate to CB with a detailed diff and pull the S3 version first.

Key Architectural Decisions

  • Why no S3 versioning? Versioning adds ~$0.023 per 1M object versions and increases recovery complexity. For a low-churn static site, manual snapshots + CloudFront invalidation provide sufficient safety without the cost overhead.
  • Why staging-first? The staging distribution has a 60-sec TTL vs. prod's 300-sec, enabling faster validation cycles and lower risk of stale cache serving incorrect content to users.
  • Why feature tokens? grep is faster and more reliable than visual inspection for detecting whether a feature is present.