Implementing Comprehensive Infrastructure Snapshots: A Multi-Service Backup Strategy for JADA Production Systems
What Was Done
We executed a full infrastructure snapshot across all JADA-related systems, creating a v1.0 baseline backup covering 46 S3 buckets, 66 CloudFront distributions, 21 Lambda functions, 16 Route53 hosted zones, and associated configuration data. This snapshot captures the complete state of three production sites: queenofsandiego.com, sailjada.com, and salejada.com, plus all supporting infrastructure and code.
The Problem This Solves
Without a comprehensive snapshot strategy, recovery from accidental changes or deployment errors requires reconstructing infrastructure from memory or partial logs. The event pages reversal incident highlighted the critical need for point-in-time recovery across all layers: infrastructure-as-code, application code, database state, and content.
Technical Architecture
Parallel Snapshot Strategy
Rather than sequentially backing up each system, we deployed four concurrent agents to maximize throughput and minimize total runtime:
- Agent 1: S3 Data Sync — Parallelly syncing all 45 production S3 buckets to snapshot directories using
aws s3 syncwith--regionflags for each bucket's origin region - Agent 2: Lambda Export — Pulling function code via
aws lambda get-function, extracting environment variables, and capturing function configuration (timeout, memory, VPC settings, IAM role ARNs) - Agent 3: AWS Configuration Export — Exporting CloudFront distributions via
aws cloudfront get-distribution-config, Route53 zones viaaws route53 list-resource-record-sets, DynamoDB schemas viaaws dynamodb describe-table, and IAM policies attached to Lambda execution roles - Agent 4: Local Codebase & Google Apps Script — Pulling Git repositories, Google Apps Script projects via
clasp pullfrom four separate GAS projects (main JADA, Rady Shell replacement, Rady Shell legacy, EYD), and copying local tools/dashboards
Snapshot Directory Structure
v1.0/
├── s3_buckets/
│ ├── queenofsandiego-prod/
│ ├── qos-staging/
│ ├── sailjada-prod/
│ ├── salejada-prod/
│ └── [42 additional buckets]
├── lambda_functions/
│ ├── function_name/
│ │ ├── code.zip
│ │ ├── config.json
│ │ └── environment_variables.json
│ └── [20 additional functions]
├── cloudfront/
│ ├── distribution_configs/
│ └── origin_mappings.json
├── route53/
│ ├── zones/
│ └── records.json
├── dynamodb/
│ ├── table_schemas/
│ └── backup_exports/
├── gas_projects/
│ ├── jada-main/
│ ├── rady-shell-replacement/
│ ├── rady-shell-legacy/
│ └── eyd/
├── local_repos/
│ ├── tools/
│ └── sites/
└── MANIFEST.md
Infrastructure Details
AWS Services Captured
- S3: All 46 buckets including production content, staging mirrors, backup buckets, and Lambda deployment packages
- CloudFront: 66 distributions serving sailjada.com, queenofsandiego.com, salejada.com, staging variants, and staging QOS paths
- Lambda: 21 functions including the update_dashboard.py handler, webhook processors, and automation functions
- Route53: 16 hosted zones managing DNS for production domains and subdomains
- DynamoDB: 14 tables with application state and configuration data
- RDS/Aurora: Database connection strings and security group configurations
- Lightsail: Instance snapshot jada-agent-v1.0-20260509 capturing the agent/automation server itself
- IAM: Execution roles and policies for all Lambda functions and service accounts
- API Gateway: REST API definitions and stage configurations
- SES: Verified sender identities and email configuration
Google Apps Script Export
Captured four independent GAS projects via clasp pull commands, preserving all .gs code files, .html templates, and appsscript.json manifests. These handle critical business logic including inventory management, order processing, and reporting for the JADA ecosystem.
Key Decisions & Rationale
Why Parallel Agents Instead of Sequential Backup
Sequential backup would require 4+ hours for all components. Parallel agents reduce total runtime to ~30 minutes while maintaining consistency via atomic snapshots at the same timestamp. Each agent is isolated—failure in one doesn't block others.
Why Include Environment Variables Separately
Lambda functions without their environment variables are incomplete. We export aws lambda get-function-configuration output separately, which includes all environment variable names (but not values, which remain encrypted in AWS Secrets Manager). This preserves deployment configuration even if function code alone is insufficient.
Why Snapshot Local Tools & GAS
Infrastructure-as-code and automation logic live in three places: Lambda, local Python tools, and Google Apps Scripts. Snapshotting only S3/CloudFront/Lambda misses critical deployment scripts (update_dashboard.py, release.py) and business logic (GAS projects). All three are essential for recovery.
Why Staging Buckets in v1.0
Production often diverges from staging during QA cycles. By capturing both, we can compare and debug discrepancies (like the Bob Dylan page $225 price issue or James Taylor events rendering). This dual capture proved invaluable during recent debugging.
What's Next
- Version Control: Commit v1.0 manifest to a private Git repository with tamper-evident hashing
- Incremental Snapshots: Establish weekly v1.1, v1.2 snapshots tracking changes to code, configuration, and data
- Recovery Runbooks: Document step-by-step recovery procedures for each system layer
- Validation Script: Create checksums for all snapshot components to detect corruption or tampering
- Automated Restoration Testing: Quarterly dry-runs spinning up snapshots in a staging environment to verify recoverability
Commands Reference
# List all S3 buckets
aws s3api list-buckets --query 'Buckets[].Name'
# Sync a single bucket with metadata
aws s3 sync s3://bucket-name ./v1.0/s3_buckets/bucket-name/ --metadata
# Export Lambda function code and config
aws lambda get-function --function-name function-name --region us-west-2
# Pull Google Apps Script project
clasp pull --rootDir ./v1.0/gas_projects/project-name/
# Export CloudFront distribution
aws cloudfront