Building a Multi-Carrier SMS Relay for QuickDumpNow: Twilio Integration & Cascading Failover
What Was Done
We integrated Twilio into the QuickDumpNow (QDN) infrastructure to solve a carrier-level routing constraint: the primary dispatch line (owned by QDN) couldn't cascade to secondary handlers at the telecom layer. Instead of upgrading carrier service tiers (expensive, slow), we built a software relay using Twilio's programmable SMS API to implement cascading failover: inbound messages route first to the primary handler (QDN), then to a secondary handler (Sergio), then to a backup (Sergio's designated 858-335-4807). This pattern decouples message routing logic from carrier limitations and gives us granular control over retry policies and state tracking.
Technical Details: The Relay Architecture
Credential Management
Twilio credentials were persisted in `/Users/cb/Documents/repos/.secrets/repos.env` with mode 600 (read/write owner only):
TWILIO_ACCOUNT_SID— stored for API authentication across all Twilio SDK callsTWILIO_AUTH_TOKEN— stored for admin operations (phone number provisioning, webhook config)- API Key + Secret pair — stored for runtime SDK initialization (preferred over account auth token in production)
A reference memory file (reference_twilio_credentials.md) was written to future-proof credential lookup; future sessions can quickly locate the auth strategy without re-reading chat history.
Message Flow State Machine
The relay implements a three-tier failover:
- Tier 1 (Primary): Inbound SMS arrives at QDN's Twilio-provisioned number. A webhook fires to Lambda, which attempts to forward the message to the primary handler's phone number (QDN internal dispatch).
- Tier 2 (Secondary): If Tier 1 delivery fails (timeout, carrier reject, handler unavailable), the Lambda catches the exception and retries via Twilio to Sergio's personal number.
- Tier 3 (Backup): If Tier 2 fails, escalate to Sergio's designated backup number (858-335-4807).
Each tier includes exponential backoff (2s, 4s, 8s) to avoid thundering herd on temporary outages. The relay stores state in a DynamoDB table (qdn-message-relay-state) keyed by message_id + timestamp, tracking tier attempts and final delivery confirmation.
Lambda Function: Message Relay Logic
The core relay logic lives in `/Users/cb/Documents/repos/sites/dashboard.quickdumpnow.com/lambda/lambda_function.py`. Key functions:
handle_inbound_webhook(event)— parses Twilio webhook payload (sender, message body, timestamp) and initiates relayattempt_forward(tier, phone_number, message)— calls Twilio API to send SMS; returns success/failure statuslog_relay_attempt(message_id, tier, status, error)— writes state to DynamoDB for audit trailescalate_on_failure(message_id, current_tier)— determines next tier and schedules SQS message for retry
The function is deployed as an AWS Lambda with environment variables for:
TWILIO_ACCOUNT_SIDandTWILIO_AUTH_TOKEN(injected from repos.env at build time)PRIMARY_HANDLER_PHONE— QDN internal dispatchSECONDARY_HANDLER_PHONE— Sergio's personal numberBACKUP_HANDLER_PHONE— Sergio's backup numberDYNAMODB_TABLE— qdn-message-relay-state
Infrastructure Changes
API Gateway Routes
Four new routes were added to the QDN API Gateway (card reference: integration with existing dashboard.quickdumpnow.com API):
POST /sms/inbound— Twilio webhook target (receives inbound SMS from QDN phone number)POST /sms/delivery-status— Twilio status callback (delivery confirmation/failure notifications)POST /sms/relay-state— internal endpoint to query relay attempt historyOPTIONS /sms/*— CORS preflight for cross-origin calls
All endpoints require either Twilio signature validation (for webhook security) or AWS IAM auth (for internal calls).
DynamoDB Table Schema
qdn-message-relay-state (on-demand billing):
- Partition key:
message_id(Twilio-assigned SID) - Sort key:
timestamp(ISO 8601 creation time) - Attributes:
sender_phone,message_body,tier_1_status,tier_2_status,tier_3_status,final_status,error_log,delivery_confirmed_at - TTL: 90 days (auto-cleanup via DynamoDB TTL attribute)
SQS Queue for Async Retries
A standard SQS queue (qdn-sms-relay-retries) decouples the inbound webhook handler from retry scheduling. When Tier 1 fails, the Lambda posts a message to the queue with exponential backoff, and a separate Lambda (triggered by SQS) processes retries. This prevents webhook response timeouts and keeps the primary handler responsive.
Key Decisions
Why Twilio Over Native Carrier APIs?
Twilio abstracts carrier fragmentation (different carriers have different routing rules and retry semantics). By centralizing message logic in Twilio, we avoid maintaining custom integrations for each carrier. Twilio also provides webhook signature validation out-of-the-box, reducing our auth surface area.
Why DynamoDB Over RDS?
The relay state doesn't require transactions or complex joins. DynamoDB's on-demand billing and built-in TTL (for auto-expiring stale records) lower operational overhead. Scan queries (e.g., "show me all failed relay attempts in the last hour") are acceptable for debugging; if performance becomes an issue, we can add CloudWatch metrics instead.
Why SQS for Retries?
The inbound webhook must return quickly (Twilio expects a 2xx response within ~5 seconds). By offloading retries to an async queue, we keep the critical path short and allow time for exponential back