Building a Multi-Carrier SMS Relay for QuickDumpNow: Twilio Integration & Cascading Failover

```html

What Was Done

We integrated Twilio into the QuickDumpNow (QDN) infrastructure to solve a carrier-level routing constraint: the primary dispatch line (owned by QDN) couldn't cascade to secondary handlers at the telecom layer. Instead of upgrading carrier service tiers (expensive, slow), we built a software relay using Twilio's programmable SMS API to implement cascading failover: inbound messages route first to the primary handler (QDN), then to a secondary handler (Sergio), then to a backup (Sergio's designated 858-335-4807). This pattern decouples message routing logic from carrier limitations and gives us granular control over retry policies and state tracking.

Technical Details: The Relay Architecture

Credential Management

Twilio credentials were persisted in `/Users/cb/Documents/repos/.secrets/repos.env` with mode 600 (read/write owner only):

TWILIO_ACCOUNT_SID — stored for API authentication across all Twilio SDK calls
TWILIO_AUTH_TOKEN — stored for admin operations (phone number provisioning, webhook config)
API Key + Secret pair — stored for runtime SDK initialization (preferred over account auth token in production)

A reference memory file (reference_twilio_credentials.md) was written to future-proof credential lookup; future sessions can quickly locate the auth strategy without re-reading chat history.

Message Flow State Machine

The relay implements a three-tier failover:

Tier 1 (Primary): Inbound SMS arrives at QDN's Twilio-provisioned number. A webhook fires to Lambda, which attempts to forward the message to the primary handler's phone number (QDN internal dispatch).
Tier 2 (Secondary): If Tier 1 delivery fails (timeout, carrier reject, handler unavailable), the Lambda catches the exception and retries via Twilio to Sergio's personal number.
Tier 3 (Backup): If Tier 2 fails, escalate to Sergio's designated backup number (858-335-4807).

Each tier includes exponential backoff (2s, 4s, 8s) to avoid thundering herd on temporary outages. The relay stores state in a DynamoDB table (qdn-message-relay-state) keyed by message_id + timestamp, tracking tier attempts and final delivery confirmation.

Lambda Function: Message Relay Logic

The core relay logic lives in `/Users/cb/Documents/repos/sites/dashboard.quickdumpnow.com/lambda/lambda_function.py`. Key functions:

handle_inbound_webhook(event) — parses Twilio webhook payload (sender, message body, timestamp) and initiates relay
attempt_forward(tier, phone_number, message) — calls Twilio API to send SMS; returns success/failure status
log_relay_attempt(message_id, tier, status, error) — writes state to DynamoDB for audit trail
escalate_on_failure(message_id, current_tier) — determines next tier and schedules SQS message for retry

The function is deployed as an AWS Lambda with environment variables for:

TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN (injected from repos.env at build time)
PRIMARY_HANDLER_PHONE — QDN internal dispatch
SECONDARY_HANDLER_PHONE — Sergio's personal number
BACKUP_HANDLER_PHONE — Sergio's backup number
DYNAMODB_TABLE — qdn-message-relay-state

Infrastructure Changes

API Gateway Routes

Four new routes were added to the QDN API Gateway (card reference: integration with existing dashboard.quickdumpnow.com API):

POST /sms/inbound — Twilio webhook target (receives inbound SMS from QDN phone number)
POST /sms/delivery-status — Twilio status callback (delivery confirmation/failure notifications)
POST /sms/relay-state — internal endpoint to query relay attempt history
OPTIONS /sms/* — CORS preflight for cross-origin calls

All endpoints require either Twilio signature validation (for webhook security) or AWS IAM auth (for internal calls).

DynamoDB Table Schema

qdn-message-relay-state (on-demand billing):

Partition key: message_id (Twilio-assigned SID)
Sort key: timestamp (ISO 8601 creation time)
Attributes: sender_phone, message_body, tier_1_status, tier_2_status, tier_3_status, final_status, error_log, delivery_confirmed_at
TTL: 90 days (auto-cleanup via DynamoDB TTL attribute)

SQS Queue for Async Retries

A standard SQS queue (qdn-sms-relay-retries) decouples the inbound webhook handler from retry scheduling. When Tier 1 fails, the Lambda posts a message to the queue with exponential backoff, and a separate Lambda (triggered by SQS) processes retries. This prevents webhook response timeouts and keeps the primary handler responsive.

Key Decisions

Why Twilio Over Native Carrier APIs?

Twilio abstracts carrier fragmentation (different carriers have different routing rules and retry semantics). By centralizing message logic in Twilio, we avoid maintaining custom integrations for each carrier. Twilio also provides webhook signature validation out-of-the-box, reducing our auth surface area.

Why DynamoDB Over RDS?

The relay state doesn't require transactions or complex joins. DynamoDB's on-demand billing and built-in TTL (for auto-expiring stale records) lower operational overhead. Scan queries (e.g., "show me all failed relay attempts in the last hour") are acceptable for debugging; if performance becomes an issue, we can add CloudWatch metrics instead.

Why SQS for Retries?

The inbound webhook must return quickly (Twilio expects a 2xx response within ~5 seconds). By offloading retries to an async queue, we keep the critical path short and allow time for exponential back