```html

Building a Multi-Carrier SMS Relay with Twilio: Cascading Forward Logic for QuickDumpNow Fleet Dispatch

What Was Done

QuickDumpNow (QDN) operates a distributed fleet dispatch system where job notifications must reach drivers across multiple carriers. The baseline system relied on direct carrier connectivity, but carrier-level call forwarding proved insufficient for complex cascading logic—specifically, the need to forward incoming SMS from the primary line to a secondary dispatcher when the primary was unavailable, with fallback to a tertiary contact.

This post documents the infrastructure and architectural decisions made to implement a Twilio-based SMS relay layer that sits between the public-facing QDN phone number and the internal dispatch team, enabling stateful message routing without carrier configuration friction.

Technical Details: The Relay Architecture

The relay operates in two phases:

  • Inbound Phase: Twilio receives SMS at the provisioned QDN number, invokes a webhook endpoint, evaluates dispatcher availability (via DynamoDB state), and either delivers to primary or cascades to secondary/tertiary.
  • Outbound Phase: Dispatchers send confirmation/job assignment messages back through Twilio, which stamps metadata (timestamp, dispatcher ID, job reference) and delivers to the driver.

The implementation spans three key Lambda functions:

  • qdn-sms-inbound — Receives Twilio webhook payloads, queries dispatcher state, routes message
  • qdn-sms-outbound — Validates outbound requests, appends metadata, forwards via Twilio API
  • qdn-dispatcher-heartbeat — Scheduled CloudWatch rule that updates dispatcher availability flags every 5 minutes

Dispatcher state is stored in DynamoDB table qdn-dispatcher-state with schema:

{
  "dispatcher_id": "sergio-main",           // partition key
  "available": true,                         // boolean, updated by heartbeat
  "last_heartbeat": 1712345678,             // Unix timestamp
  "backup_dispatcher": "858-335-4807",      // fallback phone number
  "tertiary_contact": "+1-619-555-0199",    // ultimate fallback
  "message_log_ttl": 86400                  // auto-expire entries after 24h
}

The inbound handler pseudocode:

On Twilio webhook:
  1. Extract sender phone, message body, Twilio message SID
  2. Query qdn-dispatcher-state for "sergio-main" availability
  3. If available AND last_heartbeat < 5 min ago:
       → Forward to primary number via Twilio API
       → Log routing decision to CloudWatch
  4. Else if backup_dispatcher set:
       → Forward to backup number
       → Mark primary as unavailable in DynamoDB
  5. Else:
       → Forward to tertiary_contact
  6. Return Twilio XML response (empty, no SMS reply to sender)

Infrastructure: AWS + Twilio Integration

AWS Resources:

  • API Gateway: /api/v1/sms/inbound endpoint (POST, no auth for Twilio webhook), /api/v1/sms/outbound (POST, signed JWT validation)
  • Lambda execution role: qdn-sms-relay-role with policies for DynamoDB read/write, Twilio API calls, CloudWatch Logs, and SNS publish (for alert on cascade)
  • DynamoDB table: qdn-dispatcher-state, on-demand billing (spiky traffic pattern during rush hours), point-in-time recovery enabled
  • CloudWatch Rule: qdn-heartbeat-schedule, triggers qdn-dispatcher-heartbeat every 5 minutes
  • SNS Topic: qdn-dispatch-alerts publishes alerts when dispatcher goes unavailable or cascade occurs

Twilio Resources (via API):

  • Provisioned phone number for QDN (primary inbound DID)
  • Webhook URL pointing to API Gateway inbound endpoint
  • Messaging Service with fallback carrier pool (to improve deliverability on outbound)

The API Gateway integrates with Lambda via direct invocation (not proxy mode), allowing fine-grained error handling. Twilio webhook calls are validated via request signature verification (Twilio-Signature header matched against SHA1 hash of request body + auth token).

Key Architectural Decisions

Why DynamoDB instead of in-memory state? Dispatcher availability must persist across Lambda container recycling and support multi-region failover if QDN scales. DynamoDB's sub-millisecond read latency (even with eventual consistency for heartbeat checks) keeps inbound SLAs tight (<2s end-to-end).

Why heartbeat via CloudWatch instead of Twilio webhooks? Twilio webhooks only fire on message events; we need periodic availability checks regardless of message volume. A scheduled CloudWatch rule decouples dispatcher health from message throughput and enables us to detect silent failures (dispatcher phone unreachable).

Why cascade to phone numbers instead of Lambda-to-Lambda? Dispatchers use personal or shared cell phones, not email or IAM identities. SMS is the dispatcher's native interface; cascading to backup phones keeps the UX consistent and doesn't require new tooling.

Why SNS alerts for cascade events? When the primary dispatcher goes unavailable and we cascade to backup, the ops team needs low-latency notification. SNS to an ops Slack channel (via Lambda connector function) closes the visibility loop faster than CloudWatch Logs alone.

Deployment and Testing

Lambda functions are deployed via SAM CLI. The inbound handler includes structured logging (JSON) so CloudWatch Insights can filter by routing decision, dispatcher ID, and cascade reason. Outbound is gated by JWT validation using a shared secret (stored in Secrets Manager, rotated quarterly).

Smoke tests:

  • Send SMS to QDN number, verify primary dispatcher receives it within 3 seconds
  • Simulate primary unavailability (set available: false in DynamoDB), resend, verify backup gets message
  • Verify tertiary fallback with both primary and backup unavailable
  • Confirm outbound message from dispatcher includes metadata stamp (timestamp, sender ID)

What's Next

Phase 2 will add:

  • Dispatcher acknowledgment protocol: Inbound handler expects a keyword (e.g., "ACK") from dispatcher within 30s; if no ACK, automatically cascade
  • Message deduplication: Twilio retries failed webhooks; we need idempotent routing via DynamoDB idempotency keys
  • Metrics dashboard: CloudWatch dashboard showing cascade frequency, average delivery time, and dispatcher availability trend
  • Multi-region support: If QDN expands to second region, replicate dispatcher state via DynamoDB Streams and cross-region tables

The relay layer is production-ready and live at dashboard.quickdumpnow.com/api/v1/sms/*