When you're building on WhatsApp's Cloud API, the first version is easy. A webhook receives a message, calls an LLM, sends a reply. Works great at 10 messages a day.
At 50,000 messages a day across hundreds of business accounts, everything breaks if you haven't thought about architecture.
Here's what we built and why.
The naive architecture (and why it fails)
Webhook receives message → calls Claude API (2-3 seconds) → sends reply.
The problem: Meta expects your webhook to return a 200 response within 5 seconds. If your LLM call is slow or the API is having a bad day, you time out. Meta retries. You process the same message twice. The customer gets two identical AI replies.
We hit this in week 2 of production. One business's customers received 4 copies of the same reply. Not a great look.
The fix: decouple receipt from processing
Webhook receives message → immediately returns 200 → drops message onto a queue → worker picks it up and processes it.
The webhook's only job is to receive and acknowledge. It does nothing else. All the actual work happens asynchronously in a worker process.
We use BullMQ (Redis-backed) for the queue. It handles retries, deduplication, and priority automatically.
Deduplication: solving the double-reply problem
Meta will retry webhook delivery if they don't get a 200 quickly. So you'll sometimes receive the same message twice. You need to deduplicate by message_id before processing.
We store processed message IDs in Redis with a 24-hour TTL. Before processing any message, check if the ID exists. If it does, skip it silently and return 200.
Simple. But you would be surprised how many WhatsApp bots in production don't do this.
Handling the AI being slow
Claude API is fast but not instant. Sometimes it's 3 seconds, sometimes it's 8. We set a per-message timeout of 10 seconds. If the AI call times out, we send a fallback message: "We received your message and will get back to you shortly." The lead is flagged for human follow-up.
Better to acknowledge than to go silent.
Rate limiting per business account
Meta has rate limits per phone number. If a business gets a sudden spike — say a marketing campaign drives 200 messages in 5 minutes — you can't blast replies at full speed.
We implement a per-WABA token bucket: each business gets a maximum send rate, and messages queue behind it. The customer experience is slightly delayed but never broken.
The number that matters
Current p99 response time (the slowest 1% of responses): 8.4 seconds. Median: 1.8 seconds. Zero messages dropped in the last 30 days.
That's the architecture working.