Transient network errors, timeouts, downstream issues — things fail more often than expected.
I’m curious how others are handling this in production.
Are you building custom retry logic?
Using a queue?
Relying on provider retries?
Just logging and manually checking failures?
Do you monitor webhook delivery rates or alert on repeated failures?
Would love to hear what setups people are using and what’s worked (or not worked) for you.
A few rules that have saved us:
- Persist before responding. Never process inline. Write payload to DB, return 200 fast.
- Idempotency key required. Either provider event ID or hash the payload.
- Async worker processes from queue. Exponential backoff + max attempts.
- Dead letter queue + dashboard. Humans need visibility.
- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.
- Relying on provider retries alone has bitten us more than once.
Idempotency becomes your responsibility, though, since messages can be delivered more than once.