HACKER Q&A
📣 GoatPerfect

How do you monitor and retry failed webhooks in production?


I’ve been working on a project where webhooks are a core part of the system, and I realized how fragile they can be in practice.

Transient network errors, timeouts, downstream issues — things fail more often than expected.

I’m curious how others are handling this in production.

Are you building custom retry logic?

Using a queue?

Relying on provider retries?

Just logging and manually checking failures?

Do you monitor webhook delivery rates or alert on repeated failures?

Would love to hear what setups people are using and what’s worked (or not worked) for you.


  👤 blundergoat Accepted Answer ✓
We treat webhooks as at-least-once delivery over an unreliable transport and design for duplicates and out-of-order events.

A few rules that have saved us:

- Persist before responding. Never process inline. Write payload to DB, return 200 fast.

- Idempotency key required. Either provider event ID or hash the payload.

- Async worker processes from queue. Exponential backoff + max attempts.

- Dead letter queue + dashboard. Humans need visibility.

- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.

- Relying on provider retries alone has bitten us more than once.


👤 toomuchtodo
Have you checked out https://svix.com? No affiliation, I just like the product. Might also check out https://www.standardwebhooks.com/

👤 JacobArthurs
We receive the webhook, return 200 immediately, and push the payload to a message queue for processing. That way you own the retry logic, can inspect stuck messages, and DLQ alerts handle repeated failures automatically.

Idempotency becomes your responsibility, though, since messages can be delivered more than once.