The problem I had with implementing subscriptions with Stripe is that the batches of messages from the webhook could come back in clumps in random order. So they would have references to objects that I hadn't been notified about their creation yet.
Theoretically and rarely a webhook could fail and be retransmitted arbitrarily later due to bad weather on the internet, so you have to be able to tolerate that, but practically and often it sent bunches of messages all at effectively the same time, which caused them to be processed by my web server in random order.
I finally gave up trying to structure the code so it could create objects in any order, and deal with objects it hadn't heard about yet, and just treated the webhook callback as a sign that I should soon make a request back to Stripe and ask them for ALL the events that had transpired.
So I'd log all the information in the webhook just for chuckles, then schedule a task that polled Stripe for batches of events, and dealt with them all at once without anything slipping through the cracks because of random reordering in transit.
I have run into this also using subscriptions + webhooks. For example, I am using the "charge.succeeded" webhook event to send a custom email receipt on my backend. But, for a few new customers there seems to be a race condition where the "charge.succeeded" webhook event will arrive, before the API call in my code returns a success, so there is now an event but I have no idea how to tie this back to a user. I am using the API to create a new customer and updating their user DB record with the customer token from stripe. So, I get into a situation where I do not know what customer to send the email too (because I don't have the customer token yet). I ended up just adding a HTTP 503 error (Service is Unavailable), for this specific webhook event when I cannot find the user, so the Stripe API will retry that event. This is a hack but it works. This just started happening a few weeks ago. There are a few of these things popping up here and there that I need to deal with but generally it works really well.
You don't actually want to do anything substantial in response to the webhook, anyway. Just log it to a queue and return immediately.
Then some other process can come along and process the queue, taking as long as it needs to interrogate stripe, create users and objects in the database, and send email.
At first that process was trying to build up my model piece by piece in response to each webhook callback, but they weren't being delivered in the right order all the time.
Cache and log everything for auditing and debugging, but never use or depend on anything you have cached or that was delivered to you in a webhook event. Because of random ordering, it may already be stale by the time you get it. Always do a full refresh of all user and subscription data.
In the end, that process got more robust and didn't actually give a shit about the particular events that were queued from the webhook. The fact that there was something in the queue just woke it up, then it figured out the transaction id of each event (which varied from event to event, but it ignored the most of the event data), and marked those transactions as needing to be refreshed.
Then for each refreshable transaction, it would pull down fresh user and subscription data, and go from there. So it didn't matter what order events were pushed, or how many a flurry of events you got on the same transaction, because it processes all the affected transactions just once, instead of responding to each event.
We just use a database lock for this. Lock on the charge/order/user/whatever, then the stripe webhook delivery is blocked until the transaction commits.
This might seem like a pain, but Stripe have no idea how fast their clients are, there’s not much point in waiting, say, 1 second minimum before delivering the webhook, as that only alleviates the problem for clients fast enough to “finish” within 1 second, whatever their definition of finish is.
For this reason, I think Stripe are doing the right thing, and I’m not sure there’s much they can do to make it much easier. Once you know it needs to be done, a solution is pretty straightforward.
You really should respond to the webhook handler immediately, by queuing it, and handling it asynchronously in another process.
It's bad enough to perform slow operations like creating lots of objects in the database, calling other web apis, sending email, before you return a success to stripe's webhook.
But for one webhook to lock out all other webhooks while it did all that work would only compound the problem.
Stripe tends to send a whole flurry of events related to the same transaction, when there's actually only one thing for you to do in one swoop (create a user and start their subscription), so to processes them one by one is very inefficient, especially if you have to call back to stripe for each event.
Sorry, maybe should have clarified, webhooks are typically processed in a queue, we reply to Stripe immediately, but we block the processing of the webhook on that lock. As we're processing in a queue with multiple workers this typically doesn't block much work from happening.
There are some cases where we specifically want to propagate the error to a webhook provider (not in the case of Stripe, so we work inline, but that's rare). There are also some cases where we want to process webhooks on a serial queue, one at a time, to ensure in-order delivery (again not in the case of Stripe).
these post makes me cringe as someone that deals with enterprise sass integrations that publish to webhooks. Lot of bad actors seen many interesting issues pop up because the awesome stuff subscribers try to do
Gotta say, that doesn't sound like a hack, it sounds like a perfectly normal response to the imperfect nature of async processes and the internet itself. TCP traffic itself behaves in the same way. If a packet arrives out of order, you discard it and notify the other end that the packet was not delivered, allowing it to retry. This is just the same behavior. Web services at their finest.
This is my experience with most systems that send webhooks, in particular payments, and subscription management systems. As you've elaborated on in other comments, queuing and periodic retries are generally the best way to handle interacting with what is effectively, an eventually consistent API/system.
This is a problem of API async operations, either by nature or by implementation: for either reason the process at hand (payment whatever) relies on a webhook to continue.
Here's a list of some problems it can bring:
## Synchronous context:
- Network errors whose call can be retried. For whatever reason you know the message did not reach the remote API. Exponentially/randomly increase timeout between each call. Have multiple levels. Immediate retries vs retries delayed in a queue or stored for later inspection.
- Network calls that fail and that you hesitate to retry. Typically timeouts. Was the query received on the other end of the tunnel ? You don't know. Ensure idempotency via an idempotency key or via a query to check if the entity has actually been created/updated. Both solutions require support from the API.
This concerns both calls made by the client as well as those made by the API (to send notifications the the hooks).
All of the above applies in addition to the following:
## Asynchronous context:
### Response/Webhook race condition:
- Receiving the response as well as a 'created' event. You have to decide where to put what comes next after the API call in the process logic. Keep your code synchronous by placing the continuing code under the call. Or make it asynchronous by registering it to the event.
- Receiving another event after the response. For instance, you make an API call, receive a 'created' event (ignored) followed by a 'successful' event (oops) before even receiving the response. You have a state to update, but can't find the right record because you haven't received the id from the response (and decided to ignore the 'created' event for the point above).
### Event ordering (quite the same thing as above):
- Receiving events for the same entity in the wrong order. For instance: 'succeeded' before 'created'.
- Receiving event for different entities in the wrong order. 'avatar uploaded' before 'account created'.
### Event number
- None (add to that the absence of retries for notifications and you have a good part of maintenance cost of a system whose backend relies on web APIs).
- Multiple. A typical solution is to store the received event with its id or any other way to ensure it is unique.
I did a subscription service with Stripe too several years ago. I think I just ended up failing those webhooks, and then letting Stripe automatically retry in a bit.
Hmm, interesting. Have not bumped into that one yet. This might be due to the fact I only use one webhook or that I'm a small time player. The volume of hooks is pretty managable right now.
Going from straightforward one-time purchases to maintaining subscriptions for an online service is a huge jump in complexity! Then throw in free trials, prorating subscriptions, upgrading monthly to yearly, etc. There are so many issues and edge cases with recurring charges on credit cards that can blindside you.
One thing about the Stripe API that I loved was that I could intertwingle it with my own admin interface, by pushing descriptions and metadata properties into stripe objects that had url links back into corresponding objects in my admin interface (like users, products, coupons, subscriptions, transactions, etc).
Though it may not be explicitly documented, Stripe's web site is smart enough to make the urls in metadata be clickable links, which is a godsend.
So I didn't have to duplicate stuff you could do on stripe's site in my own admin interface, I could just link back and forth between them!
For goodness sakes, log webhooks to a queue and rebuild the object graph later by probing their API. I would be absolutely blown away if Stripe recommended that folks attempt to process these inline (simple tutorial-level example code notwithstanding).
EDIT: And for your own sanity, assume 50% of the webhooks you expected to arrive didn't. Schedule a periodic task to scrape transactions from their API.
> EDIT: And for your own sanity, assume 50% of the webhooks you expected to arrive didn't. Schedule a periodic task to scrape transactions from their API.
I'm pretty sure the Webhooks keep retrying until they get a success response back from your server.
They do it for 3 days though which seems more than enough... and you can even manually trigger them from the dashboard. So I'm not really seeing why you can't just rely on Webhooks?
Cool post! I don't have a ton of experience using Stripe, but shouldn't you at least be handling some sort of payment_failed webhook?
It looks like you call _createSubscription and set the initial value for currentPeriodEnds before you know the payment actually succeeded, and since you don't ever check or listen for failed payments, anyone could get a free month (or year) of Checkly by using a bad card, or if the payment just randomly fails.
Maybe this isn't a huge deal in the early days, but you and your customer might not even notice the failed payment for quite a while unless you happen to check your Stripe dashboard!
This is a great comment. And it's stupid I left this out of the post, as I made a conscious decision to not deal with that now. I should at some stage.
I actually had one failing credit card already, but my customer base in in the 30+ under 100 range, so I easily caught it. Also, it was totally benign from an early customer that just had bodged renewal for their card.
OP here. I'm super curious about other folks using Stripe Billing and their experiences. My SaaS is fairly young, so I probably have missed some things.
Great post! The number of subscriptions that Stripe sends out is really overwhelming at first, and especially the order they can be received in. I ended up coming to the same conclusion that you did, and only end up listening to the invoice payment succeeded and invoice payment failed webhooks.
Doing so greatly simplified the process, rather than listening for customer or subscription updates and changes.
Great to read others are taking a similar approach!
Theoretically and rarely a webhook could fail and be retransmitted arbitrarily later due to bad weather on the internet, so you have to be able to tolerate that, but practically and often it sent bunches of messages all at effectively the same time, which caused them to be processed by my web server in random order.
I finally gave up trying to structure the code so it could create objects in any order, and deal with objects it hadn't heard about yet, and just treated the webhook callback as a sign that I should soon make a request back to Stripe and ask them for ALL the events that had transpired.
So I'd log all the information in the webhook just for chuckles, then schedule a task that polled Stripe for batches of events, and dealt with them all at once without anything slipping through the cracks because of random reordering in transit.