> Perhaps someone at their end screwed up a loop conditional, but you'd think some monitoring dashboard somewhere would have a warning pop up because of this.
If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.
The orgs are not ruthless like that, anything less than a certain % of the org revenue is not worth bothering unless it creates _more_ work to the person responsible for it than fixing it does.
Add some % if person who gets more work from the problem is not the same as the person who needs to fix it. People will happily leave things in a broken state if no one calls them out on it.
In my opinion, if something isn’t actually an error, you modify your logging to not log it as an error. Your error logging/alerting pipeline should always stay clean.
If something shows up in there, you should only have 2 options: 1) it’s an actual error and you fix it and make sure it never happens again, or 2) it’s not an error and then you fix it by adjusting the log level to make sure it isn’t one.
If someone suggests an “error budget” on my watch they get the door. You can have a warning budget (and the resources to adjust the log levels or remediation protocols to fix said “errors”) but actual errors should remain errors - otherwise they’re delivering broken software and that’s not what I’m paying them for.
Of course, companies who have the common sense to do this already do it and nobody in their right mind would suggest an “error budget”, but for those that don’t they have a serious problem that needs to be rectified.
The danger otherwise is that you’re making your observability pipeline useless if “errors” no longer actually mean errors. That’s really bad because now it opens the door to actual errors being ignored until it’s too late and then remediation is more costly.
At Facebook a full outage is accompanied by "first time?" Memes. Unless you are on the specific team responsible you would indeed not really have any reason to care
In my 3rd year of enterprise now and learned that there are many engineers who will purposefully not fix/improve their problematic applications as a weird sort of job security. It kind of blew up in their faces last year when we moved most of the affected on-premise applications to cloud. Seems like when you introduce tons of friction on-premise it makes the cloud look even better to the suits.
It's not a matter of "can't be bothered." Engineers are constantly fixing things and rolling out new features. "Error budgets" are an acknowledgement of the tradeoff between these two things, and making a conscious choice about the balance between them, according to the business requirements of the application in question.
Keep in mind that "fixing things" is essentially a Sisyphean task - no matter how much you do there's always more you can do. Just like adding features. You have to have some kind of guideline on when enough is enough.
If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.