This is pretty common actually At Facebook too there was no staging environment....

kgeist · on April 3, 2022

>That said features and bug fixes were often times gated by feature flags

Sorry for maybe a silly question, but how do feature flags work with migrations? If your migrations run automatically on deploy, then feature flags can't prevent badly tested migrations from corrupting the DB, locking tables and other sorts of regressions. If you run your migrations manually each time, then there's a chance that someone enables a feature toggle without running the required migrations, which can result in all sorts of downtime.

Another concern I have is that if a feature toggle isn't enabled in production for a long time (for us, several days is already a long time due to a tight release schedule) new changes to the codebase by another team can conflict with the disabled feature and, since it's disabled, you probably won't know there's a problem until it's too late?

toast0 · on April 3, 2022

> Sorry for maybe a silly question, but how do feature flags work with migrations? If your migrations run automatically on deploy

Basically they don't. Database migration based on frontend deploy doesn't really make sense at facebook scale, because deploy is no where close to synchronous; even feature flag changes aren't synchronous. I didn't work on FB databases while I was employed by them, but when you've got a lot of frontends and a lot of sharded databases, you don't have much choice; if your schema is changing, you've got to have a multiphased push:

a) push frontend that can deal with either schema

b) migrate schema

c) push frontend that uses new schema for new feature (with the understanding that the old frontend code will be running on some nodes) --- this part could be feature flagged

d) data cleanup if necessary

e) push code that can safely assume all frontends are new feature aware and all rows are new feature ready

IMHO, this multiphase push is really needed regardless of scale, but if you're small, you can cross your fingers and hope. Or if you're willing to take downtime, you can bring down the service, make the database changes without concurrent access, and bring the service back with code assuming the changes; most people don't like downtime though.

kgeist · on April 3, 2022

>Basically they don't. Database migration based on frontend deploy doesn't really make sense at facebook scale, because deploy is no where close to synchronous; even feature flag changes aren't synchronous.

Our deployments aren't strictly "synchronous" either. We have thousands of database shards which are all migrated one by one (with some degree of parallelism), and new code is deployed only after all the shards have migrated. So there's a large window (sometimes up to an hour) when some shards see the new schema and others see the old schema (while still running old code). It's one click of a button, however, and one logical release, we don't split it into separate releases (so I view them as "automatic"). The problem still stays, though, that you can only guard code with feature flags, migrations can't be conditionally disabled. With this setup, if a poorly tested migration goes awry, it's even more difficult to rollback, because it will take another hour to roll back all the shards.

rtpg · on April 4, 2022

Serious question: are you going to catch "corrupt data"-style migrations in staging in general?

There are of course "locks up the DB"-style migrations where you can then go in and fix it, so staging helps with that. But "oh this data is now wrong"-style errors seem to not really bubble up when you are just working off of test data.

Not to dismiss staging testing that much, but it feels like a tricky class of error where the answer is "be careful and don't delete data if you can avoid it"...

yencabulator · on April 4, 2022

Even the "locks up the DB" migration behavior tends to depend on the load & size of tables, which staging might not recreate.

withinboredom · on April 3, 2022

We don't have a staging environment (for the backend) at work either. However, depending on the size of the tables in-question, a migration might take days. Thus, we usually ask DBA's for a migration days/weeks before any code goes live. There's usually quite a bit of discussion, and sometimes suggestions for an entirely different table with a join and/or application-only (in code, multiple query) join.

catlifeonmars · on April 3, 2022

Sorry for the silly question, perhaps, but what is the purpose of a db migration? Do schemas in production change that often?

For context, the last couple of services I wrote all have fixed, but implicit schema, (built on key value stores). That is, the DB has no types. So instead, the type system is enforced by the API layer. Any field changes so far are gated via API access and APIs have backwards compatibility contracts with API callers.

I’m not saying that the way I do it currently is “correct” - far from it. I strongly suspect it’s influenced by my lack of familiarity with relational databases.

fauigerzigerk · on April 4, 2022

There is a lot to be said about enforcing the schema in the database vs doing it in application code, but not doing migrations comes with an additional tradeoff.

If you never change the shape of existing data, you are accumulating obsolete data representations that you have to code around for all eternity. The latest version of your API has to know about every single ancient data model going back years. And any analytics related code that may bypass the API for performance reasons has to do the same.

So I think never migrating data accumulates too much technical debt. An approach that many take in order to get the operational benefits of schemaless without incurring technical debt is to have migrations lag by one version. The API only has to deal with the latest two schema versions rather than every old data representation since the beginning of time.

Variations of this approach can be used regardless of whether or not the schema is enforced by the database or in application code.

20after4 · on April 4, 2022

Relational databases can be very strict, for example if you use foreign key references then the data base enforces that a row exists in the referenced table for every foreign key in the referring table. This strict enforcement makes it difficult to change schema.

The way you handle things with API level enforcement is actually a good architecture and it would probably make schema changes easier to deal with even on a relational database backend.

withinboredom · on April 4, 2022

A fairly recent example is a couple of tables for users who are “tagged” for marketing purposes (such as we sent them an email and want to display the same messaging in the app). These tags have an expiration date at the tag level but we wanted the expiration date per-user too. This enables marketing to create static tags. This requires a migration to the data so this can be supported.

Schemas don’t change that often, in my experience.

hinkley · on April 4, 2022

For minor changes there's a simpler path, where you add a new field to the database, default the value to some reasonable value, then add it to the workflow in stages.

Depending on how your database feels about new columns and default values, there may be additional intermediate steps to keep it happy.

drewcoo · on April 3, 2022

> how do feature flags work with migrations?

The idea is to have migrations that are backward compatible so that the current version of your code can use the db and so can the new version. Part of the reason people started breaking up monoliths is that continuous deployment with a db-backed monolith can be brittle. And making it work well requires a whole bunch of brain power that could go into things like making the product better for customers.

> another concern

Avoiding "feature flag hell" is a valid concern. It has to be managed. The big problem with conflict is underlying tightly coupled code, though. That should be fixed. Note this is also solved by breaking up monoliths.

> tight release schedule

If a release in this sense is something product-led, then feature flags almost create an API boundary (a good thing!) between product and dev. Product can determine when their release (meaning set of feature flags to be flipped) is ready and ideally toggle themselves instead of roping devs into release management roles.

kgeist · on April 3, 2022

>The idea is to have migrations that are backward compatible so that the current version of your code can use the db and so can the new version

Well, any migration has to be backward-compatible with the old code because old code is still running when a migration is taking place.

As an example of what I'm talking about: a few months ago we had a migration that passed all code reviews and worked great in the dev environment but in production it would lead to timeouts in requests for the duration of the migration for large clients (our application is sharded per tenant) because the table was very large for some of them and the migration locked it. The staging environment helped us find the problem before hitting production because we routinely clone production data (deanonymized) of the largest tenants to find problems like this. It's not practical (and maybe not very legal too) to force every developer have an up-to-date copy of that database on every VM/laptop, and load tests in an environment very similar to production show more meaningful results overall. And feature flags wouldn't help either because they only guard code. So far I'm unconvinced, it sounds pretty risky to me to go straight to prod.

I agree however that the concern about conflicts between feature toggles is largely a monolith problem, it's a communication problem when many teams make changes to the same codebase and are unaware of what the other teams are doing.

nicoburns · on April 3, 2022

> Well, any migration has to be backward-compatible with the old code because old code is still running when a migration is taking place.

This is definitely best practice, but it's not strictly necessary if a small amount of downtime is acceptable. We only have customers in one timezone and minimal traffic overnight, so we have quite a lot of leeway with this. Frankly even during business hours small amounts of downtime (e.g. 5 minutes) would be well tolerated: it's a lot better than most of the other services they are used to using anyway.

withinboredom · on April 3, 2022

> Well, any migration has to be backward-compatible with the old code because old code is still running when a migration is taking place.

This doesn't have to be true. You can create an entirely separate table with the new data. New code knows how to join on this table, old code doesn't and thus ignores the new data. It doesn't work for every kind of migration, but in my experience, it's preferred by some DBAs if you have billions and billions of rows.

Example: `select user_id, coalesce(new_col2, old_col2) as maybe_new_data, new_col3 as new_data from old_table left join new_table using user_id limit 1`

charrondev · on April 4, 2022

For us we use pt-online-schema-change which copies the table first, sets up some triggers to keep things synced, then renames the tables at the end.

atmosx · on April 4, 2022

> because we routinely clone production data (deanonymized)

Are you using an external service or in-house tool to perform this operation?

cmeacham98 · on April 3, 2022

I think their question was more "if I wrote a migration that accidentally drops the users table, how does your system prevent that from running on production"? That's a pretty extreme case, but the tldr is how are you testing migrations if you don't have a staging environment.

seadan83 · on April 3, 2022

Put the DB on docker (or provide some other one touch way to install a clean database). Run all migration scripts to get a current schema, insert sample data, now do your testing. Then make this part of the build process. Then, be sure that you detect after the migrations that the regression tests are failing and prevent merge. The key is having a DB that can be recreated as a nearly an atomic operation.

laurent123456 · on April 3, 2022

I'd think they create "append-only" migrations, that can only add columns or tables. Otherwise it wouldn't be possible to have migrations that work with both old and new code.

derefr · on April 3, 2022

> Otherwise it wouldn't be possible to have migrations that work with both old and new code.

Sure you can. Say that you've changed the type of a column in an incompatible way. You can, within a migration that executes as an SQL transaction:

1. rename the original table "out of the way" of the old code

2. add a new column of the new type

3. run an "INSERT ... SELECT ..." to populate the new column from a transformation of existing data

4. drop the old column of the old type

5. rename the new column to the old column's name

6. define a view with the name of the original table, that just queries through to the new (original + renamed + modified) table for most of the original columns, but which continues to serve the no-longer-existing column with its previous value, by computing its old-type value from its new-type value (+ data in other columns, if necessary.)

Then either make sure that the new code is reading directly from the new table; or create a trivial passthrough view for the new version to use as well.

(IMHO, as long as you've got writable-view support, every application-visible "table" should really just be a view, with its name suffixed with the ABI-major-compatibility-version of the application using it. Then the infrastructure team — and more specifically, a DBA, if you've got one — can do whatever they like with the underlying tables: refactoring them, partitioning them, moving them to other shards and forwarding them, etc. As long as all the views still work, and still produce the same query results, it doesn't matter what's underneath them.)

ransom1538 · on April 4, 2022

Not to be rude but this isn't how this works at all. Things like 'run an "INSERT ... SELECT ..."' can't happen at scale due to locking. How they actually do it is super rad:

https://www.percona.com/doc/percona-toolkit/3.0/pt-online-sc...

tl;dr; They setup a system of triggers (updates,inserts,etc) , copy the data over, then run through all the data in the trigger system. percona developed all these fancy features as well to monitor replica data etc. Another way with cloud vms (terabyte+ tables), you image a replica, do the alter, let the replica catch up, image it, build replicas off that, promote this to master.

Facebook's internal one I hear is close to this.

derefr · on April 4, 2022

"This isn't how things work at all" implies that more than 0.0000001% of DBs are running "at scale." Most DBs people will ever deal with — including within big enterprises! — are things like customer databases that have maybe 100k records in them. The tables lock waiting for lower-xact-ID read traffic to clear, yes — for about 400ms. Because queries running on the DB are all point queries that don't run any longer than that. Or they're scheduled batch jobs that happen during a maintenance window.

At scale, you're hopefully not using an RDBMS as a source of truth in the first place; but rather, using it as a CQRS/ES aggregate, downstream of a primary event store (itself likely some combination of a durable message-queue, and archival object storage for compacted event segments.) In that kind of setup, data migrations aren't the job of the RDBMS itself, but rather the job of the CQRS/ES framework — which can simply be taught a new aggregate that computes a single new column, and will start catching that aggregate up and making the data from it available. If your RDBMS is columnar (again, hopefully), you can just begin loading that new column in, no locking on the rest of the table required.

IMHO, the trigger-based approach is a weak approximation of having a pre-normalization primary source. It's fine if you want to avoid rearchitecture, but in a HOLAP system (which is usually inevitable for these sorts of systems when their data architects have tried to focus on simplicity, as this leads to eschewing denormalized secondary representations in OLAP-focused stores) it will cause your write perf to degrade + bottleneck, which is actually the worst thing for locking.

(I should know; I'm dealing with a HOLAP system right now with large numbers of computed indices, append-only OLTP inserts, and random OLAP reads; where the reads hold hundreds of locks each due to partitioning. Any time a write tx stays open for more than a few hundred milliseconds, the whole system degrades due to read locks piling up to the point that the DB begins to choke just on allocating and synchronizing them. The DB is only a few TB large, and the instance it's on has 1TB of memory and as many cores as one can get... but locking is locking.)

freedomben · on April 3, 2022

I wrote a blog about this for anyone who would like to learn more.

The query strings get you around the paywall if it comes up:

https://freedomben.medium.com/the-rules-of-clean-and-mostly-...

If anyone doesn't know what migrations are:

https://freedomben.medium.com/what-are-database-migrations-5...

strken · on April 4, 2022

My impression is that once you're at Facebook scale, most of your migrations are massive undertakings that need to take into account things like "How many terabytes more space do we need?", "How do we control the load on our DB nodes while the migration is going on?", "How do we get data from cluster A to cluster B?", "Is adding this index going to take hours and break everything?", and so on. Some of the time you'll be spinning up an entirely new cluster rather than changing the schema of an old one, and when you do migrate an existing cluster there's some five-page document specifying a week-long plan with different phases.

Then internally, they work around the inflexible db schemas by using offline batch processing tools or generic systems that can handle arbitrary data, for tasks that would be handled by the one DB in smaller systems.

ninth_ant · on April 3, 2022

That is largely the case.

For other, more complex cases where that is not possible, you migrate a portion of the userbase to a new db schema and codepath at the same time.

pluies · on April 4, 2022

> Another concern I have is that if a feature toggle isn't enabled in production for a long time (for us, several days is already a long time due to a tight release schedule) new changes to the codebase by another team can conflict with the disabled feature and, since it's disabled, you probably won't know there's a problem until it's too late?

Reminiscent of Knight Capital losing $440 million in 45 minutes via feature flags: https://dougseven.com/2014/04/17/knightmare-a-devops-caution...

bcherny · on April 4, 2022

How DB migrations work in particular (other things, like business logic, work similarly): https://news.ycombinator.com/item?id=29046303

MrTortoise · on April 4, 2022

what if you insisted that database down time for migrations was not acceptable and that your code needs to work with different versions of database (this can be adapters)?

reissbaker · on April 4, 2022

It's common, but it's more like "this is common at young companies where the cost of maintaining staging can't pay for itself in improved productivity, because there aren't enough engineers for a 5% productivity improvement to be worth hiring multiple engineers for."

I'm sure that at FB at one point there wasn't a staging env. Today at FB there are multiple layers of staging, checked automatically as part of the deployment process. I'm ex-FB as well, and we definitely used staging environments every single day as part of the ordinary deploy pipeline. You probably worked there when it was younger, and smaller, and the tooling was less advanced.

Large tech companies have advanced dev tooling; eventually, the cost of paying people to make the tools is paid for in productivity gained per engineer, with large enough eng team sizes.

ehnto · on April 4, 2022

I think it's a difference in relationship with your users. For software I'm currently working on, we require verification that the changes we made were correct before they make it out to production, so there's a requirement for a staging environment they can access. That's a software > business relationship, where there's a known good outcome. This was also true in the ecommerce agency environments I worked in, the business owners want the opportunity to verify and correct things before they go out to production.

If it were a product > user relationship, where you're the product owner and you are trying to improve your product without an explicit request from your users, I can see how no staging environment makes sense. You have no responsibility of proof of correctness to your users, what you put out is what they get, and breakages can be handled as fixes after the fact.

Gigachad · on April 3, 2022

This is how my current place does it. The only issue we are having is library / dependency updates have a tendency to work perfectly fine locally and then fail in production due to either some minor difference in environment or scale.

It's a problem to the point that we have 5 year old ruby gems which have no listed breaking changes because no one is brave enough to bump them. I had a go at it and caused a major production incident because the datadog gem decided to kill Kubernetes with too many processes.

ehnto · on April 4, 2022

Do you have a replicate of the production environment codified somehow, like into a VM? It's rarely perfect, but I usually try and develop locally on the same stack I deploy to which can help with the environment differences.

It's also why I think it's smart to rebuild the environment on deploy if it makes sense for your pipeline, so that you wipe any minor differences that have been accruing over time. Working on a long running product, you quickly find yourself with disparities building up, and they're not codified, so they're essentially unknown to the team until they cause an issue.

Gigachad · on April 4, 2022

Yes, there is a docker compose file which is usually sufficient. The issue in this case was the problem only shows up if there is sufficient load on the background workers.

Rapzid · on April 3, 2022

Facebook can completely break the user experience for 4.3 million different users each day and each user would only experience one breakage per year.

This is pretty common, but not because most employing it have 1.6bn users and 10k engineers; essentially enough scale to throw bodies at problems.

lupire · on April 3, 2022

Why would a mistake only affect 4M users and not 400M?

SgtBastard · on April 3, 2022

Look at it the other way around - you could have a different outage every single day and as long as that outage only impacted 4.3m users and they were different users each day, it would look like a once-a-year event to the average user.

They’re saying there’s a lot of leeway to break things (in a small way) at scale.

Rapzid · on April 3, 2022

No guarantees, but a feature flag or canary deploy can significantly increase likelihood of impacting a targeted subset of users.

otterley · on April 3, 2022

Was this true for the systems that related to revenue and ad sales as well? While I can believe that a lot of code at Facebook goes into production without first going through a staging environment, I would be extremely surprised if the same were true for their ads systems or anything that dealt with payment flows.

alasdair_ · on April 4, 2022

I worked on ad systems at Facebook, and yes it's (approximately) true for those as well.

The thing to realize that "in production" almost never means "rolled out to 100% of users from 0%". Instead you'd do very slow rollouts to, say, 1% of people in Hungary (or whatever) and use a ton of automatic measurements over time as well as lots and lots of tests to validate that things were working as expected before rolling out a little more, then a little more after that. By the time the code is actually being hit by the majority of users, it's often been run billions of times already.

zdragnar · on April 3, 2022

I don't know about Facebook, but at other companies without similar, each git branch gets deployed to its own subdomain, so manual testing etc. can happen prior to a merge. Dangerous changes are feature flagged or gated as much as possible to allow prod feedback after merge before enabling the changes for everyone.

hinkley · on April 4, 2022

Problem I always seem to run into is that these optional features always seem to be added at a rate that's a bit higher than the rate at which the flags are retired. It doesn't take much of a multiplier for the number to become untenable pretty quickly.

ikiris · on April 4, 2022

Facebook also managed to lock themselves out of their own network for how many hours again? :P

fareesh · on April 4, 2022

I am leading a small team working on a social audio product. We follow the same process. The vast majority of our content is live audio conversations and so we need live/production data. If our stakeholders have to test the product it means they have to join those conversations, or setup live conversations in a parallel universe. Feature flags in production are the simplest way forward, but they carry a fair amount of risk. This is offset by automated tests wherever possible.

aprdm · on April 3, 2022

The book Software Engineering at Google or something akin to that mentions the same kind of thing.

pedrosorio · on April 4, 2022

This one: "Site Reliability Engineering: How Google Runs Production Systems"?

aprdm · on April 4, 2022

https://www.oreilly.com/library/view/software-engineering-at...

alex_young · on April 3, 2022

What about third party integrations? Don’t you need some non-production environment to test them in until both parties are satisfied with the integration and it’s impact on users?

LtWorf_ · on April 4, 2022

Also facebook chats were known to take from days to infinity to deliver messages, so it's not like anyone really expected anything from facebook.

abhishekjha · on April 3, 2022

That would be controlling a lot of feature flags given how many can be switched on at once. How do you control them?

rb2k_ · on April 3, 2022

It's 7 years old by now, but there's some literature:

https://research.facebook.com/publications/holistic-configur...

You can see that there's a common backend ("configerator") that a lot of other systems ("sitevars", "gatekeeper", ...) build on top of.

Just imagine that these systems have been further developed over the last decade :)

In general, there's 'configuration change at runtime' systems that the deployed code usually has access to and that can switch things on and off in very short time (or slowly roll it out). Most of these are coupled with a variety of health checks.

sillysaurusx · on April 3, 2022

flag = true

More seriously, at my old company they just never got removed. So it wasn’t really about control. You just forgot about the ones that didn’t matter after awhile.

If that sounds horrible, that’s probably the correct reaction. But it’s also common.

Namespacing helps too. It’s easier to forget a bunch of flags when they all start with foofeature-.

skybrian · on April 3, 2022

It can become a code maintenance issue, though, when you revisit the code. You need to maintain both paths when you never know if they are being used.

Also, where flags interact, you can get a combinatorial explosion of cases to consider.

withinboredom · on April 3, 2022

I’ve seen those old flags come in handy once. Someone accidentally deleted a production database (typo) and we needed to stop all writes to restore from a backup. For most of it, it was just turning off the original feature flag, even though the feature was several years old.

harunurhan · on April 3, 2022

> the ones that didn’t matter after awhile.

Ideally you have metrics for all flags and their values, so you can easily tell if one becomes redundant and safe to remove entirely after a while.

I've also seen making it a requirement to remove a flag after N days, the feature is completely rolled out.

mdoms · on April 3, 2022

At a previous workplace we managed flags with Launch Darkly. We asked developers not to create flags in LD directly but used Jira web hooks to generate flags from any Jira issues of type Feature Flag. This issue type had a workflow that ensured you couldn't close off an epic without having rolled out and then removed every feature flag. Flags should not significantly outlast their 100% rollout.

clintonb · on April 3, 2022

I work at a different company. Typically feature flags are short-lived (on the order of days or weeks), and only control one feature. When I deploy, I only care about my one feature flag because that is the only thing gating the new functionality being deployed.

There may be other feature flags, owned by other teams, but it's rare to have flags that cross team/service boundaries in a fashion that they need to be coordinated for rollout.

alasdair_ · on April 4, 2022

You have automated tools that yell at you to clean up feature flags and you force people to include sensible expiration dates at part of your PR process. Flags past the date result in increased yelling. If your team has too much crap in the codebase eventually someone politely tells you to clean it up.

You also have tooling that measures how many times a flag was encountered vs. how many times it actually triggered etc. Once it looks like it's at 100% of traffic, again you have automations that tell people to clean up their crap.