Postgres Job Queues and Failure by MVCC

rosser · on July 9, 2017

Disclaimer: IAAPDBA (I am a PostgreSQL DBA) I have been employed thusly for a decade now, and have worked in environments ranging from tens of concurrent users to hundreds of thousands. Previously, I was a developer.

A few jobs ago, I encountered a similar situation. That stack was Python-based. The engineers had apparently not realized that the DBAPI's default is to silently wrap all database interactions in an explicit transaction. (That is, before it submits your query to the db, it transmits a "BEGIN", and expects you to "COMMIT" or "ROLLBACK" as appropriate. I, personally, think that's a terrible choice, but whatever.)

Everywhere else in the stack, this wasn't an issue — or at least not a particularly troublesome one. In the job queue, it was. The reason is simple: the job workers were long-lived, holding those (implicitly opened) explicit transactions for the lifetime of the connection. In the rest of the stack, the app code would connect, interact, and disconnect cleanly, including COMMIT-ing. How folks realized the need to do that in one place, but not the other, no-one could adequately explain to me.

When I started, they were uncomplainingly habituated to shutting down their entire jobs processing infrastructure (which had customer-visible impact) for hours, every week, to do a "VACUUM FULL"/"REINDEX" cycle on the queue table. That table's disk footprint followed a perfect "sawtooth" pattern, growing linearly "up and to the right", well into the gibibytes (for, on average, a few hundred to a few thousand relatively narrow, "live" rows), until the maintenance event, where it would drop off vertically. Lather, rinse, repeat. They had been running this way for years.

It took a one-line code change to disable that behavior, and hours of downtime per week were eliminated.

IMO, this even less PostgreSQL's "fault" than the "write amplification" issue that Uber flounced over last year. They Did It Wrong, and it bit them.

TL;DR, it's the poor craftsman who blames his tools, and when you use a tool incorrectly, that tends to be consequential.

EDIT: phrasing.

inopinatus · on July 10, 2017

> How folks realized the need to do that in one place, but not the other, no-one could adequately explain to me

I have seen this pattern before in job queueing code that used a RDBMS backend, and it was due to a conflation in the developer's understanding of database semantics, between an open transaction and a lock. Essentially the long-running transaction was being abused in the hope of achieving exactly-once semantics.

Facepalm, I know; one could list many failure modes resulting from such a naive approach, but that was the rationale nevertheless. As you said in another remark, these are teachable moments.

dkersten · on July 10, 2017

> How folks realized the need to do that in one place, but not the other, no-one could adequately explain to me

Chances are that whatever web framework they were using was doing the commits/rollbacks per-request automatically but when they were doing queries outside of a request context, they had to do it manually (and didn't). That is, its possible they just didn't know/realise what was happening.

k__ · on July 9, 2017

Would you say the DBA job is easy mostly because non-DBAs simply don't bother with getting to know their DB?

rosser · on July 9, 2017

No, I would say that makes it harder.

People have all kinds of crazy, naïve assumptions about how databases work. I often don't see the result of those assumptions until late in the process, when correcting them is significantly more difficult.

EDIT: It does create a lot of teachable moments, however. Over time, many of those assumptions are mitigated.

lmm · on July 10, 2017

I'd say that's the reason it's a job at all, at any but the largest companies. 90% of a DBA's time is spent preventing or solving problems that wouldn't be there if every developer had a level of basic competence with their DB.

smilliken · on July 9, 2017

Bloat due to long running transactions and batch writes is a significant operational pain in PostgreSQL. For extreme cases where vacuum isn't doing its job, check out pg_repack.

There is another approach to queuing in postgresql that doesn't suffer this problem though:

* Give each of your workers an id from 0..n.

* Open a connection to read from the queue, ordered by your priority index, and partition for your worker id (eg primary key % n = worker id). Make it a cursor.

* Read from the cursor in batches. You'll never see the same dead row twice because the cursor is stateful and is doing a single scan. You generally won't see dead rows anyway since this worker is the only one assigned to update these queue items.

* Perform your writes in a separate postgresql connection to avoid invalidating the cursor. Think about it like how a Unix process has a stdin and stdout pipe, except your pipes are postgresql connections.

This approach is great for throughout because you request queued items in batch, and there's no contention between workers. It's admittedly not as good for latency because a new queue item won't be dequeued until the worker it's assigned to finishes its run and restarts with a new cursor. For low latency queues, use FOR UPDATE SKIP LOCKED.

evadne · on July 9, 2017

I noticed that this article was written in 2015 and I wonder whether SKIP LOCKED in Postgres 9.5 could have changed the entire situation,

https://blog.2ndquadrant.com/what-is-select-skip-locked-for-...

http://www.cybertec.at/skip-locked-one-of-my-favorite-9-5-fe...

Also, duplicate of: https://news.ycombinator.com/item?id=9576864

rpedela · on July 9, 2017

I came to say the very same thing. There may still be problems with using SKIP LOCKED but it appears the core problem mentioned in the blog post would be solved by it.

ruslan_talpa · on July 9, 2017

certainly, would have helped but then we would not have had such great article about inner workings of PostgreSQL (especially since it applies to more the queues). If this problem would have been encountered by an average team (like me) it would have been impossible to fully understand what's going on so I am glad this article exists.

rosser · on July 9, 2017

Pre-9.5, you could use "SELECT ... FOR UPDATE NOWAIT" to achieve the same end.

malisper · on July 9, 2017

Uh... NOWAIT will error if the query encounters a locked row and rollback the transaction. SKIP LOCKED will make locked rows invisible to the current transaction, but will not cause the transaction to be rolled back. SKIP LOCKED effectively provides equivalent behavior to the recursive CTE given in the post

rosser · on July 9, 2017

> Pre-9.5...

malisper · on July 9, 2017

I don't see how NOWAIT gives you what you want here. NOWAIT and SKIPPED LOCKED are completely different.

rosser · on July 10, 2017

The clause "SKIP LOCKED" did not exist until 9.5.

Before that version, the behavior of silently skipping rows that had outstanding locks against them was specified by using "NOWAIT".

The decision of which specific clause to use is driven by which version of PostgreSQL you're using.

How much more plainly can I put that?

malisper · on July 10, 2017

AFAICT the behavior of NOWAIT has always been the same. The earliest mention of NOWAIT I could find was in Postgres 8.1[0]:

> To prevent the operation from waiting for other transactions to commit, use the NOWAIT option. SELECT FOR UPDATE NOWAIT reports an error, rather than waiting, if a selected row cannot be locked immediately.

[0] https://www.postgresql.org/docs/8.1/static/sql-select.html

rosser · on July 10, 2017

You're right. With a super-contrived test case, I got the described behavior.

Somehow, in all the the hundreds of millions (guesstimate) of rows I've used SELECT FOR UPDATE NOWAIT to process without lock conflicts, I've never actually gotten the error. At a guess, because I've generally always also used a LIMIT clause, and there have always been enough unlocked rows available to satisfy that.

Apologies for the noise.

shawn-butler · on July 10, 2017

This seems quite false. NOWAIT would error in the face of locked rows. SKIP LOCKED skips without error.

"With NOWAIT, the statement reports an error, rather than waiting, if a selected row cannot be locked immediately. Note that NOWAIT applies only to the row-level lock(s) — the required ROW SHARE table-level lock is still taken in the ordinary way"

random_comment · on July 9, 2017

I misunderstood what the heck the article was about at first. I hope this quirky analogy may help:

"I am implementing a database for a bank.

I open a transaction in a special active_customers table when the customer first joins the bank, and commit the transaction when they close their account or die. In this way, I use postgres transactions to keep track of which customers are active customers.

I am having problems with long-living customers forcing us to wait up to 120 years between autovacuums of the DB."

Clearly, that's not going to be a good way to do things.

Rossers comment (https://news.ycombinator.com/reply?id=14731214&goto=threads%...) is very helpful.

random_comment · on July 9, 2017

> As dead tuples continue to accumulate in the index, Postgres enters a hot loop as it searches the B-tree, comes up with an invisible tuple, and repeats the process again and again, surfacing empty-handed every time. By the end of the experiment illustrated in the charts above, every worker trying to lock a job would cycle through this loop 100,000 times.

1. I am curious if the author tried setting the autovacuum more appropriately for a table that is in continuous use at 50 new rows/second.

This article (http://okigiveup.net/postgresql-vacuuming-an-introduction-fo...) suggests:

ALTER TABLE person SET (autovacuum_vacuum_scale_factor = 0.0); ALTER TABLE person SET (autovacuum_vacuum_threshold = 4000);

This should result in entries being removed from the index and table regularly.

2. If I understand correctly, it seems the other part of the problem is that there are lots of rows to be individually tested to see if they can be locked, based on the way the author has implemented their queue. While lock attempts are being made on a set of rows, they aren't being deleted, since they're 'in use'. This non-deletion causes things to get out of control.

As far as I know, 9.5's SKIP LOCKED is meant to help address that (https://blog.2ndquadrant.com/what-is-select-skip-locked-for-...). Though it appears to be doing something kinda similar under the hood.

If the author of the original article (@ brandur.org) is following this discussion I would be grateful if they could evaluate these two options in the context of their test implementation.

rosser · on July 9, 2017

The problem with the first point is that autovacuum can't mark "dead" tuples as "dead" if there are open transactions that are "older" (in MVCC terms) than those tuples, no matter how aggressively it's tuned.

"SKIP LOCKED" (or, pre-9.5, "NOWAIT") will definitely help with the locking issue, but not with the bloat.

random_comment · on July 9, 2017

Thank you for a good summary.

The real problem in my understanding was that I wasn't seeing that the article author REALLY IS using long-running PG transactions to as a way to enforce external queue items running to completion (or not). I know they say it in the article, indeed it's the point of the article, but it seems so strange.

Hence he's getting these long-running transactions in PG in the first place. The phrase 'sledgehammer to crack a marshmallow' comes to mind.

At 50 items/second I guess I'm totally baffled over why you wouldn't simply use e.g. an exclusive table lock each time you connect to the DB to add/take/remove a task from a task table, with a timestamp to allow aborting and rescheduling of tasks.

I just ran a test to check the performance of exclusive table locking for this purpose, and despite the slowness of having everything dumping to the console while running, and the slowness of setting up a completely new psql session for each connection (i.e. no pgpool etc), using a BASH script, and running on a crappy 6-year-old mac, I got over 100TPS.

craigkerstiens · on July 9, 2017

Dropped the original author a note, unfortunately he's not near a machine at the moment, but does plan to check back in when he can.

random_comment · on July 9, 2017

Thanks craig

kevinburke · on July 9, 2017

This is a really great summary of the problems with queues in Postgres.

I ended up working around this by moving jobs to an "archived" table when they were complete, so the "hot" table never gets too large, and by avoiding long running transactions - we update jobs to in-progress, and then the worker is responsible for marking them as "complete" or "failed". If the worker crashes, a "stuck jobs" background worker will mark jobs as failed after a timeout.

https://github.com/Shyp/rickover

codesnik · on July 9, 2017

In one of my projects I was going to use something like that: "update ... skip locked" was used in isolated short transaction, which would just update that job with "taken" flag and timestamp and commit that. "stuck jobs" worker was meant to fire as rare as possible for cases of unhandled failures, but was still there.

elmigranto · on July 9, 2017

Here's the one using similar approach, but for Node.js.

https://github.com/timgit/pg-boss

atombender · on July 9, 2017

Whenever this comes up, I argue for moving the queue orchestration out of Postgres. Everyone will be happier that way.

By this I mean that you keep your jobs in Postgres, but instead move the ordering -- the physical queue data structure -- out into something like RabbitMQ or Redis.

To schedule a job, create a job row, then push its ID to Rabbit or whatever. To process the next job, grab the next ID, look up the job row, then do the processing.

That gives you the best of both worlds, without ever running into issues with locking and dead tuples.

(There is of course a race condition here where you may create a job but not publish it (due to Rabbit temporarily failing, for example). But there are ways around that.)

andrewvc · on July 9, 2017

I would argue that is not a pragmatic approach. The article makes a good point about the huge benefits of this approach for relatively small workloads. Transactional guarantees are hugely useful! Being able to rollback something like a user create including a queued job to send out a welcome email is incredibly nice.

IMHO HN conversations tend too much toward solving problems at google scale. We often forget here that there are tons of high-value business critical systems that generate mere 10s, hundreds, or a few thousand background jobs a day. Given the numbers quoted in the article, DB backed queue systems scale quite far. I've successfully used them to handle tens of thousands of events per day with no problem over a timespan of years. It's clear that this is a great approach for some situations, and that one size fits all doesn't make sense here.

joncrocks · on July 10, 2017

As a counterpoint, I would say "is it pragmatic to solve complex database issues in the middle of the night?" or "Do you prioritise fire safety or fire fighting"

Using the wrong tool for the job will sometimes be pragmatic. But sometimes it will burn you, burn you hard, and it'll always be in the middle of the night (note the article's graph starts at midnight :-)). As long as you accept that, all is good in the world!

What often happens though is that people set out with a simple aim, and rather than spending a litle bit of time honesting looking at what's out there, they re-invent the wheel. Because "How hard can it be?"

lozenge · on July 9, 2017

> Transactional guarantees are hugely useful! Being able to rollback something like a user create including a queued job to send out a welcome email is incredibly nice.

That would work in this case - the job is added to the postgres table in a transaction, which eventually rolls back. The ID of the job was put into a separate queue. When the queue consumer tries to get the job from the database by ID, it doesn't see it because the transaction never committed. It can discard that job ID from its queue.

andrewvc · on July 9, 2017

Yes but then you need to support a whole additional piece of infrastructure and you have to write and support more code. For a lot of use cases that's a waste of time and money. This is a case of YAGNI for the use cases I mentioned.

ruslan_talpa · on July 9, 2017

one option to push events (jobs) into RabbitMQ using NOTIFY is this little thing https://github.com/subzerocloud/pg-amqp-bridge

memracom · on July 10, 2017

When will people stop trying to build queuing systems on top of a database and use a queuing service. Amazon's SQS or RabbitMQ or one of dozens of other tools that are specialized for message queuing. A job queue is just a message queue in which the messages hold a job.

ZeroMQ is the thing to use to roll your own queuing service, not a database server whether SQL or NoSQL. Databases are not queueus and database servers are not designed with queues as any kind of important priority when features are implemented.

Traubenfuchs · on July 10, 2017

At the exact point where you don't need to learn another technology/framework and don't need to install yet another server side program (broker) anymore.

It's another independent part that needs to be installed, configured and maintained. There are several "database as queue" approaches that work perfectly fine and none of them requires more than knowledge in your programming language of choice and basic SQL.

hvidgaard · on July 10, 2017

Until you hit the roadblock that is operations. Some times you cannot get to install new tools willy nilly, and using an existing tool is prefered. There is nothing wrong with using a database for this, and under the hood, that is more or less what a message queue with persistence does anyway.

ris · on July 9, 2017

This turns into a very interesting dissection of postgres' internal workings, but it's hard to get over the beginning's massive caveat of using a RDBMS as a queue.

CodeWriter23 · on July 9, 2017

> storing jobs in a database allows a program to take advantage of its transactional consistency; when an operation fails and rolls back, an injected job rolls back with it

This is a false premise for using a database as a queue. The same goal can be simply achieved with a different queuing system, say SQS, by only enqueuing jobs AFTER successful transaction commit.

IgorPartola · on July 9, 2017

What happens when the enqueuing fails after a successful commit?

CodeWriter23 · on July 10, 2017

Retries.

IgorPartola · on July 10, 2017

So you retry indefinitely? How do you keep track of your retries? Do you have a queue to get the items into the real queue?

One solution to this is two phase commits, but I am not aware of any message broker that supports that. So basically you are stuck assuming that your message broker is never going to fail, or at least not going to fail for long enough for the users to notice.

elmigranto · on July 9, 2017

IIRC, with Postgres, you can write to Redis directly from SQL and as part of transaction. (See `redis fdw` on google.) Perhaps, there is writable FDW driver for your queue system.

(Or maybe you can just dumpt stuff to sockets by using Postgres function written in Python, JS, Perl or w/ever you prefer.)

ruslan_talpa · on July 9, 2017

How do you queue that job only after the transaction committed?

sorkin2 · on July 9, 2017

The same way the new built-in logical replication works: logical decoding.

ruslan_talpa · on July 9, 2017

this means it does not work on RDS

sorkin2 · on July 10, 2017

Tough.

rosser · on July 10, 2017

This is the kind of comment that makes me wish I could burn some of my own karma to downvote more than once.

CodeWriter23 · on July 10, 2017

Check the return code from libpq.