Hacker Newsnew | past | comments | ask | show | jobs | submit | ahachete's commentslogin

I have mentioned this before, but here it goes again:

I'm really happy that there's more options for Postgres sharding and I applaud Pgdog and the team's efforts and energy.

Having said that, this makes it a no-go for me:

> shard_number = hash(data) % num_shards

https://docs.pgdog.dev/features/sharding/basics/#terminology

Most sharding solutions distribute the hash value over linear ranges, that then split across "virtual shards", that are then placed on the physical shards or worker. This allows for shard replacement when needed. For example, Citus works this way, and even adds convenience functions for shard migration (using logical migration) in an automated way. That's all I'd need.

Operationally, it's worlds apart. With modulo distribution the only way to replace data is to reshard everything --something you don't want to do however fast the operation may be.


Yeah good callout. We'll add rendezvous soon enough. Until then, being compatible with Postgres partitions has been advantageous -- while we build everything out, people were able to migrate to PgDog for the query routing layer while doing the resharding in Postgres.

Adding a sharding function in our architecture is relatively straightforward. We also support plugins which can control the flow (and direction) for queries, so our users can add their own (and they do!).


TBH I don't think it's that straightforward, I see it more of a notable architectural change. At a very high level, this means:

* Adding a sharding function, as you say.

* Developing an external service for metadata (shard placement) or alternatively have that metadata in one place and replicate (consistently!) to every query router.

* Implementing functions/catalogs for the users to understand the placement and configure/alter it.

* Implementing shard migration / rebalancing capabilities, possibly using Postgres logical replication (plus notable automation).

Here's one idea if you follow this path, something that Citus doesn't have: make the sharding function pluggable and pick one by default which is well-known and available in many languages (e.g. xxhash). If you do so, and guarantee stability of those functions, they could be used externally (applications) to route queries / inserts especially to the appropriate shard. While it makes application more complex, it may allow (combined with access to the metadata service) for faster ingestion paths (this is often known as application assisted sharding), and its not exclusive of the query routers.

Edit: formatting


> having the ability to scale the coordinator past a single machine

Since Citus v11 (released 4 years ago), any worker node can also work as a "query router" (a node that you can query against [1], and works from this perspective as a pure coordinator:

> for very demanding applications, you now have the option to load balance distributed queries across the workers

You can also setup such query routers as dedicated nodes by setting the `shouldhaveshards` to `false`, becoming an effective coordinator (for querying; not for metadata operations).

So with Citus you can absolutely have as many query routers (coordinators if you wish) as you want.

[1]: https://www.citusdata.com/updates/v11-0/#metadata-sync

Edit: formatting, typo


> we find a Postgres server can handle up to 144K of these writes per second. That’s a lot, equivalent to 12 billion writes per day.

Based on the shown graph, this is misleading at best, essentially false. After 120K writes/s p50 spikes from 10ms to 1s (1 second for a write!!!!). That's two orders of magnitude latency spike, and an unacceptable one for an OLTP workload. It clearly shows the server is completely saturated, which is clearly a non operational regime. Quoting 144K is equivalent to quoting the throughput of a highway at the moment traffic comes to a standstill.

Based on this graph the highest number I'd quote is 120K. And probably you want to keep operating the server within a safe margin below peak, but since this is a benchmark, let's call 120K the peak. Because actually p50 is not even the clear-cut. It should be a higher percentile (say p95) at which latency is within reasonable bounds. But for the shake of not over complicating, it could be taken as a reference.

> We found that the bottleneck was in flushing the Postgres write-ahead log (WAL) to disk

Therefore, you are not measuring Postgres peak performance, but rather Postgres performance under the IO constraints of this particular system. Certainly, 120K IOPS is the maximum that this particular instance can have. But it doesn't show if Postgres could do better under a more performant IO disk. Actually, a good test would have been to try the next instance (db.m7i.48xlarge) with 240K IOPS and see if performance doubles (within the same envelop of p50 latency) or not. And afterwards to test on an instance with local NVMe (you won't find this in RDS).

> From [1]: > Postgres insert throughput > uv run python benchmarks/postgres_insert.py --rps 1000 --duration 300

300 seconds test duration?? This is not operational. You are not accounting for checkpoints, background writer, and especially autovacuum. Given that workflow pattern includes UPDATEs, you must validate bloat generation (or, equivalently, bloat removal) by a) observing much longer periods of time (e.g. 1h) and b) making sure the autovacuum configuration (and/or individual table vacuum configuration if required) makes bloat contained in a stable way. Otherwise, shown performance numbers will degrade over time, making them not realistic.

Only after proper autovacuum tuning and under the effects of bgwriter (also tuned!) and checkpoints (all required, especially for write intensive workloads), over much larger periods of time, the benchmark could be considered meaningful.

> We next measure the scalability of Postgres-backed queues.

I'd recommend benchmarking the recently announced PgQue project, that operates bloat-free (one of the largest operational hurdles in queue-like Postgres workloads). See [2] for a previous discussion in HN.

[1]: https://github.com/dbos-inc/dbos-postgres-benchmark [2]: https://news.ycombinator.com/item?id=47817349


> Based on the shown graph, this is misleading at best, essentially false. After 120K writes/s p50 spikes from 10ms to 1s (1 second for a write!!!!). That's two orders of magnitude latency spike, and an unacceptable one for an OLTP workload. It clearly shows the server is completely saturated, which is clearly a non operational regime. Quoting 144K is equivalent to quoting the throughput of a highway at the moment traffic comes to a standstill.

> Based on this graph the highest number I'd quote is 120K. And probably you want to keep operating the server within a safe margin below peak, but since this is a benchmark, let's call 120K the peak. Because actually p50 is not even the clear-cut. It should be a higher percentile (say p95) at which latency is within reasonable bounds. But for the shake of not over complicating, it could be taken as a reference.

You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.

> Therefore, you are not measuring Postgres peak performance, but rather Postgres performance under the IO constraints of this particular system. Certainly, 120K IOPS is the maximum that this particular instance can have. But it doesn't show if Postgres could do better under a more performant IO disk. Actually, a good test would have been to try the next instance (db.m7i.48xlarge) with 240K IOPS and see if performance doubles (within the same envelop of p50 latency) or not. And afterwards to test on an instance with local NVMe (you won't find this in RDS).

I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.

> 300 seconds test duration?? This is not operational. You are not accounting for checkpoints, background writer, and especially autovacuum. Given that workflow pattern includes UPDATEs, you must validate bloat generation (or, equivalently, bloat removal) by a) observing much longer periods of time (e.g. 1h) and b) making sure the autovacuum configuration (and/or individual table vacuum configuration if required) makes bloat contained in a stable way. Otherwise, shown performance numbers will degrade over time, making them not realistic.

Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.


> You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.

I disagree. It's worthless a number at saturation. Because "a slightly lower throughput" is at best an unqualified hand-waving. Real numbers can be quite far from that saturation point.

Quote instead real production numbers. You can define them clearly, it's not that hard. E.g.: p95 below 10ms latency. That's it. Measure and report that number.

> I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.

But those would be interesting numbers to share! "Doesn't improve performance significantly" --sorry, I'm not big friend of unqualified data points. Is it 10%, 20%, 50%? And definitely, when measured at saturation, surely you don't see improvements. But if measured at an operational regime, you should probably see notable improvements (unless other scaling factors start to dominate, in which case your benchmark becomes much more meaningful because then you are finding Postgres scaling limits and not just the limits of the disk on which it's running). Changes the picture dramatically.

> Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.

Sorry, but if you use that as an example, gives me little confidence about the real intent. But glad to hear you run at longer duration --add that information to the post! But again, that's not enough. Show the bloat and demonstrate how stable it is, given the tuning required to keep it contained, of course. Also show the tps over time --I'm sure it drops notably in the presence of checkpoints-- and then the "under 10ms latency at p95" will become dominated by write performance during checkpoints.

Because when you determine your SLOs, it's not at the happy path, but the opposite. And saying "Postgres can do 144K writes/sec on this machine" is beyond the happy path, so it's not meaningful for me.


Yeah, I know. Welcome to the club :(

https://x.com/ahachete/status/2035783292549755228


IMO transparent proxies for observability are not the best pattern. And I speak from experience, we developed the Postgres plugin for Envoy [1], [2] and we use it in StackGres [3], among others, for this very same reason, observability.

There's two main problems with said proxies:

* Latency. Yes, yes, yes, they add "microseconds" vs "milliseconds for queries", and that's true, but just part of the story. There's an extra hop. There's two extra sets of TCP layers being traversed. If the hop is local (say a sidecar, as we do in StackGres) it adds complexity in its deployment and management (something we solved by automation, but was an extra problem to solve) and consumes resources. If it's a network hop, then adds milliseconds, and not microseconds.

* Performance. It's not that hard to write a functioning PG wire proxy (it's not trivial either). But it is extremely hard to make it perform well under high load scenarios. Most of the proxies I have seen crack down under moderate to high performance.

What's the solution then? The Postgres extension model to capture the metrics (we also experimented with eBPF, but it causes too many kernel-user space context switches when you can do the same in an extension without them), and a small sidecar to push the metrics out via a standardized protocol like OTEL.

[1]: https://www.envoyproxy.io/docs/envoy/latest/configuration/li...

[2]: https://www.cncf.io/blog/2020/08/13/envoy-1-15-introduces-a-...

[3]: https://stackgres.io

Edit: formatting


I get what you're saying about a proxy like this, latency & performance would suffer, however minor, and in production DB land this really matters.

I've just not sure it is much of a slight on such proxies.

You don't need to run this always inline in production to get amazingly useful results. Yes, there are lots of production insight solutions out there, but lots of modern stacks can be complex enough that just getting a quick handle on how the page you're debugging talks to your DBs can be incredibly useful, which is where I love the idea of a solution like this.

Sure, it is mytop / pgtop, but trying to offering it at a different layer & with a modern interface. Seems useful to me.


> The Postgres extension model to capture the metrics (we also experimented with eBPF, but it causes too many kernel-user space context switches when you can do the same in an extension without them), and a small sidecar to push the metrics out via a standardized protocol like OTEL.

The extension model is great, but it doesn't work with existing postgres providers (RDS, Aurora, etc.). Unless one such extension becomes standard enough that all providers will support it. That would be ideal, IMO.

To be clear, I don't mean pg_stat_statements, that is standard enough, but an extension that pushes the actual queries in real-time.

> If it's a network hop, then adds milliseconds, and not microseconds.

Are you talking about connection establishing time or for query delay? I think it should normally be under a millisecond for the later.


> The extension model is great, but it doesn't work with existing postgres providers (RDS, Aurora, etc.). Unless one such extension becomes standard enough that all providers will support it. That would be ideal, IMO.

That's true, but that's a problem of the PGaaS providers for them to fix by providing the best functionality available. I'm planning on following this route in a pure OSS basis.

> Are you talking about connection establishing time or for query delay? I think it should normally be under a millisecond for the later.

Network trip. If the proxy is not co-located with the database but rather a network hop away, that's usually adding at least 1ms there, could be more.


Postgres can really scale well vertically (and horizontally for read-only workloads) as the post shows.

However, I'm still surprised about the reasons for not sharding. They have been mentioned before, but I haven't seen a substantial rationale.

Sharding is almost only analyzed from the perspective of write scaling. But sharding may not only be about write scaling, but a path to reducing blast radius. And this is something that triggers much earlier than write scaling needs (especially given how well Postgres scales vertically and reads).

When you shard your database, you end up having N clusters (for HA purposes, each "shard" must be a primary-replica(s) cluster itself), each holding 1/Nth of the data.

There are certain scenarios which, while unlikely, may hit you: data corruption in the WAL replication stream, a problem during a major upgrade, a situation that requires a whole cluster restore from a backup, you name it. For those cases, the whole cluster may experience notable downtime.

If you have a single cluster, 100% of your users experience downtime. If you sharded into N clusters, only 1/Nth of your users experience downtime. For a company servicing 800M users the difference from both scenarios is dramatically different. Even for much much smaller companies.

I'm puzzled why this is not perceived as a risk, and if it is not, how it is mitigated.

While I wouldn't advocate to shard "too early", given that it comes with notable caveats, I believe more and more in sharding your workloads when possible more earlier than later. Way before truly needing it from a write scaling perspective. Because apart from reducing the blast radius, it applies implicitly the principle of "divide-and-conquer", and your problems become much more manageable (your number of connections per cluster decreases at will, backup restore times can be a fraction of the time, logical replication can be considered as a true option for replication/upgrades/etc if you keep shards relatively small and many other operational procedures are greatly simplified if now you have much smaller databases, even if you have many more of them).


Microsoft originally bought CitusData and rebadged it as Azure CosmosDb for Postgres Cluster. Microsoft have been recommending partners to now avoid that product. It does not and will not support Entra federated workload identities (passwordless).

The replacement will be Azure Database for Postgres with Elastic Clusters. I think it is still in preview.

Again it’s Citus based, but without the CosmosDb badge and it will support federated workload identities.

https://techcommunity.microsoft.com/blog/adforpostgresql/pos...

https://learn.microsoft.com/en-us/azure/postgresql/elastic-c...


I'd also try it, but I'm Linux only ^_^


(thinking about cross-platform version, recalling my Java past)


Java (modern Java) would be awesome!


There's a fundamental disconnect: OP refers to senior engineers being replaced with AI, whereas the evidence and logical reasoning points much more to junior engineers being replaced by AI. And that premise seems like a quite plausible one...


>OP refers to senior engineers being replaced with AI, whereas the evidence and logical reasoning points much more to junior engineers being replaced by AI.

If industry cared about future seniors, they'd invest in juniors. But that's not what's happening. AI will effectively replace seniors in 20 years with the current trajectory. Whether or not that replacement is adequate or not is the bigger question.


I think the junior thing started ~24, early ~25. Because back then the level of the current models was at or above that level, with somewhat flaky reliability. In the past year that's changed. We are now at "mostly reliable" in any junior-level stuff, and "surprisingly capable, maybe still needs some hand-holding" at advanced / senior-level stuff. And somewhat super-human if the problem is easily verifiable in a feedback loop (see the atcoder stuff).


Not my experience.

It's junior level coding and maybe senior level advising, but even then only when clearly directed with the right questions and guardrails.

As an autonomous thing? Junior at best.

That's why I think it's extremely helpful for seniors: with proper guidance, it really boosts your productivity, writing notable parts of the code.


> "As part of the acquisition, we made an important decision: not only to keep PeerDB open source"

> "Here goes the Open Source reference to our validation logic."

> "PeerDB Open Source Repository"

I hate to be that guy, but PeerDB seems to be governed by the Elastic License [1] which makes it NOT open source.

The difference is not small, but significant for many. For one, it won't get integrated into other OSS projects.

In my particular case, we have integrated Debezium Embedded into StackGres, as a high level object (CRD) named SGStream [2]. It allows Postgres logical replication from Postgres and exports to another Postgres and/or CloudEvents. No Kafka required. We'd love to consider other alternatives like PeerDB, but not being OSS is a red line we can't cross (having said that, we're really happy with Debezium in general, but having competition and alternatives it's always great).

[1]: https://github.com/PeerDB-io/peerdb/blob/main/LICENSE.md

[2]: https://stackgres.io/doc/latest/reference/crd/sgstream/#sgst...

[edit: formatting]


(this is Sai, the author of the post and also PeerDB co-founder)

The wording in that post was an unintentional miss on my part. Apologies for that. We’ve just fixed it. Thanks for flagging it!

To add some context, PeerDB was originally released under ELv2 well before the acquisition. During the acquisition we made a choice to keep the project as-is rather than change its license, so this wasn’t a new decision made at that time — just continuity with how the project already existed.

We appreciate the feedback, around integration and downstream OSS adoption. That overall makes sense. We’ll take it into account as we think about licensing going forward.

Separately, I really wish you tried PeerDB out. The ease-of-use and performance around larger Postgres datasets (TBs to 10s of TB) would’ve been something you would have probably appreciated. That is something we optimized a lot on over that last few years. May be sometime in the future! :)


Thank you for acknowledging this and updating the blog post correspondingly.

I'd love to test and compare PeerDB with Debezium (Embedded), and even SynchDB. But as said, the licensing is a blocker for us. And given the focus and bandwidth we currently have, we won't have the chance to deeply look at it unless there's a high chance we could integrate it into StackGres.

Anyway feel free to DM me if you'd like to talk more.


This is a fair point. ELv2 is source-available and doesn’t meet the OSI definition of open source.


> I hate to be that guy

I think it is more ClickHouse Marketing being that guy; they have a vaguely aggressive feel to them and slightly-questionable claims like that seem on-vibe to me. Although it is tolerable. Selling databases is hard, the specialists who actually understand the trade-offs are so specialised they usually aren't the person who makes the call on what to use. At least they're selling an interesting [0] DB, Clickhouse has a fun design. They don't mislead anyone who is interested in the details and their documentation is in the end rather detailed.

[0] https://clickhouse.com/docs/academic_overview


Appreciate the feedback here. This wasn’t marketing — just an unintentional miss on my part. https://news.ycombinator.com/item?id=46392372


I have two Framework 13 DIY, top of the line (Ryzen AI 9 HX 370, 96 GB of RAM, 2TB WD BLACK SN850X). Well, TBH just one, the other one was stolen with less than one month, but that's a different story.

Can't be happier with this choice.

First of all, now I feel confident I can easily tinker with my laptop, open it and replace any components. No, I'm not repeating their Marketing, I'm truly confident now. Some months ago my previous Lenovo X1 Carbon stopped charging and I was scared as hell. Sure, I have everything in version control and cloud drives blah blah blah but if it doesn't charge, it's dead. Sure, I can extract the NVMe (is it easy at all?) and rsync the data (for a faster recovery) but is that easy, feasible, can I do it on the go (X1 stopped charging while I was on a trip).

The Framework 13 DIY I built it in less than 10 minutes. I can easily disassemble again in a heartbeat. This is priceless.

Other than this, build quality is higher than expected, and several other people said exactly the same when I showed my unit. Powerful? Most than any desktop out there. Perfectly portable. Works well OOB with Linux. What else do I need? Nothing, it's my laptop and will be the next one.

(Actually the next one may be cheaper, as I may only need to replace the motherboard, we will see)


Writing this on a first-gen 13 DIY, so it's already a bit dated. I've already added additional RAM and swapped in a much bigger SSD (really underestimated how much space you need if you heavily use docker and VMs).

The 1st year was a bit bumpy with 4k monitors over a USB-C hub being somewhat flaky. Ever since a clean Ubuntu reinstall, I'm very happy, no complaints whatsoever.

Sure, it costs more, but the combo of perfectly running Linux, giving me the piece of mind of repairability and easy upgrades for me justifies a higher price.

On the other hand, I'm not willing to pay the kind of premium you have with Apple products, where for incremental steps in more RAM or SSD you pay a multiple of the off-the-shelf price of the added space.


You do realise you can upgrade RAM and SSDs, as well as replace most components on ThinkPads too, right ?


"Most components" seems a bit of an understatement when compared with the Framework.

Sure, some components can be replaced. And not at the same cost (opening and manipulating the Framework vs the ThinkPad). But not all, like the motherboard.

I may have built multiple dozens of computers in my life, so it's not that I'm new to this world.


Replacing the motherboard is just not an important thing, and that's demonstrated by how Framework has remained a very niche product.


I've been a very happy FW16 user, so much so that I got my partner an FW13 to replace her very old and very bulky gaming laptop. Similar specs to yours + AMD dGPU module. Recently cleaned the innards and upgraded my mainboard to the HX 370 in ~20min with zero issues. Compared to what I've been reading on various forums/threads, my experience as a whole has been virtually free of problems! FW + NixOS is a potent combo that squarely satisfies the kind of configurability/overall experience I need/want in the tech I use. (And perhaps more importantly, it handily assuages my paranoid mind.)


What's the battery life you're getting out of the Ryzen AI 9?


TBH I haven't deeply tested it yet. Laptop is a few days old and I'm still "automating the install" (Ubuntu autoinstall + Ansible tasks for post-installation including BIOS upgrade and install Nix + Nix and home manager to install everything else; post(s) upcoming), so I haven't done "production use" yet.

Anyway, and while I love long battery life, it's not my main concern. Most of the time I have a power socket available and/or a nice portable battery pack that does the job. Laptop feels so much faster than my X1 Carbon that everything else seems to be a distant second feature.

P.S. Hi, Yorick, again, not the first time we cross paths ;)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: