Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This Squid vs. Varnish comparison is quite similar to the sync vs. async debate for network programming. In both cases, the question is: do I use some OS abstraction (Virtual Memory or threads, respectively) as my application's primary scheduling mechanism, or do I handle scheduling more explicitly at the application level?

Of course the OS guys like PHK or Linus think you should use the OS mechanisms. Linus hates O_DIRECT (http://kerneltrap.org/node/7563) and PHK is taking a similar tack with this article. Just let the OS handle it.

But there are real downsides to this approach. One is that it makes you far more dependent on the quality of the OS implementation. I'm sure PHK trusts FreeBSD and Linus trusts Linux, but if you're writing for cross-platform you might end up on a bad VM implementation. The last thing you want to tell your customers is that they have to upgrade or change their OS to get decent performance.

Also, the OS is by design a more static and less flexible piece of software than anything you put in user-space. What if you need something that your VM system doesn't currently provide? For example, how are you going to measure (from user-space) the percentage of requests that incurred a disk read? Disk reads are invisible with mmap'd VM. What if you need to prioritize your I/O so that some requests get their disk reads serviced even if there are lots of low-priority reads already queued? If you've bought in whole-hog into an OS-based approach and your OS doesn't support features like this, you don't have a lot of options.

And while it's great in lots of cases that the page cache can be shared across processes, OS's don't have great isolation between processes using the page cache. If you run some giant "cp" and completely trash the page cache, your Varnish process is likely to take a latency hit. In a shared server environment, you want to be able to draw walls of isolation so that each user gets the resources that he/she was promised. A shared page cache is hard to fit within an isolation model like this, whereas an explicit cache in user-space works fine.

Think about the microkernel vs. monolithic kernel debate. Maybe monolithic kernels won, but it's still a good principle that if it can be left out of the kernel without loss of performance, it should. Why is it better to use an interface like VM than to use some user-space library that manages disk I/O? The kernel's one advantage is that it can handle page faults (and so can make a memory reference into an I/O operation), but that's also the property that makes it difficult to do good accounting of when you're actually incurring I/O operations.

One final thing to mention: if you're using VM in this way, things degenerate badly in low-memory situations. Since the pages of data are competing with pages of the program itself, the program can get swapped out to service data I/O. If you've ever seen a Linux box thrash with its HDD light flashing like mad, you know how bad things can get when memory is temporarily too scarce to even let programs stay resident. Using vast amounts of VM exacerbates this, because it makes your programs and your data compete for the same RAM.



No matter what the quality of your VM, if you're going to have several IO hits versus only the one where the VM (even a crappy one) pages the data in or out just once you will always be faster in a scenario like phk describes.

Doubling or even quadrupling your IO operations is very expensive.

In a situation such as the one for which this article is meant you set things up in advance to never get in to a situation where you start trashing your disk, programs are allocated a fixed amount of memory and if a program does not abide by that it is considered faulty.

The trashing situation you describe can happen on machines that are run with less rigid setups, but on a production server that you count on serving up a few billion files every day you can't afford the luxury of random scripts firing off CRON and other niceties like that.

Custom kernel, very limited set of processes that you know are 'well behaved', as predictable as possible.


> No matter what the quality of your VM, if you're going to have several IO hits versus only the one where the VM (even a crappy one) pages the data in or out just once you will always be faster in a scenario like phk describes.

You can keep your own cache explicitly in user-space, and get multiple hits with a single load into RAM.

> but on a production server that you count on serving up a few billion files every day you can't afford the luxury of random scripts firing off CRON and other niceties like that.

In a data center where you have tens of thousands of heterogenous jobs competing for thousands of machines, you can't afford the luxury of giving out exclusive access to a machine. You have to have good enough isolation that multiple jobs can run on the same machine without impacting each other negatively. As CPUs get more cores this will become even more important.


The whole point of this article - and it is a very good point - is that keeping your cache in user-space is not the right way to approach the problem. And you can get multiple hits anyway if you make sure that data that will expire together will end up in the same page.

Your other description does not match the use case of a production web server running varnish instances as the front-end.


The whole point of my comment is that PHK's analysis leaves out many downsides of leaving it all to the kernel.

His main argument against doing it is user-space is that you will "fight with the kernel's elaborate memory management." But if you just turn off swap completely and read files with read/write instead of mmap(), there is no fight. Everything happens in user-space.

Leaving it all to the kernel has many disadvantages as I spent many paragraphs explaining.


I missed the 'if you just turn off swap completely' bit in the paragraphs above.

Edit: even on re-reading it all I can't find it.


> do I use some OS abstraction (Virtual Memory or threads, respectively)

Bzzt! You got your analogy backwards.

Using a thread pool is doing it yourself from cross-platform primitives that work on even the shittiest UNIX-wars-era platform, and an event loop using epoll/kqueue is the modern pure OS abstraction.


Read the second-half of the sentence you quoted: "as my application's primary scheduling mechanism." Using O(requests) threads leaves the OS in charge of scheduling CPU tasks, just as using VM leaves the OS in charge of scheduling I/O.

> Using a thread pool is doing it yourself from cross-platform primitives that work on even the shittiest UNIX-wars-era platform, and an event loop using epoll/kqueue is the modern pure OS abstraction.

You are very confused. First of all, epoll/kqueue are just optimizations of select(2), which first appeared in 4.2BSD (released in 1983). No standard interface for threading on UNIX appeared until pthreads was standardized in 1995.

But all of this assumes that my argument has anything at all to do with history. It does not. The question is whether you are leaving the OS in charge of scheduling decision or not.

With select/poll/epoll/kqueue/etc, the OS wakes you up and says "here is a set of events that are ready." It does not actually schedule any work. The application gets to decide, at that moment, what work to do next.

Contrast this with O(requests) threads or VM. If several threads are available to run, the OS chooses which one will run based on its internal scheduling policy. Likewise with VM, the OS is responsible for scheduling pages of RAM and when they will be evicted, based on its own internal logic and policy. This is what makes them higher-level primitives.


> the program can get swapped out to service data I/O

mlock() and friends can help for server applications.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: