I've worked on moderately busy backend platforms (~10K-20k rps handled on a ~4 e5-2650 and aiming for 5ms 95p response times).
It greatly depends on what you're doing, but for the majority of systems which are read heavy (and that most certainly includes "dynamic" sites like Amazon or Wikipedia), I hold to two major beliefs:
1 - Have very long TTLs on your internal cache servers with a way to proactively purge (message queues) and refresh in the background. Caching shouldn't be a compromise between freshness and performance. Have both!
2 - Generate message/payloads/views asynchronously in background workers and have your synchronous path as streamlined as possible (select 1 column from 1 table with indexes filters). Avoid serialization. Precaculate and denormalize. Any personalization or truly dynamic content can be done:
1 - By having the client make separate requests for that data
2 - Merging the data into the payload with some composition.
3 - Glueing bytes together (easier/safer with protocol buffers than json)
Do things asynchronously. Use message queues / streams.
Beyond that, GC becomes noticeable. For example, Go's net/http used to (might still) allocate much more than other 3rd party options.
It’s susceptible to Thundering Herd whereby more requests come in for the same cache key before the initial computation is finished, and so you end up with lots of cache misses. The fix is usually to lock the cache key and have subsequent requests wait on the original computation but it’s a bit more complex to code.
Just be careful to lock the cache key, and not the cache. I've seen the latter done often, which of course kills performance if there are cache misses for multiple keys at the same time.
It depends on the backing, if you're using just memcached there's no way to completely lock the key. None of the popular rails caching frameworks I've seen handle this either.
By default it handles the case of concurrent retrieval on the same key (the second one will just wait for the first one to finish and use that value rather than starting a duplicate computation). It also lets you configure more interesting things like eviction strategies, removal notifications, and statistics.
Last year was a lesson for me in why caches are a hard problem as I had to debug many cache issues from other people not thinking things through... (At least one of the issues was my own fault. :)) Since then whenever someone suggests we use a cache I instinctively pull out a set of questions[0] to ask. The three questions Guava has you consider can also lead you to using memcached or the like instead but my set tries to answer the question "Do you even need a cache?" and if so, generating helpful design documentation.
Is the code path as fast as it can possibly be without the cache? Do you have numbers?
Will the cache have a maximum size, or could it grow without bound?
If it grows without bound, either because of unbounded entries or because of unbounded memory for any particular entry, under what conditions will it consume all system memory?
If it has a maximum size, how does it evict things when it reaches that size?
Are you trying to cache something that could change?
If so, how is the cache invalidated?
How can it be invalidated / evicted manually by another thread or signal? (Debuggability, testability, inspectability/monitorability, hit rate and other statistics?)
Is there a race condition possibility for concurrent stores, retrieves, and evicts?
How constrained are your cache keys, that is, what does it need to know about to create one?
Do they need to take into account global info?
Do they need to take into account contextual information (like organization ID if your application server runs in a multi-tenant system, or user ID, or browser type, or requested-language)?
Or do they only depend on direct inputs to that code path?
To frame it another way, what happens if 100 requests come in at the same time for that expensive value when it isn’t in the cache yet? The expensive calculation will be run 100 times at the same time.
Ideally, you’d rather refresh the cache value in the background once and never allow duplicate requests for it from the web.
If you’re running a language that makes it easier to deduplicate requests for certain data, the original approach will last longer. The CacheEx library in Elixir, for example, will only run the expensive calculation once, set the cache and then send the value back to everything that requested it while it was loading.
CacheEx sounds interesting. Basically a debounce. Pretty sure you could solve this outside of the application layer with Varnish but that depends on how the view is composed. I prefer using a grace / stale period but that only works if it's acceptable to return stale data during computation instead of queuing it up.
Well CacheEx is just using functionality that comes naturally on the BEAM here. This is one of the reasons that a lot of CDN's like Cloudfront are written in Erlang.
CacheEx gets the ability to check for the presence of the cache key in Erlang Term Storage (ETS) which is basically an in-memory cache. If the key is present, it just returns the value.
If it's not, it sends checks to see if a process exists with the cache key name. If there isn't one, it creates one to request the resource.
For any other requests that come in until the value has been created, they will be directed to the process that is getting the value.
When the process that was calculating things comes back, it will save the value to ETS and then also send it back to all of the queued processes that have been waiting for it.
In the case of Varnish, you'd be expecting it to send back the entire completed view...HTML and all. This isn't something that you need to worry about with Elixir because the view is never actually rendered in the application. It's broken down into pieces that are never duplicated in memory and then replayed directly to the socket...meaning you really only ever need to cache expensive data and not what it's transformed into.
Here's a good read on why this view layer is so fast, if you're curious. Most people report shock that their uncached performance with Elixir and Phoenix is on par with statically cached HTML. I didn't believe it until I saw it myself.
The easiest workaround is to acquire a lock (e.g. redis redlock) before starting to refresh the cache. That way only one process will perform expensive_calculation(). A good caching library will generally do some variation on this for you.
For larger, more complex scenarios the techniques mentioned by GGP above work well, i.e. not doing expensive_calculation() in your app process at all.
The issue has a bunch of names, I know at least three: cache stampede, thundering herd and dogpile effect.
there's a race condition between checking if something is in the cache and actually putting it in the cache. A more correct solution would have all threads wait while one does the actual work.
This gets more complicated if you have a distributed cache and/or distributed application servers. The typical solution there is to allow at most one computation per process/device.
The amazing thing about 1 is that modern immutable database patterns get it for free. When things like https://www.datomic.com/ get mainstream the world is gonna be in a very different place. Not as in today's apps get faster, but as in they are so much faster that new kinds of apps become possible that couldn't have been dreamed of before. My startup http://www.hyperfiddle.net is an experiment in this space
I hope you can hire a designer and frontend engineer soon! I understand it's an experiment but the site looks 15 years outdated. In any case it seems interesting, good luck!
Or it could mean that the cost of cache invalidation is greater than the cost of allowing stale data to be read for the remainder of the TTL, or any other number of justifications for finite TTL. The point is, it depends on your application - while it's true that shorter TTLs can mask faulty invalidation logic, I don't think it's correct to say that "TTL should be infinite" with no qualifiers whatsoever.
If that's the desired behavior then cache layer should allow one to use stale (invalidated ) content. Cache layer (or application layers above it) should be aware of the status of cached data.
It's fine if stale data is used deliberately, problems arise when stale data is assumed to be 'fresh'.
True, caching even simple data dependant on 2-3 variables may require writing dozens of test cases - in practice, it often turns out that the data doesn't rely on 2-3 but 5-10+ variables.
However, just because it's hard it doesn't mean we shouldn't do it the right way.
Of course it does - negotiating trade-offs and accepting the set that best achieves your objectives is a large part of software design (and also undermines the notion of doing things a "right way").
Wouldn't deploying a less-than-precise TTLs be an appropriate trade-off for non-transactional caches, caches subject to network partitions, caches that can't enroll in invalidation messages, caches that can't poll for change sets, caches that can't implement eviction policy, etc?
Certainly the cache should be transparent about the trade-offs it has made (such as not promising authoritative data if it's accepting eventual consistency via TTL).
It's impossible to create perfect systems, so it always makes sense to give yourself an extra out to protect you from unanticipated defects (otherwise known as a "belt and suspenders" approach).
TTL in this context is "time-to-live" - "a mechanism that limits the lifespan or lifetime of data in a computer or network" (https://en.wikipedia.org/wiki/Time_to_live). E.g., the duration of a cache entry.
It greatly depends on what you're doing, but for the majority of systems which are read heavy (and that most certainly includes "dynamic" sites like Amazon or Wikipedia), I hold to two major beliefs:
1 - Have very long TTLs on your internal cache servers with a way to proactively purge (message queues) and refresh in the background. Caching shouldn't be a compromise between freshness and performance. Have both!
2 - Generate message/payloads/views asynchronously in background workers and have your synchronous path as streamlined as possible (select 1 column from 1 table with indexes filters). Avoid serialization. Precaculate and denormalize. Any personalization or truly dynamic content can be done: 1 - By having the client make separate requests for that data 2 - Merging the data into the payload with some composition. 3 - Glueing bytes together (easier/safer with protocol buffers than json)
Do things asynchronously. Use message queues / streams.
Beyond that, GC becomes noticeable. For example, Go's net/http used to (might still) allocate much more than other 3rd party options.