It’s usually not the right call to use a hash table like you’re describing. You’...

pjam · on March 26, 2021

The progressive rehashing described in the article is very similar to what Redis does [1].

Just thought I'd share a example of a use case where the incremental rehashing logic makes sense.

[1]: https://github.com/redis/redis/blob/unstable/src/dict.c

jeffffff · on March 26, 2021

iirc memcached either uses or used to use https://en.wikipedia.org/wiki/Linear_hashing to avoid rehashing induced pauses. it definitely makes sense for that use case but you're correct that it's not the right call for a general purpose hash table implementation.

legulere · on March 26, 2021

If your application is not negatively affected by such jitter why not use a garbage collected language?

theamk · on March 29, 2021

Programs have different phases and jitter might be relevant only in some of them.

Imagine hash tables used to store configuration data. You don't care about performance of "reload config" operation, but once config is loaded and the worker is handling traffic, read operations must be fast.

MauranKilom · on March 27, 2021

Maybe you still want fine-grained control over code generation and memory allocation. But it's a valid question to ask.

jbluepolarbear · on March 27, 2021

I agree. I’m rarely using a hash tables for growing data sets. In my experience, I’ve been more inclined to redesign the algorithm or use a different data structures at the point were a migrating hash table makes sense.

sgerenser · on March 26, 2021

Agreed this seems like major overkill, but then again so does writing your own hash table from scratch in C. :shrug:

ModernMech · on March 26, 2021

> Hash tables generally work on the assumption that operations are O(1) in the amortized sense, that individual inserts might be expensive because of rehashing, but in aggregate they are very cheap.

Hash tables aren't O(1) in an amortized sense, they are O(1) in a statistical sense. The O(1) complexity of hash tables is dependent entirely on the size of the underlying array, the method of collision resolution, and the quality of the (pre)hash function in relation to the distribution of data you're storing in the table. Whether you insert 1 item or 1 million items into the table, you're still getting O(1) computational complexity if the table is designed well. If not, you could be getting O(n) complexity in the worst case.

While the fixed cost of hashing is irrelevant in the O(1) nature of hash table operations, it does mean that a hash table in practice could be slower (in terms of wall time) than a linked list for insert/find/remove operations. Despite a linked list being O(n) for those operations, a linked list doesn't have to pay the fixed cost of hashing. So for low values of n, the linked list might be a better choice. I think this is what you mean by "O(1) in the amortized sense" but it's not what we mean when we say the computational complexity of these operations on a hash table is O(1).

What the OP is talking about is an example of amortization of resize(), essentially adding a fixed cost to insert() and remove() so you don't have to pay a huge sum when you call resize(). But that doesn't mean insert() and remove() are no longer O(1).

justinpombrio · on March 26, 2021

Amortized cost is a technical term in complexity analysis: https://en.wikipedia.org/wiki/Amortized_analysis

If you use the terminology precisely, the worst case cost of a regular hash table insertion is O(n) because of the potential resize, but its amortized cost is O(1) because that's what it "averages out to".