Breaking the Hash Table in Python

Our language of choice for bringing the worst out of the hash table is Python.

Let's start by talking about the hash function and why we've chosen Python for this. Hash function for integers in Python is simply identity, as you might've guessed, there's no avalanche effect. Another thing that helps us is the fact that integers in Python are technically BigInts¹. This allows us to put bit more pressure on the hashing function.

From the perspective of the implementation, it is a hash table that uses probing to resolve conflicts. This also means that it's a contiguous space in memory. Indexing works like in the provided example above. When the hash table reaches a breaking point (defined somewhere in the C code), it reallocates the table and rehashes everything.

tip

Resizing and rehashing can reduce the conflicts. That is coming from the fact that the position in the table is determined by the hash and the size of the table itself.

Preparing the attack

Knowing the things above, it is not that hard to construct a method how to cause as many conflicts as possible. Let's go over it:

We know that integers are hashed to themselves.
We also know that from that hash we use only lower bits that are used as indices.
We also know that there's a rehashing on resize that could possibly fix the conflicts.

We will test with different sequences:

ordered one, numbers through 1 to N
ordered one in a reversed order, numbers through N back to 1
numbers that are shifted to the left, so they create conflicts until resize
numbers that are shifted to the left, but resizing helps only in the end
numbers that are shifted to the left, but they won't be taken in account even after final resize

For each of these sequences, we will insert 10⁷ elements and look each of them up for 10 times in a row.

As a base of our benchmark, we will use a Strategy class and then for each strategy we will just implement the sequence of numbers that it uses:

class Strategy:
    def __init__(self, data_structure=set):
        self._table = data_structure()

    @cached_property
    def elements(self):
        raise NotImplementedError("Implement for each strategy")

    @property
    def name(self):
        raise NotImplementedError("Implement for each strategy")

    def run(self):
        print(f"\nBenchmarking:\t\t{self.name}")

        # Extract the elements here, so that the evaluation of them does not
        # slow down the relevant part of benchmark
        elements = self.elements

        # Insertion phase
        start = monotonic_ns()
        for x in elements:
            self._table.add(x)
        after_insertion = monotonic_ns()

        print(f"Insertion phase:\t{(after_insertion - start) / 1000000:.2f}ms")

        # Lookup phase
        start = monotonic_ns()
        for _ in range(LOOPS):
            for x in elements:
                assert x in self._table
        after_lookups = monotonic_ns()

        print(f"Lookup phase:\t\t{(after_lookups - start) / 1000000:.2f}ms")

Sequences

Let's have a look at how we generate the numbers to be inserted:

ordered sequence (ascending)
```
x for x in range(N_ELEMENTS)
```
ordered sequence (descending)
```
x for x in reversed(range(N_ELEMENTS))
```

progressive sequence that “heals” on resize

(x << max(5, x.bit_length())) for x in range(N_ELEMENTS)

progressive sequence that “heals” in the end

(x << max(5, x.bit_length())) for x in reversed(range(N_ELEMENTS))

conflicts everywhere
```
x << 32 for x in range(N_ELEMENTS)
```

Results

Let's have a look at the obtained results after running the code:

Technique	Insertion phase	Lookup phase
ordered sequence (ascending)	`558.60ms`	`3304.26ms`
ordered sequence (descending)	`554.08ms`	`3365.84ms`
progressive sequence that “heals” on resize	`3781.30ms`	`28565.71ms`
progressive sequence that “heals” in the end	`3280.38ms`	`26494.61ms`
conflicts everywhere	`4027.54ms`	`29132.92ms`

You can see a noticable “jump” in the time after switching to the “progressive” sequence. The last sequence that has conflicts all the time has the worst time, even though it's rather comparable with the first progressive sequence with regards to the insertion phase.

If we were to compare the always conflicting one with the first one, we can see that insertion took over 7× longer and lookups almost 9× longer.

You can have a look at the code here.

Comparing with the tree

danger

Source code can be found here.

Viewer discretion advised.

Python doesn't have a tree structure for sets/maps implemented, therefore for a comparison we will run a similar benchmark in C++. By running the same sequences on both hash table and tree (RB-tree) we will obtain the following results:

Technique	Insertion (hash)	Lookup (hash)	Insertion (tree)	Lookup (tree)
ordered (ascending)	`316ms`	`298ms`	`2098ms`	`5914ms`
ordered (descending)	`259ms`	`315ms`	`1958ms`	`14747ms`
progressive a)	`1152ms`	`6021ms`	`2581ms`	`16074ms`
progressive b)	`1041ms`	`6096ms`	`2770ms`	`15986ms`
conflicts	`964ms`	`1633ms`	`2559ms`	`13285ms`

note

We can't forget that implementation details be involved. Hash function is still the identity, to my knowledge.

One interesting thing to notice is the fact that the progressive sequences took the most time in lookups (which is not same as in the Python).

Now, if we have a look at the tree implementation, we can notice two very distinctive things:

Tree implementations are not affected by the input, therefore (except for the first sequence) we can see very consistent times.
Compared to the hash table the times are much higher and not very ideal.

The reason for the 2nd point may not be very obvious. From the technical perspective it makes some sense. Let's dive into it!

If we take a hash table, it is an array in a memory, therefore it is contiguous piece of memory. (For more information I'd suggest looking into the 1st blog post below in references section by Bjarne Stroustrup)

On the other hand, if we take a look at the tree, each node holds some attributes and pointers to the left and right descendants of itself. Even if we maintain a reasonable height of the tree (keep the tree balanced), we still need to follow the pointers which point to the nodes somewhere on the heap. When traversing the tree, we get a consistent time complexity, but at the expense of jumping between the nodes on the heap which takes some time.

danger

This is not supposed to leverage the hash table and try to persuade people not to use the tree representations. There are benefits coming from the respective data structures, even if the time is not the best.

Overall if we compare the worst-case time complexities of the tree and hash table, tree representation comes off better.

Challenge

Try to benchmark with the similar approach in the Rust. Since Rust uses different hash function, it would be the best to just override the hash, this way you can also avoid the hard part of this attack (making up the numbers that will collide).

References

Bjarne Stroustrup. Are lists evil?

Arbitrary-sized integers, they can get as big as your memory allows. ↩

Preparing the attack​

Sequences​

Results​

Comparing with the tree​

References​

Footnotes​

Preparing the attack

Sequences

Results

Comparing with the tree

References

Footnotes