Additional space complexity isn't always a useful metric

post by Brendan Long (korin43) · 2023-01-04T21:53:05.049Z · LW · GW · 3 comments

This is a link post for

I've been subscribed to Interview Cake for years, and today they had a really interesting question: Given a list of n + 1 integers in the range 1...n, find one of the duplicates (there is guaranteed to be at least one) in O(n) time and O(1) additional space. The answer is really interesting[1], and I recommend trying it, but I don't think it makes sense to care about additional space rather than total space, and I still think using a set to keep track of numbers (the most obvious solution) is the best solution in practice.

The reason I thought this was worth writing about is that the "O(1)" linked list algorithm can't be streamed, so it doesn't make any sense to treat the memory the input uses and the additional memory as fundamentally different (they both need to be allocated on the same machine). Since the algorithm requires that you hold the input in memory, the only sane way to treat the space complexity to consider the total, i.e. the linked list solution is O(n) for the input + O(1) for the rest of the algorithm, which is still O(n).

In this specific case, the constant factors are also huge. The numbers are guaranteed to be densely packed and in the range 1...n, so our set can be a bit array of length n (i.e. to mark "2" as being in our set, we set bit 2 of our array). This means that our additional space is actually going to be smaller than our input by a factor of the size of our integer type (i.e. with 32-bit integers, the input will be 32x larger than our bit array). With any reasonable integer size, the size of the input is going to be massively larger than the additional memory for our set.

In a previous post on a similar question, I pointed out that a hash table with a million 32-bit entries will only use 8 MB of memory, but this case is even better. If you want to find the duplicate in a list of a million 32-bit integers, your list will take 4 MB of memory but the set of numbers you've seen will only take 125 KB of memory.

Finally, remember how I said that the linked list solution can't be streamed? Well, the set solution can, so taking that and the bit array optimization into account, the set solution can use substantially less memory in-practice because it only has to store the extremely-efficient bit set and doesn't have to hold the entire input in memory!

For reference, here's a Python solution that uses substantially less memory than the linked-list solution when streaming the input:

from typing import Any, Iterable

# python3 -m pip install bitarray
from bitarray import bitarray

# We need a resizable bit set for streaming since we don't know the largest value we'll see
# If we do know the largest value, we can replace this entire class with bitarray(max_value)
class BitSet:
    A set wrapper around bitarray

    Note that the set can only contain integers, and this set only makes sense if the
    integers it contains are in a predictable, densely packed range (specifically,
    when max_value - min_value / 64 > 1).

    def __init__(self, min_value: int = 0, initial_size: int = 256) -> None:
        Initialize a bit set

        min_value is the smallest value that can be contained in the set (i.e. it will
        be mapped to index 0)
        self.min_value = min_value
        self._set = bitarray(initial_size)

    def __contains__(self, value: Any) -> bool:
        return (
            isinstance(value, int)
            and 0 <= value - self.min_value < len(self._set)
            and self._set[value - self.min_value] == 1

    def ensure_size(self, min_length: int) -> None:
        Ensure the underlying bit array is at least as large the given min_length
        (resizing if necessary).
        starting_size = len(self._set)
        if min_length > starting_size:
            # Don't do a bunch of tiny resizes
            min_length = max(min_length, starting_size * 2)

            # Create a new array, initialize it, and copy the old data if applicable
            old_set = self._set
            self._set = bitarray(min_length)
            self._set[starting_size:] = 0
            if old_set:
                self._set[:starting_size] = old_set

    def add(self, value: int) -> None:
        assert value >= self.min_value
        mapped_value = value - self.min_value
        self.ensure_size(mapped_value + 1)
        self._set[mapped_value] = 1

def find_first_duplicate(int_list: Iterable[int]) -> int:
    Find one of the duplicates in a list of n + 1 integers in the range 1...n
    seen = BitSet(min_value=1)
    for value in int_list:
        if value in seen:
            return value
    assert False, "No duplicates found, which is impossible with conforming inputs"

  1. If you found the Interview Cake question interesting, you might also like this Veritasium video explaining a related riddle.


Comments sorted by top scores.

comment by LawrenceC (LawChan) · 2023-01-04T21:57:30.351Z · LW(p) · GW(p)

Famously, this is the question that appears in "If Programming Was An Anime":

comment by JBlack · 2023-01-04T23:14:58.733Z · LW(p) · GW(p)

I don't think that anyone is claiming that it's a particularly useful metric, just that it's an interesting constraint for programming puzzles.

There are sometimes corresponding constraints in reality, such as systems with comparatively low amounts of RAM where data can be requested (e.g. embedded systems with I/O, servers with network APIs, etc) but data can not or should not be written via those channels. It doesn't make them common, but it is useful to be able to devise algorithms under these or other constraints.

Replies from: korin43
comment by Brendan Long (korin43) · 2023-01-04T23:42:39.376Z · LW(p) · GW(p)

I agree that the math puzzle is interesting.

I'm still skeptical that this algorithm is useful any real-world situation, although I was hoping I might get comments with counter-examples. Even in the examples you gave, you already have another machine that clearly has far more memory than you need to implement the set algorithm but for some reason you have to write this algorithm to run on a toaster and talk to your dramatically more powerful server over the network? I'm not saying it's impossible, but I hope you can see why I'm skeptical.