You Don't Always Need Indexes

post by jefftk (jkaufman) · 2023-05-25T14:20:07.901Z · LW · GW · 6 comments

Sometimes you have a lot of data, and one approach to support quick searches is pre-processing it to build an index so a search can involve only looking at a small fraction of the total data. The threshold at which it's worth switching to indexing, though, might be higher than you'd guess. Here are some cases I've worked on where full scans were better engineering choices:

Unless you know from the start that you'll be searching hundreds of millions of records, consider starting with simple scans and only add indexing if you can't get acceptable performance. And even then, if queries are rare and highly varied you may still do better to do the work at query time instead of ingestion time.


Comments sorted by top scores.

comment by Dagon · 2023-05-25T16:03:46.773Z · LW(p) · GW(p)

I've done SO MANY design reviews where my first questions were "how big is the data, what are the query rates and patterns, and what is the worst-acceptible-performance", immediately followed by a reschedule to review a much simpler design.

It's worth mentioning that indexing is work done at insertion time (or at least in advance, outside of the query path), and it's VERY often worth it to save resources at query time.  This can be true EVEN IF it's a bit more resources overall.

comment by Sinclair Chen (sinclair-chen) · 2023-05-25T21:14:37.682Z · LW(p) · GW(p)

At Manifold I find myself adding indexes for smaller collections. Like, in our postgres db we have a table for mana transactions called txns that has only 1 million rows, and I recently added an index so that a particular query would take 0.6 ms instead of 400ms. This is a query that has to run anytime a user loads their feed, and even if we never100x the size of txns we shouldn't delay the feed by almost half a second.

(I am generally for rapid, hacky prototyping)

comment by Adam Zerner (adamzerner) · 2023-05-25T17:51:27.602Z · LW(p) · GW(p)

There seems to be an assumption that the burden of proof is on showing that the index adds enough value to be worth it. My understanding is that the cost of adding an index is quite small though - it's easy, battle tested, doesn't lead to bugs, and doesn't have much impact on performance at insertion time - so it at least seems worth discussing the question of burden of proof instead of assuming an answer to it.

Replies from: jkaufman
comment by jefftk (jkaufman) · 2023-05-25T20:41:21.628Z · LW(p) · GW(p)

If you are already using a database and think you might want a simple index (ex: on an ID) then sure, just add it. But if feeling like you should have an index pushes you to start using a database, or if you want the support something complicated like full text search, then I don't think it's so clear.

(This post is not anti-index, it is anti-"you should never be doing full table scans in production")

Replies from: adamzerner
comment by Adam Zerner (adamzerner) · 2023-05-26T19:00:46.401Z · LW(p) · GW(p)

I see. That makes sense.

comment by Ron Michael Zettlemoyer (ron-michael-zettlemoyer) · 2023-05-28T00:42:33.831Z · LW(p) · GW(p)

Maybe this supports or complements your argument: platforms like Azure SQL will tell you when you need an index and even automatically create it for you if you wish. (And automatically undo it if the index doesn’t help or hurts performance) So, other than a primary key index, you could do nothing and let the database decide what’s best.

How about foreign key indexes? :)