Level up your spreadsheeting

post by angelinahli · 2024-05-25T14:57:19.730Z · LW · GW · 11 comments

This is a link post for https://docs.google.com/document/d/1cIHfQcRIV_wNF6IlqIoqY83NRDl4US012v1PtsTPo4s/edit

Contents

  Who am I?
  Who should read this?
  Principles of good spreadsheets
    It should be easy for you to extract insights from your data
    Your spreadsheet should be beautiful and easy to read
    There should be one source of truth for your data
    Your spreadsheet should be easy to audit
    Your spreadsheet should be hard to break
None
11 comments

Epistemic status: Passion project / domain I’m pretty opinionated about, just for fun.

In this post, I walk through some principles I think good spreadsheets abide by, and then in the companion piece, I walk through a whole bunch of tricks I've found valuable.

Image of a spreadsheet by GPT-4o
Illustrated by GPT-4o

Who am I?

I’ve spent a big chunk of my (short) professional career so far getting good at Excel and Google Sheets.[1] As such, I’ve accumulated a bunch of opinions on this topic.

Who should read this?

This is not a guide to learning how to start using spreadsheets at all. I think you will get more out of this post if you use spreadsheets at least somewhat frequently, e.g.

Principles of good spreadsheets

Broadly speaking, I think good spreadsheets follow some core principles (non-exhaustive list).

I think the below is a combination of good data visualization (or just communication) advice, systems design, and programming design (spreadsheets combine the code and the output).

It should be easy for you to extract insights from your data

  1. A core goal you might have with spreadsheets is quickly calculating something based on your data. A bunch of tools below are aimed at improving functionality, allowing you to more quickly grab the data you want.

Your spreadsheet should be beautiful and easy to read

  1. Sometimes, spreadsheets look like the following example.
  2. I claim that this is not beautiful or easy for your users to follow what is going on. I think there are cheap techniques you can use to improve the readability of your data.

There should be one source of truth for your data

  1. One common pitfall when designing spreadsheet-based trackers is hard copy and pasting data from one sheet to another, such that when your source data changes, the sheets you use for analyses no longer reflect “fresh” data. This is a big way in which your spreadsheet systems can break down.
  2. A bunch of tools below are designed to improve data portability — i.e. remove the need for copy and pasting.

Your spreadsheet should be easy to audit

  1. One major downside of spreadsheets as compared to most coding languages, is that it’s often easy for relatively simple spreadsheets to contain silent bugs in them.[2]
  2. Some features of spreadsheets that contribute to this problem:
    1. Spreadsheets hide the code and show you only the output by default.
      1. When you use formulas, once you hit enter, the user doesn’t by default get to read what’s going on. So if the output looks plausible, you might not notice your formula has a bug in it.
    2. It’s harder to break up your work into chunks.
      1. When you’re coding, most people will break up a complicated formula into several lines of code, using intermediate variables and comments to make things more readable. E.g.:
      2. By default, some Sheets formulas get really unwieldy, and you need to work a bit harder to recover readability.
    3. Spreadsheets contain more individual calculations.
      1. When you’re coding and you want to perform the same calculation on 100 rows of data, you’d probably use a single line of code to iterate over your data (e.g. a for loop).
      2. In Google Sheets, you’re more likely to drag your formula down across all of your rows. But this means that if you accidentally change the formula for one cell and not the others, or if your data has now changed and it turns out you need to drag your formulas down more, things can break in annoying ways.
  3. Because of this, I consider auditability one of the key qualities of a well designed spreadsheet. Some of the tools below will recover coding best practices.
  4. I also consider principles (2)-(3) above pretty related to principle (4).

Your spreadsheet should be hard to break

  1. Not all spreadsheets are meant as living documents; sometimes you’ll create a spreadsheet to conduct a specific analysis and then discard it.
  2. But sometimes, you’ll use a spreadsheet as a management tool to keep track of a bunch of moving pieces. In this case, you might care that your system isn’t going to break after a few weeks of use.[3]

Much more in the companion piece!

  1. ^

     I’m using the term ‘Google Sheets’ in this doc, but almost all of the tricks mentioned here work for Excel as well.

  2. ^

     My favorite Excel bug story: I used to work in litigation consulting, where I’d sometimes audit spreadsheets sent to us from the opposing side of a legal case.

    In one case, an expert witness for the opposing side sent over a spreadsheet with columns similar to the following: year, online sales, in-person sales, total sales. The expert was saying that total sales had almost doubled from ~3,000 → ~5,000 for this particular product in 2019.

    We eventually discovered that for the 2019 row, the expert had entered the formula =sum(A4:C4) instead of sum(B4:C4), and so had accidentally added the value ‘2019’ to the total sum. Here’s a recreation. (I’ve obfuscated the details a bit here but the core mistake was the same.)

  3. ^

     As an aside, spreadsheets have a lot of use cases, which makes giving generalizable advice a bit trickier. For instance, some common use cases for spreadsheets:

    - A database which you query whenever needed;

    - A data visualization tool meant to present some interesting findings from existing data;

    - A management tracker that you use to schedule emails and keep tabs on your tasks;

    - To model some interesting phenomenon and keep track of your assumptions

    Depending on what you’re using a spreadsheet for, you might prioritize some of these principles more or less highly. For instance, making something easy to read is probably more valuable when you’re creating a data visualization versus a database.

    Of course, lots of spreadsheets combine lots of different use cases — e.g. you might have one tab with your source of truth data, and another for random analytics.

11 comments

Comments sorted by top scores.

comment by bhauth · 2024-05-25T23:31:35.085Z · LW(p) · GW(p)

Do you have any thoughts on why hierarchical spreadsheets like TreeSheets never became popular?

comment by Mo Putera (Mo Nastri) · 2024-05-26T15:17:35.010Z · LW(p) · GW(p)

Great post, especially the companion piece :)

I'm tangentially reminded of professional modeler & health economist froolow's refactoring of GiveWell's cost-effectiveness models in his A critical review of GiveWell's 2022 cost-effectiveness model [EA · GW] (sections 3 and 4), which I think of as complementary to your post in that it teaches-via-case-study how to level up your spreadsheet modeling. 

Here's GiveWell's model architecture:

And here's froolow's refactoring: 

The difference in micro-level architecture is also quite large:

As someone who's spent a lot of his (short) career building dashboards and models in Google Sheets, and having seen GiveWell's CEAs, I empathized with froolow's remarks here:

After the issue of uncertainty analysis, I’d say the model architecture is the second biggest issue I have with the GiveWell model, and really the closest thing to a genuine ‘error’ rather than a conceptual step which could be improved. Model architecture is how different elements of your model interact with each other, and how they are laid out to a user. 

It is fairly clear that the GiveWell team are not professional modellers, in the same way it would be obvious to a professional programmer that I am not a coder (this will be obvious as soon as you check the code in my Refactored model!). That is to say, there’s a lot of wasted effort in the GiveWell model which is typical when intelligent people are concentrating on making something functional rather than using slick technique. A very common manifestation of the ‘intelligent people thinking very hard about things’ school of model design is extremely cramped and confusing model architecture. This is because you have to be a straight up genius to try and design a model as complex as the GiveWell model without using modern model planning methods, and people at that level of genius don’t need crutches the rest of us rely on like clear and straightforward model layout. However, bad architecture is technical debt that you are eventually going to have to service on your model; when you hand it over to a new member of staff it takes longer to get that member of staff up to speed and increases the probability of someone making an error when they update the model.

Replies from: angelinahli
comment by angelinahli · 2024-05-27T19:39:20.087Z · LW(p) · GW(p)

Thanks, I found this interesting! I remember reading that piece by Froolow but I didn't realize the refactoring was such a big part of it (and that the GiveWell CEA was formatted in such a dense way, wow).

This resonates a lot with my experience auditing sprawling, messy Excel models back in my last job (my god are there so many shitty Excel models in the world writ large).

FWIW if I were building a model this complex, I'd personally pop it into Squiggle [EA · GW] / Squigglehub — if only because at that point, properly multiplying probabilities together and keeping track of my confidence interval starts to really matter to me :)

comment by Ponder Stibbons · 2024-05-26T15:26:59.940Z · LW(p) · GW(p)

So after tearing my hair out trying to generate increasingly complex statistical analyses of scientific data in Excel, my world changed completely when I started using KNIME to process and transform data tables. It is perfect for a non-programmer such as myself, allowing the creation of complex yet easily broken-down workflows, that use spreadsheet input and output. Specialist domain tools are easily accessible (e.g chemical structure handling and access to the RDKit toolkit for my own speciality) and there is a thriving community generating free-to-use functionality. Best of all it is free to the single desk-top user. 

comment by Aprillion · 2024-05-26T15:26:27.945Z · LW(p) · GW(p)

Know some fancier formulas like left/mid/right, concatenate, hyperlink

Wait, I thought basic fancier formulas are like =index(.., match(.., .., 0)) 

I guess https://dev.to/aprillion/self-join-in-sheets-sql-python-and-javascript-2km4 might be a nice toy example if someone wants to practice the lessons from the companion piece 😹

comment by Malentropic Gizmo (malentropicgizmo) · 2024-05-26T07:38:55.689Z · LW(p) · GW(p)

Thank you for this! Your companion piece instantly solved a problem I was having with my diet spreadsheet!

comment by Archimedes · 2024-05-26T05:09:02.157Z · LW(p) · GW(p)

I'm surprised to see no mention of the LET function for making formulas more readable.

Another glaring omission is Power Query in Excel. I find it incredibly useful for connecting to data sources and transforming data. It's easily auditable as it produces steps of code for table transformations rather than working with thousands of cells with individual formulas.

When it comes to writing about spreadsheets, it's just about impossible to even skim the surface without missing anything, especially considering many aspects like array formulas, VBA macros, pivot tables with DAX measures, and Power Query can go super deep. I own a 758-page textbook just on Power Query and multiple books on Power Pivot / DAX.

Replies from: Bohaska
comment by Bohaska · 2024-05-26T11:37:11.021Z · LW(p) · GW(p)

They're mentioned in the companion piece (Google Docs) linked at the bottom of this post. This isn't the full post.

Replies from: Archimedes
comment by Archimedes · 2024-05-26T17:11:50.906Z · LW(p) · GW(p)

Ah, I do see the LET function now but I still can't find a reference to the Power Query editor.

Replies from: Bohaska
comment by Bohaska · 2024-05-27T07:48:39.569Z · LW(p) · GW(p)

True, but that's because the author is writing about working with Google Sheets, not Excel. 

Replies from: Archimedes
comment by Archimedes · 2024-05-29T00:30:15.139Z · LW(p) · GW(p)

Ah, I was under the impression that OP was covering both, not only things relevant to Google Sheets.