Making 2023 ACX Prediction Results Public

post by Legionnaire · 2024-03-05T17:56:18.437Z · LW · GW · No comments

This is a question post.

Contents

  My proposal: 
None
  Answers
    4 dreeves
    4 Brendan Long
    1 dreeves
None
No comments

They say you shouldn't roll your own encryption, which is why I'm posting this here, so it can be unrolled if it's too unsafe.

Problem: Astral Codex Ten finished scoring the 2023 prediction results, but the primary identifier most used for people's score was their email address. Since people wouldn't want those published, what's an easy way to get people their score?

You could email everyone, but then you have to interact with an email server, and then nobody can do cool analysis of the scores and whatever other data is the document.

My proposal: 

  1. There are ~10,000 email addresses. Hash the passwords using a hash that only maps to ~10 million values.
  2. Replace the emails in the document with the hashes. Writing a python script to this could be done in a few minutes.
  3. Give everyone access to that file.

If you know the email address of a participant, it's trivial to check their score. And if you forgot which email address you used, just try each one! Odds are you will not have had a collision.

But at the same time, with 8 billion email addresses worldwide, any given hash in the document should collide with ~1000 other email addresses (because the 10,000 real addresses will have used 0.1% of the space of the hash output), meaning you can't just brute force and figure out each persons address. Out of the 8 billion real addresses you try, ~8 million will be real and appear hashed in the document, but only 10,000 of those (~0.1%) will be the originals. So finding an address-hash that appears in the document is highly unlikely to be the actual address of the participant.

If there are a few victims of the birthday paradox, they could probably just email request for their line number in the document. It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions. My back of the envelope expects at least several collisions with a 10 mil output space. 100 mil makes it 0 or 1.

Which hash? Not sure. Maybe SHA256 then just delete N characters off the end until the space is ~10,000,000?

Please discuss how safe/unsafe this is. Thanks for your time.

Answers

answer by dreeves · 2024-03-05T20:08:29.961Z · LW(p) · GW(p)

This should be fine. In past years, Scott has had an interface where you could enter your email address and get your score. So the ability to find out other people's scores by knowing their email address is apparently not an issue. And it makes sense to me that one's score in this contest isn't particularly sensitive private information.

Source: Comment from Scott on the ACX post announcing the results

comment by Legionnaire · 2024-03-05T21:31:30.182Z · LW(p) · GW(p)

Good to know. In that case the above solution is actually even safer than that.

answer by Brendan Long · 2024-03-05T18:19:57.010Z · LW(p) · GW(p)

I think this would provide security against people casually accessing each other's scores but wouldn't provide much protection against a determined attacker.

Some problems:

  • There's no protection at all for someone's scores if the attacker knows their email address (and email addresses aren't secret)
  • It's probably not that hard to build or acquire a list of LessWrong users' email addresses
  • Even if you just brute-force this, there are probably patterns in LessWrong users' email addresses that make them distinguishable from random email addresses (more likely to be @somerationalistgroup.com, @gmail, recognizably American, nerdy, etc.).

A better solution:

  1. Generate a random ID for each user and add it to your data
  2. Email users their random ID
  3. Publish the data with emails removed

(And remove anything else that could be used to reconstruct users, like jobs/locations/etc. if relevant)

comment by Brendan Long (korin43) · 2024-03-05T18:29:20.465Z · LW(p) · GW(p)

I realized after writing this that you meant that people's email addresses are private but their scores are public if you know their email. I'd default to not exposing people's participation and scores unless they expected that to happen, but maybe that's less of an issue than I was thinking. The predictability of LessWrong emails still would expose a lot of email addresses.

I'd still recommend the random ID solution though since it's trivial to reason about (it's basically a one-time-pad).

comment by Legionnaire · 2024-03-05T18:48:38.059Z · LW(p) · GW(p)

Thanks for your input. Though ideally we wouldn't have to go through an email server, it may just be required at some level of security.

As for the patterns, the nice thing is that with a small output space in the millions, there are tons of overlapping reasonable addresses even if you pin it down to a domain. Every English first and last name combo even without any numbers in it is already a lot larger than 10 million, meaning even targeted domains should have plenty of collisions.

Replies from: korin43
comment by Brendan Long (korin43) · 2024-03-06T02:29:30.341Z · LW(p) · GW(p)

There's an idea in security where you should avoid weak security because it lets you trick yourself into thinking you're doing something. For example, if you're not going to protect passwords, in some sense it's better to leave them completely plaintext instead of hashing them with MD5. At least in the plaintext case you know you're not protecting them (and won't accidentally do something unsafe with it on the assumption that it's already protected by being hashed).

I feel like this is a case like that:

  • If you don't care if these become public, consider just making it public.
  • If you don't think they should be public, use something that guarantees that they're not (like the random ID solution)

The solution you proposed is better than nothing and might protect some email addresses in some cases, but it begs the questions: If you need to protect these sometimes, why not all the time; and if not protecting them sometimes is ok, why bother at all?

(I should say though that there are benefits to making data annoying to access, like that your scheme will protect the data from casual snoopers, and prevent it from being crawled by search engines unless someone goes to the trouble of de-anonymizing and reposting it. My point is mostly just that you should ask if you're ok with it becoming entirely public or not)

answer by dreeves · 2024-03-05T20:35:27.041Z · LW(p) · GW(p)

To make sure I understand this concern:

It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions.

Are you thinking someone may want plausible deniability? "Yes, my email hashes to this entry with a terrible Brier score but that could've been anyone!"

comment by Legionnaire · 2024-03-05T21:28:29.011Z · LW(p) · GW(p)

Plausible Deniability yes. Reason agnostic. It's hard to know why someone might not want to be known to have their address here, but with my numbers above, they would have the statistical backing that 1/1000 addresses will appear in the set by chance, meaning a someone who wants to deny it could say "for every address actually in the set, 1000 will appear to be" so that's only a 1/1000 chance I actually took the survey! (Naively of course; rest in peace rationalist@lesswrong.com)

Replies from: dreeves
comment by dreeves · 2024-03-05T22:22:44.466Z · LW(p) · GW(p)

I guess in practice it'd be the tiniest shred of plausible deniability. If your prior is that alice@example.com almost surely didn't enter the contest (p=1%) but her hash is in the table (which happens by chance with p=1/1000) then you Bayesian-update to a 91% chance that she did in fact enter the contest. If you think she had even a 10% chance on priors then her hash being in the table makes you 99% sure it's her.

No comments

Comments sorted by top scores.