publishing alignment research and exfohazards
post by Tamsin Leake (carado-1) · 2022-10-31T18:02:14.047Z · LW · GW · 12 commentsThis is a link post for https://carado.moe/publishing-infohazards.html
Contents
12 comments
(edit: i mean exfohazard, not infohazard [LW · GW])
to me, turning my thoughts into posts that i then publish on my blog and sometimes lesswrong [LW · GW] serves the following purposes:
- in conversations, i can easily link to a post of mine rather than explaining myself again (the original primary purpose of this blog!)
- having a more formally written-down version of my thoughts helps me think about them more clearly
- future posts — whether written by me or others — can link to my posts, contributing to a web of related ideas
- i can get feedback on my ideas, whether it be through comments on lesswrong or responses on discord
however, i've come to increasingly want to write and publish posts which i've determined — either on my own or with the advice of a trusted peers — to be potentially infohazardous [? · GW], notably with regards to potentially helping AI capability progress.
on one hand, there is no post of mine i wouldn't trust, say, yudkowsky reading; on the other i can't just, like, DM him and everyone else i trust a link to an unlisted post every time i make one.
it would be nice to have a platform — or maybe a lesswrong feature — which lets me choose which persons or groups can read a post, with maybe a little ⚠ sign next to its title.
note that such a platform/feature would need something more complex than just a binary "trusted" flag: just because i can make a post that the Important People can read, doesn't mean i should be trusted to read everything else that they can read; and there might be people whom i trust to read some posts of mine but not others.
maybe trusted recipients could be grouped by orgs — such as "i trust MIRI" or "i trust The Standard List Of Trusted Persons". maybe something like the ability to post on the alignment forum is a reasonable proxy for "trustable person"?
i am aware that this seems hard to figure out, let alone implement. perhaps there is a much easier alternative i'm not thinking about; for the moment, i'll just stick to making unlisted posts and sending them to the very small intersection of people i trust with infohazards and people for whom it's socially acceptable for me to DM links to new posts of mine.
12 comments
Comments sorted by top scores.
comment by geoffreymiller · 2022-10-31T21:12:39.993Z · LW(p) · GW(p)
If we're dead-serious about infohazards, we can't just be thinking in terms of 'information that might accidentally become known to others through naive LessWrong newbies sharing it on Twitter'.
Rather, we need to be thinking in terms of 'how could we actually prevent the military intelligence analysts of rival superpowers from being able to access this information'?
My personal hunch is that there are very few ways we could set up sites, security protocols, and vetting methods that would be sufficient to prevent access by a determined government. Which would mean, in practice, that we'd be sharing our infohazards only with the most intelligent, capable, and dangerous agents and organizations out there.
Which is not to say we shouldn't try to be very cautious about this issue. Just that we shouldn't be naive about what the American NSA, Russian GRU, or Chinese MSS would be capable of.
Replies from: zac-hatfield-dodds↑ comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2022-11-01T05:05:21.221Z · LW(p) · GW(p)
Bluntly: if you write it on Lesswrong or the Alignment Forum, or send it to a particular known person, governments will get a copy if they care to. Cybersecurity against state actors is really, really, really hard. Lesswrong is not capable of state-level cyberdefense.
If you must write it at all: do so with hardware which has been rendered physically unable to connect to the internet, and distribute only on paper, discussing only in areas without microphones. Consider authoring only on paper in the first place. Note that physical compromise of your home, workplace, and hardware is also a threat in this scenario.
(I doubt they care much, but this is basically what it takes if they do. Fortunately I think LW posters are very unlikely to be working with such high-grade secrets.)
Replies from: habryka4, Emrik North↑ comment by habryka (habryka4) · 2022-11-01T08:41:11.556Z · LW(p) · GW(p)
Yep, we are definitely not capable of state-level or even "determined individual" level of cyberdefense.
↑ comment by Emrik (Emrik North) · 2022-11-01T15:08:44.426Z · LW(p) · GW(p)
When walls don't work, can use ofbucsation? I have no clue about this, but wouldn't it be much easier to use pbqrjbeqf for central wurds necessary for sensicle discussion so that it wouldn't be sreachalbe, and then have your talkings with people on fb or something?
Would be easily found if written on same devices or accounts used for LW, but that sounds easier to work around than literally only using paper?
Replies from: zac-hatfield-dodds↑ comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2022-11-01T17:44:34.097Z · LW(p) · GW(p)
No, this is also easy to work around; language models are good at deobfuscation and you could probably even do it with edit-distance techniques. Nor do you have enough volume of discussion to hide from humans literally just reading all of it; nor is Facebook secure against state actors, nor is your computer secure. See also Security Mindset and Ordinary Paranoia [LW · GW].
comment by Emrik (Emrik North) · 2022-10-31T18:42:06.961Z · LW(p) · GW(p)
Yes! The way I'd like it is if LW had a "research group" feature [LW(p) · GW(p)] that anyone could start, and you could post privately to your research group.
comment by Gunnar_Zarncke · 2022-10-31T21:38:15.796Z · LW(p) · GW(p)
I like it. This is another example of AI Alignment projects needing more shared infrastructure.
comment by No77e (no77e-noi) · 2022-10-31T19:24:28.236Z · LW(p) · GW(p)
This looks like something that would be useful also for alignment orgs, if they want to organize their research in siloes, as Yudkowsky often suggests (if they haven't already implemented systems like this one).
comment by Elias Schmied (EliasSchmied) · 2022-11-01T13:17:47.163Z · LW(p) · GW(p)
I've been thinking along similar lines, but instinctively, without a lot of reflection, I'm concerned about negative social effects of having an explicit community-wide list of "trusted people".
comment by Nicholas / Heather Kross (NicholasKross) · 2024-01-09T00:19:34.499Z · LW(p) · GW(p)
"Exfohazard" is a quicker way to say "information that should not be leaked". AI capabilities has progressed on seemingly-trivial breakthroughs, and now we have shorter timelines.
The more people who know and understand the "exfohazard" concept, the safer we are from AI risk.
comment by Roman Leventov · 2023-01-23T07:41:28.975Z · LW(p) · GW(p)
I have the same sentiment as you. I wrote about this here: Has private AGI research made independent safety research ineffective already? What should we do about this? [LW · GW]