ELK shaving

post by Miss Aligned AI · 2022-05-01T21:05:38.080Z · LW · GW · 1 comments

Contents

1 comment
> Paul Christiano's incredibly complicated schemes have no chance of working in real life before DeepMind destroys the world.
> Eliezer in Death With Dignity

 

Eliciting Latent Knowledge reads to me as an incredibly narrow slice in reasoning space, a hyperbolically branching philosophical rabbit hole of caveats.

For example, this paragraph on page 7translates as:

If you can ask whether AI is saying the truth and it answers "no" then you know it is lying.

But how is trusting this answer different from just trusting AI to not deceive you as a whole?


A hundred pages of an elaborate system with competing actors playing games of causal diagrams trying to solve for the worst case is exciting ✨ precisely because it allows one to "make progress" and have incredibly nuanced discussions (ELK shaving [1]) while failing to address the core AI safety concern:

if AI is sufficiently smart, it can do absolutely whatever

– fundamentally ignoring whatever clever constraints one might come up with.

I am confused why people might be "very optimistic" [LW · GW] about ELK, I hope I am wrong.

  1. ^“Yak shaving” means performing a seemingly endless series of small tasks that must be completed before the next step in the project can move forward. Elks are kinda like yaks

1 comments

Comments sorted by top scores.

comment by leogao · 2022-05-02T04:52:43.850Z · LW(p) · GW(p)

I disagree about ELK being useless/yak shaving/too narrow. I think it's pointing at something important and far more general than the exposition might lead one to believe. In particular, some observations:

  • The causal diagram stuff is just for concreteness and not actually a core part of the idea
  • Much of the length is laying out a series of naive strategies and explaining why they don't work. The actual "system" isn't really that elaborate
  • ELK is in large part about trying to overcome the limitations of IDA/Debate/etc
  • ELK provides an upper bound on how powerful interpretability can possibly be in theory