An embedding decoder model, trained with a different objective on a different dataset, can decode another model's embeddings surprisingly accurately

post by Logan Zoellner (logan-zoellner) · 2023-09-03T11:34:20.226Z · LW · GW · 1 comments

  P.S.
None
1 comment

Seems pretty relevant to the Natural Categories [LW · GW] hypothesis.

P.S.

My current favorite story for "how we solve alignment" is

Solve the natural categories hypothesis
Add corrigibility [LW · GW]
Combine these to build an AI that "does what I meant, not what I said"
Distribute the code/a foundation model for such an AI as widely as possible so it becomes the default whenever anyone is building a AI
Build some kind of "coalition of the willing" to make sure that human-compatible AI always has big margin of advantage in terms of computation

Comments sorted by top scores.

Could you explain a bit more how this is relevant to building a DWIM AI?