Posts

JumpReLU SAEs + Early Access to Gemma 2 SAEs 2024-07-19T16:10:54.664Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
Discussion: Challenges with Unsupervised LLM Knowledge Discovery 2023-12-18T11:58:39.379Z
Explaining grokking through circuit efficiency 2023-09-08T14:39:23.910Z
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques 2022-11-25T14:36:08.948Z
Threat Model Literature Review 2022-11-01T11:03:22.610Z
Clarifying AI X-risk 2022-11-01T11:03:01.144Z
More examples of goal misgeneralization 2022-10-07T14:38:00.288Z
Refining the Sharp Left Turn threat model, part 1: claims and mechanisms 2022-08-12T15:17:38.304Z
ELK contest submission: route understanding through the human ontology 2022-03-14T21:42:26.952Z

Comments

Comment by Vikrant Varma (amrav) on Mechanistic anomaly detection and ELK · 2022-11-28T16:18:42.804Z · LW · GW

To add some more concrete counter-examples:

  • deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC's post on anomaly detection), so is included in π.
  • alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn't agree with, which the (AGI) robber exploits
Comment by Vikrant Varma (amrav) on Knowledge is not just mutual information · 2022-05-10T16:36:36.414Z · LW · GW

Thanks for this sequence!

I don't understand why the computer case is a counterexample for mutual information, doesn't it depend on your priors (which don't know anything about the other background noise interacting with photons)?

Taking the example of a one-time pad, given two random bit strings A and B, if C = A ⊕ B, learning C doesn't tell you anything about A unless you already have some information about B. So I(C; A) = 0 when B is uniform and independent of A.

Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought.

If our prior was very certain about any factors that could interact with photons, then indeed the resulting imprints would have high mutual information, but it seems like you can rescue mutual information here by saying that our prior is uncertain about these other factors so the resulting imprints are noisy as well.

On the other hand, it seems correct that an entity that did have a more certain prior over interacting factors would see photon imprints as accumulating knowledge (for example photographic film).