amrav

Posts
Comments

Posts

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure 2025-04-12T23:15:07.964Z

JumpReLU SAEs + Early Access to Gemma 2 SAEs 2024-07-19T16:10:54.664Z

Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z

[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z

[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z

Discussion: Challenges with Unsupervised LLM Knowledge Discovery 2023-12-18T11:58:39.379Z

Explaining grokking through circuit efficiency 2023-09-08T14:39:23.910Z

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques 2022-11-25T14:36:08.948Z

Threat Model Literature Review 2022-11-01T11:03:22.610Z

Clarifying AI X-risk 2022-11-01T11:03:01.144Z

More examples of goal misgeneralization 2022-10-07T14:38:00.288Z

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms 2022-08-12T15:17:38.304Z

ELK contest submission: route understanding through the human ontology 2022-03-14T21:42:26.952Z

Comments

Comment by Vikrant Varma (amrav) on MONA: Managed Myopia with Approval Feedback · 2025-01-27T10:34:22.994Z · LW · GW

We won't be able to release the dataset directly but can make it easy to reproduce, and are looking into options now. Ping me in a week if I haven’t commented!

Comment by Vikrant Varma (amrav) on Mechanistic anomaly detection and ELK · 2022-11-28T16:18:42.804Z · LW · GW

To add some more concrete counter-examples:

deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC's post on anomaly detection), so is included in π.
alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn't agree with, which the (AGI) robber exploits

Comment by Vikrant Varma (amrav) on Knowledge is not just mutual information · 2022-05-10T16:36:36.414Z · LW · GW

Thanks for this sequence!

I don't understand why the computer case is a counterexample for mutual information, doesn't it depend on your priors (which don't know anything about the other background noise interacting with photons)?

Taking the example of a one-time pad, given two random bit strings A and B, if C = A ⊕ B, learning C doesn't tell you anything about A unless you already have some information about B. So I(C; A) = 0 when B is uniform and independent of A.

Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought.

If our prior was very certain about any factors that could interact with photons, then indeed the resulting imprints would have high mutual information, but it seems like you can rescue mutual information here by saying that our prior is uncertain about these other factors so the resulting imprints are noisy as well.

On the other hand, it seems correct that an entity that did have a more certain prior over interacting factors would see photon imprints as accumulating knowledge (for example photographic film).

User info

Posts

Comments