Noa Nabeshima's Shortform

noa-nabeshima

Noa Nabeshima's Shortform

post by Noa Nabeshima (noa-nabeshima) · 2025-01-15T21:18:01.102Z · LW · GW · 3 comments

3 comments

3 comments

Comments sorted by top scores.

comment by Noa Nabeshima (noa-nabeshima) · 2025-01-15T21:18:01.247Z · LW(p) · GW(p)

TinyModel SAEs have these first entity and second entity latents.

E.g. if the story is 'Once upon a time Tim met his friend Sally.', Tim is the first entity and Sally is the second entity. The latents fire on all instances of first|second entity after the first introduction of that entity.

I think I at one point found an 'object owned by second entity' latent but have had trouble finding it again.

I wonder if LMs are generating these reusable 'pointers' and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write 'owned by first entity' to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s

(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.

This means that the exact identity information of the object (e.g. ' hammer') and the exact identity information of the first entity (' Tim') don't need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.

This suggests a more fine-grained story for what duplicate name heads are doing in IOI.

comment by Noa Nabeshima (noa-nabeshima) · 2025-01-15T21:27:37.786Z · LW(p) · GW(p)

Auto-interp is currently really really bad

I think o1 is the only model that seems to perform decently at auto-interp but it's very expensive! IE $1/latent label. This is frustrating to me.

comment by Noa Nabeshima (noa-nabeshima) · 2025-01-15T21:32:20.248Z · LW(p) · GW(p)

One barrier to SAE circuits is that it's currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn't tell you how these latents interact inside the attention layer at all.

Noa Nabeshima's Shortform

Contents

3 comments