Posts

Aspiration-based Q-Learning 2023-10-27T14:42:03.292Z

Comments

Comment by Clément Dumas (butanium) on How well do truth probes generalise? · 2024-03-06T17:22:36.245Z · LW · GW

Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?

Comment by Clément Dumas (butanium) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-16T10:34:16.621Z · LW · GW

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

Comment by Clément Dumas (butanium) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-14T00:58:51.423Z · LW · GW

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

Comment by Clément Dumas (butanium) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-21T12:52:36.549Z · LW · GW

I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.

Comment by Clément Dumas (butanium) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T01:04:11.129Z · LW · GW

Let's assume the prompt template is  Q [true/false] [banana/shred]

If I understand correctly, they don't claim   learned has_banana but  learned has_banana. Moreover evaluating  for  gives:

Therefore, we can learn a  that is a banana classifier

Comment by Clément Dumas (butanium) on Discussion: Challenges with Unsupervised LLM Knowledge Discovery · 2023-12-19T00:47:50.341Z · LW · GW
Comment by Clément Dumas (butanium) on Incidental polysemanticity · 2023-11-15T13:54:57.447Z · LW · GW

Small typo in ## Interference arbiters collisions between features

by taking aninner productt with .

Comment by Clément Dumas (butanium) on Aspiration-based Q-Learning · 2023-10-28T18:07:33.928Z · LW · GW

Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?