
Paper: LLMs trained on “A is B” fail to learn “B is A” 2023-09-23T19:55:53.427Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z


Comment by Maximilian Kaufmann (max-kaufmann) on William_S's Shortform · 2025-02-17T13:51:42.042Z · LW · GW

I'd like to use this from your description, but tough to trust / understand without a README!

Comment by Maximilian Kaufmann (max-kaufmann) on Proposal: we should start referring to the risk from unaligned AI as a type of *accident risk* · 2023-05-16T22:15:09.318Z · LW · GW

A point against that particular terminology which you might find interesting. 

Comment by Maximilian Kaufmann (max-kaufmann) on Basic Question about LLMs: how do they know what task to perform · 2023-01-14T22:15:11.247Z · LW · GW

To partially answer your question ( I think the answer to "What is happening inside the LLM when it 'switches' to one task or another?" is pretty much "We don't know"), techniques such as RLHF (which nowadays are applied to pretty much any public-facing model you are likely to interact with) cause the model to act less like something searching for the most likely completion of this sentence on the internet, and more like something which is trying to answer your questions. These models would take "question" interpretation over the "autocomplete" one. 

A purely pretrained model might be more likely to do something like output a probability distribution over the first token which is split over both possible continuations, and then would stick with whichever interpretetation it happened to generate as it autoregresses on its own output.

Comment by Maximilian Kaufmann (max-kaufmann) on More examples of goal misgeneralization · 2022-10-07T23:38:02.294Z · LW · GW

How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?