Posts
Comments
A point against that particular terminology which you might find interesting. https://www.lesswrong.com/posts/6bpW2kyeKaBtuJuEk/why-i-hate-the-accident-vs-misuse-ai-x-risk-dichotomy-quick
To partially answer your question ( I think the answer to "What is happening inside the LLM when it 'switches' to one task or another?" is pretty much "We don't know"), techniques such as RLHF (which nowadays are applied to pretty much any public-facing model you are likely to interact with) cause the model to act less like something searching for the most likely completion of this sentence on the internet, and more like something which is trying to answer your questions. These models would take "question" interpretation over the "autocomplete" one.
A purely pretrained model might be more likely to do something like output a probability distribution over the first token which is split over both possible continuations, and then would stick with whichever interpretetation it happened to generate as it autoregresses on its own output.
How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?