Posts

Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks. 2023-11-10T15:23:07.418Z
Introduction 2023-06-30T20:45:10.229Z
Inherently Interpretable Architectures 2023-06-30T20:43:23.176Z
Positive Attractors 2023-06-30T20:43:19.416Z
AISC 2023, Progress Report for March: Team Interpretable Architectures 2023-04-02T16:19:07.857Z

Comments

Comment by Sohaib Imran (sohaib-imran) on A thought experiment to help persuade skeptics that power-seeking AI is plausible · 2023-11-28T14:56:29.921Z · LW · GW

Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM's advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.

I found this interesting so just leaving some thoughts here: 

- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.

-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:

  1. Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it. 
  2. being pre-trained on text consistent with power-seeking behaviour.
  3. having learnt power-seeking during fine-tuning.

I think "1. " is the key dynamic at play here. Without this, the extra agent is not required at all.  This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don't see any easy way of accomplishing this without introducing many more failure modes though.

For "2." I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)  

I think "3. " is what most people focus on when they are asking "Why would a model power-seek upon deployment if it never had the opportunity to do so during training?". Especially since almost all discourse in power-seeking has been in the RL context.