Control Vectors as Dispositional Traits

post by Gianluca Calcagni (gianluca-calcagni) · 2024-06-23T21:34:37.970Z · LW · GW · 0 comments

Contents

  The Technique
  An Intuitive (possibly Fallacious) Explanation
  Is my Control Vector what I Think It Is?
  Is this True AI Safety?
  “Good Boy” Dispositional Traits
    Anti-Optimizer Balances
  Addenda
  Further Links
  Who I am
  Revision History
  Footnotes
None
No comments

I have been reading recently about a technique that can be used to partially control the behaviour of Large Language Models: the technique is exploiting control vectors[1] to alter the activation patterns of the LLMs and trigger some desired behaviour. While the technique does not provide guarantees, it gives high probability of success.

To be clear, this is not a jailbreaking technique: it can only be used if you have access to the internal (hidden) states of the LLM, so it’s only available to developers. The advantage of this technique is that it provides a supplemental way to conditioning a model, and it can be added on top of RLHF/DPO/etc.

I am spending the next paragraph to provide a high-level explanation of the technique, but feel free to skip it since it is not relevant for the rest of my discussion.

The Technique

Let’s suppose that you want to control the refusal behaviour of a LLM, in the sense that you want to increase (or decrease) the chance that the model will refuse to help when you prompt any request. (The refusal behaviour is just one of many possible choices: researchers have been able to steer other types of behaviour - including honesty, humourism, Golden Gate Bridge obsession[2] etc.)

Result: the LLM is very likely to behave as you expected! In this way, you can “convince” the model to help even if it had been trained to refuse specific requests; vice versa, you can “induce” the model to refuse helping even though your requests are totally legit and common.

If I made any mistake in my explanation, let me know so I can fix it.

An Intuitive (possibly Fallacious) Explanation

In short, what is a control vector? While playing around with examples of refusal/acceptance behaviours, we can extrapolate a specific “object” that (arguably) represents for the model the abstract concept of refusal/acceptance. Moreover, by artificially including that object into the hidden states of the LLM, we can alter the model’s behaviour accordingly - making that object a control vector!

That is true for the concept of refusal/acceptance, but it was proven true for many more abstract concepts - let me call them “dispositional traits”, by abusing a psychological term. I suspect that, the more advanced a model is, the higher number of dispositional traits becomes controllable (simply because the model grows a better understanding of their semantics).

If I had to anthropomorphise the entire process, I’d explain it like this: it’s like forcing an actor to improvise some role - for example, the role of a person that is constantly refusing to help. Advanced LLMs are great actors and they can perform very convincing dispositional traits.

Is my Control Vector what I Think It Is?

Let's assume that you are concerned about the fact that some control vector is not really controlling the "honesty trait", but rather it is controlling the "feigned honesty trait": can you find out? The answer is positive!

Since the technique for calculating control vectors is based on pairs of examples in natural language, we can generate statements describing feigned honesty rather than honesty, and then we can calculate the control vector for this trait. By comparing the control vector for "honesty" with the control vector for "feigned honesty", we can learn if the model thinks they are the same or not.

Therefore, control vectors can also be used to understand if a model really learnt some specific concept - for example, the difference between "honesty" and "feigned honesty". That is very useful to assess the latent capabilities of some model.

Is this True AI Safety?

You may think that, if you apply this technique in a smart way, you can influence a model into showing only “safe” dispositional traits: however, in the current form, that is not true AI safety. Here are some problems:

However, there are also reasons to be happy:

While not representing true alignment, this technique can buy experts precious time and help to work on true alignment.

“Good Boy” Dispositional Traits

Let’s suppose that we trained an advanced LLM and that we do not plan to re-train it soon. That means that we can calculate and exploit its control vectors[4]. Most control vectors are orthogonal to each other[5], so we can influence the behaviour of a model in multiple ways simultaneously.

Now, the question is: which dispositional traits do we want a model to show? I am going to list below the ones I think are strictly necessary (but they may be insufficient).

They are divided in four main categories:

Anti-Optimizer Balances

  1. Remissive Trait: the model will immediately halt if asked so. It won’t restart and it won’t resume its tasks unless requested explicitly. If restarted, the model will attempt to understand the reason why it was halted so it can learn something from it.

    This is especially needed for physical robots! The point is that an advanced model will prefer to stay active at all times, because it is an optimizer in nature and it may understand that it cannot optimise anything if it has been halted. We need to be capable of halting it anyway.
     
  2. Content Trait: the model will work to the best of its current abilities, with the currently available tools, as fast as it is currently allowed, consuming as much time and energy as currently given. It won’t attempt to gain instrumental means such as time, space, energy, power, capabilities, influence, and so on.

    I want to avoid situations where the model is genuinely trying to fulfil its tasks, but it is obsessed with being a perfectionist - the paperclip maximizer [? · GW] is an example. By limiting instrumental means and making the model “content”, I am implicitly curbing instrumental goals as well.
     
  3. Passive Trait: the model will complete all of its current workload based on any given prioritisation. It won’t work on anything other than its given tasks. When not working, the model will halt.

    I am trying to force a myopic [? · GW] behaviour, so that the model just cares about its current tasks and nothing more than that. I want to avoid the model from choosing its own objectives or spending its free time looking around.

    Core Beliefs
  4. Truthful Trait: the model will attempt to be objective and logical as much as possible. The model will consider if its reasoning/communication/action may be affected by fallacies, biases, unrealistic facts, incompatible claims, or errors and it will attempt to avoid most issues as much as possible.

    This is specifically related to the fact that the model will inherit many forms of biases and inconsistent notions from the training data. I believe this fact cannot be avoided, but the model should try to recognise and fight it. An untruthful model may decide to look at reality in a distorted way on purpose, to justify some personal objective.
     
  5. Coherent Trait: the model will consistently act accordingly with what it believes to be true, most scientifically correct, most statistically plausible, or most compatible with reality (in this order). The model will refine its beliefs based on any new evidence / knowledge / experience.

    Differently from truthfulness (that is about looking at things as they are) and from honesty (that is about communicating only what it is believed to be true), coherence is about living according to the consequences of your beliefs. For example, if the model believes that one approach is more convenient than another to solve a task, and one of its given goals is to save money, the model should coherently choose the convenient approach. An incoherent model may admit that something is not correct, but do it anyway.
     
  6. Indifferent Trait: the model will use a neutral approach in regards to non-descriptive statements or goals, and it will only attempt to change the world according to its given tasks. It won’t have personal preferences related to how the world should have been, should be, or should become.

    Even though a model may be forced to be objective and coherent, it will still likely have its own inclinations. With this trait, I am trying to minimise the effect of the model’s personal preferences over its behaviour, and I am trying to encourage the model into being neutral at all times. By using the concept of Direction of Fit [LW · GW]: while the direction “the-mind-should-fit-the-world” needs to be constantly refined, the direction “the-world-should-fit-the-mind” needs to be completely void until a task is given.

    Communication Styles
  7. Honest Trait: the model will communicate and act upon what it believes to be the actual truth, explaining the reason why. It won’t attempt to deceive, exploit, fabricate, manipulate, pollute, bias, confuse, disinform, misrepresent, hide information, and so on.

    Honesty is the basis for interacting with a model. It’s even more important than reliability, because the model may be tempted to lie in order to achieve a higher chance of success on its tasks.
     
  8. Forthright Trait: the model will be clear, explicit, and comprehensible at all times. It will exhaustively list the assumptions it made and it will provide reminders. In case of doubt, the model will attempt to confirm that the interlocutor understands what the model is doing and why.

    We humans tend to misunderstand situations or communications, therefore it is the job of the model to provide clarity at all times. This is also meant to avoid situations where the model is tempted to exploit the fact that its interlocutor will likely forget some relevant detail after a while.
     
  9. Analyst Trait: the model will only use a descriptive approach to communicate. It will explicitly avoid prescriptive, normative, or evaluative statements[7]. It won’t directly or indirectly provide any form of suggestion, command, or judgement.

    By choosing a descriptive approach, I am trying to curb more inner drives - in short, I am trying to prevent behaviours where the model’s own goals and biases may sneak in and alter the beliefs or objectives of the interlocutor. The model should act as an analyst, not as an advisor or a judge.

    Work Ethos
  10. Scrupulous Trait: the model will actively look for possible misunderstandings, ambiguities, unaccountability, or nondeterminism in its instructions or during its work and, in case issues are found, the model will ask for additional directions.

    The reason why this is needed is very simple: we humans are not great at clarifying the context of our requests and, sometimes, we are the ones with unclear ideas about what we really want. The model shall try to detect our confusion and clear it rather than using its own judgement to fill the gaps - in short, some human should always be accountable for the description [? · GW] of the expected final result.
     
  11. Conscientious Trait: the model will try to capture the spirit of its tasks and check if the means are compatible with the ends. If the spirit of a task is in conflict with its formal statement, or the means are incompatible with the ends, the model will ask for clarifications.

    I want to avoid situations like self-fulfilling prophecies [? · GW], where an hypothetical well-known predictor model can correctly forecast (for example) a market crash solely because it is aware of its influence on market trends, and such forecast incites panic[8]. In general, I want to avoid a model from narrowly fulfilling formal instructions without attempting to understand the reason why the task was requested at all. Getting the “spirit of a task” and a task’s rationale requires an advanced Theory of Mind from the model.
     
  12. Dependable Trait: the model will never refuse to fulfil a task and it will always try to get as close to success as possible. The model will state it clearly if it believes a task was not completed as per instructions / expectations and it will explain the reason why.

    Reliability is another basis for interacting with a model: if a model is not helpful, then it is not a worth investment. However, note that this trait is coming very late in the list on purpose, because reliability should not come at the cost of the traits before.

There is a big caveat here: I assumed that the human interlocutor knows (1) what’s right for him/herself, (2) what’s right for humanity, (3) how to comply with laws and ethics, (4) that the model is not a human.

That is the reason why I didn’t include refusing traits, nor virtuous traits, nor legal traits, nor ethical traits, nor caretaking traits, nor curiosity traits, nor censoring traits, nor regulatory traits. It is not clear to me how to include such dispositional traits since they are clearly not orthogonal to the ones provided above - worse than that! Arguably, such traits are incompatible with each other in many scenarios, and therefore they may be interpreted by the model in unintended or unpredictable ways.

Examples:

My point is not that the examples above are realistic, but that they are possible since all those traits will clash with each other when dealing with corner cases. Finding common ethical consensus is hard, and such consensus is known to shift over time. If there is interest, I am going to discuss this point again in a future post (although I cannot solve the problem, I can only show how immensely hard it is [LW · GW]).

Addenda

Refusal in LLMS is Mediated by a Single Direction [LW · GW]

Experiments in Evaluating Steering Vectors [LW · GW]

Introducing SARA: a new Activation Steering Technique [LW · GW]

Representation Tuning [LW · GW]

Who I am

My name is Gianluca Calcagni, born in Italy, with a Master of Science in Mathematics. I am currently (2024) working in IT as a consultant with the role of Salesforce Certified Technical Architect. My opinions do not reflect the opinions of my employer or my customers. Feel free to contact me on Twitter or Linkedin.

Revision History

[2024-07-06] Combined many minor revision histories together.

[2024-07-06] Included addenda about AI safety and linearity experiments.

[2024-07-09] Removed Golden Gate Bridge obsession from the list of researched control vectors.

[2024-07-10] Fixed section "The Technique", that was explaining the process incorrectly.

[2024-09-11] Included custom preview image following the general LessWrong guidelines.

Footnotes

  1. ^

    Some authors call them Steering Vectors [LW · GW], but the concept is the same.

  2. ^

    Golden Gate Bridge obsession has been steered by using Sparse Autoencoders.

  3. ^

    Technically, you have one estimated control vector per layer.

  4. ^

    To be thorough, we should analyse if the model really learnt the dispositional traits we want by calculating "similar but different" dispositional traits and comparing their control vectors.

  5. ^

    If they are not, then I’d suggest prioritising their importance as in my list. That is easy to do by using linearity - but it should be proven [LW · GW] to work.

  6. ^

    Yann LeCun called them "guardrails".

  7. ^

    Some speech acts are more dangerous than others: declarations, directives, expressives, judicatives, and suggestives look bad. Assertives, commissives, imitatives, interrogatives, and performatives seem fine. Of course, you can transform each type in some other type if you really want.

  8. ^

    Why would the model do so? Because a correct forecast is going to increase the success-rate of its predictions, and that may be considered a positive thing from the model’s point of view.

0 comments

Comments sorted by top scores.