Posts

Linkpost: ‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting 2023-03-13T16:52:19.599Z
Linkpost: A Contra AI FOOM Reading List 2023-03-13T14:45:57.695Z
Linkpost: A tale of 2.5 orthogonality theses 2023-03-13T14:19:16.688Z
Counterarguments to Core AI X-Risk Stories? 2023-03-11T17:55:19.309Z
Deceptive Alignment is <1% Likely by Default 2023-02-21T15:09:27.920Z
Order Matters for Deceptive Alignment 2023-02-15T19:56:07.358Z

Comments

Comment by DavidW (david-wheaton) on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-21T01:48:22.478Z · LW · GW

Nate, please correct me if I'm wrong, but it looks like you: 

  1. Skimmed, but did not read, a 3,000-word essay
  2. Posted a 1,200-word response that clearly stated that you hadn't read it properly
  3. Ignored a comment by one of the post's authors saying you thoroughly misunderstood their post and a comment by the other author offering to have a conversation with you about it
  4. Found a different person to talk to about their views (Ronny), who also had not read their post
  5. Participated in a 7,500-word dialogue with Ronny in which you speculated about what the core arguments of the original post might be and your disagreements

You've clearly put a lot of time into this. If you want to understand the argument, why not just read the original post and talk to the authors directly? It's very well-written. 

Comment by DavidW (david-wheaton) on A rough model for P(AI doom) · 2023-05-31T13:38:48.510Z · LW · GW

If you want counterarguments, here's one good place to look: Object-Level AI Risk Skepticism - LessWrong

I expect we might get more today, as it's the deadline for the Open Philanthropy AI Worldview Contest

Comment by DavidW (david-wheaton) on Order Matters for Deceptive Alignment · 2023-04-24T14:50:39.464Z · LW · GW

In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That's the definition of a differential adversarial example. 

If there were an unaligned model with no differential adversarial examples in training, that would be an example of a perfect proxy, not deceptive alignment. That's outside the scope of this post. But also, if the goal were to follow directions subject to ethical constraints, what would that perfect proxy be? What would result in the same actions across a diverse training set? It seems unlikely that you'd get even a near-perfect proxy here. And even if you did get something fairly close, the model would understand the necessary concepts for the base goal at the beginning of reinforcement learning, so why wouldn't it just learn to care about that? Setting up a diverse training environment seems likely to be a training strategy by default.

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-04-24T14:29:52.165Z · LW · GW

I have a whole section on the key assumptions about the training process and why I expect them to be the default. It's all in line with what's already happening, and the labs don't have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-04-19T15:24:47.972Z · LW · GW

Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I'm explicitly not addressing other failure modes in this post. 

What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-04-19T14:30:28.677Z · LW · GW

Which assumptions are wrong? Why?

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-04-19T14:18:14.601Z · LW · GW

I don't think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I'm assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don't know about a topic to give feedback probably won't be the strategy that gets us there. Does that answer your question?

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-04-18T11:59:51.948Z · LW · GW

From Ajeya Cotra's post that I linked to: 

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

It's not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions. 

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-03-21T20:09:10.655Z · LW · GW

Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn't directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, so it should not emerge. The one exception would be if it were already deceptively aligned, but this is a discussion of how deceptive alignment might emerge, so we are assuming that the model isn't (yet) deceptively aligned. 

I expect pre-training to creating something like a myopic prediction goal. Accomplishing this goal effectively would require sophisticated world modeling, but there would be no mechanism for the model to learn to optimize for a real-world goal. When the training mechanism switches to reinforcement learning, the model will not be deceptively aligned, and its goals will therefore evolve. The goals acquired in pre-training won't be dangerous and should shift when the model switches to reinforcement learning. 

This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal. 

Comment by DavidW (david-wheaton) on Linkpost: A tale of 2.5 orthogonality theses · 2023-03-14T17:03:23.843Z · LW · GW

I'd be curious to hear what you think about my arguments that deceptive alignment is unlikely. Without deceptive alignment, there are many fewer realistic internal goals that produce good training results. 

Comment by DavidW (david-wheaton) on Why deceptive alignment matters for AGI safety · 2023-03-13T13:49:47.407Z · LW · GW

Thanks for sharing your perspective! I've written up detailed arguments that deceptive alignment is unlikely by default. I'd love to hear what you think of it and how that fits into your view of the alignment landscape. 

Comment by DavidW (david-wheaton) on Deceptive Alignment · 2023-03-13T13:32:02.349Z · LW · GW

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” 

Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that. 

However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

A model needs situational awareness, a long-term goal, a way to tell if it's in training, and a way to identify the base goal that isn't its internal goal to become deceptively aligned. To become corrigibly aligned, all the model has to do is be able to infer the training objective, and then point at that. The latter scenario seems much more likely

Because we will likely start with something that includes a pre-trained language model, the research process will almost certainly include a direct description of the base goal. It would be weird for a model to develop all of the prerequisites of deceptive alignment before it infers the clearly described base goal and learns to optimize for that. The key concepts should already exist from pre-training. 

Comment by DavidW (david-wheaton) on Distillation of "How Likely Is Deceptive Alignment?" · 2023-03-13T12:54:53.884Z · LW · GW

Thanks for summarizing this! I have a very different perspective on the likelihood of deceptive alignment, and I'd be interested to hear what you think of it!

Comment by DavidW (david-wheaton) on How likely is deceptive alignment? · 2023-03-13T12:52:57.124Z · LW · GW

This is an interesting post. I have a very different perspective on the likelihood of deceptive alignment. I'd love to hear what you think of it and discuss further!

Comment by DavidW (david-wheaton) on Robin Hanson’s latest AI risk position statement · 2023-03-03T18:48:35.531Z · LW · GW

I recently made an inside view argument that deceptive alignment is unlikely. It doesn't cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I'd love to hear what you think of it!

Comment by DavidW (david-wheaton) on A case for capabilities work on AI as net positive · 2023-03-02T16:43:08.073Z · LW · GW

This is an interesting point, but it doesn't undermine the case that deceptive alignment is unlikely. Suppose that a model doesn't have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn't understand the correct abstraction, it can't instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can't be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model's goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-03-02T15:54:14.175Z · LW · GW

1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren't the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like "maximize reward in the next hour or so." Or maaaaaaybe "Do what humans watching you and rating your actions would rate highly," though that's a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.

I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”

With the level of LLM progress we already have, I think it's time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?

I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology? 

Doesn't this prove too much though? Doesn't it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?

Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals. 

I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn't require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation. 

As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?

Your definition of deception-relevant situational awareness doesn't seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?

Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?

Comment by DavidW (david-wheaton) on Order Matters for Deceptive Alignment · 2023-03-02T14:48:36.287Z · LW · GW

Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment. 

#4:

What amount of understanding of the base goal is sufficient? What if the answer is "It has to be quite a lot, otherwise it's really just a proxy that appears superficially similar to the base goal?" In that case the classic arguments for deceptive alignment would work fine.

TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out to mean”, it just has to point at its best estimate of X` and that will update over time because the model doesn’t know there’s a difference. 

I propose that the relevant factor here is whether the model’s internal goal is the closest thing it has to a representation of the training goal (X). I am assuming that models will have their goal information and decision parameters stored in the later layers and the world modeling overwhelmingly before the decision-making, because it doesn’t make much sense for a model to waste time world modeling (or anything else) after it makes its decision. I expect the proxy to be calculated based on high-level concepts from the world model, not separately from the world model.

Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. However, its internal goal is its flawed representation (X) of the training goal (X`). This model can’t tell the difference between the training goal and the internal proxy goal, so it can’t be deceptively aligned yet. If it performs worse than it could have on a training example because of this, the gradients could:

  1. Update the existing proxy goal to be closer to the training goal, or 
  2. Create a new concept for the training goal and become deceptively aligned.

Updating the existing goal will be a much simpler and more likely fix than creating an entirely new concept and switching the goal to point at that new concept. That new concept would have to be a better representation of the training goal on the current training example than the existing concept in order to enable deceptive alignment, and therefore better training. So, I wouldn’t expect a hyper-local update to be sufficient to make the model deceptively aligned on its own. The gradients therefore have no reason to point toward creating this new concept. On the other hand, tweaks to improve the existing understanding of the training goal would improve performance immediately. 

For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the training goal is to follow directions subject to some ethical norms but does not understand those norms well enough to apply them correctly in all situations (X). Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the training goal affect its behavior. When it messes up in this way, the gradients should point toward improving the internal goal (X) to better match the training goal (X`). On the other hand, a hyper-local move toward a separate, more accurate concept of the training goal would be insufficient to enable deception on its own and will therefore not be learned by gradient descent. 

Now consider an alternate version of the above example, where the model has the same flawed concept for the training goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the training goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the training goal without systematically changing its internal goals and remain deceptively aligned. 

A model that is pre-trained on approximately the whole internet should start with concepts relevant to understanding the training goal. It would be a bit weird if such a pre-trained model did not have a solid, but imperfect, understanding of following directions and key ethical concepts. Early, short-term reward training should be able to point at those and refine the resulting concept. This should be the closest concept to the training goal, so it should fit better with my first example than my second. This would make deceptive alignment very unlikely. 

Other than direct reward optimizers, I have trouble imagining what alternate proxy concept would be correlated enough with following directions subject to ethical considerations that it would be the internal goal late enough in the process for the model to have a long-term goal and situational awareness. Can you think of one? Having a more realistic idea for a proxy goal might make this discussion more concrete. 

1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don't like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3's are chill about value drift and not particularly interested in taking over the world. Maybe I'm reading too much between the lines, but I'd say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble. 

Interesting. I think the vast majority of humans are more like satisficers than optimizers. Perhaps that describes what I’m getting at in bucket 3 better than fuzzy targets. As mentioned in the post, I think level 4 here is the most dangerous, but 3 could still result in deceptive alignment if the foundational properties developed in the order described in this post. I agree this is a minor point, and don’t think it’s central to any disagreements. See also my answer to your fifth point, which has prompted an update to my post.

#2: 

I guess this isn't an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn't count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn't.

Yeah, I’m only talking about deceptive alignment and want to stay focused on that in this sequence. I’m not arguing against all AI x-risk. 

#3: 

Presumably the brain has some sort of SGD-like process for updating the synapses over time, that's how we learn. It's probably not exactly the same but still, couldn't you run the same argument, and get a prediction that e.g., if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they'd grow up and go to college and live the rest of their life all for the sake of pursuing reward?

We know how the gradient descent mechanism works, because we wrote the code for that. 

We don’t know how the mechanism for human value learning works. The idea that observed human value learning doesn’t match up with how gradient descent works is evidence that gradient descent is a bad analogy for human learning, not that we misunderstand the high-level mechanism for gradient descent. If gradient descent were a good way to understand human learning, we would be able to predict changes in observed human values by reasoning about the training process and how reward updates. But accurately predicting human behavior is much harder than that. If you try to change another person’s mind about their values, they will often resist your attempts openly and stick to their guns. Persuasion is generally difficult and not straightforward. 

In a comment on my other post, you make an analogy of gradient descent for evolution. Evolution and individual human learning are extremely different processes. How could they both be relevant analogies? For what it’s worth, I think they’re both poor analogies. 

If the analogy between gradient descent and human learning were useful, I’d expect to be able to describe which characteristics of human value learning correspond to each part of the training process. For hypothetical TAI in fine-tuning, here’s the training setup: 

  1. Training goal: following directions subject to ethical considerations. 
  2. Reward: some sort of human (or AI) feedback on the quality of outputs. Gradient descent makes updates on this in a roughly deterministic way. 
  3. Prompt: the model will also have some sort of prompt describing the training goal, and pre-training will provide the necessary concepts to make use of this information. 

But I find the training set-up for human value learning much more complicated and harder to describe in this way. What is the high-level training setup? What’s the training goal? What’s the reward? It’s my impression that when people change their minds about things, it’s often mediated by key factors like persuasive argument, personality traits, and social proof. Reward circuitry probably is involved somehow, but it seems vastly more complicated than that. Human values are also incredibly messy and poorly defined. 

Even if gradient descent were a good analogy, the way we raise children is very different from how we train ML models. ML training is much more structured and carefully planned with a clear reward signal. It seems like people learn values more from observing and talking to others. 

If human learning were similar to gradient descent, how would you explain that some people read about effective altruism (or any other philosophy) and quickly change their values? This seems like a very different process from gradient descent, and it’s not clear to me what the reward signal parallel would be in this case. To some extent, we seem to decide what our values are, and that probably makes sense for a social species from an evolutionary perspective. 

It seems like this discussion would benefit if we consulted someone with expertise on human value formation. 

5. 

I'm not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I'd expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.

Yeah, that’s reasonable. Thanks for pointing it out. This is a holdover from an argument that I removed from my second post before publishing because I no longer endorse it. A better argument is probably about satisficing targets instead of optimizing targets, but I think this is mostly a distraction at this point. I replaced "fuzzy targets" with "non-maximization targets". 

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-02-28T21:50:44.113Z · LW · GW

Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced "deception" with "deceptive alignment" in both posts. Thanks for pointing that out! 

I'm intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven't thought about them nearly as much, and I don't have strong intuition for how likely they are yet, so I'm choosing to stay focused on deceptive alignment for this sequence. 

Comment by DavidW (david-wheaton) on A case for capabilities work on AI as net positive · 2023-02-27T22:43:07.114Z · LW · GW

That makes sense. I misread the original post as arguing that capabilities research is better than safety work. I now realize that it just says capabilities research is net positive. That's definitely my mistake, sorry!

I strong upvoted your comment and post for modifying your views in a way that is locally unpopular when presented with new arguments. That's important and hard to do! 

Comment by DavidW (david-wheaton) on A case for capabilities work on AI as net positive · 2023-02-27T22:10:58.116Z · LW · GW

Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I'm glad you found it valuable!

For what it's worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don't address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important. 

Comment by DavidW (david-wheaton) on Does SGD Produce Deceptive Alignment? · 2023-02-27T16:09:13.147Z · LW · GW

However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.

If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn't start out with an unaligned goal, it starts out with no goal. 

To make the deception argument work, you need to describe why deception would emerge in the first place. Assuming a model is deceptive to show a model could become deceptive is not persuasive. 

One reason to think that models will often care about cross episode rewards is that caring about the future is a natural generalization. In order for a reward function to be myopic, it must contain machinery that does something like "care about X in situations C and not situations D", which is more complicated than "care about X".

Models don't start out with long-term goals. To care about long-term goals, they would need to reason about and predict future outcomes. That's pretty sophisticated reasoning to emerge without a training incentive. Why would they learn to do that if they are trained on myopic reward? Unless a model is already deceptive, caring about future reward will have a neutral or harmful effect on training performance. And we can't assume the model is deceptive, because we're trying to describe how deceptive alignment (and long-term goals, which are necessary for deceptive alignment) would emerge in the first place. I think long-term goal development is unlikely to emerge by accident. 
 

Comment by DavidW (david-wheaton) on Does SGD Produce Deceptive Alignment? · 2023-02-27T15:31:53.741Z · LW · GW

Thanks for writing this up clearly! I don't agree that gradient descent favors deception. In fact, I've made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I'd love to hear what you think of my arguments!

Comment by DavidW (david-wheaton) on Normie response to Normie AI Safety Skepticism · 2023-02-27T14:45:19.162Z · LW · GW

Thanks for writing this! If you're interested in a detailed, object-level argument that a core AI risk story is unlikely, feel free to check out my Deceptive Alignment Skepticism sequence. It explicitly doesn't cover other risk scenarios, but I would love to hear what you think!

Comment by DavidW (david-wheaton) on Incentives and Selection: A Missing Frame From AI Threat Discussions? · 2023-02-26T14:40:17.391Z · LW · GW

Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!

Comment by DavidW (david-wheaton) on Order Matters for Deceptive Alignment · 2023-02-24T17:00:44.518Z · LW · GW

The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal

Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model. 

When the learner has a really excellent world model that can make long range predictions and so forth - good enough that it can reason itself into playing the training game for a wide class of long-term goals - then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal

Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next

When this happens, gradients derived from regularisation and/or loss may push the learner's objective towards one of these problematic alternatives. 

Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?

Also, because the initial "pretty good" goal is not a long-range one (because it developed when the world model was not so good), it doesn't necessarily steer the learner away from possibilities like this

Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original? 

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-02-22T17:09:20.205Z · LW · GW

Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later. 

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-02-22T12:40:37.501Z · LW · GW

Where do you see weak points in the argument?

Comment by DavidW (david-wheaton) on Deceptive Alignment is <1% Likely by Default · 2023-02-21T21:11:54.280Z · LW · GW

Do you think language models already exhibit deceptive alignment as defined in this post?

I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post. 

If so, I'd be very interested to see examples of it! 

Comment by DavidW (david-wheaton) on How seriously should we take the hypothesis that LW is just wrong on how AI will impact the 21st century? · 2023-02-21T16:06:29.244Z · LW · GW

I just posted a detailed explanation of why I am very skeptical of the traditional deceptive alignment story. I'd love to hear what you think of it! 

Deceptive Alignment Skepticism - LessWrong

Comment by DavidW (david-wheaton) on Order Matters for Deceptive Alignment · 2023-02-17T15:31:44.427Z · LW · GW

If the model is sufficiently good at deception, there will be few to no differential adversarial examples.

We're talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned. 

Also, at this stage of the process, the model doesn't have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal. 

the vastly larger number of misaligned goals

I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?

Comment by DavidW (david-wheaton) on Order Matters for Deceptive Alignment · 2023-02-16T00:04:04.767Z · LW · GW

Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge: 

  1. Goal directedness coming before an understanding of the base goal
  2. Long-term goals coming before or around the same time as an understanding of the base goal
  3. Situational awareness coming before or around the same time as an understanding of the base goal

The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without discussion of sequencing. Long-term goals (#2) are described as happening as a result of an inductive bias toward deceptive alignment, and sequencing is not highlighted for that property. Please let me know if I missed anything in your post, and apologies in advance if that’s the case. 

Do you agree that these three property development orders are necessary for deception?