Posts
Comments
Nitpick: “odds of 63%” sounds to me like it means “odds of 63:100” i.e. “probability of around 39%”. Took me a while to realise this wasn’t what you meant.
I think the way to go, philosophically, might be to distinguish kindness-towards-conscious-minds and kindness-towards-agents. The former comes from our values, while the second may be decision theoretic.
The revealed preference orthogonality thesis
People sometimes say it seems generally kind to help agents achieve their goals. But it's possible there need be no relationship between a system's subjective preferences (i.e. the world states it experiences as good) and its revealed preferences (i.e. the world states it works towards).
For example, you can imagine an agent architecture consisting of three parts:
- a reward signal, experienced by a mind as pleasure or pain
- a reinforcement learning algorithm
- a wrapper which flips the reward signal before passing it to the RL algorithm.
This system might seek out hot stoves to touch while internally screaming. It would not be very kind to turn up the heat.
Even if you think a life’s work can’t make a difference but many can, you can still think it’s worthwhile to work on alignment for whatever reasons make you think it’s worthwhile to do things like voting.
(E.g. a non-CDT decision theory)
Since o1 I’ve been thinking that faithful chain-of-thought is waaaay underinvested in as a research direction.
If we get models such that a forward pass is kinda dumb, CoT is superhuman, and CoT is faithful and legible, then we can all go home, right? Loss of control is not gonna be a problem.
And it feels plausibly tractable.
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
Gradual/Sudden
Do we know that the test set isn’t in the training data?
You can read examples of the hidden reasoning traces here.
But it's not clear to me that in practice it would say naughty things, since it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they're avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
'We also do not want to make an unaligned chain of thought directly visible to users.' Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It's a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
Distinguish two notions of "goal-directedness":
-
The system has a fixed goal that it capably works towards across all contexts.
-
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
Thanks for the feedback!
... except, going through the proof one finds that the latter property heavily relies on the "uniqueness" of the policy. My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn't clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Yeah, uniqueness definitely doesn't always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you're following the unique optimal policy for some utility function, that's a lot of evidence for goal-directedness. If you're following one of many optimal policies, that's a bit less evidence -- there's a greater chance that it's an accident. In the most extreme case (for the constant utility function) every policy is optimal -- and we definitely don't want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it's the case.
minimum for uniformly random policy (this would've been a good property, but unless I'm mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I've corrected it to this.
Instead of tracking who is in debt to who, I think you should just track the extent to which you’re in a favouring-exchanging relationship with a given person. Less to remember and runs natively on your brain.
- ...if the malign superintelligence knows what observations we would condition on, it can likely arrange to make the world match those observations, making the probability of our observations given a malign superintelligence roughly one
The probability of any observation given the existence of a malign superintelligence is 1? So P(observation | malign superintelligence) adds up to like a gajillion?
5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that removes this incentive.
If you know which variables you want to remove the incentive to control, an alternative to penalising divergence is path-specific objectives, i.e. you compute the score function under an intervention on the model that sets the irrelevant variables to their status quo values. Then the AI has no incentive to control the variables, but no incentive to keep them the same either.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes
I think you can drop this premise and modify the conclusion to “you can find a causal model for all variables upstream of the utility and not downstream of the decision.”
Lucius-Alexander SLT dialogue?
Yep, exactly.
Neural network interpretability feels like it should be called neural network interpretation.
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
It's not just a lesswrong thing (wikipedia).
My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:
- the state or quality of being dedicated to a cause, activity, etc.
- an engagement or obligation that restricts freedom of action
Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.
IIUC, I think that in addition to making predictive models more human interpretable, there's another way this agenda aspires to get around the ELK problem.
Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.
IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the "right" notion of what constitutes a bad outcome, rather than a wrong one like "a human's best guess would be that this outcome is bad". I think Bengio's proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the "right" notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.
EDIT: from a quick look at the ELK report, this idea ("ensembling") is mentioned under "How we'd approach ELK in practice". Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it's plausible to me that this agenda could make progress on ELK even if it wasn't specifically thinking about the problem.
"Bear in mind he could be wrong" works well for telling somebody else to track a hypothesis.
"I'm bearing in mind he could be wrong" is slightly clunkier but works ok.
Really great post. The pictures are all broken for me, though.
He means the second one.
Seems true in the extreme (if you have 0 idea what something is how can you reasonably be worried about it), but less strange the futher you get from that.
Somewhat related: how do we not have separate words for these two meanings of 'maximise'?
- literally set something to its maximum value
- try to set it to a big value, the bigger the better
Even what I've written for (2) doesn't feel like it unambiguously captures the generally understood meaning of 'maximise' in common phrases like 'RL algorithms maximise reward' or 'I'm trying to maximise my income'. I think the really precise version would be 'try to affect something, having a preference ordering over outcomes which is monotonic in their size'.
But surely this concept deserves a single word. Does anyone know a good word for this, or feel like coining one?
you can train on MNIST digits with twenty wrong labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label
I know some pigeons who would question this claim
I have read many of your posts on these topics, appreciate them, and I get value from the model of you in my head that periodically checks for these sorts of reasoning mistakes.
But I worry that the focus on 'bad terminology' rather than reasoning mistakes themselves is misguided.
To choose the most clear cut example, I'm quite confident that when I say 'expectation' I mean 'weighted average over a probability distribution' and not 'anticipation of an inner consciousness'. Perhaps some people conflate the two, in which case it's useful to disabuse them of the confusion, but I really would not like it to become the case that every time I said 'expectation' I had to add a caveat to prove I know the difference, lest I get 'corrected' or sneered at.
For a probably more contentious example, I'm also reasonably confident that when I use the phrase 'the purpose of RL is to maximise reward', the thing I mean by it is something you wouldn't object to, and which does not cause me confusion. And I think those words are a straightforward way to say the thing I mean. I agree that some people have mistaken heuristics for thinking about RL, but I doubt you would disagree very strongly with mine, and yet if I was to talk to you about RL I feel I would be walking on eggshells trying to use long-winded language in such a way as to not get me marked down as one of 'those idiots'.
I wonder if it's better, as a general rule, to focus on policing arguments rather than language? If somebody uses terminology you dislike to generate a flawed reasoning step and arrive at a wrong conclusion, then you should be able to demonstrate the mistake by unpacking the terminology into your preferred version, and it's a fair cop.
But until you've seen them use it to reason poorly, perhaps it's a good norm to assume they're not confused about things, even if the terminology feels like it has misleading connotations to you.
Causal Inference in Statistics (pdf) is much shorter, and a pretty easy read.
I have not read causality but I think you should probably read the primer first and then decide if you need to read that too.
Sorry, yeah, my comment was quite ambiguous.
I meant that while gaining status might be a questionable first step in a plan to have impact, gaining skill is pretty much an essential one, and in particular getting an ML PhD or working at a big lab seem like quite solid plans for gaining skill.
i.e. if you replace status with skill I agree with the quotes instead of John.
People occasionally come up with plans like "I'll lead the parade for a while, thereby accumulating high status. Then, I'll use that high status to counterfactually influence things!". This is one subcategory of a more general class of plans: "I'll chase status for a while, then use that status to counterfactually influence things!". Various versions of this often come from EAs who are planning to get a machine learning PhD or work at one of the big three AI labs.
This but skill instead of status?
Any other biography suggestions?
I think Just Don't Build Agents could be a win-win here. All the fun of AGI without the washing up, if it's enforceable.
Possible ways to enforce it:
(1) Galaxy-brained AI methods like Davidad's night watchman. Downside: scary, hard.
(2) Ordinary human methods like requring all large training runs to be approved by the No Agents committee.
Downside: we'd have to ban not just training agents, but training any system that could plausibly be used to build an agent, which might well include oracle-ish AI like LLMs. Possibly something like Bengio's scientist AI might be allowed.
LW feature I would like: I click a button on a sequence and recieve one post in my email inbox per day.
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words, AGI threat scenario "window dressing," or when players from an EA-coded group research a topic.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I'm not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.
Perhaps I don't understand it, but this seems quite far-fetched to me and I'd be happy to trade in what I see as much more compelling alignment concerns about agents for concerns like this.
Can’t we avoid this just by being careful about credit assignment?
If we read off a prediction, take some actions in the world, then compute the gradients based on whether the prediction came true, we incentivise self-fulfilling prophecies.
If we never look at predictions which we’re going to use as training data before they resolve, then we don’t.
This is the core of the counterfactual oracles idea: just don’t let model output causally influence training labels.
I think the most obvious route to building an oracle is to combine a massive self-supervised predictive model with a question-answering head.
What’s still difficult here is getting a training signal that incentives truthfulness rather than sycophancy, which is I think is what ARC‘s ELK stuff wants (wanted?) to address. Really good mechinterp, new inherently interpretable architectures, or inductive bias-jitsu are other potential approaches.
But the other difficult aspects of the alignment problem (avoiding deceptive alignment, goalcraft) seem to just go away when you drop the agency.
AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.
Isn’t this a central example of “somebody else will surely build agentic AI?”.
I guess it argues “building safe non-agentic AI before somebody else builds agentic AI is difficult” because agents have a capability advantage.
This may well be true (but also perhaps not, because e.g. agents might have capability disadvantages from misalignment, or because reinforcement learning is just harder than other forms of ML).
But either way I think it has importantly different strategy implications to “it seems difficult to make non-agentic AI safe”.
I feel like there's a bit of a motte and bailey in AI risk discussion, where the bailey is "building safe non-agentic AI is difficult" and the motte is "somebody else will surely build agentic AI".
Are there any really compelling arguments for the bailey? If not then I think "build an oracle and ask it how to avoid risk from other people building agents" is an excellent alignment plan.
I might have misunderstood you, but I wonder if you're mixing up calculating the self-information or surpisal of an outcome with the information gain on updating your beliefs from one distribution to another.
An outcome which has probability 50% contains bit of self-information, and an outcome which has probability 75% contains bits, which seems to be what you've calculated.
But since you're talking about the bits of information between two probabilities I think the situation you have in mind is that I've started with 50% credence in some proposition A, and ended up with 25% (or 75%). To calculate the information gained here, we need to find the entropy of our initial belief distribution, and subtract the entropy of our final beliefs. The entropy of our beliefs about A is .
So for 50% -> 25% it's
And for 50%->75% it's
So your intuition is correct: these give the same answer.
Sorry, on reflection I had that wrong.
When distributing probability over outcomes, both arithmetic and geometric maximisation want to put as much probability as possible on the highest payoff outcome. It's when distributing payoffs over outcomes (e.g. deciding what bets to make) that geometric maximisation wants to distribution-match them to your probabilities.
If we consider the relation between utility functions and probability distributions, it gets even more literal. An utility function over X could be viewed as a target probability distribution over X, and maximizing expected utility is equivalent to minimizing cross-entropy between this target distribution and the real distribution.
This view can be a bit misleading, since it makes it sound like EU-maxing is like minimising H(u,p): making the real distribution similar to the target distribution.
But actually it’s like minimising H(p,u): putting as much probability as possible on the mode of the target distribution.
(Although interestingly geometric EU-maximising is actually equivalent to minimising H(u,p)/making the real distribution similar to the target.)
EDIT: Last line is wrong, see below.
I think this is often worth it in personal and blog-post communication, but I wouldn't say "reinforcement function" in e.g. a conference paper.
I came across this in Ng and Russell (2000) yesterday, and searching for it I see it's reasonably common. You could probably get away with it.
Oops, thanks, I’ve changed it to Reverse MATS to avoid confusion.
Empirical agent foundations is currently a good idea for a research direction.