Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this "is in some sense approximately the only and central core of the alignment problem". I'm wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
For some reason I've been muttering the phrase, "instrumental goals all the way up" to myself for about a year, so I'm glad somebody's come up with an idea to attach it to.
One time I was camping in the woods with some friends. We were sat around the fire in the middle of the night, listening to the sound of the woods, when one of my friends got out a bluetooth speaker and started playing donk at full volume (donk is a kind of funny, somewhat obnoxious style of dance music).
I strongly felt that this was a bad bad bad thing to be doing, and was basically pleading with my friend to turn it off. Everyone else thought it was funny and that I was being a bit dramatic -- there was nobody around for hundreds of metres, so we weren't disturbing anyone.
I think my friends felt that because we were away from people, we weren't "stepping on the toes of any instrumentally convergent subgoals" with our noise pollution. Whereas I had the vague feeling that we were disturbing all these squirrels and pigeons and or whatever that were probably sleeping in the trees, so we were "stepping on the toes of instrumentally convergent subgoals" to an awful degree.
Which is all to say, for happy instrumental convergence to be good news for other agents in your vicinity, it seems like you probably do still need to care about those agents for some reason?
I'd like to do some experiments using your loan application setting. Is it possible to share the dataset?
But - what might the model that AGI uses to downright visibility and serve up ideas look like?
What I was meaning to get at is that your brain is an AGI that does this for you automatically.
Fine, but it still seems like a reason one could give for death being net good (which is your chief criterion for being a deathist).
I do think it's a weaker reason than the second one. The following argument in defence of it is mainly for fun:
I slightly have the feeling that it's like that decision theory problem where the devil offers you pieces of a poisoned apple one by one. First half, then a quarter, then an eighth, than a sixteenth... You'll be fine unless you eat the whole apple, in which case you'll be poisoned. Each time you're offered a piece it's rational to take it, but following that policy means you get poisoned.
The analogy is that I consider living for eternity to be scary, and you say, "well, you can stop any time". True, but it's always going to be rational for me to live for one more year, and that way lies eternity.
Other (more compelling to me) reasons for being a "deathist":
- Eternity can seem kinda terrifying.
- In particular, death is insurance against the worst outcomes lasting forever. Things will always return to neutral eventually and stay there.
Your brain is for holding ideas, not for having them
Notes systems are nice for storing ideas but they tend to get clogged up with stuff you don't need, and you might never see the stuff you do need again. Wouldn't it be better if
- ideas got automatically downweighted in visibility over time according to their importance, as judged by an AGI who is intimately familiar with every aspect of your life
- ideas got automatically served up to you at relevant moments as judged by that AGI.
Your brain is that notes system. On the other hand, writing notes is a great way to come up with new ideas.
and nobody else ever seems to do anything useful as a result of such fights
I would guess a large fraction of the potential value of debating these things comes from its impact on people who aren’t the main proponents of the research program, but are observers deciding on their own direction.
Is that priced in to the feeling that the debates don’t lead anywhere useful?
The notion of ‘fairness’ discussed in e.g. the FDT paper is something like: it’s fair to respond to your policy, i.e. what you would do in any counterfactual situation, but it’s not fair to respond to the way that policy is decided.
I think the hope is that you might get a result like “for all fair decision problems, decision-making procedure A is better than decision-making procedure B by some criterion to do with the outcomes it leads to”.
Without the fairness assumption you could create an instant counterexample to any such result by writing down a decision problem where decision-making procedure A is explicitly penalised e.g. omega checks if you use A and gives you minus a million points if so.
a Bayesian interpretation where you don't need to renormalize after every likelihood computation
How does this differ from using Bayes' rule in odds ratio form? In that case you only ever have to renormalise if at some point you want to convert to probabilities.
I think all of the following:
- process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
- outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
- outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it's using
- combining process-based feedback with outcome-based feedback:
- pushes extra hard against legibility because it incentivises obfuscating said strategies[1]
- unclear sign wrt faithfulness
or at least has the potential to, depending on the details. ↩︎
I found this really useful, thanks! I especially appreciate details like how much time you spent on slides at first, and how much you do now.
Relevant keyword: I think the term for interactions like this where players have an incentive to misreport their preferences in order to bring about their desired outcome is “not strategyproof”.
The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected.
Like if you had thrown 100 coins and then revealed that 80 were heads.
Could mech interp ever be as good as chain of thought?
Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly - not exactly - explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.
Doesn't this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don't think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?
Seems like 99+ to me.
Elon Musk to: Ilya Sutskever, Greg Brockman, Sam Altman - Feb 19, 2016 12:05 AM
Frankly, what surprises me is that the AI community is taking this long to figure out concepts. It doesn't sound super hard. High-level linking of a large number of deep nets sounds like the right approach or at least a key part of the right approach
Ilya Sutskever to: Elon Musk, (cc: Greg Brockman, Sam Altman) - Feb 19, 2016 10:28 AM
Several points:
It is not the case that once we solve "concepts," we get AI. Other problems that will have to be solved include unsupervised learning, transfer learning, and lifetime learning. We're also doing pretty badly with language right now.
Seems like there was a misunderstanding here where Musk meant “the AI community is taking a long time to figure stuff out” but Ilya thought he meant “figure out how to make AI that can think in terms of concepts”.
Compute is used in two ways: it is used to run a big experiment quickly, and it is used to run many experiments in parallel.
95% of progress comes from the ability to run big experiments quickly. The utility of running many experiments is much less useful. In the old days, a large cluster could help you run more experiments, but it could not help with running a single large experiment quickly.
For this reason, an academic lab could compete with Google, because Google's only advantage was the ability to run many experiments. This is not a great advantage.
Recently, it has become possible to combine 100s of GPUs and 100s of CPUs to run an experiment that's 100x bigger than what is possible on a single machine while requiring comparable time. This has become possible due to the work of many different groups. As a result, the minimum necessary cluster for being competitive is now 10–100x larger than it was before.
I think Ilya said almost exactly this in his NeurIPS talk yesterday. Sounds like “doing the biggest experiments fast is more important than doing many experiments in parallel” is a core piece of wisdom to him.
This might be the source of Sarah Constantin’s incredulity?
Is Eliezer claiming to (train himself to perform only those steps) over thirty seconds, or train himself to (perform only those steps over thirty seconds)?
I'm confused by this argument from Redwood's AI control paper.
5.2 WHY WE DIDN’T CONSIDER STRATEGIES THAT INVOLVE TRAINING THE MODEL [...] Any protocol that involves training the model can be transformed into a validation-style protocol that merely validates the model’s performance on the dataset that was used as a train set. In any case where the validation-style protocol fails, we know that the red-team model is not distinguished from an aligned model by the measurements done by the protocol. Given that none of the training examples allow the reward-generating process to determine whether the policy it is training is undesirable, we don’t see why training would transform the model from an undesirable policy to a desirable policy. It is conceivable that we could find a training scheme that systematically converts undesirable policies to desirable ones, despite the lack of a reward differential causing it to do so. But we don’t know how.
If I understand correctly, the argument is:
- In order for training on dataset D to convince you a model is desirable, good performance on D needs to be a sufficient condition for being desirable.
- If good performance on D is a sufficient condition for being desirable, you can just test performance on D to decide whether a model is desirable.
But 2 seems wrong -- if good performance on D is sufficient but not neccesary for being desirable, then you could have two models, one desirable and one undesirable, which are indistinguishable on D because they both perform badly, and then turn them both into desirable models by training on D.
As an extreme example, suppose your desirable model outputs correct code and your undesirable model backdoored code. Using D you train them both to output machine checkable proofs of the correctness of the code they write. Now both models are desirable because you can just check correctness before deploying the code.
I gave $290. Partly because of the personal value I get out of LW, partly because I think it's a solidly cost-effective donation.
Do we have any data on p(doom) in the LW/rationalist community? I would guess the median is lower than 35-55%.
It's not exactly clear where to draw the line, but I would guess this to be the case for, say, the 10% most active LessWrong users.
Not sure. I guess you also have to exclude policy gradient methods that make use of learned value estimates. "Learned evaluation vs sampled evaluation" is one way you could say it.
Model-based vs model-free does feel quite appropriate, shame it's used for a narrower kind of model in RL. Not sure if it's used in your sense in other contexts.
Under that definition you end up saying that what are usually called ‘model-free’ RL algorithms like Q-learning are model-based. E.g. in Connect 4, once you’ve learned that getting 3 in a row has a high value, you get credit for taking actions that lead to 3 in a row, even if you ultimately lose the game.
I think it is kinda reasonable to call Q-learning model-based, to be fair, since you can back out a lot of information about the world from the Q-values with little effort.
Nitpick: “odds of 63%” sounds to me like it means “odds of 63:100” i.e. “probability of around 39%”. Took me a while to realise this wasn’t what you meant.
I think the way to go, philosophically, might be to distinguish kindness-towards-conscious-minds and kindness-towards-agents. The former comes from our values, while the second may be decision theoretic.
The revealed preference orthogonality thesis
People sometimes say it seems generally kind to help agents achieve their goals. But it's possible there need be no relationship between a system's subjective preferences (i.e. the world states it experiences as good) and its revealed preferences (i.e. the world states it works towards).
For example, you can imagine an agent architecture consisting of three parts:
- a reward signal, experienced by a mind as pleasure or pain
- a reinforcement learning algorithm
- a wrapper which flips the reward signal before passing it to the RL algorithm.
This system might seek out hot stoves to touch while internally screaming. It would not be very kind to turn up the heat.
Even if you think a life’s work can’t make a difference but many can, you can still think it’s worthwhile to work on alignment for whatever reasons make you think it’s worthwhile to do things like voting.
(E.g. a non-CDT decision theory)
Since o1 I’ve been thinking that faithful chain-of-thought is waaaay underinvested in as a research direction.
If we get models such that a forward pass is kinda dumb, CoT is superhuman, and CoT is faithful and legible, then we can all go home, right? Loss of control is not gonna be a problem.
And it feels plausibly tractable.
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
Do we know that the test set isn’t in the training data?
You can read examples of the hidden reasoning traces here.
But it's not clear to me that in practice it would say naughty things, since it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they're avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
'We also do not want to make an unaligned chain of thought directly visible to users.' Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.
My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It's a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
Distinguish two notions of "goal-directedness":
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
Thanks for the feedback!
... except, going through the proof one finds that the latter property heavily relies on the "uniqueness" of the policy. My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn't clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Yeah, uniqueness definitely doesn't always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you're following the unique optimal policy for some utility function, that's a lot of evidence for goal-directedness. If you're following one of many optimal policies, that's a bit less evidence -- there's a greater chance that it's an accident. In the most extreme case (for the constant utility function) every policy is optimal -- and we definitely don't want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it's the case.
minimum for uniformly random policy (this would've been a good property, but unless I'm mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I've corrected it to this.
Instead of tracking who is in debt to who, I think you should just track the extent to which you’re in a favouring-exchanging relationship with a given person. Less to remember and runs natively on your brain.
- ...if the malign superintelligence knows what observations we would condition on, it can likely arrange to make the world match those observations, making the probability of our observations given a malign superintelligence roughly one
The probability of any observation given the existence of a malign superintelligence is 1? So P(observation | malign superintelligence) adds up to like a gajillion?
5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that removes this incentive.
If you know which variables you want to remove the incentive to control, an alternative to penalising divergence is path-specific objectives, i.e. you compute the score function under an intervention on the model that sets the irrelevant variables to their status quo values. Then the AI has no incentive to control the variables, but no incentive to keep them the same either.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes
I think you can drop this premise and modify the conclusion to “you can find a causal model for all variables upstream of the utility and not downstream of the decision.”
Lucius-Alexander SLT dialogue?
Yep, exactly.
Neural network interpretability feels like it should be called neural network interpretation.
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
It's not just a lesswrong thing (wikipedia).
My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:
- the state or quality of being dedicated to a cause, activity, etc.
- an engagement or obligation that restricts freedom of action
Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.
IIUC, I think that in addition to making predictive models more human interpretable, there's another way this agenda aspires to get around the ELK problem.
Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.
IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the "right" notion of what constitutes a bad outcome, rather than a wrong one like "a human's best guess would be that this outcome is bad". I think Bengio's proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the "right" notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.
EDIT: from a quick look at the ELK report, this idea ("ensembling") is mentioned under "How we'd approach ELK in practice". Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it's plausible to me that this agenda could make progress on ELK even if it wasn't specifically thinking about the problem.