paulfchristiano

Even setting aside that ignominious origin, I expect that optimal speeds have significantly increased as technology has improved and so speed limits are probably too low because of inertia.

Comment by paulfchristiano on Catching AIs red-handed · 2024-01-06T17:03:42.781Z · LW · GW

As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:

Can the AI identify structurally similar errors in other code? Is the behavior sensitive to evidence about how the code will be used, in a way that doesn't seem relevant to the nature or salience of the error? Is the pattern of sensitivity suggestive of an attempt to disempower humanity?
Is the behavior sensitive to the AI's beliefs about the quality of human oversight? For example does training data suggesting more vigilant human oversight in this domain decrease the probability of the mistake? Do other AI systems trained in different ways reproduce a similar pattern of mistakes?
If we have one example of a mistake that looks suspicious, we can examine other similar mistakes and build up a detailed pattern of behavior. For example we can hold fix the context and then look at whether the AI misses other kinds of software errors. With multiple mistakes in hand we can run all this analysis on a broader spectrum of mistakes. We can iteratively modify scenarios to explore the landscape of deceptive behavior, gradient descenting our way towards stronger and stronger evidence.

As another example, suppose that an AI system tries to modulate its behavior in any way based on whether it believes it is being closely monitored vs operating in the real world. And suppose that you find an example in the lab where the AI system messes up, and therefore where there is an unusually large gap between its ability to solve a problem in a "realistic-looking" situation vs a situation that's obviously a test. The initial behavior isn't super suspicious (sometimes behavior randomly varies with framing), but it's suspicious enough to get flagged for more thorough review. Subsequent rounds find further evidence of sensitivity, and eventually it's pretty likely (though still not "legal smoking gun") that we've got an example of an AI systems modulating its behavior based on whether it believes it is being tested.

At that point you can start collecting data about what cues the AI system is using to determine whether it is being tested, by varying features of the context and seeing whether you see nicer or less nice behavior. And once you have this understanding in hand you can use it in other contexts to try to get the AI to mess up more. As you find more examples you can build a richer picture, in a virtuous cycle (like normal science).

I think being an AI in this situation is pretty untenable if you are up against human scientists trying hard to figure out what's going on with a typical level of rigor and competence. And so my guess is that typical competent malicious AI systems won't try to modulate their behavior in a significant way based on whether they believe they are being actively tested/monitored. (But it's still worth looking extremely hard, and e.g. it's likely that some AI systems will mess up and try to pull this kind of stunt.) And that's a huge difference with the human case: changing your behavior when you are being closely watched is one of the key things that a human needs to do in order to preserve plausible deniability, and one of the key features of many AI safety horror stories.

(We could tell an identical story about cases where AI systems try to behave badly in ways that are too subtle for humans to notice, but mess up in a single case by underestimating human abilities. In that case we get a clear example of sensitivity to information about human competence, and again can explore the whole frontier of this behavior.)

This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it's never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI's training data and limiting information about the world. So "super confident that humans will never notice" is a very high bar.

just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan.

I'm not sure this distinction is particularly important. If ChatGPT is taking actions to undermine human control of AI, and you can establish a pattern of behavior, that's a big deal which will cause scientists and the world to sit up and take interest. I don't think a defense like "it's just trying to undermine human control because it's read a lot of stories about AI systems taking over" or "it lies and manipulates the user because that's what a sociopathic human would do, and RL training put it into a more sociopathic mode" are going to fly very well.

If this is happening at a point in time when the AI system is competent enough to have real consequences, then I think it's going to be an extremely good basis for a policy reaction.

We've seen a tiny example of this with Sydney, in a system where it has almost zero consequences and where it's pretty much random rather than having a pattern of behavior, and that's already had significant effects on public and scientific perception.

Overall I would agree it takes a lot of work to actually do the monitoring (especially if AI developers want to avoid accountability), to establish a pattern of behavior, to amplify a little bit of evidence into a robust scientific picture, to diagnose the issue appropriately and avoid overfitting, etc. I think some people have the view that the evidence is already in and nothing will change, but I think that's very wrong, and it seems to me like it's one of the more harmful features of the AI safety ideological bubble. (If you go out into the broader scientific community, I think "do you have evidence" is really a central hangup, and the standards of evidence being used are overall pretty reasonable IMO.)

Comment by paulfchristiano on Catching AIs red-handed · 2024-01-06T03:19:02.804Z · LW · GW

One difference is that if you observe a human behaving deceptively you are forced to make educated guesses about their motives and context, but if you catch an AI behaving deceptively you are able to record the exact context in which it acted, examine its behavior in a large number of counterfactual situations, run other AI systems in exactly the same situation, etc.

I think this removes most of the leeway that humans use to maintain plausible deniability, especially in high-stakes situations with thorough investigations. Overall I'm more intuitively worried about cases where humans never flag any suspicious behavior (e.g. because its thoroughly undetectable, or occurs only once an AI takeover is underway) than cases where they flag suspicious behavior but the AI is able to maintain plausible deniability during a thorough investigation.

Comment by paulfchristiano on Trading off Lives · 2024-01-03T16:15:46.182Z · LW · GW

Taking your numbers at face value, you'd have 1.5 billion passenger hours afflicted by the ban per life saved, or about 3000 lifetimes worth of hours.

Or: if people spent every waking minute of their lives under annoying regulatory requirments about as bad as this one with the same tradeoffs, the benefit would be extending the average lifespan from 77.28 years to 77.29 years.

I expect most people would demand more like +10 years of lifespan in return for that level of restriction, not +0.01 years. So the cost benefit is probably off by ~3 orders of magnitude.

I generally prefer to think about this kind of tradeoff by scaling up the benefits to 1 life and then concentrating the costs in 1 life, and seeing how the tradeoff looks. That might be idiosyncratic, but to me it's very natural to ask my gut how much lifespan I'd like to trade off for a few minutes of pain or inconvenience.

Comment by paulfchristiano on Stop talking about p(doom) · 2024-01-02T18:28:46.644Z · LW · GW

I think the term "existential risk" comes from here, where it is defined as:

Existential risk – One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.

(I think on a plain english reading "existential risk" doesn't have any clear precise meaning. I would intuitively have included e.g. social collapse, but probably wouldn't have included an outcome where humanity can never expand beyond the solar system, but I think Bostrom's definition is also consistent with the vague plain meaning.)

In general I don't think using "existential risk" with this precise meaning is very helpful in broader discourse and will tend to confuse more than it clarifies. It's also a very gnarly concept. In most cases it seems better to talk directly about human extinction, AI takeover, or whatever other concrete negative outcome is on the table.

Comment by paulfchristiano on 2023 in AI predictions · 2024-01-02T17:34:27.008Z · LW · GW

It seems fairly unlikely that this specific task will be completed soon for a variety of reasons: it sounds like it technically requires training a new LM that removes all data about zelda games; it involves a fair amount of videogame-specific engineering hassle; and it's far from anything with obvious economic relevance + games are out of fashion (not because they are too hard). I do still think it will be done before 2033.

If we could find a similar task that was less out of the way then I'd probably be willing to bet on it happening much sooner. Presumably this is an analogy to something that would be relevant for AI systems automating R&D and is therefore closer to what people are interested in doing with LMs.

Although we can't bet on it, I do think that if AI developers made a serious engineering effort on the zelda task right now then they would have a reasonable chance of success within 2 years (I'd wildly guess 25%), and this will rise over time. I think GPT-4 with vision will do a reasonable job of identifying the next step needed to complete the game, and models trained with RL to follow instructions in video games across a broad variety of games (including 3d games with similar controls and perspective to Zelda) would likely be competent enough to solve most of the subtasks if you really went all out on it.

I don't have a good sense of what part you think is hard. I'd guess that the most technically uncertain part is training an RL policy that takes a description of a local task (e.g. "throw a bomb so that it explodes next to the monster's eye") and then actually executing it. But my sense is that you might be more concerned about high-level planning.

Comment by paulfchristiano on 2023 in AI predictions · 2024-01-02T00:18:27.935Z · LW · GW

Do you have any hard things that you are confident LLMs won't do soon? (Short of: "autonomously carry out R&D.") Any tasks you think an LM agent won't be able to achieve?

Comment by paulfchristiano on 2023 in AI predictions · 2024-01-01T23:25:42.015Z · LW · GW

I can't tell if you think these problems will remain hard for the model, and if so why.

I think 70% that an LM agent can do the 4x4 grid example by EOY 2024 because it seems pretty easy. I'd update if that was wrong. (And I'd be fine replacing that by held out examples of similar complexity.)

Will you be updating your picture if it can do these tasks by EOY? How much have you updated in the last few years? I feel like 2018 Paul was pretty surprised by how good ChatGPT is now (its turing test ability is maybe ~85th percentile of my forecasts), and that in 2018 you were at least qualitatively trying to argue in the opposite direction.

Comment by paulfchristiano on 2023 in AI predictions · 2024-01-01T23:20:03.217Z · LW · GW

I don't see how that's a valid interpretation of the rules. Isn't it checking to find that there is at least one 2x repetition and at least one 3x repetition? Whereas the request was exactly two of each.

Comment by paulfchristiano on 2023 in AI predictions · 2024-01-01T18:37:44.714Z · LW · GW

I'm glad you have held out problems, and I think it would be great if you had a handful (like 3) rather than just one. (If you have 5-10 it would also be cool to plot the success rate going up over time as ChatGPT improves.)

Here is the result of running your prompt with a generic system prompt (asking for an initial answer + refinement). It fails to meet the corner condition (and perplexingly says "The four corners (top left 'A', top right 'A', bottom left 'A', bottom right 'B') are distinct."). When I point out that the four corners aren't distinct it fixes this problem and gets it correct.

I'm happy to call this a failure until the model doesn't need someone to point out problems. But I think that's entirely fine-tuning and prompting and will probably be fixed on GPT-4.

That said, I agree that if you keep making these problems more complicated you will be able to find something that's still pretty easy for a human (<5 minutes for the top 5% of college grads) and stumps the model. E.g. I tried: fill in a 4 x 4 grid such that one column and row have the same letter 4 times, a second column has the same letter 3 times, and all other rows and columns have distinct letters (here's the model's attempt). I'm predicting that this will no longer work by EOY 2024.

Comment by paulfchristiano on 2023 in AI predictions · 2024-01-01T17:54:28.717Z · LW · GW

Find a sequence of words that is: - 20 words long - contains exactly 2 repetitions of the same word twice in a row - contains exactly 2 repetitions of the same word thrice in a row

Here is its attempt. I add usual boilerplate about being fine to think before answering. First it gives a valid sequence using letters instead of words. I ask for words instead of letters and then it gives a sequence that is only 18 words long. I ask for 20 words and then it finally gets it.

Here's a second try where I use a disambiguated version of your prompt (without boilerplate) and don't provide hints beyond "I'm not satisfied, try harder"---the model ends up producing a sequence with placeholders like "unique8" instead of words, and although I keep saying I'm unsatisfied it makes up nonsensical explanations for why and can't figure out the real problem. It gets it immediately when I point out that I'm unhappy because "unique8" isn't a word.

(This is without any custom instructions; it also seems able to do the task without code and its decision of whether to use code is very sensitive to even apparently unrelated instructions.)

I think it's very likely that GPT-4 with more fine-tuning for general competence will be able to solve this task, and that with fine-tuning or a system prompt for persistence it will need it would not need the "I'm not satisfied, try harder" reminder and will instead keep thinking until its answer is stable on reflection.

I didn't see a more complicated version in the thread, but I think it's quite likely that whatever you wrote will also be solved in 2024. I'd wildly guess a 50% chance that by the end of 2024 you will be unable (with an hour of tinkering, say) to design a task like this that's easy for humans (in the sense that say at least 5% of college graduates can do it within 5 minutes) but hard for the best public agent built with the best public model.

Comment by paulfchristiano on How do you feel about LessWrong these days? [Open feedback thread] · 2023-12-18T20:47:02.546Z · LW · GW

I wrote a fair amount about alignment from 2014-2020^[1] which you can read here. So it's relatively easy to get a sense for what I believed.

Here are some summary notes about my views as reflected in that writing, though I'd encourage you to just judge for yourself^[2] by browsing the archives:

I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an AI system would kill everyone because it didn't understand that people would disapprove of that action, and therefore this was not the main source of takeover concerns. (By 2017 I expected RLHF to work pretty well with language models, which was reflected in my research prioritization choices and discussions within OpenAI though not clearly in my public writing.)
I consistently expressed that my main concerns were instead about (i) systems that were too smart for humans to understand the actions they proposed, (ii) treacherous turns from deceptive alignment. This comes up a lot, and when I talk about other problems I'm usually clear that they are prerequisites that we should expect to succeed. Eg.. see an unaligned benchmark. I don't think this position was an extreme outlier, my impression at the time was that other researchers had broadly similar views.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I've updated about this and definitely acknowledge I was wrong.^[3] I don't think it totally changes the picture though: I'm still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
In 2016 I pointed out that ML systems being misaligned on adversarial inputs and exploitable by adversaries was likely to be the first indicator of serious problems, and therefore that researchers in alignment should probably embrace a security framing and motivation for their research.
I expected LM agents to work well (see this 2015 post). Comparing this post to the world of 2023 I think my biggest mistake was overestimating the importance of task decomposition vs just putting everything in a single in-context chain of thought. These updates overall make crazy amplification schemes seem harder (and to require much smarter models than I originally expected, if they even make sense at all) but at the same time less necessary (since chain of thought works fine for capability amplification for longer than I would have expected).

I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don't think my views were extreme outliers (and during this period I was often pointed to as a sensible representative of fairly hardcore and traditional alignment concerns).

Like you, I am somewhat frustrated that e.g. Eliezer has not really acknowledged how different 2023 looks from the picture that someone would take away from his writing. I think he's right about lots of dynamics that would become relevant for a sufficiently powerful system, but at this point it's pretty clear that he was overconfident about what would happen when (and IMO is still very overconfident in a way that is directly relevant to alignment difficulty). The most obvious one is that ML systems have made way more progress towards being useful R&D assistants way earlier than you would expect if you read Eliezer's writing and took it seriously. By all appearances he didn't even expect AI systems to be able to talk before they started exhibiting potentially catastrophic misalignment.

^{^}
I think my opinions about AI and alignment were much worse from 2012-2014, but I did explicitly update and acknowledge many mistakes from that period (though some of it was also methodological issues, e.g. I believe that "think about a utility function that's safe to optimize" was a useful exercise for me even though by 2015 I no longer thought it had much direct relevance).
^{^}
I'd also welcome readers to pull out posts or quotes that seem to indicate the kind of misprediction you are talking about. I might either acknowledge those (and I do expect my historical reading is very biased for obvious reasons), or I might push back against them as a misreading and explain why I think that.
^{^}
That said, in fall 2018 I made and shared some forecasts which were the most serious forecasts I made from 2016-2020. I just looked at those again to check my views. I gave a 7.5% chance of TAI by 2028 using short-horizon RL (over a <5k word horizon using human feedback or cheap proxies rather than long-term outcomes), and a 7.5% chance that by 2028 we would be able to train smart enough models to be transformative using short-horizon optimization but be limited by engineering challenges of training and integrating AI systems into R&D workflows (resulting in TAI over the following 5-10 years). So when I actually look at my probability distributions here I think they were pretty reasonable. I updated in favor of alignment being easier because of the relative unimportance of long-horizon RL, but the success of imitation learning and short-horizon RL was still a possibility I was taking very seriously and overall probably assigned higher probability to than almost anyone in ML.

Comment by paulfchristiano on Impossibility results for unbounded utilities · 2023-12-05T20:56:54.942Z · LW · GW

I think that's reasonable, this is the one with the discussion and it has a forward link, would be better to review them as a unit.

Comment by paulfchristiano on Impossibility results for unbounded utilities · 2023-12-05T16:56:23.193Z · LW · GW

I think the dominance principle used in this post is too strong and relatively easy to deny. I think that the Better impossibility results for unbounded utilities are actually significantly better.

Comment by paulfchristiano on Where I agree and disagree with Eliezer · 2023-11-30T17:26:35.278Z · LW · GW

I clarified my views here because people kept misunderstanding or misquoting them.

The grandparent describes my probability that humans irreversibly lose control of AI systems, which I'm still guessing at 10-20%. I should probably think harder about this at some point and revise it, I have no idea which direction it will move.

I think the tweet you linked is referring to the probability for "humanity irreversibly messes up our future within 10 years of building human-level AI." (It's presented as "probability of AI killing everyone" which is not really right.)

I generally don't know what people mean when they say p(doom). I think they probably imagine that the vast majority of existential risk from AI comes from loss of control, and that catastrophic loss of control necessarily leads to extinction, both of which seem hard to defend.

Comment by paulfchristiano on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T16:56:06.408Z · LW · GW

If this is what's going on, then I basically can't imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying "it's already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans." Is there something I'm missing?

I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.

Comment by paulfchristiano on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T16:47:34.983Z · LW · GW

If you use that definition, I don't understand in what sense LMs don't "want" things---if you prompt them to "take actions to achieve X" then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn't that "want" or "desire" like behavior? So what does it mean when Nate says "AI doesn't seem to have all that much "want"- or "desire"-like behavior"?

I'm genuinely unclear what the OP is asserting at that point, and it seems like it's clearly not responsive to actual people in the real world saying "LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?” People who say that kind of thing mostly aren't saying that LMs can't be prompted to achieve outcomes. They are saying that LMs don't want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don't seem to have preferences about the training objective, or that are coherent over time).

Comment by paulfchristiano on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T02:35:39.116Z · LW · GW

If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.

You say:

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

But the OP says:

to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise

This seems to strongly imply that a particular capability---succeeding at these long horizon tasks---implies the AI has "wants/desires." That's what I'm saying seems wrong.

Comment by paulfchristiano on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T17:01:09.136Z · LW · GW

When the post says:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

It seems like it's saying that if you prompt an LM with "Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way," and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.

Which is a fine definition to pick. But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

Comment by paulfchristiano on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T16:57:58.454Z · LW · GW

Differences:

I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
Nate writes:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense"."

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!
I think this post is a bad answer to the question "when are the people who expected 'agents' going to update?" I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it's not really advancing the discussion.

Comment by paulfchristiano on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-24T20:25:15.201Z · LW · GW

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior? [...] Well, I claim that these are more-or-less the same fact.

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.

I think that a system may not even be able to "want" things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can't want things or solve long horizon tasks at all, then maybe you shouldn't update at all when they don't appear to want things.

But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.)

(The foreshadowing example doesn't seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you've already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)

Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between "wanting things" and "ability to solve long horizon tasks" (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it's not particularly about "ability to solve long-horizon tasks," and we are obviously getting evidence about it each time we train a new language model.

Comment by paulfchristiano on Deception Chess: Game #1 · 2023-11-05T16:54:15.745Z · LW · GW

It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus.

Mostly I think it would be faster and I think a lot less noisy per minute. I also think it's a bit unrepresentative to be able to use "how well did this advisor's suggestions work out in hindsight?" to learn which advisors are honest and so it's nice to make the dishonest advisors' job easier.

(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research acceleration---e.g. it would be very valuable to just get predictions of which research direction will feel promising to me after spending a day thinking about it. But I think the main open question here is whether some kind of debate or decomposition can add value over and above the obvious big wins.)

For what it's worth I think using chess might be kind of tough---if you provide significant time, the debaters can basically just play out the game.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-27T21:02:46.077Z · LW · GW

I don't think you need to reliably classify a system as safe or not. You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that's up for debate).

So at that point you obviously aren't talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries---which I don't even think seems very unrealistic at this point and IMO is totally reasonable for "very good"), and I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good").

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much. I'm sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it's plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-26T17:17:00.889Z · LW · GW

I think politically realistic hardware controls could buy significant time, or be used to push other jurisdictions to implement appropriate regulation and allow for international verification if they want access to hardware. This seems increasingly plausible given the United States' apparent willingness to try to control access to hardware (e.g. see here).

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-26T17:09:32.437Z · LW · GW

Which laxer jurisdictions are poised to capture talent/hardware/etc. right now? It seems like 'The West' (interpreted as Silicon Valley) is close to the laxest jurisdiction on Earth when it comes to tech! (If we interpret 'The West' more broadly, this no longer holds, thankfully.)

If you implemented a unilateral pause on AI training runs in the West, then anyone who wasn't pausing AI would be a much laxer jurisdiction.

Regarding the situation today, I don't believe that any jurisdiction has regulations that meaningfully reduce catastrophic risk, but that the US, EU, and UK seem by far the closest, which I'd call "the West."

I assume your caveat about 'a pause on new computing hardware' indicates that you think that business-as-usual capitalism means that pausing capital-intensive frontier development unilaterally doesn't buy much, because hardware (and talent and data etc.) will flow basically-resistance-free to other places? This seems like a crux: one I don't feel well-equipped to evaluate, but which I do feel it's appropriate to be quite uncertain on.

I think a unilateral pause in the US would slow down AI development materially, there is obviously a ton of resistance. Over the long term I do think you will bounce back significantly to the previous trajectory from catch-up growth, despite resistance, and I think the open question is more like whether that bounce back is 10% or 50% or 90%. So I end up ambivalent; the value of a year of pause now is pretty low compared to the value of a year of pause later, and you are concentrating development in time and shifting it to places that are (by hypothesis) less inclined to regulate risk.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-26T16:52:06.633Z · LW · GW

I don't think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I'm not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that's a very small impact). I think that in that regime random political and social consequences of faster or slower technological development likely dominate the direct effects from becoming better prepared over time. I would have the same view in retrospect about e.g. a possible pause on AI development 6 years ago. I think at that point the amount of quality-adjusted work on alignment was probably higher than the quality-adjusted work on these kinds of risks today, but still the direct effects on increasingly alignment preparedness would be pretty tiny compared to random other incidental effects of a pause on the AI landscape.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-26T16:47:53.244Z · LW · GW

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very good implementation).

The point of this comment is to explain why I am primarily worried about implementation difficulty, rather than about the risk that failures will occur before we detect them. It seems extremely difficult to manage risks even once they appear, and almost all of the risk comes from our failure to do so.

(Incidentally, I think some other participants in this discussion are advocating for an indefinite pause starting now, and so I'd expect them to be much more optimistic about this step than you appear to be.)

(I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures)

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.

unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.

I don't really believe this argument. I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible. And even if we can't solve the problem we could continue to invest in stronger understanding of risk, and with good enough understanding in hand I think there is a significant chance (perhaps 50%) that we could hold off on AI development for many years such that other game-changing technologies or institutional changes could arrive first.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-25T17:56:45.251Z · LW · GW

Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.

On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.

Small caveats: (i) I don't know enough to understand the implications or comment on the recommendation "they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented," (ii) "take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next" seems like a bit of a severe understatement that might undermine urgency (I think we should that possibility seriously over the next few years, and I'd give better than even odds that they will outperform humans across all critical domains within this decade or next), (iii) I think that RSPs / if-then commitments are valuable not just for bridging the period between now and when regulation is in place, but for helping accelerate more concrete discussions about regulation and building relevant infrastructure.

I'm a tiny bit nervous about the way that "autonomous replication" is used as a dangerous capability here and in other communications. I've advocated for it as a good benchmark task for evaluation and responses because it seems likely to be easier than almost anything catastrophic (including e.g. intelligence explosion, superhuman weapons R&D, organizing a revolution or coup...) and by the time it occurs there is a meaningful probability of catastrophe unless you have much more comprehensive evaluations in place. That said, I think most audiences will think it sounds somewhat improbable as a catastrophic risk in and of itself (and a bit science-fiction-y, in contrast with other risks like cybersecurity that also aren't existential in-and-of-themselves but sound much more grounded). So it's possible that while it makes a good evaluation target it doesn't make a good first item on a list of dangerous capabilities. I would defer to people who have a better understanding of politics and perception, I mostly raise the hesitation because I think ARC may have had a role in how focal it is in some of these discussions.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-25T04:11:40.814Z · LW · GW

Unknown unknowns seem like a totally valid basis for concern.

But I don't think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop. Without further elaboration I don't think "unknown unknowns could cause a catastrophe" is enough to convince governments (or AI developers) to take significant actions.

I think RSPs make this situation better by pushing developers away from vague "Yeah we'll be safe" to saying "Here's what we'll actually do" and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.

My own take is that there is small but non-negligible risk before Anthropic's ASL-3. For my part I'd vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I'm not the median voter or decision-maker here (nor is Anthropic), and so I'll say my piece but then move on to trying to convince people or to find a compromise that works.

Comment by paulfchristiano on Lying is Cowardice, not Strategy · 2023-10-25T02:55:47.607Z · LW · GW

Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.

I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I usually caveat that doing so seems costly and politically challenging and so I expect it to require clearer evidence of risk.

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-25T02:14:02.163Z · LW · GW

That's fair, I think I misread you.

I guess our biggest differences are (i) I don't think the takeaway depends so strongly on whether AI developers are trying to do the right thing---either way it's up to all of us, and (ii) I think it's already worth talking about ways which Anthropic's RSP is good or bad or could be better, and so I disagree with "there's probably not much to say at this point."

Comment by paulfchristiano on Thoughts on responsible scaling policies and regulation · 2023-10-25T01:29:21.675Z · LW · GW

But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.

This seems wrong to me. We can say all kinds of things, like:

Are these RSPs actually effective if implemented? How could they be better? (Including aspects like: how will this policy be updated in the future? What will happen given disagreements?)
Is there external verification that they are implemented well?
Which developers have and have not implemented effective and verifiable RSPs?
How could employees, the public, and governments push developers to do better?

I don't think we're just sitting here and rolling a die about which is going to happen, path #1 or path #2. Maybe that's right if you just are asking how much companies will do voluntarily, but I don't think that should be the exclusive focus (and if it was there wouldn't be much purpose to this more meta discussion). One of my main points is that external stakeholders can look at what companies are doing, discuss ways in which it is or isn't adequate, and then actually push them to do better (and build support for government action to demand better). That process can start immediately, not at some hypothetical future time.

Comment by paulfchristiano on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-20T19:27:18.980Z · LW · GW

The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.

The post mentions a "failsafe" where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I'm not aware of any public information about what that supermajority is, or whether there are other ways the Trust's formal powers could be reduced.

Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it's also listed plenty of other places.)

Comment by paulfchristiano on Prizes for matrix completion problems · 2023-09-05T02:55:33.889Z · LW · GW

We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).

I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on is not solvable. We would award a prize for an algorithm running in time $O (m n ε^{- 1})$ which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least $ε$ . And of course a lower bound is still fair game.

That said, I don't expect any new submissions to win prizes and so wouldn't recommend that anyone start working on it.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-08-14T16:51:27.039Z · LW · GW

By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.

I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, then I think I'm with you. But I think (i) it's totally fair to call that "strong global coordination," (ii) you would probably have to do a somewhat better job than we did of nuclear non-proliferation.

I think the technical question is usually going to be about how to trade off capability against risk. If you didn't care about that at all, you could just not build scary ML systems. I'm saying that you should build smaller models with process-based RL.

It might be good to focus on legible or easy-to-enforce lines rather than just trading off capability vs risk optimally. But I don't think that "no RL" is effective as a line---it still leaves you with a lot of reward-hacking (e.g. by planning against an ML model, or predicting what actions lead to a high reward, or expert iteration...). Trying to avoid all these things requires really tightly monitoring every use of AI, rather than just training runs. And I'm not convinced it helps significantly with deceptive alignment.

So in any event it seems like you are going to care about model size. "No big models" is also a way easier line to enforce. This is pretty much like saying "minimize the amount of black-box end-to-end optimization you do," which feels like it gets closer to the heart of the issue.

If you are taking that approach, I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models (and will ultimately want to use outcomes in relatively safe ways). Yes it would be safer to use neither process-based RL nor big models, and just make your AI weaker. But the main purpose of technical work is to reduce how demanding the policy ask is---how much people are being asked to give up, how unstable the equilibrium is, how much powerful AI we can tolerate in order to help enforce or demonstrate necessity. Otherwise we wouldn't be talking about these compromises at all---we'd just be pausing AI development now until safety is better understood.

I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-08-10T17:24:40.675Z · LW · GW

It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).

It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.

So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.

You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all.

But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive.

(People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.)

I think there's still a good chance that process-based RL in the distillation step still can't be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it's at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-08-09T03:59:15.070Z · LW · GW

My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-08-01T18:37:36.580Z · LW · GW

Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-07-31T20:06:44.099Z · LW · GW

Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.

At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-07-31T18:50:12.131Z · LW · GW

Although this is an important discussion I want to emphasize up front that I don't think it's closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.

If you allow models to think for a while they do much better than if you just ask them to answer the question. By "think for a while" we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.

I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don't really consider this evidence one way or the other.

If you want to state any concrete prediction about the future I'm happy to say whether I agree with it. For example:

I think that the gap between "spit out an answer" and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.

My sense right now is that this feels a bit semantic.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-07-31T17:19:08.612Z · LW · GW

I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance." I'm happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don't have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).

To the extent models trained with RLHF are doing anything smart in the real world I think it's basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.

Comment by paulfchristiano on Thoughts on sharing information about language model capabilities · 2023-07-31T17:18:48.765Z · LW · GW

I changed the section to try to make it a bit more clear that I mean "understanding of LM agents." For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.

Comment by paulfchristiano on Self-driving car bets · 2023-07-29T22:14:48.752Z · LW · GW

This is incorrect, and you're a world class expert in this domain.

What's incorrect? My view that a cheap simulation of arbitrary human experts would be enough to end life as we know it one way or the other?

(In the subsequent text it seems like you are saying that you don't need to match human experts in every domain in order to have a transformative impact, which I agree with. I set the TAI threshold as "economic impact as large as" but believe that this impact will be achieved by systems which are in some respects weaker than human experts and in other respects stronger/faster/cheaper than humans.)

Do you think 30% is too low or too high for July 2033?

Comment by paulfchristiano on Self-driving car bets · 2023-07-29T20:58:34.806Z · LW · GW

Yes. My median is probably 2.5 years to have 10 of the 50 largest US cities where a member of the public can hail a self-driving car (though emphasizing that I don't know anything about the field beyond the public announcements).

Some of these bets had a higher threshold of covering >50% of the commutes within the city, i.e. multiplying fraction of days where it can run due to weather, and fraction of commute endpoints in the service zone. I think Phoenix wouldn't yet count, though a deployment in SF likely will immediately. If you include that requirement then maybe my median is 3.5 years. (My 60% wasn't with that requirement and was intended to count something like the current Phoenix deployment.)

(Updated these numbers in the 60 seconds after posting, from (2/2.5) to (2.5/3.5). Take that as an indication of how stable those forecasts are.)

Comment by paulfchristiano on [Linkpost] Introducing Superalignment · 2023-07-07T16:45:07.154Z · LW · GW

Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don't think "refuses to carry out phishing attacks when asked" is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn't part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)

I think relevant forms of "misalignment in the lab" are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.

Comment by paulfchristiano on [Linkpost] Introducing Superalignment · 2023-07-07T01:27:30.900Z · LW · GW

Note that ARC evals haven't done anything I would describe as "try to investigate misalignment in the lab." They've asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.

However I also think "create misalignment in the lab" is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it's great to actually have the cost-benefit discussion (e.g. by considering concrete ways in which an experimental setup increases risk and evaluating them), but that a knee-jerk reaction of "misalignment = bad" would be unproductive. Misalignment in the lab often won't increase risk at all (or will have a truly negligible effect) while being more-or-less necessary for any institutionally credible alignment strategy.

There are coherent worldviews where I can see the cost-benefit coming out against, but they involve having a very low total level of risk (such that this problem is unlikely to appear unless you deliberately create it) together with a very high level of civilizational vulnerability (such that a small amount of misaligned AI in the lab can cause a global catastrophe). Or maybe more realistically just a claim that studying alignment in the lab has very small probability of helping mitigate it.

Comment by paulfchristiano on [Linkpost] Introducing Superalignment · 2023-07-07T01:12:52.295Z · LW · GW

I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.

He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet). Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.

One potential substantive disagreement with Jan's position is that I'm somewhat more scared of AI systems evaluating the consequences of each other's actions and therefore more interested in trying to evaluate proposed actions on paper (rather than needing to run them to see what happens). That is, I'm more interested in "process-based" supervision and decoupled evaluation, whereas my understanding is that Jan sees a larger role for systems doing things like carrying out experiments with evaluation of results in the same way that we'd evaluate employee's output.

(This is related to the difference between IDA and RRM that I mentioned above. I'm actually not sure about Jan's all-things-considered position, and I think this piece is a bit agnostic on this question. I'll return to this question below.)

The basic tension here is that if you evaluate proposed actions you easily lose competitiveness (since AI systems will learn things overseers don't know about the consequences of different possible actions) whereas if you evaluate outcomes then you are more likely to have an abrupt takeover where AI systems grab control of sensors / the reward channel / their own computers (since that will lead to the highest reward). A subtle related point is that if you have a big competitiveness gap from process-based feedback, then you may also be running an elevated risk from deceptive alignment (since it indicates that your model understands things about the world that you don't).

In practice I don't think either of those issues (competitiveness or takeover risk) is a huge issue right now. I think process-based feedback is pretty much competitive in most domains, but the gap could grow quickly as AI systems improve (depending on how well our techniques work). On the other side, I think that takeover risks will be small in the near future, and it is very plausible that you can get huge amounts of research out of AI systems before takeover is a significant risk. That said I do think eventually that risk will become large and so we will need to turn to something else: new breakthroughs, process-based feedback, or fortunate facts about generalization.

As I mentioned, I'm actually not sure what Jan's current take on this is, or exactly what view he is expressing in this piece. He says:

Another important open question is how much easier evaluation is if you can’t rely on feedback signals from the real world. For example, is evaluation of a piece of code easier than writing it, even if you’re not allowed to run it? If we’re worried that our AI systems are writing code that might contain trojans and sandbox-breaking code, then we can’t run it to “see what happens” before we’ve reviewed it carefully.

I'm not sure where he comes down on whether we should use feedback signals from the real world, and if so what kinds of precaution we should take to avoid takeover and how long we should expect them to hold up. I think both halves of this are just important open questions---will we need real world feedback to evaluate AI outcomes? In what cases will we be able to do so safely? If Jan is also just very unsure about both of these questions then we may be on the same page.

I generally hope that OpenAI can have strong evaluations of takeover risk (including: understanding their AI's capabilities, whether their AI may try to take over, and their own security against takeover attempts). If so, then questions about the safety of outcomes-based feedback can probably be settled empirically and the community can take an "all of the above" approach. In this case all of the above is particularly easy since everything is sitting on the same spectrum. A realistic system is likely to involve some messy combination of outcomes-based and process-based supervision, we'll just be adjusting dials in response to evidence about what works and what is risky.

Comment by paulfchristiano on [Linkpost] Introducing Superalignment · 2023-07-06T02:06:56.651Z · LW · GW

1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.

2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which is what I'm focusing on), and obfuscated arguments are a big part of the obstacle. But there's not much indication yet that it's a practical problem for aligning modestly superhuman systems (while at the same time I think research on decomposition and debate has mostly engaged with more boring practical issues). I don't think obfuscated arguments have been a major part of most people's research prioritization.

3. I think many people are actively working on decomposition-focused approaches. I think it's a core part of the default approach to prosaic AI alignment at all the labs, and if anything is feeling even more salient these days as something that's likely to be an important ingredient. I think it makes sense to emphasize it less for research outside of labs, since it benefits quite a lot from scale (and indeed my main regret here is that working on this for GPT-3 was premature). There is a further question of whether alignment people need to work on decomposition/debate or should just leave it to capabilities people---the core ingredient is finding a way to turn compute into better intelligence without compromising alignment, and that's naturally something that is interesting to everyone. I still think that exactly how good we are at this is one of the major drivers for whether the AI kills us, and therefore is a reasonable topic for alignment people to push on sooner and harder than it would otherwise happen, but I think that's a reasonable topic for controversy.

Comment by paulfchristiano on ARC is hiring theoretical researchers · 2023-06-14T22:46:36.742Z · LW · GW

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that "we can detect bad behavior but the model does a treacherous turn anyway" is a plausible failure mode to address.

A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. . You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.

So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.

If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary---you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.

Comment by paulfchristiano on ARC is hiring theoretical researchers · 2023-06-14T22:38:35.970Z · LW · GW

I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.

I hope to write up a reasonable pitch sometime over the next few weeks.

In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and together are the two halves of the alignment problem and so solving both is very exciting. That said, I've considered this in less detail than the ELK application. I'll try to give a bit more detail on this in the child comment.

User info

Posts

Comments