Posts
Comments
Some thoughts:
- The correct answer is clearly (c) - it depends on a bunch of factors.
- My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard's reasons.
- Given [new potential-to-shift-motivation information/understanding], I expect there's a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
- Specifically:
- Who gets picked to run such a project? If it's primarily a [let's beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who're cautious and highly adaptable?
- Here I note that the kind of 'caution' we'd need is [people who push effectively for the system to operate with caution]. Most people who want caution are more cautious.
- How is the project structured? Will the structure be optimized for adaptability? For red-teaming of top-level goals?
- Suppose that a mid-to-high-level participant receives information making the current top-level goals questionable - is the setup likely to reward them for pushing for changes? (noting that these are the kind of changes that were not expected to be needed when the project launched)
- Which external advisors do leaders of the project develop relationships with? What would trigger these to change?
- ...
- Who gets picked to run such a project? If it's primarily a [let's beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who're cautious and highly adaptable?
- I do think that it makes sense to aim for some centralized project - but only if it's the right kind.
- I expect that almost all the directional influence is in [influence the initial conditions].
- For this reason, I expect [push for some kind of centralized project, and hope it changes later] is a bad idea.
- I think [devote great effort to influencing the likely initial direction of any such future project] seems a great idea (so long as you're sufficiently enlightened about desirable initial directions, of course :))
- I'd note that [initial conditions] needn't only be internal to the project - in principle we could have reason to believe that various external mechanisms would be likely to shift the project's motivation sufficiently over time. (I don't know of any such reasons)
- I think the question becomes significantly harder once the primary motivation behind a project isn't [let's beat China!], but also isn't [your ideal project motivation (with your ideal initial conditions)].
- I note that my p(doom) doesn't change much if we eliminate racing but don't slow down until it's clear to most decision makers that it's necessary.
- Likewise, I don't expect that [focus on avoiding the earliest disasters] is likely to be the best strategy. So e.g. getting into a good position on security seems great, all else equal - but I wouldn't sacrifice much in terms of [odds of getting to a sufficiently cautious overall strategy] to achieve better short-term security outcomes.
First some points of agreement:
- I like that you're focusing on neglected approaches. Not much on the technical side seems promising to me, so I like to see exploration.
- Skimming through your suggestions, I think I'm most keen on human augmentation related approaches - hopefully the kind that focuses on higher quality decision-making and direction finding, rather than simply faster throughput.
- I think outreach to Republicans / conservatives, and working across political lines is important, and I'm glad that people are actively thinking about this.
- I do buy the [Trump's high variance is helpful here] argument. It's far from a principled analysis, but I can more easily imagine [Trump does correct thing] than [Harris does correct thing]. (mainly since I expect the bar on "correct thing" to be high so that it needs variance)
- I'm certainly making no implicit "...but the Democrats would have been great..." claim below.
That said, various of the ideas you outline above seem to be founded on likely-to-be-false assumptions.
Insofar as you're aiming for a strategy that provides broadly correct information to policymakers, this seems undesirable - particularly where you may be setting up unrealistic expectations.
Highlights of the below:
- Telling policymakers that we don't need to slow down seems negative.
- I don't think you've made any valid argument that not needing to slow down is likely. (of course it'd be convenient)
- A negative [alignment-in-the-required-sense tax] seems implausible. (see below)
- (I don't think it even makes sense in the sense that "alignment tax" was originally meant[1], but if "negative tax" gets conservatives listening, I'm all for it!)
- I think it's great for people to consider convenient possibilities (e.g. those where economic incentives work for us) in some detail, even where they're highly unlikely. Whether they're actually 0.25% or 25% likely isn't too important here.
- Once we're talking about policy advocacy, their probability is important.
More details:
A conservative approach to AI alignment doesn’t require slowing progress, avoiding open sourcing etc. Alignment and innovation are mutually necessary, not mutually exclusive: if alignment R&D indeed makes systems more useful and capable, then investing in alignment is investing in US tech leadership.
Here and in the case for a negative alignment tax, I think you're:
- Using a too-low-resolution picture of "alignment" and "alignment research".
- This makes it too easy to slip between ideas like:
- Some alignment research has property x
- All alignment research has property x
- A [sufficient for scalable alignment solution] subset of alignment research has property x
- A [sufficient for scalable alignment solution] subset of alignment research that we're likely to complete has property x
- An argument that requires (iv) but only justifies (i) doesn't accomplish much. (we need something like (iv) for alignment tax arguments)
- This makes it too easy to slip between ideas like:
- Failing to distinguish between:
- Alignment := Behaves acceptably for now, as far as we can see.
- Alignment := [some mildly stronger version of 'alignment']
- Alignment := notkilleveryoneism
In particular, there'll naturally be some crossover between [set of research that's helpful for alignment] and [set of research that leads to innovation and capability advances] - but alone this says very little.
What we'd need is something like:
- Optimizing efficiently for innovation in a way that incorporates various alignment-flavored lines of research gets us sufficient notkilleveryoneism progress before any unrecoverable catastrophe with high probability.
It'd be lovely if something like this were true - it'd be great if we could leverage economic incentives to push towards sufficient-for-long-term-safety research progress. However, the above statement seems near-certainly false to me. I'd be (genuinely!) interested in a version of that statement you'd endorse at >5% probability.
The rest of that paragraph seems broadly reasonable, but I don't see how you get to "doesn't require slowing progress".
On "negative alignment taxes":
First, a point that relates to the 'alignment' disambiguation above.
In the case for a negative alignment tax, you offer the following quote as support for alignment/capability synergy:
...Behaving in an aligned fashion is just another capability... (Anthropic quote from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback)
However, the capability is [ability to behave in an aligned fashion], and not [tendency to actually behave in an aligned fashion] (granted, Anthropic didn't word things precisely here). The latter is a propensity, not a capability.
What we need for scalable alignment is the propensity part: no-one sensible is suggesting that superintelligences wouldn't have the ability to behave in an aligned fashion. The [behavior-consistent-with-alignment]-capability synergy exists while a major challenge is for systems to be able to behave desirably.
Once capabilities are autonomous-x-risk-level, the major challenge will be to get them to actually exhibit robustly aligned behavior. At that point there'll be no reason to expect the synergy - and so no basis to expect a negative or low alignment tax where it matters.
On things like "Cooperative/prosocial AI systems", I'd note that hits-based exploration is great - but please don't expect it to work (and that "if implemented into AI systems in the right ways" is almost all of the problem).
On this basis, it seems to me that the conservative-friendly case you've presented doesn't stand up at all (to be clear, I'm not critiquing the broader claim that outreach and cooperation are desirable):
- We don't have a basis to expect negative (or even low) alignment tax.
- (unclear so far that we'll achieve non-infinite alignment tax for autonomous x-risk relevant cases)
- It's highly likely that we do need to slow advancement, and will need serious regulation.
Given our lack of precise understanding of the risks, we'll likely have to choose between [overly restrictive regulation] and [dangerously lax regulation] - we don't have the understanding to draw the line in precisely the right place. (completely agree that for non-frontier systems, it's best to go with little regulation)
I'd prefer a strategy that includes [policymakers are made aware of hard truths] somewhere.
I don't think we're in a world where sufficient measures are convenient.
It's unsurprising that conservatives are receptive to quite a bit "when coupled with ideas around negative alignment taxes and increased economic competitiveness" - but this just seems like wishful thinking and poor expectation management to me.
Similarly, I don't see a compelling case for:
that is, where alignment techniques are discovered that render systems more capable by virtue of their alignment properties. It seems quite safe to bet that significant positive alignment taxes simply will not be tolerated by the incoming federal Republican-led government—the attractor state of more capable AI will simply be too strong.
Of course this is true by default - in worlds where decision-makers continue not to appreciate the scale of the problem, they'll stick to their standard approaches. However, conditional on their understanding the situation, and understanding that at least so far we have not discovered techniques through which some alignment/capability synergy keeps us safe, this is much less obvious.
I have to imagine that there is some level of perceived x-risk that snaps politicians out of their default mode.
I'd bet on [Republicans tolerate significant positive alignment taxes] over [alignment research leads to a negative alignment tax on autonomous-x-risk-capable systems] at at least ten to one odds (though I'm not clear how to operationalize the latter).
Republicans are more flexible than reality :).
- ^
As I understand the term, alignment tax compares [lowest cost for us to train a system with some capability level] against [lowest cost for us to train an aligned system with some capability level]. Systems in the second category are also in the first category, so zero tax is the lower bound.
This seems a better definition, since it focuses on the outputs, and there's no need to handwave about what counts as an alignment-flavored training technique: it's just [...any system...] vs [...aligned system...].
Separately, I'm not crazy about the term: it can suggests to new people that we know how to scalably align systems at all. Talking about "lowering the alignment tax" from infinity strikes me as an odd picture.
I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.
I don't think this is quite right.
Two major objections to the bio-anchors 30-year-median conclusion might be:
- The whole thing is laundering vibes into credible-sounding headline numbers.
- Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.
To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).
I guess there's a sense in which a mistake on (2) could be seen as a consequence of (1) - but it seems distinct: it's a logic error, not a free parameter. I do think it's useful to distinguish [motivated reasoning in free-parameter choice] from [motivated reasoning in error-checking].
It's not so obvious to me that the bio-anchors report was without foundation as an upper bound estimate.
To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].
That said, I think I'd disagree on one word of the following:
The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations
tothat help orchestrate the actions required to do it. This is true even if they've been selected for passively.
Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].
In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)
I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).
If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)
Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].
My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].
Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.
It seems to me that there's no fixed notion of "active" that works for both paragraphs here.
If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.
If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.
There are two dimensions here:
- Whether the circumvention is implemented passively/actively.
- Whether the circumvention is selected for passively/actively.
In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).
I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].
I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].
It may be better to think about it that way, yes - in some cases, at least.
Probably it makes sense to throw in some more variables.
Something like:
- To stand x chance of property p applying to system s, we'd need to apply resources r.
In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].
Do you see this as likely to have been avoidable? How?
I agree that it's undesirable. Less clear to me that it's an "own goal".
Do you see other specific things we're doing now (or that we may soon do) that seem likely to be future-own-goals?
[all of the below is "this is how it appears to my non-expert eyes"; I've never studied such dynamics, so perhaps I'm missing important factors]
I expect that, even early on, e/acc actively looked for sources of long-term disagreement with AI safety advocates, so it doesn't seem likely to me that [AI safety people don't emphasize this so much] would have much of an impact.
I expect that anything less than a position of [open-source will be fine forever] would have had much the same impact - though perhaps a little slower. (granted, there's potential for hindsight bias here, so I shouldn't say "I'm confident that this was inevitable", but it's not at all clear to me that it wasn't highly likely)
It's also not clear to me that any narrow definition of [AI safety community] was in a position to prevent some claims that open-source will be unacceptably dangerous at some point. E.g. IIRC Geoffrey Hinton rhetorically compared it to giving everyone nukes quite a while ago.
Reducing focus on [desirable, but controversial, short-term wins] seems important to consider where non-adversarial groups are concerned. It's less clear that it helps against (proto-)adversarial groups - unless you're proposing some kind of widespread, strict message discipline (I assume that you're not).
[EDIT for useful replies to this, see Richard's replies to Akash above]
On your bottom line, I entirely agree - to the extent that there are non-power-seeking strategies that'd be effective, I'm all for them. To the extent that we disagree, I think it's about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].
Constrained-power-seeking still seems necessary to me. (unfortunately)
A few clarifications:
- I guess most technical AIS work is net negative in expectation. My ask there is that people work on clearer cases for their work being positive.
- I don't think my (or Eliezer's) conclusions on strategy are downstream of [likelihood of doom]. I've formed some model of the situation. One output of the model is [likelihood of doom]. Another is [seemingly least bad strategies]. The strategies are based around why doom seems likely, not (primarily) that doom seems likely.
- It doesn't feel like "I am responding to the situation with the appropriate level of power-seeking given how extreme the circumstances are".
- It feels like the level of power-seeking I'm suggesting seems necessary is appropriate.
- My cognitive biases push me away from enacting power-seeking strategies.
- Biases aside, confidence in [power seems necessary] doesn't imply confidence that I know what constraints I'd want applied to the application of that power.
- In strategies I'd like, [constraints on future use of power] would go hand in hand with any [accrual of power].
- It's non-obvious that there are good strategies with this property, but the unconstrained version feels both suspicious and icky to me.
- Suspicious, since [I don't have a clue how this power will need to be directed now, but trust me - it'll be clear later (and the right people will remain in control until then)] does not justify confidence.
- To me, you seem to be over-rating the applicability of various reference classes in assessing [(inputs to) likelihood of doom]. As I think I've said before, it seems absolutely the correct strategy to look for evidence based on all the relevant reference classes we can find.
- However, all else equal, I'd expect:
- Spending a long time looking for x, makes x feel more important.
- [Wanting to find useful x] tends to shade into [expecting to find useful x] and [perceiving xs as more useful than they are].
- Particularly so when [absent x, we'll have no clear path to resolving hugely important uncertainties].
- The world doesn't owe us convenient reference classes. I don't think there's any way around inside-view analysis here - in particular, [how relevant/significant is this reference class to this situation?] is an inside-view question.
- That doesn't make my (or Eliezer's, or ...'s) analysis correct, but there's no escaping that you're relying on inside-view too. Our disagreement only escapes [inside-view dependence on your side] once we broadly agree on [the influence of inside-view properties on the relevance/significance of your reference classes]. I assume that we'd have significant disagreements there.
- Though it still seems useful to figure out where. I expect that there are reference classes that we'd agree could clarify various sub-questions.
- In many non-AI-x-risk situations, we would agree - some modest level of inside-view agreement would be sufficient to broadly agree about the relevance/significance of various reference classes.
- That doesn't make my (or Eliezer's, or ...'s) analysis correct, but there's no escaping that you're relying on inside-view too. Our disagreement only escapes [inside-view dependence on your side] once we broadly agree on [the influence of inside-view properties on the relevance/significance of your reference classes]. I assume that we'd have significant disagreements there.
- However, all else equal, I'd expect:
E.g. prioritizing competence means that you'll try less hard to get "your" person into power. Prioritizing legitimacy means you're making it harder to get your own ideas implemented, when others disagree.
That's clarifying. In particular, I hadn't realized you meant to imply [legitimacy of the 'community' as a whole] in your post.
I think both are good examples in principle, given the point you're making. I expect neither to work in practice, since I don't think that either [broad competence of decision-makers] or [increased legitimacy of broad (and broadening!) AIS community] help us much at all in achieving our goals.
To achieve our goals, I expect we'll need something much closer to 'our' people in power (where 'our' means [people with a pretty rare combination of properties, conducive to furthering our goals]), and increased legitimacy for [narrow part of the community I think is correct].
I think we'd need to go with [aim for a relatively narrow form of power], since I don't think accumulating less power will work. (though it's a good plan, to the extent that it's possible)
First, I think that thinking about and highlighting these kind of dynamics is important.
I expect that, by default, too few people will focus on analyzing such dynamics from a truth-seeking and/or instrumentally-useful-for-safety perspective.
That said:
- It seems to me you're painting with too broad a brush throughout.
- At the least, I think you should give some examples that lie just outside the boundary of what you'd want to call [structural power-seeking].
- Structural power-seeking in some sense seems unavoidable. (AI is increasingly powerful; influencing it implies power)
- It's not clear to me that you're sticking to a consistent sense throughout.
- E.g. "That makes AI safety strategies which require power-seeking more difficult to carry out successfully." seems false in general, unless you mean something fairly narrow by power-seeking.
- It's not clear to me that you're sticking to a consistent sense throughout.
- An important aspect is the (perceived) versatility of power:
- To the extent that it's [general power that could be efficiently applied to any goal], it's suspicious.
- To the extent that it's [specialized power that's only helpful in pursuing a narrow range of goals] it's less suspicious.
- Similarly, it's important under what circumstances the power would become general: if I take actions that can only give me power by routing through [develops principled alignment solution], that would make a stated goal of [develop principled alignment solution] believable; it doesn't necessarily make some other goal believable - e.g. [...and we'll use it to create this kind of utopia].
- Increasing legitimacy is power-seeking - unless it's done in such a way that it implies constraints.
- That said, you may be right that it's somewhat less likely to be perceived as such.
- Aiming for [people will tend to believe whatever I say about x] is textbook power-seeking wherever [influence on x] implies power.
- We'd want something more like [people will tend to believe things that I say about x, so long as their generating process was subject to [constraints]].
- Here it's preferable for [constraints] to be highly limiting and clear (all else equal).
- I'd say that "prioritizing competence" begs the question.
- What is the required sense of "competence"?
- For the most important AI-based decision-making, I doubt that "...broadly competent, and capable of responding sensibly..." is a high enough bar.
- In particular, "...because they don't yet take AGI very seriously" is not the only reason people are making predictable mistakes.
- "...as AGI capabilities and risks become less speculative..."
- Again, this seems too coarse-grained:
- Some risks becoming (much) clearer does not entail all risks becoming (much) clearer.
- Understanding some risks well while remaining blind to others, does not clearly imply safer decision-making, since "responding sensibly" will tend to be judged based on [risks we've noticed].
- Again, this seems too coarse-grained:
- What is the required sense of "competence"?
That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now)
What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
I expect that this would lead people to develop clearer, richer models.
Presumably this will take months rather than hours, but it seems worth it (whether or not I'm correct - I expect that [the understanding required to clearly demonstrate to me that I'm wrong] would be useful in a bunch of other ways).
"[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...
I'm claiming something more like "[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we're highly likely to get doom.
I'll clarify more below.
Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
Sure, but I mostly don't buy p(doom) reduction here, other than through [highlight near-misses] - so that an approach that hides symptoms of fundamental problems is probably net negative.
In the free-for-all world, I think doom is overdetermined, absent miracles [1]- and [significantly improved debate setup] does not strike me as a likely miracle, even after I condition on [a miracle occurred].
Factors that push in the other direction:
- I can imagine techniques that reduce near-term widespread low-stakes failures.
- This may be instrumentally positive if e.g. AI is much better for collective sensemaking than otherwise it would be (even if that's only [the negative impact isn't as severe]).
- Similarly, I can imagine such techniques mitigating the near-term impact of [we get what we measure] failures. This too seems instrumentally useful.
- I do accept that technical work I'm not too keen on may avoid some early foolish/embarrassing ways to fail catastrophically.
- I mostly don't think this helps significantly, since we'll consistently hit doom later without a change in strategy.
- Nonetheless, [don't be dead yet] is instrumentally useful if we want more time to change strategy, so avoiding early catastrophe is a plus.
- [probably other things along similar lines that I'm missing]
But I suppose that on the [usefulness of debate (/scalable oversight techniques generally) research], I'm mainly thinking: [more clearly understanding how and when this may fail catastrophically, and how we'd robustly predict this] seems positive, whereas [show that versions of this technique get higher scores on some benchmarks] probably doesn't.
Even if I'm wrong about the latter, the former seems more important.
Granted, it also seems harder - but I think that having a bunch of researchers focus on it and fail to come up with any principled case is useful too (at least for them).
If we ignore the misunderstanding part then I'm at << 1% probability on "we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future".
(I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
Agreed. This is why my main hope on this routes through [work on level 6/7 specifications clarifies the depth and severity of the problem] and [more-formally-specified 6/7 specifications give us something to point to in regulation].
(on the level 7, I'm assuming "in all contexts" must be an overstatement; in particular, we only need something like "...in all contexts plausibly reachable from the current state, given that all powerful AIs developed by us or our AIS follow this specification or this-specification-endorsed specifications")
Clarifications I'd make on my [doom seems likely, but not inevitable; some technical work seems net negative] position:
- If I expected that we had 25 years to get things right, I think I'd be pretty keen on most hands-on technical approaches (debate included).
- Quite a bit depends on the type of technical work. I like the kind of work that plausibly has the property [if we iterate on this we'll probably notice all catastrophic problems before triggering them].
- I do think there's a low-but-non-zero chance of breakthroughs in pretty general technical work. I can't rule out that ARC theory come up with something transformational in the next few years (or that it comes from some group that's outside my current awareness).
- I'm not ruling out an [AI assistants help us make meaningful alignment progress] path - I currently think it's unlikely, not impossible.
- However, here I note that there's a big difference between:
- The odds that [solve alignment with AI assistants] would work if optimally managed.
- The odds that it works in practice.
- I worry that researchers doing technical research tend to have the the former in mind (implicitly, subconsciously) - i.e. the (implicit) argument is something like "Our work stands a good chance to unlock a winning strategy here".
- But this is not the question - the question is how likely it is to work in practice.
- (even conditioning on not-obviously-reckless people being in charge)
- It's guesswork, but on [does a low-risk winning strategy of this form exist (without a huge slowdown)?] I'm perhaps 25%. On [will we actually find and implement such a strategy, even assuming the most reckless people aren't a factor], I become quite a bit more pessimistic - if I start to say "10%", I recoil at the implied [40% shot at finding and following a good enough path if one exists].
- Of course a lot here depends on whether we can do well enough to fail safely. Even a 5% shot is obviously great if the other 95% is [we realize it's not working, and pivot].
- However, here I note that there's a big difference between:
- However, since I don't see debate-like approaches as plausible in any direct-path-to-alignment sense, I'd like to see a much clearer plan for using such methods as stepping-stones to (stepping stones to...) a solution.
- In particular, I'm interested in the case for [if this doesn't work, we have principled reasons to believe it'll fail safely] (as an overall process, that is - not on each individual experiment).
- When I look at e.g. Buck/Ryan's outlined iteration process here,[2] I'm not comforted on this point: this has the same structure as [run SGD on passing our evals], only it's [run researcher iteration on passing our evals]. This is less bad, but still entirely loses the [evals are an independent check on an approach we have principled reasons to think will work] property.
- On some level this kind of loop is unavoidable - but having the "core workflow" of alignment researchers be [tweak the approach, then test it against evals] seems a bit nuts.
- Most of the hope here seems to come from [the problem is surprisingly (to me) easy] or [catastrophic failure modes are surprisingly (to me) sparse].
- In particular, I'm interested in the case for [if this doesn't work, we have principled reasons to believe it'll fail safely] (as an overall process, that is - not on each individual experiment).
Not going to respond to everything
No worries at all - I was aiming for [Rohin better understands where I'm coming from]. My response was over-long.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
Agreed, but I think this is too coarse-grained a view.
I expect that, absent impressive levels of international coordination, we're screwed. I'm not expecting [higher perceived risk] --> [actions that decrease risk] to operate successfully on the "move fast and break things" crowd.
I'm considering:
- What kinds of people are making/influencing key decisions in worlds where we're likely to survive?
- How do we get those people this influence? (or influential people to acquire these qualities)
- What kinds of situation / process increase the probability that these people make risk-reducing decisions?
I think some kind of analysis along these lines makes sense - though clearly it's hard to know where to draw the line between [it's unrealistic to expect decision-makers/influencers this principled] and [it's unrealistic to think things may go well with decision-makers this poor].
I don't think conditioning on the status-quo free-for-all makes sense, since I don't think that's a world where our actions have much influence on our odds of success.
I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2).
Agreed (I think your links make good points). However, I'd point out that it can be true both that:
- Most first-principles reasoning about x is terrible.
- First-principles reasoning is required in order to make any useful prediction of x. (for most x, I don't think this holds)
You've listed a bunch of claims about AI, but haven't spelled out why they should make us expect large risk compensation effects
I think almost everything comes down to [perceived level of risk] sometimes dropping hugely more than [actual risk] in the case of AI. So it's about the magnitude of the input.
- We understand AI much less well.
- We'll underestimate a bunch of risks, due to lack of understanding.
- We may also over-estimate a bunch, but the errors don't cancel: being over-cautious around fire doesn't stop us from drowning.
- Certain types of research will address [some risks we understand], but fail to address [some risks we don't see / underestimate].
- They'll then have a much larger impact on [our perception of risk] than on [actual risk].
- Drop in perceived risk is much larger than the drop in actual risk.
- In most other situations, this isn't the case, since we have better understanding and/or adaptive feedback loops to correct risk estimates.
It depends hugely on the specific stronger safety measure you talk about. E.g. I'd be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I'm hesitant around such small probabilities on any social claim.
That's useful, thanks. (these numbers don't seem foolish to me - I think we disagree mainly on [how necessary are the stronger measures] rather than [how likely are they])
Hmm, then I don't understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper).
Oh sorry, I should have been more specific - I'm only keen on specifications that plausibly give real guarantees: level 6(?) or 7. I'm only keen on the framework conditional on meeting an extremely high bar for the specification.
If that part gets ignored on the basis that it's hard (which it obviously is), then it's not clear to me that the framework is worth much.
I suppose I'm also influenced by the way some of the researchers talk about it - I'm not clear how much focus Davidad is currently putting on level 6/7 specifications, but he seems clear that they'll be necessary.
[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]
Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?
Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true).
[I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]
It seems like your argument here, and in other parts of your comment, is something like "we could do this more costly thing that increases safety even more". This seems like a pretty different argument; it's not about risk compensation (i.e. when you introduce safety measures, people do more risky things), but rather about opportunity cost (i.e. when you introduce weak safety measures, you reduce the will to have stronger safety measures). This is fine, but I want to note the explicit change in argument; my earlier comment and the discussion above was not trying to address this argument.
Perhaps we've been somewhat talking at cross purposes then, since I certainly consider [not acting to bring about stronger safety measures] to be within the category of [doing more risky things].
It fits the pattern of [lower perceived risk] --> [actions that increase risk].
For clarity, risky actions I'm thinking about would include:
- Not pushing for stronger safety measures.
- Not pushing for international coordination.
- Pushing forward with capabilities more quickly.
- Placing too much trust in systems off distribution.
- Placing too much trust in the output of [researchers using systems to help solve alignment].
- Not spending enough on more principled (by my lights) alignment research.
Some of these might be both [opportunity cost] and [risk compensation].
I.e.:
- We did x.
- x reduced perceived risk.
- A precondition of y was greater-than-current perceived risk.
- Now we can't do y.
- [Not doing y] is risky.
If you do agree with that, what makes AI different from those cases? (The arguments you give seem like very general considerations that apply to other fields as well.)
Mostly for other readers, I'm going to spoiler my answer to this: my primary claim is that more people need to think about and answer these questions themselves, so I'd suggest that readers take a little time to do this. I'm significantly more confident that the question is important, than that my answer is correct or near-complete.
I agree that the considerations I'm pointing to are general.
I think the conclusions differ, since AI is different.
First a few clarifications:
- I'm largely thinking of x-risk, not moderate disasters.
- On this basis, I'm thinking about loss-of-control, rather than misuse.
- I think if we're talking about e.g. [misuse leads to moderate disaster], then other fields and prior AI experience become much more reasonable reference classes - and I'd expect things like debate to be net positive here.
- My current guess/estimate that debate research is net negative is largely based on x-risk via loss-of-control. (again, I may be wrong - but I'd like researchers to do serious work on answering this question)
How I view AI risk being different:
- First another clarification: the risk-increasing actions I'm worried about are not [individual user uses AI incautiously] but things like [AI lab develops/deploys models incautiously], and [governments develop policy incautiously] - so we shouldn't be thinking of e.g. [people driving faster with seatbelts], but e.g. [car designers / regulators acting differently once seatbelts are a thing].
- To be clear, I don't claim that seatbelts were negative on this basis.
- Key differences:
- We can build AI without understanding it.
- People and systems are not used to this.
- Plausible subconscious heuristics that break here:
- [I can build x] and [almost no-one understands x better than me], [therefore, I understand x pretty well].
- [I can build x] and [I've worked with x for years], [therefore I have a pretty good understanding of the potential failure modes].
- [We can build x], [therefore we should expect to be able to describe any likely failure modes pretty concretely/rigorously]
- [I've discussed the risks of x with many experts who build x], [therefore I have a decent high-level understanding of potential failures]
- [I'm not afraid], [therefore a deadly threat we're not ready to deal with must be unlikely]. (I agree it's unlikely now)
- ...
- One failure of AI can be unrecoverable.
- In non-AI cases there tend to be feedback loops that allow adjustments both in the design, and in people's risk estimates.
- Of course such loops exist for some AI failure modes.
- This makes the unilateralist's curse aspect more significant: it's not the average understanding or caution that matters. High variance is a problem.
- Knowing how this impacts risk isn't straightforward, since it depends a lot on [the levels of coordination we expect] and [the levels of coordination that seem necessary].
- My expectation is that we're just screwed if the most reckless organizations are able to continue without constraint.
- Therefore my concern is focused on worlds where we achieve sufficient coordination to make clearly reckless orgs not relevant. (and on making such worlds more likely)
- Debate-style approaches seem a good fit for [this won't actually work, but intelligent, well-informed, non-reckless engineers/decision-makers might think that it will]. I'm concerned about those who are somewhat overconfident.
- In non-AI cases there tend to be feedback loops that allow adjustments both in the design, and in people's risk estimates.
- The most concerning AI is very general.
- This leads to a large space of potential failure modes. (both in terms of mechanism, and in terms of behaviour/outcome)
- It is hard to make a principled case that we've covered all serious failure modes (absent a constructive argument based on various worst-case assumptions).
- The most concerning AI will be doing some kind of learned optimization. (in a broad, John Wentworthy sense)
- Pathogens may have the [we don't really understand them] property, but an individual pathogen isn't going to be general, or to be doing significant learned optimization. (if and when this isn't true, then I'd be similarly wary of ad-hoc pathogen safety work)
- New generations of AI can often solve qualitatively different problems from previous generations. With each new generation, we're off distribution in a much more significant sense than with new generations of e.g. car design.
- Unknown unknowns may arise internally due to the complex systems nature of AI.
- In cases we're used to, the unknown unknowns tend to come from outside: radically unexpected things may happen to a car, but probably not directly due to the car.
- We can build AI without understanding it.
I'm sure I'm missing various relevant factors.
Overall, I think straightforward reference classes for [safety impact of risk compensation] tell us approximately nothing - too much is different.
I'm all for looking for a variety of reference classes: we need all the evidence we can get.
However, I think people are too ready to fall back on the best reference classes they can find - even when they're terrible.
Briefly on opportunity cost arguments, the key factors are (a) how much will is there to pay large costs for safety, (b) how much time remains to do the necessary research and implement it, and (c) how feasible is the stronger safety measure.
This seems to miss the obvious: (d) how much safety do the weak vs strong measures get us?
I expect that you believe work on debate (and similar weak measures) gets us significantly more than I do. (and hopefully you're correct!)
Anyway for now let's just say that I've thought about these three factors and think it isn't especially realistic to expect that we can get stronger safety measures, and as a result I don't see opportunity cost as a big reason not to do the safety work we currently do.
Given that this seems highly significant, can you:
- Quantify "it isn't especially realistic" - are we talking [15% chance with great effort], or [1% chance with great effort]?
- Give a sense of reasons you expect this.
- Is [because we have a bunch of work on weak measures] not a big factor in your view?
- Or is [isn't especially realistic] overdetermined, with [less work on weak measures] only helping conditional on removal of other obstacles?
I'd also note that it's not only [more progress on weak measures] that matters here, but also the signal sent by pursuing these research directions.
If various people with influence on government see [a bunch of lab safety teams pursuing x safety directions] I expect that most will conclude: "There seems to be a decent consensus within the field that x will get us acceptable safety levels", rather than "Probably x is inadequate, but these researchers don't see much hope in getting stronger-than-x measures adopted".
I assume that you personally must have some constraints on the communication strategy you're able to pursue. However, it does seem highly important that if the safety teams at labs are pursuing agendas based on [this isn't great, but it's probably the best we can get], this is clearly and loudly communicated.
Similarly, the theory of change you cite for your examples seems to be "discovers or clarifies problems that shows that we don't have a solution" (including for Guaranteed Safe AI and ARC theory, even though in principle those could be about building safe AI systems). So as far as I can tell, the disagreement is really that you think current work that tries to provide a specific recipe for building safe AI systems is net negative, and I think it is net positive.
This characterization is a little confusing to me: all of these approaches (ARC / Guaranteed Safe AI / Debate) involve identifying problems, and, if possible, solving/mitigating them.
To the extent that the problems can be solved, then the approach contributes to [building safe AI systems]; to the extent that they cannot be solved, the approach contributes to [clarifying that we don't have a solution].
The reason I prefer GSA / ARC is that I expect these approaches to notice more fundamental problems. I then largely expect them to contribute to [clarifying that we don't have a solution], since solving the problems probably won't be practical, and they'll realize that the AI systems they could build won't be safe-with-high-probability.
I expect scalable oversight (alone) to notice a smaller set of less fundamental problems - which I expect to be easier to fix/mitigate. I expect the impact to be [building plausibly-safe AI systems that aren't safe (in any robustly scalable sense)].
Of course I may be wrong, if the fundamental problems I believe exist are mirages (very surprising to me) - or if the indirect [help with alignment research] approach turns out to be more effective than I expect (fairly surprising, but I certainly don't want to dismiss this).
I do still agree that there's a significant difference in that GSA/ARC are taking a worst-case-assumptions approach - so in principle they could be too conservative. In practice, I think the worst-case-assumption approach is the principled way to do things given our lack of understanding.
I think [
resolvingthe existence of this uncertainty is] an important issue to notice when considering research directionsWhy? It doesn't seem especially action guiding, if we've agreed that it's not high value to try to resolve the uncertainty (which is what I take away from your (1)).
(it's not obvious to me that it's not high value to try to resolve this uncertainty, but it's plausible based on your prior comments on this issue, and I'm stipulating that here)
Here I meant that it's important to notice that:
- The uncertainty exists and is important. (even though resolving it may be impractical)
- The fact that you've been paying little attention to it does not imply that either:
- Your current estimate is accurate.
- If your estimate changed, that wouldn't be highly significant.
I'm saying that, unless people think carefully and deliberately about this, the mind will tend to conclude [my current estimate is pretty accurate] and/or [this variable changing wouldn't be highly significant] - both of which may be false.
To the extent that these are believed, they're likely to impact other beliefs that are decision-relevant. (e.g. believing that my estimate of x is accurate tells me something about the character of x and my understanding of it)
Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).
[I've included various links below, but they're largely intended for readers-that-aren't-you]
I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".
(Thanks for this too. I don't endorse that description, but it's genuinely useful to know what impression I'm creating - particularly when it's not what I intended)
I'd rephase my position as:
- There'll always be some risk of existential failure. We want to reduce the total risk. Each step we take forward is a tradeoff: we accept some risk on that step, and (hopefully) reduce future risk by a greater amount.
- Risk might be reduced through:
- Increased understanding.
- Clear evidence that our current understanding isn't sufficient. (helpful for political will, coordination...)
- [various other things]
- Risk might be reduced through:
- I am saying "we might get doom".
- I think the odds are high primarily because I don't expect we'll get a [mostly safe setup that greatly reduces our odds of [trying something plausible that causes catastrophe]]. (I may be wrong here - hopefully so).
- I am not saying "we should not do safety work"; I'm saying "risk compensation needs to be a large factor in deciding which safety work to do", and "I think few researchers take risk compensation sufficiently seriously".
- To a first approximation, I'd say the odds of [safety work on x turns out negative due to risk compensation] is dependent on how well decision-makers (or 'experts' they're listening to) are tracking what we don't understand. Specifically, what we might expect to be different from previous experience and/or other domains, and how significant these differences are to outcomes.
- Risk is highest when we believe that we understand more than we do understand.
- There's also a unilateralist's curse issue here: it matters if there are any dangerously overconfident actors in a position to take action that'll increase risk. (noting that [slightly overconfident] may be [dangerously overconfident] in this context)
- This matters in considering the downstream impact of research. I'd be quite a bit less worried about debate research if I only had to consider [what will researchers at least as cautious as Rohin do with this?]. (though still somewhat worried :))
- See also Critch's thoughts on the need for social models when estimating impact.
- We tend to have a distorted picture of our own understanding, since there's a strong correlation between [we understand x well] and [we're able to notice x and express it clearly].
- There's a kind of bias/variance tradeoff here: if we set our evidential standards such that we don't focus on vague/speculative/indirect/conceptual arguments, we'll reduce variance, but risk significant sampling bias.
- Similarly, conclusions downstream of availability-heuristic-triggered thoughts will tend to be disproportionately influenced by the parts of the problem we understand (at least well enough to formulate clear questions).
- I expect that some researchers actively compensate for this in their conscious deliberation, but that it's very hard to get our intuitions to compensate appropriately.
- To a first approximation, I'd say the odds of [safety work on x turns out negative due to risk compensation] is dependent on how well decision-makers (or 'experts' they're listening to) are tracking what we don't understand. Specifically, what we might expect to be different from previous experience and/or other domains, and how significant these differences are to outcomes.
- [[EDIT: Oh, and when considering downstream impact, risk compensation etc, I think it's hugely important that most decision-makers and decision-making systems have adapted to a world where [those who can build x understand x] holds. A natural corollary here is [the list of known problems that concern [operation of the system itself] is complete].
- This implicit assumption underlies risk management approaches, governance structures and individuals' decision-making processes.
- That it's implicit makes it much harder to address, since there aren't usually "and here we assume that those who can build x understand x" signposts.
- It might be viable to re-imagine risk-management such that this is handled.
- It's much less likely that we get to re-imagine governance structures, or the cognition of individual decision-makers.]]
Again, I don't think it's implausible that debate-like approaches come out better than various alternatives after carefully considering risk compensation. I do think it's a serious error not to spend quite a bit of time and effort understanding and reducing uncertainty on this.
Possibly many researchers do this, but don't have any clean, legible way to express their process/conclusions. I don't get that impression: my impression is that arguments along the lines I'm making tend to be perceived as fully general counter-arguments and dismissed (whether they come from outside, or from the researchers themselves).
What's an example of alignment work that you think is net positive with the theory of change "this is a better way to build future powerful AI systems"?
I'm not sure how direct you intend "this is a better way..." (vs e.g. "this will build the foundation of better ways..."). I guess I'd want to reframe it as "This is a better process by which to build future powerful AI systems", so as to avoid baking in a level of concreteness before looking at the problem.
That said, the following seem good to me:
- Davidad's stuff - e.g. Towards Guaranteed Safe AI.
- I'm not without worries here, and I think "Guaranteed safe", "Provably safe" are silly, unhelpful overstatements - but the broad direction seems useful.
- That said, the part I'm most keen on would be the "safety specification", and what I'd expect here is something like [more work on this clarifies that it's a very hard problem we're not close to solving]. (noting that [this doesn't directly cause catastrophe] is an empty 'guarantee')
- I'm not without worries here, and I think "Guaranteed safe", "Provably safe" are silly, unhelpful overstatements - but the broad direction seems useful.
- ARC theory's stuff.
- I'm not sure you'd want to include this??
- I'm keen on it because:
- It seems likely to expand understanding - to uncover new problems.
- It may plausibly be sufficiently general - closer to the [fundamental fix] than the [patch this noticeably undesirable behaviour] end of the spectrum.
- Evan's red-teaming stuff (e.g. model organisms of misalignment).
- This seems net positive, and I think [finding a thousand ways not to build an AI] is part of a good process.
- I do still worry that there's a path of [find concrete problem] -> [make concrete problem a target for research] -> [fix concrete problems we've found, without fixing the underlying issues].
- The ideal situation from [mitigate risk compensation] perspective is:
- Have as much understanding and clarity of fundamental problems as possible.
- Have enough clear, concrete examples to make the need for caution clear to decision-makers.
- To the extent that Evan et al achieve (1), I'm unreservedly in favour.
- On (2) the situation is a bit less clear: we're in more danger when there are serious problems for which we're unable to make a clear/concrete case.
- And, again, my conclusion is not "never expose new concrete problems", but rather "consider the downsides of doing so when picking a particular line of research".
- This seems net positive, and I think [finding a thousand ways not to build an AI] is part of a good process.
- A bunch of foundational stuff you probably also wouldn't want to include under this heading (singular learning theory, natural abstractions, various agent-foundations-shaped things, perhaps computational mechanics (??))).
(I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.)
Not a problem. However, I'd want to highlight the distinction between:
- Substantial efforts to resolve this uncertainty haven't worked out.
- Resolving this uncertainty isn't important.
Both are reasonable explanations for not putting much further effort into such discussions.
However, I worry that the system-1s of researchers don't distinguish between (1) and (2) here - so that the impact of concluding (1) will often be to act as if the uncertainty isn't important (and as if the implicit corollaries hold).
I don't see an easy way to fix this - but I think it's an important issue to notice when considering research directions. Not only [consider this uncertainty when picking a research direction], but also [consider that your system 1 is probably underestimating the importance of this uncertainty (and corollaries), since you haven't been paying much attention to it (for understandable reasons)].
Sure, linking to that seems useful, thanks.
That said, I'm expecting that the crux isn't [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?].
For something like the latter, it's not clear to me that it's not useful.
Mainly my pessimism is about:
- Debate seeming not to address the failure modes I'm worried about - e.g. scheming.
- Expecting [systems insufficiently capable to cause catastrophe] not to radically (>10x) boost the most important research on alignment. (hopefully I'm wrong!)
- As a result, expecting continued strong pressure to make systems more capable, making [understand when a given oversight approach will fail catastrophically] very important.
- No research I'm aware of seeming likely to tell us when debate would fail catastrophically. (I don't think the Future work here seems likely to tell us much about catastrophic failure)
- No research I'm aware of making a principled case for [it's very unlikely that any dangerous capability could be acquired suddenly]. (I expect such thresholds to be uncommon, but to exist)
- Seeing no arguments along the lines of [We expect debate to give us clearer red flags than other approaches, and here's why...] or [We expect debate-derived red flags are more likely to lead to a safe response, rather than an insufficiently general fix that leaves core problems unaddressed].
- This is not to say that no such arguments could exist.
- I'm very interested in the case that could be made here.
Of course little of this is specific to debate. Nor is it clear to me that debate is worse than alternatives in these respects - I just haven't seen an argument that it's better (on what assumptions; in which contexts).
I understand that it's hard to answer the questions I'd want answered.
I also expect that working on debate isn't the way to answer them - so I think it's fine to say [I currently expect debate to be a safer approach than most because ... and hope that research directions x and y will shed more light on this]. But I'm not clear on people's rationale for the first part - why does it seem safer?
Do you have a [link to] / [summary of] your argument/intuitions for [this kind of research on debate makes us safer in expectation]? (e.g. is Geoffrey Irving's AXRP appearance a good summary of the rationale?)
To me it seems likely to lead to [approach that appears to work to many, but fails catastrophically] before it leads to [approach that works]. (This needn't be direct)
I.e. currently I'd expect this direction to make things worse both for:
- We're aiming for an oversight protocol that's directly scalable to superintelligence.
- We're aiming for e.g. a control setup, where debate enables us e.g.:
- Access to sufficiently high capability for meaningful alignment progress without taking unacceptable risk.
- Clearer-than-we'd-otherwise-get signals that we can't access such capability without unacceptable risk.
- [some other (incremental/indirect) safety property]
I tend to assume the idea is more (2) than (1). (presumably [debate fails in the limit] isn't controversial)
I can see a case that it's possible debate helps with some (2).
I don't currently see the case for [net positive in expectation].
More specifically, the upside from [direction/protocol x eliminates various failure modes, so reducing risk] needs to be balanced against the downside from [our (over?)confidence in direction/protocol x leads us to take risks we otherwise wouldn't].
I note here that this isn't a fully-general counterargument, but rather a general consideration.
When I consider this, I currently think "On balance, this seems to make things worse in worlds where we might plausibly have succeeded" (of course I hope to be wrong somewhere! - and, if I'm wrong, I'd like to know).
This may be downstream of different threat models.
Alternatively, it may be that you have in mind some strategy I haven't considered (at some level of abstraction).
It'd be useful to have more clarity here.
What strikes you as a plausible [debate turns out to be helpful in reducing x-risk] story? (concreteness would be nice, but mostly I'm interested in whatever story you have)
I'd enjoy being more constructive and getting into specifics on the paper/direction.
However, from my point of view that'd makes things worse (at 'best' I'd be thinning out the red flags).
Broadly I agree.
I'm not sure about:
but the team has not cohered around a leadership structure or agenda yet. I'm hopeful that this will come together
I don't expect the most effective strategy at present to be [(try hard to) cohere around an agenda]. An umbrella org hosting individual researchers seems the right starting point. Beyond that, I'd expect [structures and support to facilitate collaboration and self-organization] to be ideal.
If things naturally coalesce that's probably a good sign - but I'd prefer that to be a downstream consequence of exploration, not something to aim for in itself.
To be clear, this is all on the research side - on the operations side organization is clearly good.
Ah okay, that's clarifying. Thanks.
It still seems to me that there's a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I'd expect [compromised safety measure(s)] -> [rogue deployment] -> [catastrophe]
Rather than [compromised safety measure(s)] -> [catastrophe].
However, supposing we had a robust means to avoid rogue deployment, I don't think it's clear that we'd be left with good odds to catch sketchy behavior.
That may be true if we had an independent chance to notice sketchiness for each action - but I'd expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I'd be surprised if it's the only one.
This seems a helpful model - so long as it's borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn't a guarantee.
Thoughts:
- It's not clear to me whether the following counts as a rogue deployment (I'm assuming so):
- [un-noticed failure of one safety measure, in a context where all other safety measures are operational]
- For this kind of case:
- The name "rogue deployment" doesn't seem a great fit.
- In general, it's not clear to me how to draw the line between:
- Safety measure x didn't achieve what we wanted, because it wasn't specified/implemented sufficiently well. (not a rogue deployment)
- Safety measure x was subverted. (rogue deployment)
- For example, I think it'd be reasonable to think of [Amazing, long-term jailbreaks] as rogue deployments on this basis: the jailbreak is subverting a safety measure, so that "the safety measures are absent" is true in some sense.
- It seems important to distinguish things like:
- This safety measure appears to be in effect.
- This safety measure is running as designed.
- We're getting the safety-improving-property we wanted from this safety measure.
- When considering the [Employees of the AI company might run the model in an unauthorized way] case,
- I think one central example to consider is of an employee who:
- Thinks this is a good idea for the world.
- Can make a pretty compelling case to others that it's a good idea.
- The examples in the post seem to focus on [bad intent and/or incompetence], which seems important, but too limited.
- I think one central example to consider is of an employee who:
(Egan's Incandescence is relevant and worth checking out - though it's not exactly thrilling :))
I'm not crazy about the terminology here:
- Unfalsifiable-in-principle doesn't imply false. It implies that there's a sense in which the claim is empty. This tends to imply [it will not be accepted as science], but not [it is false].
- Where something is practically unfalsifiable (but falsifiable in principle), that doesn't suggest it's false either. It suggests it's hard to check.
- It seems to me that the thing you'd want to point to as potentially suspicious is [practically unfalsifiable claim made with high confidence].
- The fact that it's unusual and inconvenient for something predictable to be practically unfalsifiable does not inherently make such prediction unsound.
- I don't think it's foolish to look for analogous examples here, but I guess it'd make more sense to make the case directly:
- No, a hypothesis does not always need to make advance predictions (though it's convenient when it does!).
- Claims predicting AI disaster are based on our not understanding how things will work concretely. Being unable to make many good predictions in this context is not strange.
- Various AI x-risk claims concern patterns with no precedents we'd observe significantly before the end. This, again, is inconvenient - but not strange: they're dangerous in large part because they're patterns without predictable early warning signs.
- No, a hypothesis does not always need to make advance predictions (though it's convenient when it does!).
Here and above, I'm unclear what "getting to 7..." means.
With x = "always reliably determines worst-case properties about a model and what happened to it during training even if that model is deceptive and actively trying to evade detection".
Which of the following do you mean (if either)?:
- We have a method that x.
- We have a method that x, and we have justified >80% confidence that the method x.
I don't see how model organisms of deceptive alignment (MODA) get us (2).
This would seem to require some theoretical reason to believe our MODA in some sense covered the space of (early) deception.
I note that for some future time t, I'd expect both [our MODA at t] and [our transparency and interpretability understanding at t] to be downstream of [our understanding at t] - so that there's quite likely to be a correlation between [failure modes our interpretability tools miss] and [failure modes not covered by our MODA].
I agree with this.
Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).
I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different from] is not a useful relation - we can't get far by saying "We've seen [fundamentally different] situations before - what happened there?". It'll all come back to how they were fundamentally different.
To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model).
A place I'd start here would be:
- Attempt to understand another frame.
- See how far I need to zoom out before that frame's models become a reasonable abstraction for the problem-as-I-understand-it.
- Find the smallest changes to my models that'd allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful.
For most frames, I end up needing to zoom out too far for them to say much of relevance - so this doesn't much change my p(doom) assessment.
It seems more useful to apply other frames to evaluate smaller parts of our models. I'm sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.
then even if we reveal information, adversaries may still assume (likely correctly) we aren't sharing all our information
I think the same reasoning applies if they hack us: they'll assume that the stuff they were able to hack was the part we left suspiciously vulnerable, and the really important information is behind more serious security.
I expect they'll assume we're in control either way - once the stakes are really high.
It seems preferable to actually be in control.
I'll grant that it's far from clear that the best strategy would be used.
(apologies if I misinterpreted your assumptions in my previous reply)
Working on this seems good insofar as greater control implies more options. With good security, it's still possible to opt in to whatever weight-sharing / transparency mechanisms seem net positive - including with adversaries. Without security there's no option.
Granted, the [more options are likely better] conclusion is clearer if we condition on wise strategy.
However, [we have great security, therefore we're sharing nothing with adversaries] is clearly not a valid inference in general.
I think this is great overall.
One area I'd ideally prefer a clearer presentation/framing is "Safety/performance trade-offs".
I agree that it's better than "alignment tax", but I think it shares one of the core downsides:
- If we say "alignment tax" many people will conclude ["we can pay the tax and achieve alignment" and "the alignment tax isn't infinite"].
- If we say "Safety/performance trade-offs" many people will conclude ["we know how to make systems safe, so long as we're willing to sacrifice performance" and "performance sacrifice won't imply any hard limit on capability"]
I'm not claiming that this is logically implied by "Safety/performance trade-offs".
I am claiming it's what most people will imagine by default.
I don't think this is a problem for near-term LLM safety.
I do think it's a problem if this way of thinking gets ingrained in those thinking about governance (most of whom won't be reading the papers that contain all the caveats, details and clarifications).
I don't have a pithy description that captures the same idea without being misleading.
What I'd want to convey is something like "[lower bound on risk] / performance trade-offs".
I think the DSA framing is in keeping with the spirit of "first critical try" discourse.
(With that in mind, the below is more "this too seems very important", rather than "omitting this is an error".)
However, I think it's important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think "loss of control" is the threat to think about, not "AI(s) take(s) control". Admittedly this gets into Moloch-related grey areas - but this may indicate that [humans do/don't have control] is too coarse-grained a framing.
I'd say that the key properties of "first critical try" are:
- We have the option to trigger some novel process.
- We're unlikely to stop the process once it starts, even if it's not going well.
- Includes both [we can't stop it] and [we won't stop it].
- If the process goes badly, the odds of doom greatly increase.
- There's a significant chance the process goes badly.
My guess is that the most likely near-term failure mode doesn't start out as [some set of AIs gets a DSA], but rather [AI capability increase selects against meaningful human control] - and the DSA stuff is downstream of that.
This is a possibility with the [individually controllable powerful AI assistants] approach - whether or not this immediately takes things to transformational AI territory. Suppose we get the hoped-for >10x research speedup. Do we have a principled strategy for controlling the collective system this produces? I haven't heard one. I wouldn't say we're doing a good job of controlling the current collective system.
I've heard cases for [this will speed things up], and [here are some good things this would make easier] but not for [overall, such a process should be expected to take things in a less doomy direction].
For such cases "you can’t learn enough from analogous but lower-stakes contexts" ought not to apply. However, I'd certainly expect "we won’t learn enough from analogous but lower-stakes contexts" (without huge efforts to avoid this).
On your (2), I think you're ignoring an understanding-related asymmetry:
- Without clear models describing (a path to) a solution, it is highly unlikely we have a workable solution to a deep and complex problem:
- Absence of concrete [we have (a path to) a solution] is pretty strong evidence of absence.
[EDIT for clarity, by "we have" I mean "we know of", not "there exists"; I'm not claiming there's strong evidence that no path to a solution exists]
- Absence of concrete [we have (a path to) a solution] is pretty strong evidence of absence.
- Whether or not we have clear models of a problem, it is entirely possible for it to exist and to kill us:
- Absence of concrete [there-is-a-problem] evidence is weak evidence of absence.
A problem doesn't have to wait until we have formal arguments or strong, concrete empirical evidence for its existence before killing us. To claim that it's "premature" to shut down the field before we have [evidence of type x], you'd need to make a case that [doom before we have evidence of type x] is highly unlikely.
A large part of the MIRI case is that there is much we don't understand, and that parts of the problem we don't understand are likely to be hugely important. An evidential standard that greatly down-weights any but the most rigorous, legible evidence is liable to lead to death-by-sampling-bias.
Of course it remains desirable for MIRI arguments to be as legible and rigorous as possible. Empiricism would be nice too (e.g. if someone could come up with concrete problems whose solution would be significant evidence for understanding something important-according-to-MIRI about alignment).
But ignoring the asymmetry here is a serious problem.
On your (3), it seems to me that you want "skeptical" to do more work than is reasonable. I agree that we "should be skeptical of purely theoretical arguments for doom" - but initial skepticism does not imply [do not update much on this]. It implies [consider this very carefully before updating]. It's perfectly reasonable to be initially skeptical but to make large updates once convinced.
I do not think [the arguments are purely theoretical] is one of your true objections - rather it's that you don't find these particular theoretical arguments convincing. That's fine, but no argument against theoretical arguments.
This makes it even clearer that Altman’s claims of ignorance were lies – he cannot possibly have believed that former employees unanimously signed non-disparagements for free!
This is still quoting Neel, right? Presumably you intended to indent it.
Have you looked through the FLI faculty listed there?
How many seem useful supervisors for this kind of thing? Why?
If we're sticking to the [generate new approaches to core problems] aim, I can see three or four I'd be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).
There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basically aren't working on them).
The majority don't write anything that suggests they know what the core problems are.
For almost all of these supervisors, doing a PhD would seem to provide quite a few constraints, undesirable incentives, and an environment that's poor.
From an individual's point of view this can still make sense, if it's one of the only ways to get stable medium-term funding.
From a funder's point of view, it seems nuts.
(again, less nuts if the goal were [incremental progress on prosaic approaches, and generation of a respectable publication record])
A few points here (all with respect to a target of "find new approaches to core problems in AGI alignment"):
It's not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that's more easily achieved by not doing a PhD. (to the extent that development of research 'taste'/skill acts to service a publish-or-perish constraint, that's likely to be harmful)
This is not to say that there's nothing useful about an academic context - only that the sensible approach seems to be [create environments with some of the same upsides, but fewer downsides].
I can see a more persuasive upside where the PhD environment gives:
- Access to deep expertise in some relevant field.
- The freedom to explore openly (without any "publish or perish" constraint).
This seems likely to be both rare, and more likely for professors not doing ML. I note here that ML professors are currently not solving fundamental alignment problems - we're not in a [Newtonian physics looking for Einstein] situation; more [Aristotelian physics looking for Einstein]. I can more easily imagine a mathematics PhD environment being useful than an ML one (though I'd expect this to be rare too).
This is also not to say that a PhD environment might not be useful in various other ways. For example, I think David Krueger's lab has done and is doing a bunch of useful stuff - but it's highly unlikely to uncover new approaches to core problems.
For example, of the 213 concrete problems posed here how many would lead us to think [it's plausible that a good answer to this question leads to meaningful progress on core AGI alignment problems]? 5? 10? (many more can be a bit helpful for short-term safety)
There are a few where sufficiently general answers would be useful, but I don't expect such generality - both since it's hard, and because incentives constantly push towards [publish something on this local pattern], rather than [don't waste time running and writing up experiments on this local pattern, but instead investigate underlying structure].
I note that David's probably at the top of my list for [would be a good supervisor for this kind of thing, conditional on having agreed the exploratory aims at the outset], but the environment still seems likely to be not-close-to-optimal, since you'd be surrounded by people not doing such exploratory work.
RFPs seem a good tool here for sure. Other coordination mechanisms too.
(And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you'd like to see])
Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it's non-obvious what conclusions to draw, but more data is a good starting point. It's on my to-do-list to read it carefully and share some thoughts.
I agree with Tsvi here (as I'm sure will shock you :)).
I'd make a few points:
- "our revealed preferences largely disagree with point 1" - this isn't clear at all. We know MATS' [preferences, given the incentives and constraints under which MATS operates]. We don't know what you'd do absent such incentives and constraints.
- I note also that "but we aren't Refine" has the form [but we're not doing x], rather than [but we have good reasons not to do x]. (I don't think MATS should be Refine, but "we're not currently 20% Refine-on-ramp" is no argument that it wouldn't be a good idea)
- MATS is in a stronger position than most to exert influence on the funding landscape. Sure, others should make this case too, but MATS should be actively making a case for what seems most important (to you, that is), not only catering to the current market.
- Granted, this is complicated by MATS' own funding constraints - you have more to lose too (and I do think this is a serious factor, undesirable as it might be).
- If you believe that the current direction of the field isn't great, then "ensure that our program continues to meet the talent needs of safety teams" is simply the wrong goal.
- Of course the right goal isn't diametrically opposed to that - but still, not that.
- There's little reason to expect the current direction of the field to be close to ideal:
- At best, the accuracy of the field's collective direction will tend to correspond to its collective understanding - which is low.
- There are huge commercial incentives exerting influence.
- There's no clarity on what constitutes (progress towards) genuine impact.
- There are many incentives to work on what's already not neglected (e.g. things with easily located "tight empirical feedback loops"). The desirable properties of the non-neglected directions are a large part of the reason they're not neglected.
- Similar arguments apply to [field-level self-correction mechanisms].
- Given (4), there's an inherent sampling bias in taking [needs of current field] as [what MATS should provide]. Of course there's still an efficiency upside in catering to [needs of current field] to a large extent - but efficiently heading in a poor direction still sucks.
- I think it's instructive to consider extreme-field-composition thought experiments: suppose the field were composed of [10,000 researchers doing mech interp] [10 researchers doing agent foundations].
- Where would there be most jobs? Most funding? Most concrete ideas for further work? Does it follow that MATS would focus almost entirely on meeting the needs of all the mech interp orgs? (I expect that almost all the researchers in that scenario would claim mech interp is the most promising direction)
- If you think that feedback loops along the lines of [[fast legible work on x] --> [x seems productive] --> [more people fund and work on x]] lead to desirable field dynamics in an AIS context, then it may make sense to cater to the current market. (personally, I expect this to give a systematically poor signal, but it's not as though it's easy to find good signals)
- If you don't expect such dynamics to end well, it's worth considering to what extent MATS can be a field-level self-correction mechanism, rather than a contributor to predictably undesirable dynamics.
- I'm not claiming this is easy!!
- I'm claiming that it should be tried.
Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.
Understandable, but do you know anyone who's considering this? As the core of their job, I mean - not on a [something they occasionally think/talk about for a couple of hours] level. It's non-obvious to me that anyone at OpenPhil has time for this.
It seems to me that the collective 'decision' we've made here is something like:
- Any person/team doing this job would need:
- Extremely good AIS understanding.
- To be broadly respected.
- Have a lot of time.
- Nobody like this exists.
- We'll just hope things work out okay using a passive distributed approach.
To my eye this leads to a load of narrow optimization according to often-not-particularly-enlightened metrics - lots of common incentives, common metrics, and correlated failure.
Oh and I still think MATS is great :) - and that most of these issues are only solvable with appropriate downstream funding landscape alterations. That said, I remain hopeful that MATS can nudge things in a helpful direction.
For reference there's this: What I learned running Refine
When I talked to Adam about this (over 12 months ago), he didn't think there was much to say beyond what's in that post. Perhaps he's updated since.
My sense is that I view it as more of a success than Adam does. In particular, I think it's a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Agreed that Refine's timescale is clearly too short.
However, a much longer program would set a high bar for whoever's running it.
Personally, I'd only be comfortable doing so if the setup were flexible enough that it didn't seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).
(understood that you'd want to avoid the below by construction through the specification)
I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes.
It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn't capture everything that we care about.
I haven't thought about it in any detail, but doesn't using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?
[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"]
not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world.
Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe".
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details
This is one mechanism by which such a system could cause great downstream harm.
Suppose that we have a process to avoid this. What assurance do we have that there aren't other mechanisms to cause harm?
I don't yet buy the description complexity penalty argument (as I currently understand it - but quite possibly I'm missing something). It's possible to manipulate by strategically omitting information. Perhaps the "penalise heavily biased sampling" is intended to avoid this (??). If so, I'm not sure how this gets us more than a hand-waving argument.
I imagine it's very hard to do indirect manipulation without adding much complexity.
I imagine that ASL-4+ systems are capable of many very hard things.
Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer - which I expect is untrue for any simple x.
I can buy that there are simple properties whose reduction guarantees safety if it's done to an extreme degree - but then I'm back to expecting the system to do nothing useful.
As an aside, I'd note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That's not a criticism of the overall approach - I just want to highlight that I don't think we get to have both [system provides helpful-in-ways-we-hadn't-considered output] and [system can't produce harmful output]. Allowing the former seems to allow the latter.
I would like to fund a sleeper-agents-style experiment on this by the end of 2025
That's probably a good idea, but this kind of approach doesn't seem in keeping with a "Guaranteed safe" label. More of a "We haven't yet found a way in which this is unsafe".
This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])
The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup - and here of course "least harmful" isn't a utopia, since it's a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?
I'm very pleased that people are thinking about this, but I fail to understand the optimism - hopefully I'm confused somewhere!
Is anyone working on toy examples as proof of concept?
I worry that there's so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I'd suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What's the basis to think we can find such a specification?
It seems to me that finding a fit-for-purpose safety/acceptability specification won't be significantly easier than finding a specification for ambitious value alignment.
So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).
In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirable incentives (mostly).
Even here it's hard, since there'd always need to be a [gain more influence] mechanism to balance the possibility of losing your influence.
In practice, most of the implicit bets made through inaction go unnoticed - even where they're high-stakes (arguably especially when they're high-stakes: most counterfactual value lies in the actions that won't get done by someone else; you won't be punished for being late to the party when the party never happens).
That leaves the explicit bets. To look like a good decision-maker the incentive is then to make low-variance explicit positive EV bets, and rely on the fact that most of the high-variance, high-EV opportunities you're not taking will go unnoticed.
From my by-no-means-fully-informed perspective, the failure mode at OpenPhil in recent years seems not to be [too many explicit bets that don't turn out well], but rather [too many failures to make unclear bets, so that most EV is left on the table]. I don't see support for hits-based research. I don't see serious attempts to shape the incentive landscape to encourage sufficient exploration. It's not clear that things are structurally set up so anyone at OP has time to do such things well (my impression is that they don't have time, and that thinking about such things is no-one's job (?? am I wrong ??)).
It's not obvious to me whether the OpenAI grant was a bad idea ex-ante. (though probably not something I'd have done)
However, I think that another incentive towards middle-of-the-road, risk-averse grant-making is the last thing OP needs.
That said, I suppose much of the downside might be mitigated by making a distinction between [you wasted a lot of money in ways you can't legibly justify] and [you funded a process with (clear, ex-ante) high negative impact].
If anyone's proposing punishing the latter, I'd want it made very clear that this doesn't imply punishing the former. I expect that the best policies do involve wasting a bunch of money in ways that can't be legibly justified on the individual-funding-decision level.
Some thoughts:
- Necessary conditions aren't sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
- The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
- I don't see any reason to think that prompting would achieve this robustly.
- As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
- I note that I don't expect trying it empirically to produce catastrophe in the immediate term (though I can't rule it out).
- I also don't expect it to produce useful understanding of what would give a robust generalization guarantee.
- With a lot of effort we might achieve [we no longer notice any problems]. This is not a generalization guarantee. It is an outcome I consider plausible after putting huge effort into eliminating all noticeable problems.
- The "capabilities are very important [for safety]" point seems misleading:
- Capabilities create the severe risks in the first place.
- We can't create a safe AGI without advanced capabilities, but we may be able to understand how to make an AGI safe without advanced capabilities.
- There's no "...so it makes sense that we're working on capabilities" corollary here.
- The correct global action would be to try gaining theoretical understanding for a few decades before pushing the cutting edge on capabilities. (clearly this requires non-trivial coordination!)
I think it's important to distinguish between:
- Has understood a load of work in the field.
- Has understood all known fundamental difficulties.
It's entirely possible to achieve (1) without (2).
I'd be wary of assuming that any particular person has achieved (2) without good evidence.
Relevant here is Geoffrey Irving's AXRP podcast appearance. (if anyone already linked this, I missed it)
I think Daniel Filan does a good job there both in clarifying debate and in questioning its utility (or at least the role of debate-as-solution-to-fundamental-alignment-subproblems). I don't specifically remember satisfying answers to your (1)/(2)/(3), but figured it's worth pointing at regardless.
Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.
[emphasis mine]
This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".
A more reasonable-to-my-mind behavioral reductionist perspective might look like this.
Ruling out goal realism as a good way to think does not leave us with [the particular type of reductionist perspective you're highlighting].
In practice, I think the reductionist perspective you point at is:
- Useful, insofar as it answers some significant questions.
- Highly misleading if we ever forget that [this perspective doesn't show us that x is a problem] doesn't tell us [x is not a problem].
Sure, understood.
However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me that this isn't sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.
Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).
Models don't need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I'm with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.
It's unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don't understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don't necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ()] are not a subset of [inferences about author given input ( ... )].
Thanks for the thoughtful response.
A few thoughts:
If length is the issue, then replacing "leads" with "led" would reflect the reality.
I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".
Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.
I agree that this isn't a huge deal in general - however, I do think it's usually easy to fix: either a [name a process, not a result] or a [say what happened, not what you guess it implies] approach is pretty general.
Also agreed that improving summaries is more important. Quite hard to achieve given the selection effects: [x writes a summary on y] tends to select for [x is enthusiastic about y] and [x has time to write a summary]. [x is enthusiastic about y] in turn selects for [x misunderstands y to be more significant than it is].
Improving this situation deserves thought and effort, but seems hard. Great communication from the primary source is clearly a big plus (not without significant time cost, I'm sure). I think your/Buck's posts on the control stuff are commendably clear and thorough.
I expect the paper itself is useful (I've still not read it). In general I'd like the focus to be on understanding where/how/why debate fails - both in the near-term cases, and the more exotic cases (though I expect the latter not to look like debate-specific research). It's unsurprising that it'll work most of the time in some contexts. Completely fine for [show a setup that works] to be the first step, of course - it's just not the interesting bit.
I'd be curious what the take is of someone who disagrees with my comment.
(I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)
I'm not clear whether the idea is that:
- The title isn't an overstatement.
- The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness)
- The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention.
- There are upsides to the current name, and it seems net positive. (e.g. if it'd get more attention, and [paper gets attention] is considered positive)
- This is the usual standard, so [it's fine] or [it's silly to complain about] or ...?
- Something else.
I'm not claiming that this is unusual, or a huge issue on its own.
I am claiming that the norms here seem systematically unhelpful.
I'm more interested in the general practice than this paper specifically (though I think it's negative here).
I'd be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it's an unhelpful equilibrium, but if we unilaterally depart from it it'll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)
Interesting - I look forward to reading the paper.
However, given that most people won't read the paper (or even the abstract), could I appeal for paper titles that don't overstate the generality of the results. I know it's standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you've actually shown "Debating with More Persuasive LLMs Leads to More Truthful Answers", rather than "In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answers".
The title matters to those who won't read the paper, and can't easily guess at the generality of what you'll have shown (e.g. that your paper doesn't include theoretical results suggesting that we should expect this pattern to apply robustly or in general). Again, I know this is a general issue - this just happens to be a context where I can point this out with high confidence without having read the paper :).
Thanks for the link.
I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)
I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.
Great post (I've not yet read it thoroughly, or thought for long).
The first concern that springs to mind:
- I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don't tend to think of [subversive strategy x]], and [humans don't tend to notice [subversive strategy x]].
- My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies - since they've never been human skills, and so don't appear on our map.
- It seems to me that the near-term danger is better described as [subversive strategies that don't occur to humans], rather than [superhuman subversive strategies]. The latter will put people in mind of [strategies humans use, only better] - genius hacking/persuasion etc. I also want people considering [strategies that don't occur to humans at all].
- On a minor-but-related note, I'm not too keen on writing/thinking in terms of "problematic domains" rather than "problematic skills" / "problematic strategies". There's no necessity for a subversive strategy to map nicely into something I have a label for - e.g. "hacking", "persuasion".
- If we need to rule out all problematic strategies, I want to avoid language/thinking that may stop us considering problems outside our usual categories.
- (I don't think you're saying anything substantively imprecise - here I'm only concerned about language and blind-spots)
One last thing that caught my eye:
...but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science...
This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it's likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.
I note also that there's a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].