Posts

Truthfulness, standards and credibility 2022-04-07T10:31:53.103Z
Review of "Learning Normativity: A Research Agenda" 2021-06-06T13:33:28.371Z
Review of "Fun with +12 OOMs of Compute" 2021-03-28T14:55:36.984Z
A Critique of Non-Obstruction 2021-02-03T08:45:42.228Z
Optimal play in human-judged Debate usually won't answer your question 2021-01-27T07:34:18.958Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z

Comments

Comment by Joe_Collman on What is the purpose and application of AI Debate? · 2024-04-05T23:33:33.800Z · LW · GW

Relevant here is Geoffrey Irving's AXRP podcast appearance. (if anyone already linked this, I missed it)

I think Daniel Filan does a good job there both in clarifying debate and in questioning its utility (or at least the role of debate-as-solution-to-fundamental-alignment-subproblems). I don't specifically remember satisfying answers to your (1)/(2)/(3), but figured it's worth pointing at regardless.

Comment by Joe_Collman on Counting arguments provide no evidence for AI doom · 2024-02-28T19:20:07.914Z · LW · GW

Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.

[emphasis mine]

This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".

A more reasonable-to-my-mind behavioral reductionist perspective might look like this.

Ruling out goal realism as a good way to think does not leave us with [the particular type of reductionist perspective you're highlighting].
In practice, I think the reductionist perspective you point at is:

  • Useful, insofar as it answers some significant questions.
  • Highly misleading if we ever forget that [this perspective doesn't show us that x is a problem] doesn't tell us [x is not a problem].
Comment by Joe_Collman on Critiques of the AI control agenda · 2024-02-17T07:05:21.962Z · LW · GW

Sure, understood.

However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.

I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me that this isn't sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)

Comment by Joe_Collman on Critiques of the AI control agenda · 2024-02-16T19:56:04.542Z · LW · GW

Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.

Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).

Models don't need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.

Overall, I'm with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.

It's unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.

If we don't understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don't necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ()] are not a subset of [inferences about author given input (    ...  )].

Comment by Joe_Collman on Debating with More Persuasive LLMs Leads to More Truthful Answers · 2024-02-08T07:46:30.823Z · LW · GW

Thanks for the thoughtful response.

A few thoughts:
If length is the issue, then replacing "leads" with "led" would reflect the reality.

I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".
Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.

I agree that this isn't a huge deal in general - however, I do think it's usually easy to fix: either a [name a process, not a result] or a [say what happened, not what you guess it implies] approach is pretty general.

Also agreed that improving summaries is more important. Quite hard to achieve given the selection effects: [x writes a summary on y] tends to select for [x is enthusiastic about y] and [x has time to write a summary]. [x is enthusiastic about y] in turn selects for [x misunderstands y to be more significant than it is].

Improving this situation deserves thought and effort, but seems hard. Great communication from the primary source is clearly a big plus (not without significant time cost, I'm sure). I think your/Buck's posts on the control stuff are commendably clear and thorough.

I expect the paper itself is useful (I've still not read it). In general I'd like the focus to be on understanding where/how/why debate fails - both in the near-term cases, and the more exotic cases (though I expect the latter not to look like debate-specific research). It's unsurprising that it'll work most of the time in some contexts. Completely fine for [show a setup that works] to be the first step, of course - it's just not the interesting bit.

Comment by Joe_Collman on Debating with More Persuasive LLMs Leads to More Truthful Answers · 2024-02-08T03:14:22.172Z · LW · GW

I'd be curious what the take is of someone who disagrees with my comment.
(I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)

I'm not clear whether the idea is that:

  1. The title isn't an overstatement.
  2. The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness)
  3. The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention.
  4. There are upsides to the current name, and it seems net positive. (e.g. if it'd get more attention, and [paper gets attention] is considered positive)
  5. This is the usual standard, so [it's fine] or [it's silly to complain about] or ...?
  6. Something else.

I'm not claiming that this is unusual, or a huge issue on its own.
I am claiming that the norms here seem systematically unhelpful.
I'm more interested in the general practice than this paper specifically (though I think it's negative here).

I'd be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it's an unhelpful equilibrium, but if we unilaterally depart from it it'll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)

Comment by Joe_Collman on Debating with More Persuasive LLMs Leads to More Truthful Answers · 2024-02-07T22:04:43.031Z · LW · GW

Interesting - I look forward to reading the paper.

However, given that most people won't read the paper (or even the abstract), could I appeal for paper titles that don't overstate the generality of the results. I know it's standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you've actually shown "Debating with More Persuasive LLMs Leads to More Truthful Answers", rather than "In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answers".

The title matters to those who won't read the paper, and can't easily guess at the generality of what you'll have shown (e.g. that your paper doesn't include theoretical results suggesting that we should expect this pattern to apply robustly or in general). Again, I know this is a general issue - this just happens to be a context where I can point this out with high confidence without having read the paper :).

Comment by Joe_Collman on The case for ensuring that powerful AIs are controlled · 2024-01-28T07:13:15.744Z · LW · GW

Thanks for the link.

I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)

I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.

Comment by Joe_Collman on The case for ensuring that powerful AIs are controlled · 2024-01-26T07:27:21.919Z · LW · GW

Great post (I've not yet read it thoroughly, or thought for long).

The first concern that springs to mind:

  • I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don't tend to think of [subversive strategy x]], and [humans don't tend to notice [subversive strategy x]].
    • My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies - since they've never been human skills, and so don't appear on our map.
    • It seems to me that the near-term danger is better described as [subversive strategies that don't occur to humans], rather than [superhuman subversive strategies]. The latter will put people in mind of [strategies humans use, only better] - genius hacking/persuasion etc. I also want people considering [strategies that don't occur to humans at all].
  • On a minor-but-related note, I'm not too keen on writing/thinking in terms of "problematic domains" rather than "problematic skills" / "problematic strategies". There's no necessity for a subversive strategy to map nicely into something I have a label for - e.g. "hacking", "persuasion".
    • If we need to rule out all problematic strategies, I want to avoid language/thinking that may stop us considering problems outside our usual categories.
    • (I don't think you're saying anything substantively imprecise - here I'm only concerned about language and blind-spots)

One last thing that caught my eye:

...but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science...

This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it's likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.

I note also that there's a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].

Comment by Joe_Collman on What rationality failure modes are there? · 2024-01-20T22:36:08.196Z · LW · GW

Some examples here: mistakes with conservation of expected evidence

Comment by Joe_Collman on Against most, but not all, AI risk analogies · 2024-01-15T01:15:17.333Z · LW · GW

For crying out loud, LLMs are already considered "AIs" by most people! How could they be a worse analogy for AI, across the board, than extraterrestrial beings that we have never come in contract with?

By tending to lead to overconfidence.
An aliens analogy is explicitly relying on [we have no idea what this will do]. It's easy to imagine friendly aliens, just as it's easy to imagine unfriendly ones, or entirely disinterested ones. The analogy is unlikely to lead to a highly specific, incorrect model.

This is not true for LLMs. It's easy to assume that particular patterns will continue to hold - e.g. that it'll be reasonably safe to train systems with something like our current degree of understanding.

To be clear, I'm not saying they're worse in terms of information content: I'm saying they can be worse in the terms you're using to object to analogies: "routinely conveying the false impression of a specific, credible model of AI".

I think it's correct that we should be very wary of the use of analogies (though they're likely unavoidable).
However, the cases where we need to be the most wary are those that seem most naturally applicable - these are the cases that are most likely to lead to overconfidence. LLMs, [current NNs], or [current AI systems generally] are central examples here.

 

On asymmetric pushback, I think you're correct, but that you'll tend to get an asymmetry everywhere between [bad argument for conclusion most people agree with] and [bad argument for conclusion most people disagree with].
People have limited time. They'll tend to put a higher value on critiquing invalid-in-their-opinion arguments when those lead to incorrect-in-their-opinion conclusions (at least unless they're deeply involved in the discussion).

There's also an asymmetry in terms of consequences-of-mistakes here: if we think that AI will be catastrophic, and are wrong, this causes a delay, a large loss of value, and a small-but-significant increase in x-risk; if we think that AI will be non-catastrophic, and are wrong, we're dead.

Lack of pushback shouldn't be taken as a strong indication that people agree with the argumentation used.

Clearly this isn't ideal.
I do think it's worth thinking about mechanisms to increase the quality of argument.
E.g. I think the ability to emoji react to particular comment sections is helpful here - though I don't think there's one that's great for [analogy seems misleading] as things stand. Perhaps there should be a [seems misleading] react?? (I don't think "locally invalid" covers this)

Comment by Joe_Collman on OpenAI's Preparedness Framework: Praise & Recommendations · 2024-01-14T06:21:52.280Z · LW · GW

Concrete suggestion: OpenAI should allow the Safety Advisory Group Chair and the head of the Preparedness Team to have “veto power” on model development and deployment decisions

Quite possibly a good idea, but I think it's less obvious than it seems at first glance:
Remember that a position's having veto power will tend to have a large impact on selection for that position.

The comparison isn't [x with veto power] vs [x without veto power].
It's [x with veto power] vs [y without veto power].
If y would tend to have deeper understanding, more independence or more caution than x, it's not obvious that giving the position veto power helps. Better to have someone who'll spot problems and need to use persuasion, than someone who can veto but spots no problems.

Comment by Joe_Collman on Simulators · 2024-01-12T04:53:40.087Z · LW · GW

I'm using 'simulation' as it's used in the post [the imitation of the operation of a real-world process or system over time]. The real-world process is the production of the string of tokens.

I still think that referring to what the LLM does in one step as "a simulation" is at best misleading. "a prediction" seems accurate and not to mislead in the same way.

Comment by Joe_Collman on Simulators · 2024-01-12T04:42:02.739Z · LW · GW

The point I'm making here is that in the terms of this post the LLM defines the transition function of a simulation.

I.e. the LLM acts on [string of tokens], to produce [extended string of tokens].
The simulation is the entire thing: the string of tokens changing over time according to the action of the LLM.

Saying "the LLM is a simulation" strongly suggests that a simulation process (i.e. "the imitation of the operation of a real-world process or system over time") is occurring within the LLM internals.

Saying "GPT is a simulator" isn't too bad - it's like saying "The laws of physics are a simulator". Loosely correct.
Saying "GPT is a simulation" is like saying "The laws of physics are a simulation", which is at least misleading - I'd say wrong.

In another context it might not be too bad. In this post simulation has been specifically described as "the imitation of the operation of a real-world process or system over time". There's no basis to think that the LLM is doing this internally.

Unless we're claiming that it's doing something like that internally, we can reasonably say "The LLM produces a simulation", but not "The LLM is a simulation".

(oh and FYI, Janus is "they" - in the sense of actually being two people: Kyle and Laria)

Comment by Joe_Collman on Simulators · 2024-01-12T04:21:10.523Z · LW · GW

Yeah - I just noticed this "...is the mechanics underlying our world." on the tag page.
Agreed that it's inaccurate and misleading.

I hadn't realized it was being read this way.

Comment by Joe_Collman on Saving the world sucks · 2024-01-12T03:37:33.262Z · LW · GW

Some relevant thoughts are in Nate's Replacing Guilt sequence.

I can understand the sentiments expressed here - particularly in terms of dealing with these things at the age of 12.

However, I'd draw a distinction between:

  1. Making the world better according to values that aren't your own, from a sense of obligation.
  2. Doing something to prevent the literal end of the world.

And I'd note that (2) does not rest on (1), on altruism (effective or otherwise), or on any particularly narrow moral view. Wanting some kind of non-paperclip-like world to continue existing isn't a niche value.

Nietzsche or Ayn Rand would be among the last people to be guilted into saving the world, but may well do it anyway. This is not because they cared deeply about shrimp! (not to my knowledge, that is)

But of course there's a lot to be said for understanding your values, and following a path you endorse on your own terms.

Comment by Joe_Collman on Simulators · 2024-01-12T02:37:25.510Z · LW · GW

I don't disagree, but I don't think that describing the process an LLM uses to generate a single token as a simulation is clarifying in this context.

I'm fairly sure the post is making no such claim, and I think it becomes a lot more likely that readers will have habryka's interpretation if the word "simulation" is applied to LLM internals (and correctly conclude that this interpretation entails implausible claims).
I think "predictor" or the like is much better here.

Unless I'm badly misunderstanding, the post is taking a time-evolution-of-a-system view of the string of tokens - not of LLM internals.
I don't think it's claiming anything about what the internal LLM mechanism looks like.

Comment by Joe_Collman on An even deeper atheism · 2024-01-12T02:17:13.114Z · LW · GW

The main case for optimism on human-human alignment under extreme optimization seems to be indirection: not that [what I want] and [what you want] happen to be sufficiently similar, but that there's a [what you want] pointer within [what I want].

Value fragility doesn't argue strongly against the pointer-based version. The tails don't come apart when they're tied together.

It's not obvious that the values-on-reflection of an individual human would robustly maintain the necessary pointers (to other humans, to past selves, to alternative selves/others...), but it is at least plausible - if you pick the right human.

More generally, an argument along the lines of [the default outcome with AI doesn't look too different from the default outcome without AI, for most people] suggests that we need to do better than the default, with or without AI. (I'm not particularly optimistic about human-human alignment without serious and principled efforts)

Comment by Joe_Collman on Simulators · 2024-01-11T19:10:23.597Z · LW · GW

Sure, but I don't think anyone is claiming that there's a similarity between a brain stepping forward in physical time and transformer internals. (perhaps my wording was clumsy earlier)

IIUC, the single timestep in the 'physics' of the post is the generation and addition of one new token.
I.e. GPT uses [some internal process] to generate a token.
Adding the new token is a single atomic update to the "world state" of the simulation.
The [some internal process] defines GPT's "laws of physics".

The post isn't claiming that GPT is doing some generalized physics internally.
It's saying that [GPT(input_states) --> (output_states)] can be seen as defining the physical laws by which a simulation evolves.

As I understand it, it's making almost no claim about internal mechanism.

Though I think "GPT is a simulator" is only intended to apply if its simulator-like behaviour robustly generalizes - i.e. if it's always producing output according to the "laws of physics" of the training distribution (this is imprecise, at least in my head - I'm unclear whether Janus have any more precise criterion).

I don't think the post is making substantive claims that disagree with [your model as I understand it]. It's only saying: here's a useful way to think about the behaviour of GPT.

Comment by Joe_Collman on Simulators · 2024-01-11T07:15:40.073Z · LW · GW

Oh, hang on - are you thinking that Janus is claiming that GPT works by learning some approximation to physics, rather than 'physics'?

IIUC, the physics being referred to is either through analogy (when it refers to real-world physics), or as a generalized 'physics' of [stepwise addition of tokens]. There's no presumption of a simulation of physics (at any granularity).

E.g.:

Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples.

Apologies if I'm the one who's confused :).
This just seemed like a natural explanation for your seeming to think the post is claiming a lot more mechanistically. (I think it's claiming almost nothing)

Comment by Joe_Collman on Simulators · 2024-01-11T06:44:42.310Z · LW · GW

Perhaps we're talking past each other to a degree. I don't disagree with what you're saying.
I think I've been unclear - or perhaps just saying almost vacuous things. I'm attempting to make a very weak claim (I think the post is also making no strong claim - not about internal mechanism, at least).

I only mean that the output can often be efficiently understood in terms of human characters (among other things). I.e. that the output is a simulation, and that human-like minds will be an efficient abstraction for us to use when thinking about such a simulation. Privileging hypotheses involving the dynamics of the outputs of human-like minds will tend to usefully constrain expectations.

Again, I'm saying something obvious here - perhaps it's too obvious to you. The only real content is something like [thinking of the output as being a simulation including various simulacra, is likely to be less misleading than thinking of it as the response of an agent].

I do not mean to imply that the internal cognition of the model necessarily has anything simulation-like about it. I do not mean that individual outputs are produced by simulation. I think you're correct that this is highly unlikely to be the most efficient internal mechanism to predict text.

Overall, I think the word "simulation" invites confusion, since it's forever unclear whether we're pointing at the output of a simulation process, or the internal structure of that process.
Generally I'm saying:
[add a token single token] : single simulation step - using the training distribution's 'physics'.
[long string of tokens] : a simulation
[process of generating a single token] : [highly unlikely to be a simulation]

Comment by Joe_Collman on Simulators · 2024-01-11T05:02:08.594Z · LW · GW

To add to Charlie's point (which seems right to me):

As I understand things, I think we are talking about a simulation of something somewhat close to human minds - e.g. text behaviour of humanlike simulacra (made of tokens - but humans are made of atoms). There's just no claim of an internal simulation.

I'd guess a common upside is to avoid constraining expectations unhelpfully in ways that [GPT as agent] might.

However, I do still worry about saying "GPT is a simulator" rather than something like "GPT currently produces simulations".
I think the former suggests too strongly that we understand something about what it's doing internally - e.g. at least that it's not inner misaligned, and won't stop acting like a simulator at some future time (and can easily be taken to mean that it's doing simulation internally).

If the aim is to get people thinking more clearly, I'd want it to be clearer that this is a characterization of [what GPTs currently output], not [what GPTs fundamentally are].

Comment by Joe_Collman on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-09T18:12:02.870Z · LW · GW

I think the broader use is sensible - e.g. to include post-training.

However, I'm not sure how narrow you'd want [training hacking] to be.
Do you want to call it training only if NN internals get updated by default? Or just that it's training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a ...selection... process that could be ongoing], seems to cover all deceptive alignment - potential deletion/adjustment being a selection process).

Fine if there's no bright line - I'd just be curious to know your criteria.

Comment by Joe_Collman on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-09T17:12:17.044Z · LW · GW

By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.

This description allows us to classify every output of a highly capable AI as deceptive:
For any AI output, it's essentially guaranteed that a human will update away from the truth about something. A highly capable AI will be able to predict some of these updates - thus it will be "knowingly providing ... misleading information".

Conversely, we can't require that a human be misled about everything in order to classify something as deceptive - nothing would then qualify as deceptive.

There's no obvious fix here.

Our common-sense notion of deception is fundamentally tied to motivation:

  • A teacher says X in order to give a student a basic-but-flawed model that's misleading in various ways, but is a step towards deeper understanding.
    • Not deception.
  • A teacher says X to a student in order to give them a basic-but-flawed model that's misleading in various ways, to manipulate them.
    • Deception.

The student's updates in these cases can be identical. Whether we want to call the statement deceptive comes down to the motivation of the speaker (perhaps as inferred from subsequent actions).

In a real world context, it is not possible to rule out misleading behavior: all behavior misleads about something.
We can only hope to rule out malign misleading behavior. This gets us into questions around motivation, values etc (or at least into much broader considerations involving patterns of behavior and long-term consequences).


(I note also that requiring "knowingly" is an obvious loophole - allowing self-deception, willful ignorance or negligence to lead to bad outcomes; this is why some are focusing on truthfulness rather than honesty)

Comment by Joe_Collman on Deceptive AI ≠ Deceptively-aligned AI · 2024-01-09T16:23:01.164Z · LW · GW

…but the AI is actually emitting those outputs in order to create that impression—more specifically, the AI has situational awareness

I think it's best to avoid going beyond the RFLO description.

In particular, it is not strictly required that the AI be aiming to "create that impression", or that it has "situational awareness" in any strong/general sense.

Per footnote 26 in RFLO (footnote 7 in the post):
"Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to."

It needs to be:
Modeling the optimization pressure.
Adapting its responses to that optimization pressure.

Saying more than that risks confusion and overly narrow approaches.
By all means use things like "in order to create that impression" in an example. It shouldn't be in the definition.

Comment by Joe_Collman on Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”) · 2023-12-15T07:17:43.378Z · LW · GW

Thanks for writing the report - I think it’s an important issue, and you’ve clearly gone to a lot of effort. Overall, I think it’s good. 

However, it seems to me that the "incentivized episode" concept is confused, and that some conclusions over the probability of beyond-episode goals are invalid on this basis. I'm fairly sure I'm not confused here (though I'm somewhat confused that no-one picked this up in a draft, so who knows?!).

I'm not sure to what extent the below will change your broader conclusions - if it only moves you from [this can't happen, by definition], to [I expect this to be rare], the implications may be slight. It does seem to necessitate further thought - and different assumptions in empirical investigations.

The below can all be considered as [this is how things seem to me], so I’ll drop caveats along those lines. I’m largely making the same point throughout - I’m hoping the redundancy will increase clarity.

  1. "The unit of time such that we can know by definition that training is not directly pressuring the model to care about consequences beyond that time." - this is not a useful definition, since there is no unit of time where we know this by definition. We know only that the model is not pressured by consequences beyond training.
    1. For this reason, in what follows I'll largely assume that [episode = all of training]. Sticking with the above definition gives unbounded episodes, and that doesn't seem in the spirit of things.
  2. Replacing "incentivized episode" with "incentivizing episode" would help quite a bit. This clarifies the causal direction we know by definition. It’s a mistake to think that we can know what’s incentivized "by definition": we know what incentivizes.
    1. In particular, only the episode incentivizes; we expect [caring about the episode] to be incentivized; we shouldn't expect [caring about the episode] to be the only thing incentivized (at least not by definition).
  3. For instance, for any t, we can put our best attempt at [care about consequences beyond tin the reward function (perhaps we reward behaviour most likely to indicate post-t caring; perhaps we use interpretability on weights and activations, and we reward patterns that indicate post-t caring - suppose for now that we have superb interpretability tools).
    1. The interpretability case is more direct than rewarding future consequences: it rewards the mechanisms that constitute caring/goals directly.
    2. If a process we've designed explicitly to train for , is described as "not directly pressuring the model to " by definition, then our definition is at least misleading - I'd say broken.
    3. Of course this example is contrived - but it illustrates the problem. There may be analogous implicit (but direct!) effects that are harder to notice. We can't rule these out with a definition.
      1. These need not be down to inductive bias: as here, they can be favoured as a consequence of the reward function (though usually not so explicitly and extremely as in this example).
  4. What matters is the robustness of correlations, not directness.
    1. Most directness is in our map, not in the territory. I.e. it's largely saying [this incentive is obvious to us], [this consequence is obvious to us] etc.
    2. The episode is the direct cause of gradient updates.
    3. As a consequence of gradient updates, behaviour on the episode is no more/less direct than behaviour outside that interval.
  5. While a strict constraint on beyond-episode goals may be rare, pressure will not be.
    1. The former would require that some beyond-episode goal is entailed by high performance in-episode.
    2. The latter only requires that some beyond-episode goal is more likely given high performance in-episode.
      1. This is the kind of pressure that might, in principle, be overcome by adversarial training - but it remains direct pressure for beyond-episode goals resulting from in-episode consequences.

 

The most significant invalid conclusion (Page 52):

  • "Why might you not expect naturally-arising beyond-episode goals? The most salient reason, to me, is that by definition, the gradients given in training (a) do not directly pressure the model to have them…"
    • As a general "by definition" claim, this is false (or just silly, if we say that the incentivized episode is unbounded, and there is no "beyond-episode").
    • It's less clear to me how often I'd expect active pressure towards beyond-episode-goals over strictly-in-episode-goals (other than through the former being more common/simple). It is clear that such pressure is possible.
    • Directness is not the issue: if I keep  small, I keep  small. That I happen to be manipulating  directly, rather than  directly, is immaterial.
      • Again, the out-of-episode consequences of gradient updates are just as direct as the in-episode consequences - they’re simply harder for us to predict.
Comment by Joe_Collman on AI #39: The Week of OpenAI · 2023-11-23T20:56:05.134Z · LW · GW

Is that ‘deceptive alignment’? You tell me.

I don't think it makes sense to classify every instance of this as deceptive alignment - and I don't think this is the usual use of the term.

I think that to say "this is deceptive alignment" is generally to say something like "there's a sense in which this system has a goal different from ours, is modeling the selection pressure it's under, anticipating that this selection pressure may not exist in the future, and adapting its behaviour accordingly".

That still leaves things underdefined, e.g. since this can all happen implicitly and/or without the system knowing this mechanism exists.
However, if you're not suggesting in any sense that [anticipation of potential future removal of selection pressure] is a big factor, then it's strange to call it deceptive alignment.

I assume Wiblin means it in this sense - not that this is the chance we get catastrophically bad generalization, but rather that it happens via a mechanism he'd characterize this way.

[I'm now less clear that this is generally agreed, since e.g. Apollo seem to be using a foolish-to-my-mind definition here: When an AI has Misaligned goals and uses Strategic Deception to achieve them (see "Appendix C - Alternative definitions we considered", for clarification).
This is not close to the RFLO definition, so I really wish they wouldn't use the same name. Things are confusing enough without our help.]

All that said, it's not clear to me that [deceptive alignment] is a helpful term or target, given that there isn't a crisp boundary, and that there'll be a tendency to tackle an artificially narrow version of the problem.
The rationale for solving it usually seems to be [if we can solve/avoid this subproblem, we'd have instrumentally useful guarantees in solving the more general generalization problem] - but I haven't seen a good case made that we get the kind of guarantees we'd need (e.g. knowing only that we avoid explicit/intentional/strategic... deception of the oversight process is not enough).
It's easy to motte-and-bailey ourselves into trouble.

Comment by Joe_Collman on AI Safety Research Organization Incubation Program - Expression of Interest · 2023-11-23T06:29:50.262Z · LW · GW

This seems great in principle.
The below is meant in the spirit of [please consider these things while moving forward with this], and not [please don't move forward until you have good answers on everything].

That said:

First, I think it's important to clearly distinguish:

  1. A great world would have a lot more AI safety orgs. (yes)
  2. Conditional on many new AI safety orgs starting, the world is in a better place. (maybe)
  3. Intervening to facilitate the creation of new AI safety orgs makes the world better. (also maybe)

This program would be doing (3), so it's important to be aware that (1) is not in itself much of an argument. I expect that it's very hard to do (3) well, and that even a perfect version doesn't allow us to jump to the (1) of our dreams. But I still think it's a good idea!

Some thoughts that might be worth considering (very incomplete, I'm sure):

  1. Impact of potential orgs will vary hugely.
    1. Your impact will largely come down to [how much you increase (/reduce) the chance that positive (/negative) high-impact orgs get created].
    2. This may be best achieved by aiming to create many orgs. It may not.
      1. Of course [our default should be zero new orgs] is silly, but so would be [we're aiming to create as many new orgs as possible].
      2. You'll have a bunch of information, time and levers that funders don't have, so I don't think such considerations can be left to funders.
    3. In the below I'll be mostly assuming that you're not agnostic to the kind of orgs you're facilitating (since this would be foolish :)). However, I note that even if you were agnostic, you'd inevitably make choices that imply significant tradeoffs.
  2. Consider the incentive landscape created by current funding sources.
    1. Consider how this compares to a highly-improved-by-your-lights incentive landscape.
    2. Consider what you can do to change things for the better in this regard.
      1. If anything seems clearly suboptimal as things stand, consider spending significant effort making this case to funders as soon as possible.
      2. Consider what data you could gather on potential failure modes, or simply on dynamics that are non-obvious at the outset. (anonymized appropriately)
        Gather as much data as possible.
        1. If you don't have the resources to do a good job at experimentation, data gathering etc., make this case to funders and get those resources. Make the case that the cost of this is trivial relative to the opportunity cost of failing to gather the information.
  3. The most positive-for-the-world orgs are likely among the hardest to create.
    1. By default, orgs created are likely to be doing not-particularly-neglected things (similar selection pressures that created the current field act on new orgs; non-neglected areas of the field correlate positively with available jobs and in-demand skills...).
    2. By default, it's much more likely to select for [org that moves efficiently in some direction] than [org that picks a high-EV-given-what's-currently-known direction].
      1. Given that impact can easily vary by a couple of orders of magnitude (and can be negative), direction is important.
      2. It's long-term direction that's important. In principle, an org that moves efficiently in some direction could radically alter that direction later. In practice, that's uncommon - unless this mindset existed at the outset.
        1. Perhaps facilitating this is another worthwhile intervention?? - i.e. ensuring that safety orgs have an incentive to pivot to higher-EV approaches, rather than to continue with a [low EV-relative-to-counterfactual, but high comparative advantage] approach.
    3. Making it easier to create any kind of safety org doesn't change the selection pressures much (though I do think it's a modest improvement). If all the branches are a little lower, it's still the low-hanging-fruit that tends to be picked first. It may often be easier to lower the low branches too.
      1. If possible, you'd want to disproportionately lower the highest branches. Clearly this is easier said than done. (e.g. spending a lot of resources on helping those with hard-to-make-legible ideas achieve legibility, [on a process level, if necessary], so that there's not strong selection for [easy to make legible])
  4. Ground truth feedback on the most important kinds of progress is sparse-to-non-existent.
    1. You'll be using proxies (for [what seems important], [what impact we'd expect org x to have], [what impact direction y has had], [what impact new org z has had] etc. etc.).
      1. Most proxies aren't great.
      2. The most natural proxies and metrics will tend to be the same ones others are using. This may help to get a project funded. It tends to act against neglectedness.
      3. Using multiple, non-obvious proxies is worth a thought.
        1. However, note that you don't have the True Name of [AI safety], [alignment]... in your head: you have a vague, confused proxy.
        2. One person coming up with multiple proxies, will tend to mean creating various proxies to their own internal proxy. That's still a single point of failure.
        3. If you all clearly understand the importance of all the proxies you're using, that's probably a bad sign.
  5. It's much better to create a great org slowly, than a mediocre org quickly. This can easily happen with (some of) the same people, entailing a high opportunity cost.
    • I think one of the most harmful dynamics at present is the expectation that people/orgs should have a concretely mapped out agenda/path-to-impact within a few months. This strongly selects against neglectedness.
    • Even Marius' response to this seems to have the wrong emphasis:
      "Second, a great agenda just doesn't seem like a necessary requirement. It seems totally fine for me to replicate other people’s work, extend existing agendas, or ask other orgs if they have projects to outsource (usually they do) for a year or so and build skills during that time. After a while, people naturally develop their own new ideas and then start developing their own agendas."
      I.e. that the options are:
      1. Have a great agenda.
      2. Replicate existing work, extend existing agenda, grab existing ideas to work on.
    • Where is the [spend time focusing on understanding the problem more deeply, and forming new ideas / approaches]? Of course this may sometime entail some replication/extension, but that shouldn't be the motivation.
    • Financial pressures and incentives are important here: [We'll fund you for six months to focus on coming up with new approaches] amounts to [if you pick a high-variance approach, your org may well cease to exist in six months]. If the aim is to get an org to focus on exploration for six months, guaranteed funding for two years is a more realistic minimum.
      • Of course this isn't directly within your control - but it's the kind of thing you might want to make a case for to funders.
      • Again, the more you're able to shape the incentive landscape for future orgs, the more you'll be able to avoid unhelpful instrumental constraints, and focus on facilitating the creation of the kind of orgs that should exist.
      • Also worth considering that the requirement for this kind of freedom is closer to [the people need near-guaranteed financial support for 2+ years]. Where an org is uncertain/experimental, it may still make sense to give the org short-term funding, but all the people involved medium-term funding.
Comment by Joe_Collman on The other side of the tidal wave · 2023-11-10T18:19:11.343Z · LW · GW

That's my guess too, but I'm not highly confident in the [no attractors between those two] part.

It seems conceivable to have a not-quite-perfect alignment solution with a not-quite-perfect self-correction mechanism that ends up orbiting utopia, but neither getting there, nor being flung off into oblivion.

It's not obvious that this is an unstable, knife-edge configuration. It seems possible to have correction/improvement be easier at a greater distance from utopia. (whether that correction/improvement is triggered by our own agency, or other systems)

If stable orbits exist, it's not obvious that they'd be configurations we'd endorse (or that the things we'd become would endorse them).

Comment by Joe_Collman on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-30T21:54:35.308Z · LW · GW

Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.

Oh, I'm certainly not claiming that no-one should attempt to make the estimates.

I'm claiming that, conditional on such estimation teams being enshrined in official regulation, I'd expect their results to get misused. Therefore, I'd rather that we didn't have official regulation set up this way.

The kind of risk assessments I think I would advocate would be based on the overall risk of a lab's policy, rather than their immediate actions. I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]. (importantly, it won't always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)

Comment by Joe_Collman on Linkpost: Rishi Sunak's Speech on AI (26th October) · 2023-10-28T21:54:51.426Z · LW · GW

It's important not to ignore that this speech is to the general public.
While I agree that "in the most unlikely but extreme cases" is not accurate, it's not clear that this reflects the views of the PM / government, rather than what they think it's expedient to say.

Even if they took the risk fully seriously, and had doom at 60%, I don't think he'd say that in a speech.

The speech is consistent with [not quite getting it yet], but also consistent with [getting it, but not thinking it's helpful to say it in a public speech]. I'm glad Eliezer's out there saying the unvarnished truth - but it's less clear that this would be helpful from the prime minister.

It's worth considering the current political situation: the Conservatives are very likely to lose the next election (no later than Jan 2025 - but it often happens early [this lets the governing party pick their moment, have the element of surprise, and look like calling the election was a positive choice]).
Being fully clear about the threat in public could be perceived as political desperation. So far, the issue hasn't been politicized. If not coming out with the brutal truth helps with that, it's likely a price worth paying. In particular, it doesn't help if the UK government commits to things that Labour will scrap as soon as they get in.

Perhaps more importantly from his point of view, he'll need support from within his own party over the next year - if he's seen as sabotaging the Conservatives' chances in the next election by saying anything too weird / alarmist-seeming / not-playing-to-their-base, he may lose that.

Again, it's also consistent with not quite getting it, but that's far from the only explanation.

We could do a lot worse than Rishi Sunak followed by Keir Starmer.
Relative to most plausible counterfactuals, we seem to have gotten very lucky here.

Comment by Joe_Collman on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-28T08:20:46.490Z · LW · GW

Thanks for clarifying your views. I think it's important.
 

...build consensus around conditional pauses...

My issue with this is that it's empty unless the conditions commit labs to taking actions they otherwise wouldn't. Anthropic's RSP isn't terrible, but I think a reasonable summary is "Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it's a good idea".

It's a commitment to take some actions that aren't pausing - defining ASL4 measures, implementing ASL3 measures that they know are possible. That's nice as far as it goes. However, there's nothing yet in there that commits them to pause when they don't think it's a good idea.

They could have included such conditions, even if they weren't concrete, and wouldn't come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.

That might be perfectly reasonable, given that it's unilateral. But if (even) Anthropic aren't going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn't say much for RSPs as conditional pause mechanisms.

The transparency probably does help to a degree. I can imagine situations where greater clarity in labs' future actions might help a little with coordination, even if they're only doing what they'd do without the commitment.

Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.

This seems a reasonable criticism only if it's a question of [improvement with downside] vs [status-quo]. I don't think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.

It may be important to solve x, but also that it's not prematurely believed we've solved x. This applies to technical alignment, and to alignment regulation.

Things being "confused for sufficient progress" isn't a small problem: this is precisely what makes misalignment an x-risk.

Initially, communication around RSPs was doing a bad job of making their insufficiency clear.
Evan's, Paul's and your posts are welcome clarifications - but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).

Comment by Joe_Collman on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-27T20:20:17.368Z · LW · GW

That's reasonable, but most of my worry comes back to:

  1. If the team of experts is sufficiently cautious, then it's a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths" doesn't seem to matter a whole lot)
    1. I note that 8 billion deaths seems much more likely than 100 million, so the expectation of "1% chance of over 100 million deaths" is much more than 1 million.
  2. If the team of experts is not sufficiently cautious, and come up with "1% chance of OpenAI's next model causing over 100 million deaths" given [not-great methodology x], my worry isn't that it's not persuasive that time. It's that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we'll be screwed.

In part, I'm worried that the argument for (1) is too simple - so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.

I'd prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.

The only case I can see against this is [there's a version of using AI assistants for alignment work that reduces overall risk]. Here I'd like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it's more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).

However, I don't think Eliezer's critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he's likely to be essentially correct - that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can't do this, we'll be accelerating a vehicle that can't navigate.

[EDIT: oh and of course there's the [if we really suck at navigation, then it's not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there's a decent case that improving our ability to navigate might be something that it's hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]

But this seems to be the only reasonable crux. This aside, we don't need complex analyses.

Comment by Joe_Collman on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T20:50:09.281Z · LW · GW

it relies on evals that we do not have

I agree that this is a problem, but it strikes me that we wouldn't necessarily need a concrete eval - i.e. we wouldn't need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].

We could have [here is a precise description of what we mean by "understanding a model", such that we could, in principle, create an evaluation process that answers this question].

We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it's not obvious to me that defining the right question isn't already most of the work)

Personally, I'd prefer that this were done already - i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved - e.g. deferring to an external board. [Anthropic's Long Term Benefit Trust doesn't seem great for this, since it's essentially just Paul who'd have relevant expertise (?? I'm not sure about this - it's just unclear that any of the others would)]

I do think it's reasonable for labs to say that they wouldn't do this kind of thing unilaterally - but I would want them to push for a more comprehensive setup when it comes to policy.

Comment by Joe_Collman on Lying to chess players for alignment · 2023-10-26T14:11:19.836Z · LW · GW

Oh I didn't mean only to do it afterwards. I think before is definitely required to know the experiment is worth doing with a given setup/people. Afterwards is nice-to-have for Science. (even a few blitz games is better than nothing)

Comment by Joe_Collman on Lying to chess players for alignment · 2023-10-26T14:06:13.490Z · LW · GW

Oh that's cool - nice that someone's run the numbers on this.
I'm actually surprised quite how close-to-50% both backgammon and poker are.

Comment by Joe_Collman on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T13:55:43.872Z · LW · GW

tl;dr:
Dario's statement seems likely to reduce overconfidence.
Risk-management-style policy seems likely to increase it.
Overconfidence gets us killed.


I think Dario's public estimate of 10-25% is useful in large part because:

  1. It makes it more likely that the risks are taken seriously.
  2. It's clearly very rough and unprincipled.

Conditional on regulators adopting a serious risk-management-style approach, I expect that we've already achieved (1).

The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.
Further, I think that for AI risk that's not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it'll tend to lead to overconfidence.

I don't think I'm particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.

That's not what I expect would happen (if it's part of an official regulatory system, that is).
Two cases that spring to mind are:

  1. The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there's no reason to believe they can make accurate AI x-risk predictions)
  2. The people involved aren't sufficiently cautious, and publish their estimates in a form you'd expect of Very Serious People, in a Very Serious Organization - with many numbers, charts and trends, and no "We basically have no idea what we're doing - these are wild guesses!" warning in huge red letters at the top of every page.

The first makes this kind of approach unnecessary - better to get the cautious people make the case that we have no solid basis to make these assessments that isn't a wild guess.

The second seems likely to lead to overconfidence. If there's an officially sanctioned team of "experts" making "expert" assessments for an international(?) regulator, I don't expect this to be treated like the guess that it is in practice.

Comment by Joe_Collman on Thoughts on responsible scaling policies and regulation · 2023-10-26T13:27:53.855Z · LW · GW

The parallel to the nuclear case doesn't work:
Successfully building nuclear weapons is to China's advantage.
Successfully building a dangerously misaligned AI is not. (not in national, party, nor personal interest)

The clear path to regulation working with China is to get them to realize the scale of the risk - and that the risk applies even if only they continue rushing forward.

It's not an easy path, but it's not obvious that convincing China that going forward is foolish is any harder than convincing the US, UK....

Conditional on international buy-in on the risk, the game theory looks very different from the nuclear case.
(granted, it's also worse in some ways, since the upsides of [defecting-and-getting-lucky] are much higher) 

Comment by Joe_Collman on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T13:04:23.790Z · LW · GW

As a general point, I agree that your suggestion is likely to seem better than RSPs.
I'm claiming that this is a bad thing.

To the extent that an approach is inadequate, it's hugely preferable for it to be clearly inadequate.
Having respectable-looking numbers is not helpful.
Having a respectable-looking chain of mostly-correct predictions is not helpful where we have little reason to expect the process used to generate them will work for the x-risk case.

But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.

The fact that you think that this is a solid baseline (and that others may agree), is much of the problem.

What we'd need would be:
[people who deeply understand AI x-risk] x [forecasters well-calibrated on AI x-risk] x [risk management experts capable of adapting to this context]

We don't have the first, and have no basis to expect the second (the third should be doable, conditional on having the others).

I do expect the first few shots of risk estimate to be overconfident

Ok, so now assume that [AI non-existential risk] and [AI existential risk] are very different categories in terms of what's necessary in order to understand/predict them (e.g. via the ability to use plans/concepts that no human can understand, and/or to exert power in unexpected ways that don't trigger red flags).

We'd then get: "I expect the first few shots of AI x-risk estimate to be overconfident, and that after many failures the field would be red-pilled, but for the inconvenient detail that they'll all be dead".

Feedback loops on non-existential incidents will allow useful updates on a lower bound for x-risk estimates.
A lower bound is no good.

...not only do that but also deterministic safety analysis and scenario based risk analysis...

Doing either of these effectively for powerful systems is downstream of understanding we lack.
This again gives us a lower bound at best - we can rule out all the concrete failure modes we think of.

However, saying "we're doing deterministic safety analysis and scenario based risk analysis" seems highly likely to lead to overconfidence, because it'll seem to people like the kind of thing that should work.

However, all three aspects fail for the same reason: we don't have the necessary understanding.

I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it

This requires [well calibrated on AI x-risk forecasters]. That's not something we have. Nor is it something we can have. (we can believe we have it if we think calibration on other things necessarily transfers to calibration on AI x-risk - but this would be foolish)

The best case is that forecasters say precisely that: that there's no basis to think they can do this.
I imagine that some will say this - but I'd rather not rely on that.
It's not obvious all will realize they can't do it, nor is it obvious that regulators won't just keep asking people until they find those who claim they can do it.

Better not to create the unrealistic expectation that this is a thing that can be done. (absent deep understanding)

Comment by Joe_Collman on Lying to chess players for alignment · 2023-10-26T05:45:08.879Z · LW · GW

I'm not sure about poker, but I think for backgammon it'd be harder to get three levels where C beats B beats A reliably. I'm not a backgammon expert, but I could win games against experts - it's enough to be competent and lucky. A may also learn too fast - becoming competent is much faster for backgammon than for chess. (needing a larger sample size due to randomness makes A learning more of a problem - this may apply with poker too??)

I have a lot more experience and skill at chess, but it's still pretty simple to find players who'll beat me 90% of the time.

Comment by Joe_Collman on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T05:31:55.913Z · LW · GW

Thanks for your effort in writing this.
I'm very glad people are taking this seriously and exploring various approaches.
However, I happen to think your policy suggestion would be a big mistake.

On you recommendations:
1) Entirely agree. The name sucks, for the reasons you state.
2) Agreed. Much more explicit clarity on this would be great.
3) No.

I'll elaborate on (3):

measure the risks, deal with them, and make the residual level of risks and the methodology public”.

I'll agree that it would be nice if we knew how to do this, but we do not.
With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).

“Inability to show that risks are below acceptable levels is a failure. Hence, the less we understand a system, the harder it is to claim safety.”

This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don't understand. We cannot demonstrate risks are below acceptable levels.

Assemble a representative group of risk management experts, AI risk experts...

The issue here is that AI risk "experts" in the relevant sense do not exist.
We have "experts" (those who understand more than almost anyone else).
We have no experts (those who understand well).

For a standard risk management approach, we'd need people who understand well.
Given our current levels of understanding, all a team of "experts" could do would be to figure out a lower bound on risk. I.e. "here are all the ways we understand that the system could go wrong, making the risk at least ...".

We don't know how to estimate an upper bound in any way that doesn't amount to a wild guess.


Why is pushing for risk quantification in policy a bad idea?

Because, logically, it should amount to an immediate stop on all development.
However, since "We should stop immediately because we don't understand" can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.

Quantified risk estimates that are wrong are much worse than underdefined statements.
I'd note here that I do not expect [ability to calculate risk for low-stakes failures] to translate into [ability to calculate risk for catastrophic failures] - many are likely to be different in kind.
Quantified risk estimates that are correct for low-stakes failures, but not for catastrophic failures are worse still.

One of the things Anthropic's RSP does right is not to quantify things that we don't have the understanding to quantify.

There is no principled way to quantify catastrophic risk without a huge increase in understanding.
The dangerous corollary is that there's no principled way to expose dubious quantifications as dubious (unless they exhibit obvious failures we concretely understand).

Once a lab has accounted for all the risks that are well understood, they'd be able to say "we think the remaining risk is very low because [many soothing words and much hand-waving]", and there'll be no solid basis to critique this - because we lack the understanding.

I think locking in the idea that AI risk can be quantified in a principled way would be a huge error.
If we think that standard risk-management approaches are best, then the thing to propose would be:

1) Stop now.
2) Gain sufficient understanding to quantify risk. (this may take decades)
3) Apply risk management techniques.

Comment by Joe_Collman on Lying to chess players for alignment · 2023-10-26T04:16:55.190Z · LW · GW

That said, I don't expect the setup to be particularly sensitive to the control games or time controls.

If you have something like:
A: novice
B: ~1700
C: ~2200
Then A is going to robustly lose to B and B to C.
An extra couple of minutes either way isn't going to matter. (thinking for longer might get you 100 Elo, but nowhere close to 500)

If this reliably holds - e.g. B beats A 9-0 in blitz games, and the same for C vs B, then it doesn't seem worth the time to do more careful controls. (or at least the primary reason to do more careful controls at that point would be a worry that the results wouldn't otherwise be taken seriously by some because you weren't doing Proper Science)

Comment by Joe_Collman on Lying to chess players for alignment · 2023-10-26T04:03:14.426Z · LW · GW

I think it'd make sense to give C at least as long as B. B doesn't need to do any explaining.
I think giving A significantly longer than B is fine, so long as the players have enough time to stick around for that.
I think it's a more interesting experiment if A has ample time to figure things out to the best of their ability. A failing because they weren't able to understand quickly seems less interesting.

The best way to handle this seems to be to play A-vs-B and B-vs-C control games with something as close to the final setup as possible.

So e.g. you could have B-vs-C games to check that C really is significantly better, but require C to write explanations for their move and why they didn't make a couple of other moves. Essentially C imagines they're playing the final setup, except their move is always picked.
And you can do A-vs-B games where A has a significant advantage in time over B (though I think blitz games is still more efficient in gaining the most information in the given time).

This way it doesn't matter much whether the setup is 'fair' to A/B/C, so long as it's unfair to a similar level in the control 1-v-1 games as in the advisor-based games.

Comment by Joe_Collman on Lying to chess players for alignment · 2023-10-26T03:49:53.784Z · LW · GW

I'm happy to be B if it'd be useful - mainly because I expect that to require least time, and I do play chess to relax anyway. Pretty flexible on times/days. I don't think I'd have time for A/C. (unless the whole thing is quite quick - I'd be ok spending an afternoon or two, so long as it's not in the next two weeks; currently very busy)

I've not been rated recently. IIRC I was about 1900 in blitz on chess.com when playing for fun.
I'd guess that I could be ~1900 on longer controls if I spent quite a bit of effort on the games.
I'd prefer to participate with more of a ~1700 expectation, since I can do that quickly.

So long as I'm B, I'm fine with multi-week or multi-month 1-move-per-day games - but clearly the limiting factor is that this is much more demanding on A and C.

Some thoughts on the setup:
It'd make sense to have at least a few fastish games between B and C, so that it's pretty clear there is the expected skill disparity. Blitz games are likely to be the most efficient here - I'd suggest an increment of at least 5 seconds per move, to avoid the incentive to win on time. But ~3 minutes on the clock may be enough. (9 games of ~10 minutes each will tell you a lot more than 1 game of ~90minutes)

Similarly between A and B.

This should ideally be done at the end of the experiment too, in particular to guard against A being a very fast learner.
B improving a lot seems less likely (though possible, if they started out rusty).
I don't think Cs improving should be an issue.
But it's plausible that both the A-B and B-C gaps shrink during the experiment.

A control that's probably useful is to have A play some games against B with entirely honest advisors.
The point here being that it can impose some penalty to have three suggestions rather than one - e.g. if the advisors know different opening lines, A might pick an inconsistent combination: advisor 1 makes a suggestion that goes down a path advisor 2 doesn't know well; A picks advisor 1's move, then advisor 2's follow-up, resulting in an incoherent strategy.
I don't expect this would be have a large effect, but it seems sensible to do if there's time. (if time's a big constraint, it might not be worth it)

It's worth considering what norms make sense for the C role.
For instance, if C is giving explanations, does that extend to giving complex arguments against other plausible moves? Is C aiming to play fully to win given the constraints, or is there an in-the-spirit-of-things norm?

E.g. if C had a character limit on the advice they could give, the most efficient approach might be to give various lines in chess notation, without any explanation. Is this desirable?

Would it make sense to limit the move depth that C can talk about in concrete terms? E.g. to say that you can give a concrete line up to 6 plies, but beyond that point you can only talk in generalities (more space; pressure on dark squares; more active pieces; will win material...).

I expect that prototyping this will make sense - come up with something vaguely plausible, then just try it and adjust.

I'd be interested to give feedback on the setup you're planning, if that'd be useful.

Comment by Joe_Collman on Thoughts on responsible scaling policies and regulation · 2023-10-26T02:02:49.589Z · LW · GW

Hmm. Perhaps the thing I'd endorse is more [include this in every detailed statement about policy/regulation], rather than [shout it from the rooftops].

So, for example, if the authors agree with the statement, I think this should be in:

  • ARC Evals' RSP post.
  • Every RSP.
  • Proposals for regulation.
  • ...

I'm fine if we don't start printing it on bumper stickers.

The outcome I'm interested in is something like: every person with significant influence on policy knows that this is believed to be a good/ideal solution, and that the only reasons against it are based on whether it's achievable in the right form.

If ARC Evals aren't saying this, RSPs don't include it, and many policy proposals don't include it..., then I don't expect this to become common knowledge.
We're much less likely to get a stop if most people with influence don't even realize it's the thing that we'd ideally get.

Comment by Joe_Collman on Thoughts on responsible scaling policies and regulation · 2023-10-26T01:36:40.581Z · LW · GW

I think forcing people to publicly endorse policies

Saying "If this happened, it would solve the problem" is not to endorse a policy. (though perhaps literally shouting from the rooftops might be less than ideal)

It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.

It's reasonable to say that this might not be the most effective communication strategy. (though I think on balance I'd disagree)
It's not reasonable to say that this amounts to publicly endorsing a policy.

...if we ended capitalism, it would solve climate change. Though true...

This seems an unhelpful parallel, first because it's not clearly true. (In particular "ended capitalism" isn't "ended capitalism, and replaced it with communism", nor "ended capitalism overnight without a plan to replace it").
Second, because the proposal in this case is to not actively enact a radically disruptive change to society.

The logic of the point you're making is reasonable, but the parallel has a bunch of baggage that reduces overall clarity IMO.

...because there are much more effective ways of addressing climate change than starting a communist revolution...

This part isn't even a parallel: here even if successful the communist revolution wouldn't be most effective. However if successful a sufficiently good pause would be most effective.

Comment by Joe_Collman on Thoughts on responsible scaling policies and regulation · 2023-10-25T05:42:58.297Z · LW · GW

The specific conversation is much better than nothing - but I do think it ought to be emphasized that solving all the problems we're aware of isn't sufficient for safety. We're training on the test set.[1]
Our confidence levels should reflect that - but I expect overconfidence.

It's plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].
My presumption is that without good coordination (a necessary condition being cautious decision-makers), things will go badly.

RSPs seem likely to increase the odds we get some international coordination and regulation. But to get sufficient regulation, we'd need the unknown unknowns issue to be covered at some point. To me this seems simplest to add clearly and explicitly from the beginning. Otherwise I expect regulation to adapt to issues for which we have concrete new evidence, and to fail to adapt beyond that.

Granted that you're not the median voter/decision-maker - but you're certainly one of the most, if not the most, influential voice on the issue. It seems important not to underestimate your capacity to change people's views before figuring out a compromise to aim for (I'm primarily thinking of government people, who seem more likely to have views that might change radically based on a few conversations). But I'm certainly no expert on this kind of thing.

  1. ^

    I do wonder whether it might be helpful not to share all known problems publicly on this basis - I'd have somewhat more confidence in safety measures that succeeded in solving some problems of a type the designers didn't know about.

Comment by Joe_Collman on Thoughts on responsible scaling policies and regulation · 2023-10-25T03:47:57.842Z · LW · GW

Thanks for writing this.

I'd be interested in your view on the comments made on Evan's RSP post w.r.t unknown unknowns. I think aysja put it best in this comment. It seems important to move the burden of proof.

Would you consider "an unknown unknown causes a catastrophe" to be a "concrete way in which they fail to manage risk"? Concrete or not, this seems sufficient grounds to stop, unless there's a clear argument that a bit more scaling actually helps for safety. (I'd be interested in your take on that - e.g. on what speed boost you might expect with your own research, given AI assistants of some level)

By default, I don't expect the "affirmative case for safety that will require novel science" to be sufficient if it ends up looking like "We developed state of the art tools that address all known problems, and we don't expect others".

On the name, it's not 'responsible' that bothers me, but rather 'scaling'.
"Responsible x-ing policy" gives the strong impression that x-ing can be done responsibly, and that x-ing will continue. I'd prefer e.g. "Responsible training and deployment policy". That way scaling isn't a baked in presumption, and we're naming things that we know can be done responsibly.

Comment by Joe_Collman on Lying is Cowardice, not Strategy · 2023-10-24T19:34:53.784Z · LW · GW

This is not what most people mean by "for personal gain". (I'm not disputing that Alice gets personal gain)

Insofar as the influence is required for altruistic ends, aiming for it doesn't imply aiming for personal gain.
Insofar as the influence is not required for altruistic ends, we have no basis to believe Alice was aiming for it.

"You're just doing that for personal gain!" is not generally taken to mean that you may be genuinely doing your best to create a better world for everyone, as you see it, in a way that many would broadly endorse.

In this context, an appropriate standard is the post's own:
Does this "predictably lead people to believe false things"?
Yes, it does. (if they believe it)

"Lying for personal gain" is a predictably misleading description, unless much stronger claims are being made about motivation (and I don't think there's sufficient evidence to back those up).

The "lying" part I can mostly go along with. (though based on a contextual 'duty' to speak out when it's unusually important; and I think I'd still want to label the two situations differently: [not speaking out] and [explicitly lying] may both be undesirable, but they're not the same thing)
(I don't really think in terms of duties, but it's a reasonable shorthand here)

Comment by Joe_Collman on Lying is Cowardice, not Strategy · 2023-10-24T19:10:55.449Z · LW · GW

Ah okay - thanks. That's clarifying.

Agreed that the post is at the very least not clear.
In particular, it's obviously not true that [if we don't stop today, there's more than a 10% chance we all die], and I don't think [if we never stop, under any circumstances...] is a case many people would be considering at all.

It'd make sense to be much clearer on the 'this' that "many people believe".

(and I hope you're correct on P(doom)!)