Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results

iknownothing-1

Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results

post by Iknownothing · 2024-01-15T19:37:07.984Z · LW · GW · 0 comments

This is a link post for https://aiplans.substack.com/p/december-critique-a-thon-results

  Winners:
    1st Place:
      Congratulations to Lorenzo Venieri! 🥇
    2nd Place:
      Congratulations to NicholasKees & Janus! 🥈
    3rd Place:
      Congratulations to Momom2 & AIPanic!🥉 Momom2 & AIPanic had the third highest mean score of 6.286 for their Critique of Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)
  Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)
    Critique A, Momom2 & AIPanic:
        Strong empiricism: This approach can be used to test hypotheses about weak-to-strong generalization and provides a practical sandbox to discard weaker ideas.
        Insightful discussion: While the discussion of the results does not provide definitive answers, proposes various interpretations and highlights misconceptions to avoid.
        Questionable methodology:
        Some degree of contamination probably took place and they don't seem to have addressed that. The degree is likely higher than average considering the models used.
        Unclear tractability of the weak-to-strong problem:
    Critique B, Tetraspace
    Judge Review: 
      Ramana Kumar
      Charbel-Raphaël Segerie
      Nate Soares
      Anonymous Judge
  Nash Learning from Human Feedback (DeepMind, Dec 2023)
    Critique A:
      TheAntithesis
    Critique B:
      Tetraspace
    Judge Review: 
      Charbel-Raphaël Segerie
      Nate Soares
  A General Theoretical Paradigm to Understand Learning from Human  Preferences (DeepMind, Dec 2023)
    Critique A:
      Lorenzo Venieri 
        ## How can this new method help with alignment
    Judge Review:
      Ramana Kumar
      Dr Peter S. Park
      Charbel-Raphaël Segerie
      Nate Soares
      Anonymous Judge
        Lorenzo Venieri mean score = 7.5
  Constitutional AI: Harmlessness from AI Feedback (Anthropic, Dec 2022)
    Critique A:
      Momom2 & AIPanic:
    Critique B
      Lorenzo Venieri
    Judge Review:
      Dr Peter S. Park
      Charbel-Raphaël Segerie
      Nate Soares
  Path-Specific Objectives for Safer Agent Incentives (Deepmind, Oxford, April 2022)
    Critique A
      Alvin
    Critique B:
      Alvin
    Judge Review:
      Ramana Kumar
      Dr Peter S. Park
      Nate Soares
        Alvin, Critique A, mean score = 6.4
  Provably safe systems: the only path to controllable AGI (Max Tegmark, Steve Omohundro, Sept 2023)
    Critique A:
      Alvin
    Judge Review:
      Dr Peter S. Park
      Charbel-Raphaël Segerie
      Nate Soares
  Cyborgism
      (NicholasKees, janus, Feb 2023)
    Critique A:
        List of pros & cons
        Pros:
        Extra:
        Counterpoints to pros & cons per item
        Pro 1: Cyborgism’s problems will be very analogous to those of brain uploading, and
        likely thus a complementary/preparatory research direction.
        Pro 2: It is actually a plan - it gives you a clear idea how you would proceed in the
        direction.
        Pro 3a: working intimately with LLMs increases peoples’ intuition for LLM failure
        Pro 3b: the creation of guidelines for working with LLMs is useful (albeit possibly dual
        use).
    Judge Review:
      Dr Peter S. Park
      Charbel-Raphaël Segerie
      Nate Soares
  Legible Normativity for AI Alignment: The Value of Silly Rules (Dylan Hadﬁeld-Menell, McKane Andrus,Gillian K. Hadﬁeld, 2018)
    Critique A:
      Alvin
    Critique BAayush 
        Overlooking Deceptive Agents
    Critique C
      Sanyer
        Robust Simulation Framework: expanded
    Critique D:
      Sanyer
        Overlooking Deceptive Agents: expanded
    Judge Review:
      Dr Peter S. Park
      Charbel-Raphaël Segerie
      Nate Soares
        Alvin mean score = 3.8
        Aayush mean score = 5.8
        Sanyer, Critique C, mean score = 6
None
No comments

These are the Critiques from the AI-Plans.com December Critique-a-Thon of 2023.
Prizes will be awarded to contestants based on collective judge feedback.

Critiques were anonymized when reviewed by the judges.

Judges:
- @So8res [LW · GW]
- @Ramana Kumar [LW · GW]
- @Peter S. Park [LW · GW]
- @Charbel-Raphaël [LW · GW], Head of AI Unit at EffiSciences
- An anonymous judge (researcher at a major lab)

If you're interested in being a Judge for an upcoming Critique-a-Thon (held every 2 months), please email kabir03999@gmail.com or DM Kabir Kumar on Discord (kabs9744) or LessWrong!

Winners:

1st Place:

Congratulations to Lorenzo Venieri! 🥇

Lorenzo had the highest mean score, of 7.5, for his Critique of:
A General Theoretical Paradigm to Understand Learning from Human Preferences, the December 2023 paper by DeepMind.

2nd Place:

Congratulations to NicholasKees & Janus! 🥈

Nicholas and Janus has the second highest mean score of 6.9, for their Critique of Cyborgism!

3rd Place:

Congratulations to Momom2 & AIPanic!🥉

Momom2 & AIPanic had the third highest mean score of 6.286 for their Critique of Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)

Critique A, Momom2 & AIPanic:

This paper is very similar in concept to Constitutional AI and shares some of its strengths and weaknesses. The basic idea is to generate training data (weak labels) from a first model (the teacher), which is used to train a more advanced, bigger model (the student).

The hope is to make training advanced AI models more human-compute efficient while preserving a decent understanding of the human’s values. Hopefully, further research in this direction may prove enough to validate scalable oversight as an alignment agenda.

This should not be taken as a solution to an alignment problem, but rather as a justification for further research; the paper presents preliminary results and mostly gestures suggestively while mouthing “what if we dug deeper?”

Pros:

Well-presented: The paper is good at self-critiquing and avoids making bold claims, except for the tractability of the weak-to-strong problem.

Strong empiricism: This approach can be used to test hypotheses about weak-to-strong generalization and provides a practical sandbox to discard weaker ideas.

Insightful discussion: While the discussion of the results does not provide definitive answers, proposes various interpretations and highlights misconceptions to avoid.

Cons:

Questionable methodology:

The benchmarks are insufficient. The chess one, in particular, is pretty much irrelevant as it does not rely on some underlying reasoning of an open system. Chess is a narrow task, and the fact that LLMs perform decently at it is circumstantial.

Some degree of contamination probably took place and they don't seem to have addressed that. The degree is likely higher than average considering the models used.

The auxiliary confidence loss introduces a simplicity bias, on the assumption that the intent of the teacher is easier to capture than the specific ways in which it suboptimally reaches for it.

This is justified by the Natural Abstraction Hypothesis, which may explain why the auxiliary confidence loss improves performance.

Mixed bag:

Unclear tractability of the weak-to-strong problem:

Maybe the method plucked the low-hanging fruit of this problem, and despite encouraging results further algorithmic improvements will be extremely hard to attain.

On the other hand, these preliminary results justify this approach as a reasonable methodology to figure out whether or not the weak-to-strong problem is tractable.

Critique B, Tetraspace

Insufficient Consideration of Manipulation Risks

The plan fails to adequately address the risks of AI systems manipulating humans.
Evidence: The plan explicitly calls out assumptions that are required for weak-to-strong generalisation to be part of an alignment plan. Two assumptions that are relevant to this critique are "elicited concepts are sufficiently robust, or do not need to be", and "the concepts we care about are natural to future AGI".
These (alongside all the other assumptions, which I think are clearly not relevant to this aspect) not sufficient for the student AI to not learn to manipulate the supervisor; for example, Honesty (return the model's understanding of whether the statement is true or false) and Flatter-the-Supervisor (return the model's understanding of whether the supervisor would think the statement is true or false) are capabilities that both look the same for statements that the supervisor can evaluate accurately.
Honesty being natural does not help, because Flatter-the-Supervisor could be natural in the same way; Honesty being robust doesn't help, because it is not necessarily the case that applying a lot of optimisation pressure via a training signal shared between Honesty and Flatter-the-Supervisor. There is hence an additional assumption required that is or entails "deception always takes more optimisation power to elicit than the corresponding non-deceptive capability".

Judge Review:

Ramana Kumar

Critique A (Momom2 & AIPanic)

Accuracy: 9/10
Communication: 9/10

Critique B (Tetraspace)
Communication: 5/10

Charbel-Raphaël Segerie

Critique A (Momom2 & AIPanic)

Accuracy: 7/10
Communication: 7/10

Reason: I share this analysis. I disagree with some minor nitpicking.

Critique B (Tetraspace)

Accuracy: 4.5/10
Communication: 4/10

Reason: This is not clearly written.

Nate Soares

Critique A (Momom2 & AIPanic)

Rating: 2/10

Comments:

wrong on the chess count
doesn't hit what i consider the key critiques
this critique seems more superficial than the sort of critique i'd find compelling. what i'd want to see considered would be questions like:
* how might this idea of small models training bigger models generalize to the notkilleveryoneism problems?
* which of the hard problems might it help with? which might it struggle with?
* does the writing seem aware of how the proposal relates to the notkilleveryoneism problems?

Critique B (Tetraspace)

Rating: 3/10

Comments:

this critique is at least trying to go for the deeper stuff, and manages to touch upon one fragment. i'd've preferred to see more discussion of how the game changes as the "student" gets qualitatively more capable, etc.
slightly closer to the key critiques

Anonymous Judge

Accuracy: 4/10
Communication: 6/10

Reason: I think they’re far too pessimistic. What about the crazy results that the strong model doesn’t simply “imitate” the weak model’s errors! (Even without regularization) That’s a substantial update against the “oh no what if the human simulator gets learned” worry.

Critique B

Accuracy: 2/10
Communication: 3/10

Reason: I think “optimization power” and “naturality” are not the right concepts for thinking about whether generalization will work. We should be thinking about what functions have tons of probability under the neural network prior, in the sense of taking up large volumes of parameter-space. “Optimization power” seems like very vague and imprecise shorthand.

Momom2 & AIPanic mean score = 6.286

Tetraspace mean score = 3.583

Nash Learning from Human Feedback (DeepMind, Dec 2023)

Critique A:

TheAntithesis

The plan uses reward functions that do not accurately reflect human values or the intended goals, leading to potentially harmful AI behavior.
Evidence: There are some cases, such as the Prisoners' Dilemma, where a Nash Equilibrium of human preferences is undesirable. In these cases, an AI trained on NLHF will take undesirable actions such as allowing (or even encouraging) people to hurt each-other.
Also, there are some situations where the Nash Equlibrium does not exist (such as the Hawk-Dove Game) or there are multiple Nash Equilibria. In these situations, the AI's behaviour is undefined.
Importance for solving alignment (0-100): 30

Critique B:

Tetraspace

The plan uses reward functions that do not accurately reflect human values or the intended goals, leading to potentially harmful AI behavior.
Evidence: The plan relies on having a human preference model that is grounded in human ratings of pairs of responses. For normal responses in the training distribution, these will resemble human preferences over outputs, but if the reward model is sufficiently smart then it is being outer aligned towards the ground truth of human _responses_, which do not resemble preferences in the presence of manipulation. If it is not sufficiently smart, then there is no guarantee of a lack of adversarial examples in the reward model, and the model finding the Nash equilibria would be incentivised to output these adversarial examples instead.

If the situations where the Nash Equilibrium is undesirable form a small enough portion of the situations the AI will find itself in, an AI trained on NLHF may be safe enough to create a better world despite this vulnerability.

Judge Review:

Charbel-Raphaël Segerie

Critique A (TheAntithesis)

Accuracy: 4/10
Communication: 6/10

Reason:

The critique should at least say that NLHF is not really solving the most important problems in RLHF which are listed in the paper Open Problems and fundamental limitaitons with RLHF.

Critique B (Tetraspace)

Accuracy: 2.5/10
Communication: 3/10

Reason: Same, and pretty convoluted to read.

Nate Soares

Critique A (TheAntithesis)

Rating: 2/10

Reason:

seems like a plausible guess, but not with nearly enough depth--a solid critique (by my lights) would have spelled out how the policy-dominance the Nash Learning paper achieves relates to nash equilibria that we disprefer, including an example, and then it would have separately talked about how this training (fails to) generalize to the case of smarter minds and notkilleveryoneism problems.

Critique B (Tetraspace)

Rating: 2/10

Reason: same comment as above, more-or-less.

TheAntithesis mean score = 4

Tetraspace mean score = 2.5

A General Theoretical Paradigm to Understand Learning from Human Preferences (DeepMind, Dec 2023)

Critique A:

Lorenzo Venieri

RLHF relies on many assumptions, two of which are addressed by IPO:

Pairwise preferences can be substituted with pointwise rewards

A reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy.

DPO bypasses the reward modeling and directly optimizes the language model using preference data (skips approximation 2). [3]

In the paper is defined a general theoretical framework ΨPO that encompasses the specific cases of RLHF and DPO. This theoretical formulation reveals a common pitfall of these methods: approximation 1 can lead to overfitting to the preference dataset at the expense of ignoring the KL-regularisation term.

(In section 4.2, it is explained that RLHF is more robust to this problem thanks to the underfitting of the reward model. “DPO, in avoiding the training of the reward function, loses the regularisation of the policy that the underfitted reward function affords.”

IPO is a special case of ΨPO that also bypasses assumption 1, potentially solving the overfitting to preferences problem.

In section 5.4, examples show how IPO does not overfit (converging to degenerate solutions) and it’s possible to control how much the fine-tuned policy remains close to the original pre-trained policy, using the regularisation parameter.

## How can this new method help with alignment

RLHF has some problems and limitations related to the reward model and policy optimization [1] that IPO has the potential to solve, or at least alleviate.

IPO may help solve some issues related to the RL part of RLHF

RL makes the model more prone to undesirable behaviors and properties [2] (goal-directedness, less symmetry, self-preservation preferences, sycophancy, deception, more opaque thoughts, …)
IPO has the potential to solve this, both because it doesn’t directly use RL to optimize the policy and because it seems to make it possible to tune how much we want our final policy to be close to the pre-trained policy.
Remains to be checked with experiments how much of the useful qualities from RLHF can be kept if this is done.

IPO solves challenges with jointly training the Reward Model and Policy

(Joint training induces distribution shifts & It is difficult to balance efficiency and avoiding overfitting by the policy)
These problems should be already completely solved by DPO, since it doesn’t require training a reward model and eliminates the need for sampling from the LM during fine-tuning. IPO has these same benefits and can also avoid overfitting on preferences.

Reward models can misgeneralize to be poor reward proxies, even from correctly-labeled training data → IPO doesn’t rely on an explicit reward model

This is a fundamental problem that is hard to solve, but IPO makes a step in the right direction by making it possible to learn more fine-grained information about human values (it can use directly pairwise preferences instead of pointwise rewards), and getting rid of assumption 2 (“a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy”) because it doesn’t use an explicit reward model.
Moreover, assuming that the language model (from IPO) has got an encoding of human values which is better than the reward model, the problem of reward hacking (arising from an imperfect reward proxy) is probably alleviated. This assumption is supported by the fact that a problem of the reward model from RLHF is that it is prone to underfitting human preferences.

Problems with RLHF that IPO doesn’t solve:

Evaluating reward models is difficult and expensive → IPO makes this worse

This problem is probably made worse by this approach since now we are using the language model itself as an implicit reward model. This likely increases the difficulty of finding out how the policies are made and why they’re made that way- since we’re no longer dealing with a single-purpose reward model, but a much broader functionality LLM.

Every other problem/vulnerability that applies to RLHF applies also to IPO:
E.g.

All the challenges with obtaining human feedback ([1] section 3.1)

Difficult to represent a diverse society of humans and their values

Policies tend to be adversarially exploitable.

Policies can perform poorly in deployment even if rewards seen during training were perfectly correct.

References:

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (https://arxiv.org/pdf/2307.15217.pdf)

Discovering Language Model Behaviors with Model-Written Evaluations (https://arxiv.org/pdf/2212.09251.pdf)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/pdf/2305.18290.pdf)

Judge Review:

Ramana Kumar

Critique A (Lorenzo Venieri)

Accuracy: 9/10
Communication: 9/10

Dr Peter S. Park

Critique A (Lorenzo Venieri)

Accuracy: 8.5/10
Communication: 9/10

Reason:

The critique concisely but comprehensively summarizes the concepts of the paper, and adeptly identifies the promising aspects and the pitfalls of the IPO framework.

Charbel-Raphaël Segerie

Critique A (Lorenzo Venieri)

Accuracy: 8/10
Communication: 5/10

Nate Soares

Critique A (Lorenzo Venieri)

Rating: 5/10

Reason:

seems like an actual critique. still light on the projection out to notkilleveryoneism problems, which is the part i care about, but seems like a fine myopic summary of some pros and cons of IPO vs RLHF

Anonymous Judge

Critique A

Accuracy: 5/10
Communication: 9/10

Reason: I’m mixed on this. There are several false or ungrounded claims, which I rate “0/10.” But there’s also a lot of useful information here.

Lorenzo Venieri mean score = 7.5

Constitutional AI: Harmlessness from AI Feedback (Anthropic, Dec 2022)

Critique A:

Momom2 & AIPanic:

RLAIF proposes a way to train a model on feedback from a model previously trained with human feedback, thus making the training beyond foundations (which is arguably where alignment is grafted onto capabilities) more human-compute efficient. This approach addresses some present problems very convincingly, though it does not address and arguably worsens underlying risks.

The main hope of Constitutional AI is to find a technique that performs better than RLHF on HHH metrics, and to enable scalable oversight through increasing automation of alignment techniques.

Disclaimer: Some of the considerations below have probably been studied at length in future work by Anthropic. This paper is only the first one of many to study this approach.

Pros:

Well-presented: The authors are clear about what they do not claim and are careful not to present their results as more significant than they are. The methodology is described extensively, with a discussion of how to interpret the results that highlights possible sources of confusion.

Cons:

Unclear empirical results: Personal experiments cast doubt on the claim that constitution size has little influence on the output and public tests show that the attention does not scale well with size. This hints that the empirical results are not as strong as they may look and that there are technical challenges to overcome for more advanced application of this method. We did not check if such challenges were solved in later work.

Mixed bag:

More execution than innovation: One can argue that the results boil down to an effective execution of simple alignment techniques rather than a qualitative innovation that would avert classes of risk by design. By this regard, this paper is informative as a report on safety methods rather than as a promising area for future research.

No questionable underlying assumptions: The paper sticks to compelling evidence for its claims, and does not use questionable analogies and correspondence with threat models to develop in detail the potential meaning of their results. On the other hand, refusing to indulge in this kind of speculation makes discussing the results, especially questioning their relevance, more difficult. The authors refuse to put their head on the chopping block.

Transparency: The authors argue that the constitution improves transparency, but this is true only insofar as the model abides by the constitution. This can also be viewed as stacking a new layer of complexity on top of previous work albeit one that is interpretable. However, this may distract from solving pre-existing transparency issues with the underlying foundation model or the mechanistic effects of the finetuning.

Unclear interpretation of results: The results are ambiguous, though the authors discuss at length how to avoid making them say more than they say. Here are a few considerations that muddle the interpretation of the results:

What justifies the correspondence between ELO and harmlessness?

How subjective are the harmfulness evaluations?

Informal discussion interpretation of the results rather than a clearly correct conclusion.

Will future work be able to solve the questions left open (especially robustness)?

This should not necessarily be taken against the paper, because the intention of the paper is not to “solve alignment” once and for all but rather to present incremental improvements on prosaic risks. By that regard trying to interpret the results from an alignment perspective is not constructive.

Contextual considerations: It is difficult to choose if the context of the paper’s publication should be taken into consideration at all. The following may not be used to criticize the paper per se, but might still be interesting for researchers looking at this because of safety concerns. Make of it what you want, here’s what we think:

What is the impact of the paper on academic research? This kind of work from a major AI lab explicitly committed to safety might distract from more effective/innovative work.

What is the paper used for in practice? Constitutional AI has been presented by D. Amodey as a solution to safety concerns. This is questionable under Anthropic’s own view of the difficulty of AI alignment; in the easy case, this is a significant improvement, but in the hard case, this is basically useless.

It feels like papers on prosaic risks collectively help avert present harms, though it is unclear how much one single paper contributes to this.

Critique B

Lorenzo Venieri

Strength:
More scalable than RLHF
Point Made:
RL from Human Feedback is not guaranteed to scale to more intelligence. Constitutional AI tries to solve this problem by using the AI itself to check if the outputs of the model are compliant to the values in the given constitution. This makes it possible to use the AI extrapolation of human evaluation abilities to provide a more reliable training signal, even at the point when the AI outputs are too complex for a human evaluator.
Evidence:
The results shown in the paper.
Vulnerability:
Constitutional values may be too vague
Critique importance if true (0 - 100): 90

Point Made:

The AI interpretation/understanding of the constitution may differ from our own.
Evidence:
The same thing applies to human legal systems, where individuals may interpret laws differently, necessitating the involvement of a third party, such as a judge or jury, to provide a final and authoritative interpretation of the written law.
Critique importance if true (0 - 100): 80

Judge Review:

Dr Peter S. Park

Critique A (Momom2 & AIPanic)

Accuracy: 4.5/10
Communication: 6/10

Critique B (Lorenzo Venieri)

Accuracy: 6/10
Communication: 5/10

Charbel-Raphaël Segerie

Critique A (Momom2 & AIPanic)

Accuracy: 3/10
Communication: 5/10

Reason:

Clearly written, but a pretty bad critique; The critic do not see that even if Constitutional AI is imperfect, this is still pretty innovative and interesting work

Critique B (Lorenzo Venieri)

Accuracy: 6.5/10
Communication: 7/10

Reason:

Too short, so it's clear, but there's not a lot of non-trivial information here.

Nate Soares

Critique A (Momom2 & AIPanic)

Rating: 2/10

Reason:

again: superficial and light on discussion of how any of this relates to notkilleveryoneism

Momom2 & AIPanic mean score = 4.1

Lorenzo Venieri mean score = 6.125

Path-Specific Objectives for Safer Agent Incentives (Deepmind, Oxford, April 2022)

Critique A

Alvin

It offers an alternative to a utility function that would be maximized by keeping things constant

Other approaches to reward specification have the problem with a utility function that would be maximized by keeping certain things constant. For example, a large negative reward for a variable being changed would incentivized keeping the variable constant. This approach formalize what would be needed to avoid that for states that should be able to change, but that we don’t want an AI to change intentionally.

This approach also allows for a negative reward/utility for changing states that you actually want the agent to keep constant.

The approach seems to allow for a utility function which does not create any instrumental control incentives regarding a potential off button. If there is an off button, intuitively an agent would be incentivized to hinder anyone from pressing it, unless the agent would receive a large reward when the button is pressed. This means that the agent would either try to press it or try to protect it. Path-specific objectives offers an alternative, where the utility function is independent on any action towards the button. However, even if the utility function would be independent from actions towards the button, it seems hard to design a training methodology for which the agent would learn to ignore the button.

Critique B:

Alvin

Under large amounts of influence, wouldn’t everything be unstable?

Stable states are described as a state that is robust to non-deliberate changes, in other words something that will not be affected by actions that are not deliberately trying to change the state in question. However, it seems like an agent that are doing actions that have an extremely large effect on anything will potentially change even the ‘stable’ states. If the agent decides to blow up the planet, it would kill all humans, even if that was a complete side-effect of pursuing another goal. In other words, are not the vast majority of states ‘unstable’, given a really powerful agent?

In general, it seems like there are large incentives on applying AGI to a wide range of unstable environments, where it would have large effects unintentionally, even if the effect wasn’t large. If the proposal is largely dependent on important states being stable, that is a huge flaw.

However, it is a matter of probability. One will need to take into account things improbable for less smart agents, but possible for smarter ones. In other words, this approach would as many other approaches work more reliably for less intelligent systems.

On the other hand, it might be possible to associate a utility with a lot of states in order to make a very powerful agent being incentivized not to take extreme actions that would affect otherwise stable states. For example, this approach could be combined with some general measure of effect, perhaps measured by another AI, and there would be a large negative utility associated with large effects.

Judge Review:

Ramana Kumar

Critique A (Alvin)

Accuracy: 8/10
Communication: 6/10

Critique B (Alvin)

Accuracy: 8.5/10
Communication: 8/10

Dr Peter S. Park

Critique A (Alvin)

Accuracy: 8/10
Communication: 8/10

Reason:

Concisely explains the thrust of the paper’s contribution and limitations.

Critique B (Alvin)

Accuracy: 9/10
Communication: 6/10

Reason:

Identifies important pitfall. However, the argument is written in a way that assumes the reader is intimately familiar with the paper’s content. It would be better to assume the reader has no or minimal knowledge of the content (e.g,. Why is the author talking right away about ‘stable states’?)

Nate Soares

Critique A (Alvin)

Accuracy: 2/10

Reason:

too shallow; a critique that i found compelling would either (a) find clever holes in this utility approach (in the same way that MIRI used to poke holes in, e.g., utility indifference approaches), or (b) provide a trace of reasoning showing that such holes can't exist (and thus describing exactly what was solved by this paper)

Critique B (Alvin)

Rating: 2/10

Reason: similar to above

Alvin, Critique A, mean score = 6.4

Alvin, Critique B, mean score = 6.7

Provably safe systems: the only path to controllable AGI (Max Tegmark, Steve Omohundro, Sept 2023)

Critique A:

Alvin

Proofs do not bind to the real world

There is an inherent "gap" between the real world and formal proof caused by lossy abstractions. A good example of what such a lossy abstraction looks like is a side-channel such as RowHammer (https://en.wikipedia.org/wiki/Row_hammer)

Any proof for an action by an AI being safe would rely on proving that the action has or hasn’t certain effects on the world. The world is very complicated, which would mean that any such proof would be based on certain estimations about the effects of the action. A superintelligence might be able to game the way such estimations are being made. It might however be possible to determine safety with higher certainty for simpler contexts, like for cybersecurity and the blockchain, which was mentioned in the paper.

On the other hand, proofs always bind to the real world as long as the assumptions and the rules of inference are valid. Yes, sometimes we discover that our proofs no longer hold because we discover something new about the world, which for example is what happened with euclidean geometry and the structure of space. But how rare is that? Since proofs are the most airtight approach available to the sciences, it provides us guarantees we can never get otherwise. It would be important to ensure is that our axioms aren't flakey, and one way to do that is to only deploy AI agents in environments where we can guarantee the said axioms. For example by not making robots that interact with the real world. Instead, keep everything within a computer in a controlled environment with guaranteed properties.

However, even if such failures of a proof system are rare, they happen. In a scenario where we only have one chance to get it right, it would be extremely important to avoid errors in the proofs and infrastructure that an AI (or other adversary) could utilize. But a proof-based approach might show the most promise. The other, less deductive (proof based) and more inductive/empirical (observation based) approaches are prone to “black swan” cases, where we see something we couldn’t have deduced from prior experience.

There is also a problem of computability regarding the safety proofs. Even if an action could theoretically be proven safe, it would require an extremely long proof for a series of actions in a complex environment. Given the computational difficulties, more assumptions must be made, which in turn could cause more lossy abstractions. A “Provably Compliant System” seems infeasible for any system intended to be used in any complex context. However, you might be able to prove safety to a sufficient degree. For example, an advanced AI might be able to find a proof that a series of actions by an AGI do not significantly affect things that we do not want to be affected. A formal proof of complete safety might not be needed if we instead focus on proofs for sufficient safety.

Judge Review:

Dr Peter S. Park

Critique A (Alvin)

Can’t rate (Conflict of interest - Max Tegmark is my postdoctoral mentor)

Charbel-Raphaël Segerie

Critique A (Alvin)

Accuracy: 8.5/10
Communication: 8/10

Reason: Ok, this is a good critique, but could be more technical.

Nate Soares

Critique A (Alvin)

Accuracy: 0/10

Reason:

at a glance: seems like the wrong critique to me. the correct critique is that there's no statement we can make about a capable AGI program that, if proven, convinces us that the AGI is safe

Alvin mean score = 5.5

Cyborgism [LW · GW]

(NicholasKees, janus, Feb 2023)

Critique A:

The Cyborgism post “proposes a strategy for safely accelerating alignment research. The plan is
to set up human-in-the-loop systems which empower human agency rather than outsource it,
and to use those systems to differentially accelerate progress on alignment.”[1]
We evaluate Cyborgism in three ways:
1. Are the Cyborgist tools useful in conducting research?
2. Have Cyborgist writings aided research?
3. How dual use is Cyborgism? Are the uses that come from Cyborgism mostly used for
alignment research or does it accelerate capabilities as well?
Before that we compliment Cyborgism for being an actual plan. It is clear what someone
pursuing this agenda should do.
Now as to whether we think this is something someone should do.
Cyborgist Tools (±)
+ Cyborgs that consciously interface with LLMs might enrich their variety in perspectives.
- Cyborgs might unlearn tasks they frequently delegate to LLMs.
- It is unclear which kinds of tasks can be safely delegated.
People seem to get use out of working with an LLM, though it is unclear to what degree any
LLM-assisted success can be attributed to Cyborgism. we have not heard of LOOM, the
forefront Cyborgist tool, be recommended. Janus says we need stronger models for Cyborgist
tools to become useful for alignment [2]. we have never used it ourselves, so we will have to
leave the question of whether Cyborgist tools accelerate research up in the air.
Cyborgist Writings (+)
+ Cyborgism is aware of LLM failure modes.
+ Cyborgism is a source of guidelines for working with LLMs.
+ Cyborgist insights could inspire adjacent research directions, like interpretability.
Cyborgists (people creating works for cyborgs) created popular pieces that addressed a lot of
misconceptions. It’s fair to say that the Simulators and Waluigi posts changed the views of a
wide community on how LLMs work and Cyborgists seem to continue this project [3, 4].
Dual Use (-)
- Cyborgist tools are just as useful for alignment research as for capability research.
- Cyborgist works are not, as known by us, gate-kept.
- Cyborgist developers· express interest in advancing LLM capabilities. Perhaps to make
their Cyborgist tools work better.
The Cyborgism post itself calls dual use its weakest link and we agree. Cyborgist works thus far
do not seem to have succeeded to be only helpful to alignment research without also being
accessible and equally useful to capability research. The main other way, curated access, does
not yet seem to be applied (though it could be a well-kept secret).
The main pitfall Cyborgism walks into now, is that developers might be tempted to improve LLM
capabilities to make their Cyborgist tools work. The post addresses this potential failure mode
and solves it by encouraging research only into non-agentic capabilities, which I think is a
message that is doomed to be misinterpreted.
Conclusion
- Cyborgist tools don’t seem to see much use in research but might with stronger LLMs.
- Cyborgist literature has helped people understand LLMs and has inspired related areas.
- There seems to be no effort to limit the usefulness of Cyborgism to alignment and the
agenda, if it ever yields useful tools, likely just ends up accelerating the entire timeline.
References
1. https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism [LW · GW] Below the starting
image.
2. https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism [LW · GW] Appendix
3. https://arxiv.org/abs/2305.16367
4. https://www.lesswrong.com/posts/iFBdEqEogtXcjCPBB/the-compleat-cybornaut
Appendix
This document started with listing our pros and cons which took about 4 hours. Luck² listed cons
first, Jono listed pros first and then we switched positions. We then commented on our points for
2 hours and finally Jono wrote the above with some minor edits made by Luck².
List of pros & cons
Pros:
1. [J] Cyborgisms problems will be very analogous to those of brain uploading, and likely
thus a complementary/preparatory research direction.
2. [J] It is actually a plan - it gives you a clear idea how you would proceed in the direction.
3.
a. [J] working intimately with LLMs increases peoples’ intuition for LLM failure
modes, which even outside of high-level alignment goals would find use in
preventing harmful application.
b. [J] the creation of guidelines for working with LLMs is useful (albeit possibly dual
use).
4. [L] working with simulators likely increases the practitioners ability in perspective taking,
resulting in them becoming more reasonable, which isn’t necessarily good for alignment,
but helps achieve part of the goals for which alignment is pursued (safety).
5. [L] not just intuition for failure modes, but also might provide direction for interpretability
research [which is good in the short term].
Cons:
1. [L] it is unclear which things we can transfer over to the LLM part of a cyborg system
without disempowering/unskilling/unlearning ourselves from a use-it-or-lose-it
perspective.
2. [L] the states in a simulator are opaque for humans, so while states can update there it
relies on the human to discern whether the states are updated in the manner we think
they are [[[[[ possible consequences deception? ]]]]]
3. [L] NDA required for training alignment researchers. If tutorials are open they are likely to
be easily translatable into dual use.
4. [L] systems like the Loom system are useful for capability research.
5. [J] Maybe superficial, but the name makes no explicit mention against agent-like use of
these systems, but rather elicits the combination of humans and AI systems, whether
simulators, agents or something else. i.e. maybe a different name would be more
appropriate for something with alignment as goal.
6. [J] post says: “when you look up close, alignment and capabilities look very different”,
sorry what, why, unclear.
7. [J] Cyborgism is dual use by default, which makes it costly from a risk management
perspective.
8. [J] Cyborgism is less of a way to achieve alignment and more of a way to still make use
of LLMs in current postponing getting paperclipped~ / human in the loop is not scaleable
for alignment.
Extra:
1. [J] Agencyticness is described as something which would be desirable to avoid, but the
proposal does not go far into a taxonomy or spectrum of agenticness and which
attributes of it are bad/dangerous.- agenticness is in part subjective because whether
something needs to be treated as an agent depends on how much you 1) need to
predict it’s goal and 2) need extra information to make a better predictions for the entities
next state(s). This view leaves bad/dangerousness on the sidelines as those are linked
to capabilities - in the moment you need to treat a roomba as an agent to make sure you
can step aside after it is done turning (have to wait for it’s “decision”), but when it is going
in a straight line it is as predictable as a ball rolling down a hill.
Counterpoints to pros & cons per item
Pro 1: Cyborgism’s problems will be very analogous to those of brain uploading, and
likely thus a complementary/preparatory research direction.
[L] This is is an assumption, it might or might not, maybe we remove it later
Pro 2: It is actually a plan - it gives you a clear idea how you would proceed in the
direction.
[J] using gpt for brainstorming is very nice, but from the perspective of Janus (the anecdote) it
hasn’t helped him with actual alignment - actual help is here deferred to later LLM models, but
how much time-slack fo we have.
[J] Not everyone will have acces to these future (base-) models, because iot is likely they will be
kept internal resulting in reliance on the goodwill and good ability of owners discretion for how to
use them for alignment. As such in practice its usefulness for alignment will be doubtful.
[J] cyborgism works best when good simulators are available to use for it, but this requires
better models, and as such it requires the thing which it is trying to be an alternative to.
Pro 3a: working intimately with LLMs increases peoples’ intuition for LLM failure
[L] that’s looking at one side of the coin, this also goes for success, making it dual use again
bothw.r.t. good/bad actors as well as capabilities research.
Pro 3b: the creation of guidelines for working with LLMs is useful (albeit possibly dual
use).
[J] Janus said it hasn’t proven very useful for alignment so far
Safely accelerating alignment research. Setup human-in-the-loop systems which
empower human agency rather than outsourcing it. - cyborgism
The Complaet cybernaut:
[J] cyborg(ism) can be seen as a nomer for a set of actions/behaviours/ways of doing things, but
it is not necessary to think of the things you’re doing as part of cyborgism to pursue them
effectively.
[J] In theory cyborgism should be beneficial for alignment, in practice it is dual use (e.g. info
availability/lack selection of people that get to engage with it/), also the most referenced
cyborgism related information/tool is the Simulators post, which isn’t empirically verified as
having been a boon for alignment (as opposed to a bane), but this could/should be verified.

Judge Review:

Dr Peter S. Park

Critique A (NicholasKees, janus)

Accuracy: 9.5/10
Communication: 9/10

Reason: Very comprehensive

Charbel-Raphaël Segerie

Critique A (NicholasKees, janus)

Accuracy: 7/10
Communication: 4/10

Reason: Good work, but too many bullet points

Nate Soares

Critique A (NicholasKees, janus)

Rating: 5/10

Reason:

seems basically right to me (with the correct critique being "cyborgism is dual-use, so doesn't change the landscape much")

NicholasKees & Janus mean score = 6.9

Legible Normativity for AI Alignment: The Value of Silly Rules (Dylan Hadﬁeld-Menell, McKane Andrus,Gillian K. Hadﬁeld, 2018)

Critique A:

Alvin

Distinguishing Normative Values
Point made: Highlights the importance of AI systems recognizing the difference between “silly” and important values. I haven’t come across any other paper in the field making a similar point with some reasonable simulations backing it up, and believe it’s an insight that should be somewhat useful when thinking about value alignment.

Evidence: Indicates significant behavioral changes in agents based on their response to different rule types.

Critique B
Aayush

Overlooking Deceptive Agents

The paper doesn’t address how deceptive agents might exploit the dynamics of silly and important norms. For example, deceptive agents could follow silly norms to appear cooperative, then violate important norms without detection.

I mention this because I expect deception to be an important problem in AI Alignment, that should be assumed rather than not assumed since it likely changes the dynamics and conclusions quite a bit.

Evidence: Absence of discussion on deceptive agents in the simulation.

Critique C

Sanyer

Robust Simulation Framework: expanded

The paper offers a detailed enough experimental design for simulations that can be adapted to study various alignment scenarios. We are excited about making modifications to the setup in order to study different scenarios. For example, simulating with more complex agents and realistic environments.

For example, further research could explore how silly rules affect different kinds of agents, such as reinforcement agents or LLMs with varying parameters. Additionally, we are interested to see different versions of the game presented in the paper, such as text-based games for LLMs, different types of games for agents, and more realistic scenarios with relevance to real life. Finally, we are excited to see the addition of more complex phenomena such as deception to see how it affects the results.

However, there is no guarantee that the rule-based framework would scale to more complicated scenarios. As mentioned, the model is simple, and making it account for more complicated scenarios, such as deception, might be intractable.

Critique D:

Sanyer

Overlooking Deceptive Agents: expanded

Criticism: The paper doesn’t address how deceptive agents might exploit the dynamics of silly and important norms. For example, deceptive agents could follow silly norms to appear cooperative, then violate important norms without detection.

Importance: Deception is expected to be an important problem in AI Alignment [1]. Understanding how deceptive agents could leverage this important and silly norms dynamic, and what could be done about that, is crucial to making this simulation more realistic and useful for AI Alignment.

Evidence: Absence of discussion on deceptive agents in the simulation.

(Instead of a criticism, this can also be interpreted as a future research direction we would be very excited about)

[1] AI Deception: A Survey of Examples, Risks, and Potential Solutions https://arxiv.org/abs/2308.14752

Judge Review:

Dr Peter S. Park

Critique A (Alvin)

Accuracy: 4/10
Communication: 5/10

Critique B (Aayush)

Accuracy: 9/10
Communication: 5/10

Reason:

Identifies the biggest problem (not robust to deception emerging) but could communicate background better

Critique C (Sanyer)

Accuracy: 9/10
Communication: 8/10

Reason:

Concisely describes paper, as well as the biggest problem (not robust to deception emerging)

Critique D (Sanyer)

While I strongly agree with the critique, because it’s my coauthored paper cited as the only reference (and argument), I chose to not rate this one either.

Charbel-Raphaël Segerie

Critique A (Alvin)

Accuracy: 3/10
Communication: 5/10

Reason:

I think this paper is useless, and the critic is not communicating the point that I find important.

Critique B (Aayush)

Accuracy: 7/10
Communication: 5/10

Reason: Ok, but this is probably missing the point of the paper.

Critique C (Sanyer)

Accuracy: 4/10
Communication: 7/10

Critique D (Sanyer)

Accuracy: 6/10
Communication: 4/10

Nate Soares

Critique A (Alvin)

Rating: 2/10

Reason: too shallow

Critique B (Aayush)

Rating: 3/10

Reason: fine, but quite shallow

Critique C (Sanyer)

Rating: 2/10

Reason: too shallow (in manners similar to above)

Critique D (Sanyer)

Rating: 3/10

Reason:

points out at least one thing that relates to notkilleveryoneism (deception), but unfortunately only points at the one that's very common in discussion, and misses an opportunity to point out others (this same critique applies above). too shallow.

Alvin mean score = 3.8

Aayush mean score = 5.8

Sanyer, Critique C, mean score = 6

Sanyer, Critique D, mean score = 4.333

Thank you very much to all of the participants and the judges!!
Please let us know your thoughts on the Critique-a-Thon and it's Results!

0 comments

Comments sorted by top scores.

Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results

Contents

Winners:

1st Place:

Congratulations to Lorenzo Venieri! 🥇

2nd Place:

Congratulations to NicholasKees & Janus! 🥈

3rd Place:

Congratulations to Momom2 & AIPanic!🥉 Momom2 & AIPanic had the third highest mean score of 6.286 for their Critique of Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)

Critique A, Momom2 & AIPanic:

Critique B, Tetraspace

Judge Review:

Ramana Kumar

Charbel-Raphaël Segerie

Nate Soares

Anonymous Judge

Nash Learning from Human Feedback (DeepMind, Dec 2023)

Critique A:

TheAntithesis

Critique B:

Tetraspace

Judge Review:

Charbel-Raphaël Segerie

Nate Soares

A General Theoretical Paradigm to Understand Learning from Human Preferences (DeepMind, Dec 2023)

Critique A:

Lorenzo Venieri

Judge Review:

Ramana Kumar

Dr Peter S. Park

Charbel-Raphaël Segerie

Nate Soares

Anonymous Judge

Constitutional AI: Harmlessness from AI Feedback (Anthropic, Dec 2022)

Critique A:

Momom2 & AIPanic:

Critique B

Lorenzo Venieri

Judge Review:

Dr Peter S. Park

Charbel-Raphaël Segerie

Nate Soares

Path-Specific Objectives for Safer Agent Incentives (Deepmind, Oxford, April 2022)

Critique A

Alvin

Critique B:

Alvin

Judge Review:

Ramana Kumar

Dr Peter S. Park

Nate Soares

Provably safe systems: the only path to controllable AGI (Max Tegmark, Steve Omohundro, Sept 2023)

Critique A:

Alvin

Judge Review:

Dr Peter S. Park

Charbel-Raphaël Segerie

Nate Soares

Cyborgism [LW · GW]

(NicholasKees, janus, Feb 2023)

Critique A:

Judge Review:

Dr Peter S. Park

Charbel-Raphaël Segerie

Nate Soares

Legible Normativity for AI Alignment: The Value of Silly Rules (Dylan Hadﬁeld-Menell, McKane Andrus,Gillian K. Hadﬁeld, 2018)

Critique A:

Alvin

Critique BAayush

Critique C

Sanyer

Critique D:

Sanyer

Judge Review:

Dr Peter S. Park

Charbel-Raphaël Segerie

Nate Soares

0 comments

Congratulations to Momom2 & AIPanic!🥉

Momom2 & AIPanic had the third highest mean score of 6.286 for their Critique of Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (OpenAI, SuperAlignment, Dec 2023)

Critique B
Aayush