Thoughts on sharing information about language model capabilities

post by paulfchristiano · 2023-07-31T16:04:21.396Z · LW · GW · 44 comments

Contents

  Core claim
  Context
  Accelerating LM agents seems neutral (or maybe positive)
    Improvements in LM agents seem good for safety
    "Overhang" in LM agents seems risky
  Understanding of capabilities is valuable
  Information about capabilities is more impactful for understanding than speed
  I think we should make this decision based on best estimates of costs and benefits
None
44 comments

Core claim

I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).

Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).

Context

ARC Evals currently focuses on evaluating the capabilities and limitations of existing ML systems, with an aim towards understanding whether or when they may be capable enough to pose catastrophic risks. Current evaluations are particularly focused on monitoring progress in language model agents.

I believe that sharing this kind of information significantly improves society's ability to handle risks from AI, and so I am encouraging the team to share more information. However this issue is certainly not straightforward, and in some places (particularly in the EA community where this post is being shared) I believe my position is controversial.

I'm writing this post at the request of the Evals team to lay out my views publicly. I am speaking only for myself. I believe the team is broadly sympathetic to my position, but would prefer to see a broader and more thorough discussion about this question.

I do not think this post presents a complete or convincing argument for my beliefs. The purpose is mostly to outline and explain the basic view, at a similar level of clarity and thoroughness to the arguments against sharing information (which have mostly not been laid out explicitly).

Added 8/1: Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing that kind of information; this post will explain why I believe that.

Accelerating LM agents seems neutral (or maybe positive)

I believe that having a better understanding of LM agents increases safety[1] through two channels:

I will discuss these mechanisms in more detail in the rest of this section.

I also think that accelerating LM agents will drive investment in improving and deploying ML systems, and so can reduce time available to react to risk. As a result I'm ambivalent about the net effect of improving the design of LM agents—my personal tentative guess is that it's positive, but I would be hesitant about deliberately accelerating LM agents to improve safety (moreover I think this would be a very unleveraged approach to improving safety[3] and would strongly discourage anyone from pursuing it).

But this means that I am significantly less concerned about information about LM agent capabilities accelerating progress on LM agents. Given that I am already positively disposed towards sharing information about ML capabilities and limitations despite the risk of acceleration, I am particularly positive about sharing information in cases where the main cost is accelerating LM agents.

Improvements in LM agents seem good for safety

Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces. Progress in understanding LM agents seems relevant for improving agents built this way, while having at best marginal relevance for systems optimized end to end (to which I expect the "bitter lesson" to apply strongly) or for situations where individual ML invocations are just "cogs in the Turing machine."

I think this kind of ML system seems great for safety:

So at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.

As mentioned before, I do think that progress in LM agents will increase overall investment in ML, and not just LM agent performance. And to a significant extent I think the success of LM agents will be determined by technical factors rather than how much investment there is (although this also makes me more skeptical about the acceleration impacts). But if it weren't for these considerations I would think that progress on LM agents would be clearly and significantly positive. 

"Overhang" in LM agents seems risky

Right now people are investing billions of dollars in scaling up LMs. If people only invested millions of dollars in improving LM agents, and such agents were important for overall performance, then I think we would be faced with a massive "overhang:" small additional investments in LM agents could significantly improve overall AI performance.

Under these conditions, increasing investment to speed up LM agents today is likely to slow down LM agents in the future, picking low-hanging fruit that would instead have been picked later when investment increased. If I had to guess, I'd say that accelerating AI progress by 1 day today by improving LM agents would give us back 0.5 days later. (This clawback comes not just from future investments in general agents, but also in the domain-specific investment needed to make a valuable product in any given domain). I am sympathetic to a broad range of estimates, from 0 to 0.9.[5]

This leaves us with an ambiguous sign, because time later seems much more valuable than time now:[6]

So even if LM agents had no relevance for safety, I would feel ambivalent about whether it is good to speed up or slow them down. (I feel similar ambivalence about many forms of pause, and as I've mentioned I feel like higher investment in the past would quite clearly have slowed down progress now and would probably be net positive, but I think LM agents are an unusually favorable case.)

If you told me that existing language models could already be transformative with the right agent design, I think this position would become stronger rather than weaker. I think in that scenario the overwhelmingly most important game is noticing this overhang and slowing down progress past GPT-4, and from starting to get transformative work out of relatively safe modern ML systems rather than overshooting badly.

I think this overhang argument applies to some extent for most investments in 2023; for example if AI labs buy all the GPUs today then they will get an immediate boost by training bigger models next year, but the boost after that will require having TSMC build more GPUs and so will be much slower (and the one after that will require building new fabs and be much slower). I mentioned it in the previous section and do think it's a major factor explaining why I place a lower premium on slowing down AI than other people. However I think it's a more important factor for LM agents than for e.g. improving the efficiency of LMs or investing more in hardware.

Understanding of capabilities is valuable

I think that a broad understanding of AI capabilities, and how those capabilities are likely to change over time, would significantly reduce risks:

This factor seems especially large over the next few years, where most risk comes from the possibility that humanity is taken by surprise.  I think this is the most important timeframe for individual decisions about sharing information, since the effects of current decisions will be increasingly attenuated over longer horizons.

Over the longer term I think the dangerous capabilities of AI systems will likely be increasingly clear. But I think better understanding still improves how prepared we are and reduces the risk of large surprises.

I think the importance of information about capabilities is pretty robust across worldviews:

I think that significantly increasing and broadening an understanding of LM capabilities would very significantly decrease risk, but it's hard to quantify this effect (because it's hard to measure increases in understanding). Qualitatively, I believe that realistic increases in understanding could cut risk by tens of percent.

Information about capabilities is more impactful for understanding than speed

I think that more accurate information about LM capabilities and limitations can drive faster progress in two big ways:

I think these are real effects. But combining with the unquantified estimate in the last section, if I had to make a wild guess I'd say the benefits from sharing information about ML capabilities are 5-10x larger than the costs from acceleration (even without focusing attention on LM agents).

Here are the main reasons why I think this acceleration cost is smaller than you might fear:

I think we should make this decision based on best estimates of costs and benefits

One could have a variety of procedural objections to sharing information even if the benefits appear to exceed the cost. I don't think these apply strongly, and therefore I think we should make this decision based on object level analysis:

  1. ^

    In this post I focus mostly on the risk of AI takeover, because the community worried about takeover is the primary place where I have encountered a widespread belief that measurement of general LM capabilities may be actively counterproductive.

  2. ^

    It's conceivable that LM agents pose novel risks that are as large or larger than existing threat models—but to the extent that is the case I am if anything even more excited about exploring such agents sooner, and even more skeptical about buying time to e.g. do (apparently-misguided) alignment research today.

  3. ^

    Because ML capabilities researchers will already be seeking out and implementing these improvements

  4. ^

    Most explicitly, see the 2016 discussion of the bootstrapping protocol in ALBA, in which models trained by RLHF solve harder problems by using chain of thought and task decomposition. See also this early 2015 post, which has some distracting simplifications and additional facts, but which presents LM agents in what I would say is essentially the same form that seems most plausible today. This isn't really related to the thrust of this post and is mostly me just feeling proud of my picture holding up well, but I do think the history here is somewhat relevant to understanding my view—this isn't something I'm making up now, this is comparing the real world to expectations from many years ago and seeing that LM agents look even more likely to pay a central role.

  5. ^

    Note that this number could be negative. The average across all forms of progress is 0, since accelerating everything by 1 day should decrease timelines by exactly 1 day. I think that areas with high investment are naturally below zero and those with lower investment are naturally above zero, because low-investment areas will expand more easily later. I think probably all software progress and ML-specific investment is above 0, and that improvements in the quantity and quality of compute are well below 0.

  6. ^

    Another reason that time later is more valuable than time now is that AI systems themselves will be doing a large fraction of the cognitive work in the future. But this consideration cancels out when you do the full analysis, both increasing the value of time later and making it harder to get back time later.

  7. ^

    One counterargument is that almost all the policy value comes from policy research driven primarily by altruists who aren't significantly more likely to work on AI as risks become more concrete and systems become more capable. I don't personally find this very plausible---it seems like the quantity of research has in fact increased, and that the quality and  relevance of that research has also improved significantly.

  8. ^

     I've seen the opposite asserted—that momentum means that accelerating now just accelerates more in the future. I don't think this issue is completely straightforward and it would be a longer digression to really get to the bottom of it. But right now I feel like on-paper analysis and observations of the last 10 years of AI both point pretty strongly towards this conclusion, and I haven't really seen the alternative laid out.

  9. ^

     By analogy, it seems to me that if humanity had trained GPT-4 for $250M in 2012, using a larger ML community and a larger number of worse computers, the net effect would be a reduction in risk. Making further progress from that point would be harder and easier to regulate, since scaling up spending would become prohibitively difficult and further ML progress would only be possible with large amounts of labor. On top of that, effective AI populations would be smaller since AI would already be using a much larger fraction of humanity's computing hardware, further computing scaleup would be increasingly bottlenecked, and an intelligence explosion would plausibly proceed several times more slowly. One could argue that increasing preparedness between 2012 and 2022 was enough to compensate for this factor, but that doesn't look right to me. I am more ambivalent about the effects of acceleration at this point and think it is negative in expectation, because I think society is now investing much more heavily in trying to understand and adapt to the AI we already have and we're already on track to scale up through the next 5 orders of magnitude distressingly quickly.

44 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2023-07-31T16:43:55.178Z · LW(p) · GW(p)

Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces.

This seems like a very narrow and specific definition of languade model agents that doesn't even obviously apply to the most agentic language model systems we have right now. It is neither the case that human-comprehensible task decomposition actually improves performance on almost any task for current language models (Auto-GPT does not actually work), and it is not clear that current RLHF and RLAF trained-models are "solving a human-comprehensible task" when I chat with them. They seem to pursue a highly-complicated mixed-objective which includes optimizing for sycophancy, and various other messy things, and their behavior seems only weakly characterized as primarily solving a human-comprehensible task. 

But even assuming that current systems are doing something that is accurately described as solving human-comprehensible tasks composed along human-comprehensible interfaces, this seems (to me) unlikely to continue much into the future. RLHF and RLAF already encourage the system to do more of its planning internally in a non-decomposed manner, due to the strong pressure towards giving short responses, and it seems likely we will train language model agents on various complicated game-like environments in order to train them explicitly to do long-term planning.

These systems would still meaningfully be "LM agents", but I don't see any reason to assume that those systems would continue to do things in the decomposed manner that you seem to be assuming here.

My guess is it might be best to clarify that you are not in-general in favor of advancing agents built on top of language models, which seem to me to be very hard to align in-general, but are only in favor of advancing a specific technology for making agents out of language models, which tries to leverage factored cognition and tries to actively avoid giving the agents complicated end-to-end tasks with reinforcement learning feedback. 

And my guess is that we then have a disagreement in that you expect that by-default the ML field will develop in a direction that will leverage factored-cognition style approaches, which doesn't currently seem that likely to me. I expect more end-to-end reinforcement learning on complicated environments, and more RLHF-style optimization to internalize agentic computation. Might be worth trying to make a bet on the degree to which LM capabilities will meaningfully be characterized as doing transparent task-decomposition and acting along simple human-comprehensible interfaces.

My current guess is that sharing info on language model capabilities is overall still good, but I disagree with the "Improvements in LM agents seem good for safety" section. My best guess is that the fastest path to LM agents is to just do more end-to-end training on complicated environments. This will not produce anything that is particularly easy to audit or align. Overall, this approach to building agents seems among the worst ways AI could develop in terms of making it easy to align, and my guess is there are many other ways that would be substantially better to push ahead instead.

Replies from: paulfchristiano, paulfchristiano
comment by paulfchristiano · 2023-07-31T17:19:08.612Z · LW(p) · GW(p)

I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance.  While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance." I'm happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don't have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).

To the extent models trained with RLHF are doing anything smart in the real world I think it's basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.

Replies from: habryka4, bogdan-ionut-cirstea
comment by habryka (habryka4) · 2023-07-31T17:47:26.601Z · LW(p) · GW(p)

Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.

I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don't currently know of any not-extremely-gerry-mandered task where doing this actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance."

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true. 

For example, I don't really understand whether this model is surprised or unsurprised by the extreme breadth of knowledge that modern LLMs have. I don't see any "imitation of human cognitive steps" when an LLM is capable of remembering things from a much wider range of topics. It seems just that its way of acquiring knowledge is very different from humans, giving rise to a very different capability-landscape. This capability does not seem to be built out of "composition of imitations of human cognitive steps". 

Similarly, when I use Codex for programming, I do not see any evidence that suggests that Codex is solving programming problems by composing imitations of human cognitive steps. Indeed, it mostly seems to just solve the problems in one-shot, vastly faster than I would be able to even type, and in a way that seems completely alien to me as a programmer.

E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use.

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over. 

I have no strong opinions on tool-use. Seems like the LLMs will use APIs the same way as humans would. I do think if you train more on end-to-end tasks, the code they write to solve sub-problems will become less readable. I have thought less about this and wouldn't currently take a bet.

I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).

I am not super confident of this, but my best guess is that we will see more end-to-end optimization, and that those will make a big difference in task performance. It also seems like a natural endpoint of something like RLAF, where you have the AI guide a lot of the training process itself when given a highly-complicated objective, and then you do various forms of RL on self-evaluations. 

Replies from: LRudL, paulfchristiano, bogdan-ionut-cirstea
comment by L Rudolf L (LRudL) · 2023-08-01T05:10:10.846Z · LW(p) · GW(p)

I don't currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.

Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.

It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.

I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it's plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model's weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.

Replies from: habryka4
comment by habryka (habryka4) · 2023-08-01T05:20:48.970Z · LW(p) · GW(p)

Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.

That's a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me. 

I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive.

I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.

comment by paulfchristiano · 2023-07-31T18:50:12.131Z · LW(p) · GW(p)

Although this is an important discussion I want to emphasize up front that I don't think it's closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true. 

If you allow models to think for a while they do much better than if you just ask them to answer the question. By "think for a while" we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.

I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don't really consider this evidence one way or the other.

If you want to state any concrete prediction about the future I'm happy to say whether I agree with it. For example:

  • I think that the gap between "spit out an answer" and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
  • I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
  • I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.

My sense right now is that this feels a bit semantic.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-05T10:08:13.806Z · LW(p) · GW(p)

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over. 

This prediction seems largely falsified as long as transformers remain the dominant architecture, and especially if we deliberately add optimization pressures towards externalized reasoning and against internal, latent reasoning; see e.g. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and LLMs Do Not Think Step-by-step In Implicit Reasoning.

Replies from: habryka4
comment by habryka (habryka4) · 2024-12-05T16:28:10.007Z · LW(p) · GW(p)

I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).

In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-05T16:37:50.474Z · LW(p) · GW(p)

I'm pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can't even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn't seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So 'chain-of-thought but internalized to the model will take over' seems very unlikely with transformers, and much more so if basic mitigations like unlearning (e.g. of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights [LW · GW], to mitigate the possibility of latent scheming) were applied.

Steganography is a separate threat model, but even there I'd interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).

Replies from: habryka4
comment by habryka (habryka4) · 2024-12-05T16:48:31.819Z · LW(p) · GW(p)

Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions.

Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-05T16:53:59.016Z · LW(p) · GW(p)

I don't dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.

Replies from: habryka4
comment by habryka (habryka4) · 2024-12-05T17:02:24.060Z · LW(p) · GW(p)

What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-12-05T17:58:43.248Z · LW(p) · GW(p)

The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section 'B.1.3 HIDDEN SCHEMING REASONING' from Towards evaluations-based safety cases for AI scheming. I wouldn't be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights [LW · GW], could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section 'B.1.2 OBFUSCATED SCHEMING REASONING' of Towards evaluations-based safety cases for AI scheming.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-07-18T23:41:04.758Z · LW(p) · GW(p)

E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use

Some evidence in favor: https://x.com/YangjunR/status/1793681241398788319 (for increasingly wide gap between LM single forward pass ('snap judgment') and CoT), https://xwang.dev/mint-bench/ (for tool use being increasingly useful, with both model scale, and number of tool use turns).

comment by paulfchristiano · 2023-07-31T17:18:48.765Z · LW(p) · GW(p)

I changed the section to try to make it a bit more clear that I mean "understanding of LM agents." For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.

comment by evhub · 2023-07-31T19:56:30.074Z · LW(p) · GW(p)

Something that I think is worth noting here: I don't think that you have to agree with the "Accelerating LM agents seems neutral (or maybe positive)" section to think that sharing current model capabilities evaluations is a good idea as long as you agree with the "Understanding of capabilities is valuable" section.

Personally, I feel much more uncertain than Paul on the "Accelerating LM agents seems neutral (or maybe positive)" point, but I agree with all the key points in the "Understanding of capabilities is valuable" section, and I think that's enough to justify substantial sharing of model capabilities evaluations (though I think you'd still want to be very careful about anything that might leak capabilities secrets).

Replies from: paulfchristiano
comment by paulfchristiano · 2023-07-31T20:06:44.099Z · LW(p) · GW(p)

Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.

At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.

comment by paulfchristiano · 2023-08-01T18:37:36.580Z · LW(p) · GW(p)

Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.

comment by Simon Goldstein (simon-goldstein) · 2023-07-31T22:05:45.543Z · LW(p) · GW(p)

Thanks for the thoughtful post, lots of important points here. For what it’s worth, here is a recent post where I’ve argued in detail (along with Cameron Domenico Kirk-Giannini) that language model agents are a particularly safe route to agi: https://www.alignmentforum.org/posts/8hf5hNksjn78CouKR/language-agents-reduce-the-risk-of-existential-catastrophe [AF · GW]

comment by johnswentworth · 2023-07-31T18:29:32.283Z · LW(p) · GW(p)

There's a bunch of considerations and models mixed together in this post. Here's a way I'm factoring some of them, which other people may also find useful.

I'd consider counterfactuality the main top-level node; things which would have been done anyway have radically different considerations from things which wouldn't. E.g. doing an eval which (carefully, a little bit at a time) mimics what e.g. chaosGPT does, in a controlled environment prior to release, seems straightforwardly good so long as people were going to build chaosGPT soon anyway. It's a direct improvement over something which would have happened quickly anyway in the absence of the eval. That argument still holds even if a bunch of the other stuff in the post is totally wrong or totally the wrong way of thinking about things (e.g. I largely agree with habryka's comment about comprehensibility of future LM-based agents).

On the other hand, building a better version of chaosGPT which users would not have tried anyway, or building it much sooner, is at least not obviously an improvement. I would say that's probably a bad idea, but that's where the rest of the models in the post start to be relevant to the discussion.

Alas, we don't actually know ahead of time which things people will/won't counterfactually be tried, so there's some grey zone. But at least this frame makes it clear that "what would people counterfactually try anyway?" is a key subquestion.

(Side note: also remember that counterfactuality gets trickier in multiplayer scenarios where players are making decisions based on their expectations of other players. We don't want a situation where all the major labs build chaosGPT because they expect all the others to do so anyway. But in the case of chaosGPT, multiplayer considerations aren't really relevant, because somebody was going to build the thing regardless of whether they expected OpenAI/Deepmind/Anthropic to build the thing. And I expect that's the prototypical case; the major labs don't actually have enough of a moat for small-game multiplayer dynamics to be a very good model here.)

comment by michaelcohen (cocoa) · 2023-08-10T01:32:57.157Z · LW(p) · GW(p)

I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,[4] [AF(p) · GW(p)] and in my view this is looking more and more plausible over time.

I agree whole-heartedly with the first sentence. I'm not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a different way to get superhuman systems, and one encourages that intervening in feedback if the agent is capable enough. Doesn't the first sentence support the case that it would be safer to stick to chain of thought and decomposition as the key drivers of superhumanness, rather than using RL?

Replies from: paulfchristiano
comment by paulfchristiano · 2023-08-10T17:24:40.675Z · LW(p) · GW(p)

It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).

It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.

So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.

You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all.

But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive.

(People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.)

I think there's still a good chance that process-based RL in the distillation step still can't be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it's at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)

Replies from: cocoa
comment by michaelcohen (cocoa) · 2023-08-12T06:37:19.379Z · LW(p) · GW(p)

What is process-based RL?

I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for "we're all human"-type reasons, and maybe in part because legislators and regulators know that they won't get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries' incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn't a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don't know, 25% GDP growth from their country's imitation learners, and say, "these international AI agreements are too cautious and are holding us back from even more growth"; it seems much more likely to me that politicians' appetite for risking great power conflict requires much worse economic conditions than that.

In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms for diplomacy and enforcement are powerful enough "coordination mechanisms" to stop highly-capitalized RL projects. I also object a bit to calling a solution here "strong global coordination". If China makes a law preventing AI that would kill everyone with 1% probability if made, that's rational for them to do regardless of whether the US does the same. We just need leaders to understand the risks, and we need them to be presiding over enough growth that they don't need to take desperate action, and that seems doable.

Also, consider how much more state capacity AI-enabled states could have. It seems to me that a vast population of imitation learners (or imitations of populations of imitation learners) can prevent advanced RL from ever being developed, if the latter is illegal; they don't have to compete with them after they've been made. If there are well-designed laws against RL (beyond some level of capability), we would have plenty of time to put such enforcement in place.

Replies from: paulfchristiano, matthew-barnett
comment by paulfchristiano · 2023-08-14T16:51:27.039Z · LW(p) · GW(p)

By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.

I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, then I think I'm with you. But I think (i) it's totally fair to call that "strong global coordination," (ii) you would probably have to do a somewhat better job than we did of nuclear non-proliferation.

I think the technical question is usually going to be about how to trade off capability against risk. If you didn't care about that at all, you could just not build scary ML systems. I'm saying that you should build smaller models with process-based RL. 

It might be good to focus on legible or easy-to-enforce lines rather than just trading off capability vs risk optimally. But I don't think that "no RL" is effective as a line---it still leaves you with a lot of reward-hacking (e.g. by planning against an ML model, or predicting what actions lead to a high reward, or expert iteration...). Trying to avoid all these things requires really tightly monitoring every use of AI, rather than just training runs. And I'm not convinced it helps significantly with deceptive alignment.

So in any event it seems like you are going to care about model size. "No big models" is also a way easier line to enforce. This is pretty much like saying "minimize the amount of black-box end-to-end optimization you do," which feels like it gets closer to the heart of the issue.

If you are taking that approach, I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models (and will ultimately want to use outcomes in relatively safe ways). Yes it would be safer to use neither process-based RL nor big models, and just make your AI weaker. But the main purpose of technical work is to reduce how demanding the policy ask is---how much people are being asked to give up, how unstable the equilibrium is, how much powerful AI we can tolerate in order to help enforce or demonstrate necessity. Otherwise we wouldn't be talking about these compromises at all---we'd just be pausing AI development now until safety is better understood.

I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.

Replies from: cocoa
comment by michaelcohen (cocoa) · 2023-08-17T20:59:29.729Z · LW(p) · GW(p)

I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.

I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don't know how much myopia we need.

If you're always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you're saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.

I say "defensible" instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:

I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models

You suggest that increasing compute is the last thing we should do if we're looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don't see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don't think problems are particularly likely in either case.

comment by Matthew Barnett (matthew-barnett) · 2023-09-23T22:42:05.050Z · LW(p) · GW(p)

Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries.

I think this comparison is imperfect. Standard economic models predict an acceleration in the growth rate by at least an order of magnitude, and usually more. Over one decade, an increase in economic capacity by 1-4 orders of magnitude seems probable. By contrast, my understanding was that the models of open borders roughly predict a one-time doubling of world GDP over several decades, and for housing, it's something like a 50% increase in GDP over decades.

Perhaps a better way to put this is that if AI is developed anywhere, even in a small country, that country could soon (within 10 years) grow to be the world's foremost economic power. Nothing comparable seems true for other policies. There only really needs to be be one successful defecting nation for this coordination to fall apart.

comment by tailcalled · 2023-08-03T17:00:49.178Z · LW(p) · GW(p)

I think once you have an LM agent that is sufficiently powerful so as to be economically competitive as an independent actor in a lot of domains (if that is even possible - I am still skeptical about LLMs), we've reached "Armageddon". At that point, the economic pressure to improve upon it will be massive, and there is no particular reason these improvements have to stay limited to LLMs (you could e.g. build some sort of backchaining/optimization on top of it and use it to train the LLM, burning away the interpretability/safety benefits of LLMs). And I have a hard time seeing AI safety winning over people purely concerned with empowerment maximization in this fight, as the latter have a simpler problem to solve and can therefore probably solve it faster.

comment by RobertM (T3t) · 2023-08-05T00:10:07.643Z · LW(p) · GW(p)

Curated.

This post lays out legible [LW · GW] arguments for its position, which I consider to be one of the best ways to drive conversations forward, short of demonstrating convincing empirical results (which seem like they'd be difficult to obtain in this domain).  In this case, I hope that future conversations about sharing LLM capabilities focus more on object-level details, e.g. what evidence would bear on the argument about LM agent "overhang".

comment by Zach Stein-Perlman · 2023-07-31T22:15:36.792Z · LW(p) · GW(p)

Good post.

Other points aside, the proposition "LM agents are an unusually safe way to build powerful AI systems" seems really important; it would be great to see more research/intuitions on this + clarification on various flavors of "LM agents."

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2023-08-03T23:42:30.240Z · LW(p) · GW(p)

I guess one crux for sharing research on LM agents is whether there are viable alternative paths to powerful AI systems. If LM-agents is clearly the easiest path, there's less reason to share research on them; if a less-safe path looks similarly easy, we should differentially advance LM-agents.

I'm not aware of alternative paths that look anywhere near as easy as LM-agents. Or: I don't know what viable alternative paths LM-agents are supposed to be safer than. (Edit: some alignment researcher friends mention old-fashioned RL agents as a possible path to powerful AI that's less safe than LM-agents but say that path looks substantially harder than LM-agents, such that we don't need to boost LM-agents more.)

Replies from: Zach Stein-Perlman
comment by Zach Stein-Perlman · 2023-08-09T01:59:33.439Z · LW(p) · GW(p)

Maybe rather than 'different paths' Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:

at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.

I buy something like this, at least. But (I weakly intuit) we'll almost exclusively be relying on LM agents rather than mere next-token-predictors by default; there's no need to boost LM agents. And even if that's good, that doesn't mean that marginal improvements in LM agents' sophistication/complexity are safer than marginal improvements in underlying-LM-capability. (I don't have a take on this-- just flagging it as a crux.)

Replies from: paulfchristiano
comment by paulfchristiano · 2023-08-09T03:59:15.070Z · LW(p) · GW(p)

My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.

comment by evhub · 2023-08-05T01:07:49.665Z · LW(p) · GW(p)

Though, as I noted in a separate comment [LW(p) · GW(p)], I agree with the basic arguments in "Understanding of capabilities is valuable" section, one thing that I'm still a bit worried about in the context of the ARC report explicitly is that labs might try to compete with each other on doing the "best" they can on the ARC eval to demonstrate that they have the most capable model, which seems probably bad (though it is legitimately unclear whether this is actually bad or not).

However, if it is really bad, here's an idea: I think you could avoid that downside while still capturing the upside of making it clear publicly how capable models are (e.g. for the purpose of galvanizing policy responses) by revealing only the max performance on each task across all the evaluated models, rather than revealing the results individually for each model.

Replies from: beth-barnes, Raemon
comment by Beth Barnes (beth-barnes) · 2023-08-07T02:06:49.664Z · LW(p) · GW(p)

What we've currently published is 'number of agents that completed each task', which has a similar effect of making comparisons between models harder - does that seem like it addresses the downside sufficiently?

comment by Raemon · 2023-08-05T01:09:20.754Z · LW(p) · GW(p)

plus-one-ing  the impulse to "look for third options"

comment by Christopher King (christopher-king) · 2023-08-01T16:23:13.767Z · LW(p) · GW(p)

I know that prediction markets don't really work in this domain (apocalypse markets are equivalent to loans), but what if we tried to approximate Solomonoff induction via a code golfing competition?

That is, we take a bunch of signals related to AI capabilities and safety (investment numbers, stock prices, ML benchmarks, number of LW posts, posting frequency or embedding vectors of various experts' twitter account, etc...) and hold a collaborative competition to find the smallest program that generates this data. (You could allow the program to be output probabilities sequentially, at a penalty of (log_(1/2) of the overall likelihood) bits.) Contestants are encouraged to modify or combine other entries (thus ensuring there are no unnecessary special cases hiding in the code).

By analyzing such a program, we would get a very precise model of the relationship between the variables, and maybe even could extract causal relationships.

(Really pushing the idea, you also include human population in the data and we all agree to a joint policy that maximizes the probability of the "population never hits 0" event. This might be stretching how precise of models we can code-golf though.)

Technically, taking a weighted average of the entries would be closer to Solomonoff induction, but the probability is basically dominated by the smallest program.

comment by jacquesthibs (jacques-thibodeau) · 2023-07-31T18:49:53.660Z · LW(p) · GW(p)

Under these conditions, increasing investment to speed up LM agents today is likely to slow down LM agents in the future, picking low-hanging fruit that would instead have been picked later when investment increased. If I had to guess, I'd say that accelerating AI progress by 1 day today by improving LM agents would give us back 0.5 days later.

Probably worth noting that the more time and investment people put into LM agents now, the better we will be a constructing connected and powerful LM agents in the future. Meaning that you reduce the "agency overhang", but you increase our ability to make use of that agency once we do get more powerful systems. If you invest in it less now, it'll take more time to make it work well in the future (though by that point you also have more powerful systems that get you to understand what works with LM agents much faster).

comment by jacquesthibs (jacques-thibodeau) · 2023-07-31T18:35:19.710Z · LW(p) · GW(p)

Sharing a comment I made in the past on this [LW(p) · GW(p)] post:

I want to recognize there is some difficulty that comes with predicting which aspects will drive capability advances. I think there is value in reading papers (something that more alignment researchers should probably do) because it can give us hints at the next capability leaps. Over time, I think it can improve our intuition for what lies ahead and allows us to better predict the order of capability advances. This is how I’ve felt as I’ve been pursuing the Accelerating Alignment agenda (language model systems for accelerating alignment research). I’ve been at the forefront, reading Twitter/papers/etc to find insights into how to use language models for research and feel like I’ve been gaining a lot of intuition into where the field is going.

As you said, it's also important to remember that most of the field isn't directly aiming for AGI. Safety discussions, particularly about self-improvement and similar topics, may have inspired some individuals to consider pursuing directions useful for AGI, when they might not have otherwise. This is why some people will say things like, "AI safety has been net negative and AGI safety discussions have shortened AGI timelines". I think there is some truth to the timelines argument, but it’s not clear it has been net negative, in my opinion. There's a point at which AI Safety work must be done and investment must be made in AGI safety.

One concern I’d like to bring up as a point of discussion is that whether infohazard policies could backfire. By withholding certain insights, these policies may leave safety researchers in the dark about the field's trajectory, while capability researchers are engaged in active discussions. Some of us were aware about the AgentGPT-like models likely happening soon (though unsure about the exact date), but it seems to have blindsided a lot of people concerned about alignment. It’s possible that safety researchers could be blindsided again by rapid developments they were not privy to due to infohazard policies.

This may have been manageable when progress was slower, but now, with global attention on AI, it may lead to some infohazard policies backfiring, particularly due to alignment people not being able to react as quickly as they should. I think most of the balance favours keeping infohazard policies as is for now, but this was a thought I had earlier this week and figured I would share.

comment by Jason Hoelscher-Obermaier (jas-ho) · 2023-08-02T08:51:58.238Z · LW(p) · GW(p)

There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns.

I'm not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions. 

For example, if oversight is implemented via an overseer LLM agent O which gives scores for proposed actions by another agent A, then A might end up adversarially optimizing against O if A is set up for a high success rate (high rate of actions accepted).

(I agree very much with the general point of the post, though)

comment by Zach Stein-Perlman · 2024-12-11T05:45:47.954Z · LW(p) · GW(p)

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)

comment by Joe Rogero · 2023-08-09T15:49:17.798Z · LW(p) · GW(p)

This was a good post, and shifted my view slightly on accelerating vs halting AI capabilities progress.

I was confused by your "overhang" argument all the way until footnote 9, but I think I have the gist. You're saying that even if absolute progress in capabilities increases as a result of earlier investment, progress relative to safety will be slower.

A key assumption seems to be that we are not expecting doom immediately; i.e. the next major jump in capabilities is deemed nearly impossible to kill us all with misaligned AI. I'm not sure I buy this assumption fully; it seems to have non-negligible probability to me and that seems relevant to the wisdom of endorsing faster progress in capabilities.

But if we assume the next jump in capabilities, or the next low-hanging fruit plucked by investment, won't be the beginning of the end...then it does sorta make sense that accelerating capabilities in the short run might accelerate safety and policy enough to compensate. 

comment by Review Bot · 2024-07-13T16:41:36.978Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by RGRGRG · 2023-08-01T17:26:14.905Z · LW(p) · GW(p)

My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art.  I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then?  Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now.  But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.

Replies from: RGRGRG
comment by RGRGRG · 2023-08-01T23:10:58.259Z · LW(p) · GW(p)

As one specific example - has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?

https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf