fake alignment solutions????

post by KvmanThinking (avery-liu) · 2024-12-11T03:31:53.740Z · LW · GW · 3 comments

This is a question post.

Contents

3 comments

Suppose the long-term risk center's researchers, or a random group of teenage nerd hackers, or whatever, come up with what they call an "alignment solution". A really complicated and esoteric, yet somehow elegant, way of describing what we really, really value, and cramming it into a big mess of virtual neurons. Suppose Eliezer and Tammy and Hanson and Wentworth and everyone else all go and look at the "alignment solution" very carefully for a very long time, and do not find any flaws in it. Lastly, suppose they test it on a weak AI and the AI immediately stops producing strange outputs/deceiving supervisors/specification gaming, and starts acting super nice and reasonable.

Great, right? Awesome, right? We won eternal eutopia, right? Our hard work finally paid off, right?

Even if this were to happen, I would still be skeptical to plug our new, shiny solution into a superintelligence and hit run. I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot. That for every true alignment solution, there are dozens of fake ones.

 

Is this something that I should be seriously concerned about?

Answers

3 comments

Comments sorted by top scores.

comment by johnswentworth · 2024-12-11T13:53:16.816Z · LW(p) · GW(p)

Note that Eliezer and I and probably Tammy would all tell you that "can't see how it fails" is not a very useful bar to aim for. At the bare minimum, we should have both (1) a compelling positive case that it in fact works, and (2) a compelling positive case that even if we missed something crucial, failure is likely to be recoverable rather than catastrophic.

comment by quila · 2024-12-11T04:34:19.764Z · LW(p) · GW(p)

That for every true alignment solution, there are dozens of fake ones.

Is this something that I should be seriously concerned about?

if you truly believe in a 1-to-dozens ratio between[1] real and 'fake' (endorsed by eliezer and others but unnoticedly flawed) solutions, then yes. in that case, you would naturally favor something like human intelligence augmentation, at least if you thought it had a chance of succeeding greater than p(chance of a solution being proposed which eliezer and others deem correct) × 1/24

I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot

i suggest writing why you believe that. in particular, how do you estimate the prominence of 'unnoticeably flawed' alignment solutions given they are unnoticeable (to any human)?[2] where does "for every true alignment solution, there are dozens of fake ones" come from?

A really complicated and esoteric, yet somehow elegant

why does the proposed alignment solution have to be really complicated? overlooked mistakes become more likely as complexity increases, so this premise favors your conclusion.

  1. ^

    (out of the ones which might be proposed, to avoid technicalities about infinite or implausible-to-be-thought-of proposals)

  2. ^

    (there are ways you could in principle, for example if there were a pattern of the researchers continually noticing they made increasingly unintuitive errors up until the 'final' version (where they no longer notice, but presumably this pattern would be noticed); or extrapolation from general principles about {some class of programming that includes alignment} (?) being easy to make hard-to-notice unintuitive mistakes in)

Replies from: avery-liu
comment by KvmanThinking (avery-liu) · 2024-12-11T12:27:58.240Z · LW(p) · GW(p)
  1. Yes, human intelligence augmentation sounds like a good idea.
  2. There are all sorts of "strategies" (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they're new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves. Supposes there are strategies which ineffectiveness is only obvious and explainable by people who know way more about decisions and agents and optimal strategies and stuff than humanity has currently figured out thus far. (Analogy: A society who only know basic arithmetic could reasonably stumble upon and understand the Collatz conjecture; and yet, with all our mathematical development, we can't do anything to prove it. Just like we could reasonably stumble upon an "alignment solution" that we can't disprove that it would work, because that would take a much higher understanding of these kinds of situations.)
  3. If the solution to alignment were simple, we would have found it by now. Humans are far from simple, human brains are far from simple, human behavior is far from simple. That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.