fake alignment solutions????

kvmanthinking

fake alignment solutions????

post by KvmanThinking (avery-liu) · 2024-12-11T03:31:53.740Z · LW · GW · 7 comments

This is a question post.

7 comments

Suppose the long-term risk center's researchers, or a random group of teenage nerd hackers, or whatever, come up with what they call an "alignment solution". A really complicated and esoteric, yet somehow elegant, way of describing what we really, really value, and cramming it into a big mess of virtual neurons. Suppose Eliezer and Tammy and Hanson and Wentworth and everyone else all go and look at the "alignment solution" very carefully for a very long time, and do not find any flaws in it. Lastly, suppose they test it on a weak AI and the AI immediately stops producing strange outputs/deceiving supervisors/specification gaming, and starts acting super nice and reasonable.

Great, right? Awesome, right? We won eternal eutopia, right? Our hard work finally paid off, right?

Even if this were to happen, I would still be skeptical to plug our new, shiny solution into a superintelligence and hit run. I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot. That for every true alignment solution, there are dozens of fake ones.

Is this something that I should be seriously concerned about?

Answers

7 comments

Comments sorted by top scores.

comment by johnswentworth · 2024-12-11T13:53:16.816Z · LW(p) · GW(p)

Note that Eliezer and I and probably Tammy would all tell you that "can't see how it fails" is not a very useful bar to aim for. At the bare minimum, we should have both (1) a compelling positive case that it in fact works, and (2) a compelling positive case that even if we missed something crucial, failure is likely to be recoverable rather than catastrophic.

Replies from: avery-liu

↑ comment by KvmanThinking (avery-liu) · 2024-12-13T14:05:20.744Z · LW(p) · GW(p)

I already figured that. The point of this question was to ask if there could possibly exist things that look indistinguishable from true alignment solutions (even to smart people), but that aren't actually alignment solutions. Do you think things like this could exist?

By the way, good luck with your plan [LW · GW]. Seeing people actively go out and do actually meaningful work to save the world gives me hope for the future. Just try not to burn out. Smart people are more useful to humanity when their mental health is in good shape.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-13T16:24:09.702Z · LW(p) · GW(p)

I'm pretty uncertain on this one. Could a superintelligence find a plan which fools me? Yes. Will such a plans show up early on in a search order without actively trying to fool me? Ehh... harder to say. It's definitely a possibility I keep in mind. Most importantly, over time as our understanding improves on the theory side, it gets less and less likely that a plan which would fool me shows up early in a natural search order.

comment by quila · 2024-12-11T04:34:19.764Z · LW(p) · GW(p)

That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?

if you truly believe in a 1-to-dozens ratio between^[1] real and 'fake' (endorsed by eliezer and others but unnoticedly flawed) solutions, then yes. in that case, you would naturally favor something like human intelligence augmentation, at least if you thought it had a chance of succeeding greater than p(chance of a solution being proposed which eliezer and others deem correct) × 1/24

I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot

i suggest writing why you believe that. in particular, how do you estimate the prominence of 'unnoticeably flawed' alignment solutions given they are unnoticeable (to any human)?^[2] where does "for every true alignment solution, there are dozens of fake ones" come from?

A really complicated and esoteric, yet somehow elegant

why does the proposed alignment solution have to be really complicated? overlooked mistakes become more likely as complexity increases, so this premise favors your conclusion.

^{^}
(out of the ones which might be proposed, to avoid technicalities about infinite or implausible-to-be-thought-of proposals)
^{^}
(there are ways you could in principle, for example if there were a pattern of the researchers continually noticing they made increasingly unintuitive errors up until the 'final' version (where they no longer notice, but presumably this pattern would be noticed); or extrapolation from general principles about {some class of programming that includes alignment} (?) being easy to make hard-to-notice unintuitive mistakes in)

Replies from: avery-liu

↑ comment by KvmanThinking (avery-liu) · 2024-12-11T12:27:58.240Z · LW(p) · GW(p)

Yes, human intelligence augmentation sounds like a good idea.
There are all sorts of "strategies" (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they're new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves. Supposes there are strategies which ineffectiveness is only obvious and explainable by people who know way more about decisions and agents and optimal strategies and stuff than humanity has currently figured out thus far. (Analogy: A society who only know basic arithmetic could reasonably stumble upon and understand the Collatz conjecture; and yet, with all our mathematical development, we can't do anything to prove it. Just like we could reasonably stumble upon an "alignment solution" that we can't disprove that it would work, because that would take a much higher understanding of these kinds of situations.)
If the solution to alignment were simple, we would have found it by now. Humans are far from simple, human brains are far from simple, human behavior is far from simple. That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.

Replies from: quila, quila

↑ comment by quila · 2024-12-12T02:22:17.223Z · LW(p) · GW(p)

There are all sorts of "strategies" (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they're new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves.

yep but the first three all fail for the shared reason of "programs will do what they say to do, including in response to your efforts". (the fourth one, 'use a weaker AI to align it', is at least obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)

when i think of other 'newly heard of alignment' proposals, like boxing, most of them seem to fail because the proposer doesn't actually have a model of how this is supposed to work or help in the first place. (the strong version of 'use ai to align it' probably fits better here)

(there are some issues which a programmatic model doesn't automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is some evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)

i guess my next response then would be that some subjects are bounded, and we might notice (if not 'be able to prove') such bounds telling us 'theres not more things beyond what you have already written down', which would be negative evidence (strength depending on how strongly we've identified a bound). (this is more of an intuition, i don't know how to elaborate this)

(also on what johnswentworth wrote: a similar point i was considering making is that the question is set up in a way that forces you into playing a game of "show how you'd outperform ~~magnus carlsen~~ {those researchers} in ~~chess~~ alignment theory" - for any consideration you can think of, one can respond that those researchers will probably also think of it, which might preclude them from actually approving, which makes the conditional 'they approve but its wrong'^[1] harder to be true and basically dependent on them instead of object-level properties of alignment.)

i am interested in reading more arguments about the object-level question if you or anyone else has them.

If the solution to alignment were simple, we would have found it by now [...] That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.

the pointer to values [LW · GW] does not need to be complex (even if the values themselves are)

If the solution to alignment were simple, we would have found it by now

generally: simple things don't have to be easy to find. the hard part can be locating them within some huge space of possible things. (math (including is use in laws of physics) come to mind?). (and specifically to alignment: i also strongly expect an alignment solution to ... {have some set of simple principles from which it can be easily derived (i.e. whether the program itself ends up long)}, but idk if i can legibly explain why. real complexity usually results from stochastic interactions in a process, but "aligned superintelligent agent" is a simply-defined, abstract thing?)

^{^}
ig you actually wrote 'they dont notice flaws', which is ambiguously between 'they approve' and 'they don't find affirmative failure cases'. and maybe the latter was your intent all along.
it's understandable because we do have to refer to humans to call something unintuitive.

↑ comment by quila · 2024-12-12T01:38:22.452Z · LW(p) · GW(p)

fake alignment solutions????

Contents

Answers

7 comments