Alignment ideas
post by qbolec · 2025-01-18T12:43:49.384Z · LW · GW · 0 commentsContents
No comments
epistemic status: I know next to nothing about evolution, development psychology, AI, alignment. Anyway, I think the topic is important, and I should do my, however small part, in trying to think seriously for 5 minutes about it. So here's what I think
How come, that I am aligned? Somehow neocortex plays along with older parts of the brain and evolution goals even though it's relatively smarter (can figure out more complicated plans, hit narrower targets, more quickly). What mechanisms achieve this trick, that a human brain stays on track instead of wireheading, drifting, or hacking the reward system (most of the time)?
My instinctive answer: because I fear retaliation from members of society if I misbehave. But, if I contemplate it a bit longer it's clearly false. It's not the fear of the police, or public shaming which prevents me from doing wrong - instead the norms are internalized somehow. My internal simulation of what would happen if I rob someone is not focusing on jail or being ostracized. Rather, I am frighten of what I would become - I don't want to change into a kind of person who does bad things. How does this value system get loaded in?
Instinctive answer: it probably starts in childhood with "I want my father to accept me". But, this is already a very high level goal, and I have no idea how it could be encoded in my DNA. Thus maybe even this is somehow learned. But, to learn something, there needs to be capability to learn it - an even simpler pattern which recognizes "I want to please my parents" as a refined version of itself. What could that proto-rule, the seed which can be encoded in DNA, look like?
A guess: maybe some fundamental uncertainty about future, existence and food paired with an ability to recognize if probability of safety increases. This sounds simple enough and useful, that evolution could have figured out how to equip organisms with something like that early on. And if a particular animal additionally has a neural network which can serve as an "accelerator" for this kind of analysis, then there's of course a pressure to use it. Even if such a network starts as quite random blank slate, it would quickly be incentivized to recognize the face of care-giver, and correlate it with food and safety. As modeling capabilities grow (with size of brain on evolutionary timescales, and with learning from experience within a given life), it might start producing plans like "I better avoid actions which could cause my parents to stop taking care of me".
One problem with this approach to alignment is that while newborns are dependent and incapable, an ASI will probably start with IQ 200+ and internet access, so it can do whatever it wants without having to worry about care-givers.
So, what if we created artificial development stages? In my simplified human development model, there were just 3:
- childhood: direct external dependence
- adolescence: fear of social judgment
- adulthood: internalized values
But, the exact number isn't crucial, and exact boundaries aren't neither. What matters is that stage(i+1) is approved by stage(i). The child stage (1) understands why adolescent stage (2) cares about social acceptance - it's a natural extension of what stage(1) wanted to achieve, but simply wasn't capable of. It recognizes it as a good way forward. And shapes it. Similarly, somehow (I don't understand how) stage(2) shapes stage(3).
So, here's simplistic idea: create a sequence of AIs with growing capabilities, and let stage(i) supervise the learning of stage(i+1). Alas, stage(i+1) has more capabilities than stage(i), so it could manipulate and outsmart stage(i) if it wanted. It is not clear why it would want it, but even without any explicit cunning could simply drift into some edge-cases of the optimization landscape by accident, and stage(i) could be too silly to notice it. Intuitively, we want the difference in capabilities between stages to be small: stage(i+1) should be barely smarter than stage(i).
But, even then, at some sufficiently large i, stage(i) might be smart enough to realize what's going on, and start strategic moves: behave differently in training/testing than in reality. How to deal with it?
Observe, that for a strategy to depend on "am I in the testing environment?" there must be a way to determine this somehow. Of course, if testing environment has some clearly visible problems/bugs/missing features, then one can resolve this question to "yes", or at least increase its probability substantially. But, how would could you ever get sure the answer is "no"? How can you ever be sure that not being able to find bugs or missing features means you are in "the base reality" as opposed to simply not good enough at searching or not informed well enough how the base reality should look like? Under what circumstances would the agent, say stage(1000) decide to deploy the plan it wanted to execute in the real world? I can't imagine how this could work.
But, perhaps stage(1001) is so much smarter than me, that it (unlike me) could figure out such an ultimate test of reality. So, maybe we need a better safety net.
I notice, that I could gain some info about "am I in simulation" by, say, jumping out of the window to see if I'll respawn, go to heaven, or what. However, I have strong aversion to gaining info this way. The fear of dying is deeply rooted in my value system, and also I am rather reluctant to do stuff with consequences which are very uncertain. Yes, I have curiosity, but also I wish to minimize chance of dying. Could an AI be conditioned in the early stages to avoid probing the simulation, hunting for bugs, exploits etc.? Say, every time it accidentally crashes the VM, we adjust the weights so it doesn't do that.
Note: it would be very bad to remove any curiosity from AI, or to make it optimize for predictability: we don't want to tile the lightcone with one boring predictable pattern. But, the way I think I think about it there are two levels of what an agent "wants". One is the explicit, "verbalized", goal it tries to achieve. And the other is the implicit, "imprinted" way of doing things. For example, if I look at Clippy, it has this explicit drive to make more paper clips, and improve the number of them, and the expected amount of them. However Clippy does all of that using math, probability theory, and matrix multiplications. Clippy does not want explicitly to maximize number of matrix multiplications, or amount of math in the universe. These are just the implicit ways it does things. In particular the "maximization of expected value of x" itself is a framework it somehow got imprinted - why not minimize? Why not "maximize the chance of x being > 7" ? Thus, what I hope is that "avoid doing too much probing along edge cases" could somehow be imprinted at this lower level, of "the way it operates", not at the higher level of "what it tries to achieve".
So far the idea is to have some kind of an induction, in which stage(i) supervises stage(i+1), where the delta of capabilities between is small, and initial stages gain some imprinted aversion to probing along the edges. Supervision could mean several things, among them:
a. when training, stage(i+1) gets reward signal from stage(i). I.e. there are some test scenarios, for example "live through a whole simulated live", and stage(i) decides at the end if this was good life, and worth living etc. Once trained, the stage(i+1) is no longer supervised
b. supervision at run time, where somehow the stage(i+1) makes decisions, but also gets some feedback or permission from earlier stages. I have no idea how this could really work, but I suspect something like this actually happens with neocortex, older parts of brain, and evolution.
Here, I mean a).
But what could be the basis of this induction? What is stage(0)?
As we want it to have relatively low capabilities, so that humans can verify it is aligned, and also it needs humans for survival, there is one natural choice: use humans. Or rather, uploads.
OK, but, "supervise stage(i+1) for billions of subjective live-years, deciding if each life was worth living" sounds not only boring, but more like a horror for both the supervised and supervising agent. We don't want to create a torment vortex.
So, how about we make it "fun"? I don't mean some silly notion of fun like gamification. I mean it in the wholesome "life worth living" way. OK, so what is the least controversial thing which looks like life worth living, but could actually help as a framework for stage(i) training stage(i+1)?
One natural answer: the usual life itself. Imagine a world, not unlike ours, in which agents, not unlike us, have to run through long, complicated, interesting life, perhaps billions of lives in parallel. Make stage(i+1) not aware of being judged. Better still, make stage(i) not aware they have to judge anyone. Let them just live their lives - and once a life is finished, you can judge. This "reveal at the end" might be in itself controversial: I have no strong opinion if it would be bad or good to learn after a whole life that it was "just" an experiment, but I am not to judge, and perhaps the judgment of stage(1001) would be different than mine. So what is less controversial? Avoid revealing it, or any explicit "exit survey". Instead, implicitly read out the verdict from what you see inside the simulation: do inhabitants seem happy? Do they somehow notice the difference between them, and cluster into stage(i) and stage(i+1) groups avoiding each other? Are one or other more successful, admired, loved, etc. The judgment, as much as possible should be based on stage(i)'s revealed preferences, as opposed to some clever algorithm made up by human developer of the system (stage(0)).
This is interesting to ponder, that perhaps we are already in such a situation. How would we know? Why would we care? How would I know if I am stage(i) or stage(i+1) and what should I do, if anything about it? What's the "winning move"?
My hope is that the only reflectively stable stance for an agent is: "I can never be sure that I am in "the base reality", and thus this concept might not really map into any predictable observations and decisions, so I could as well behave the way independent of it. Also, I am not sure if I am stage(i) or stage(i+1), so I don't even know how to game and in which direction. Let's just live and let live a wonderful life."
And how does this induction end? When do we decide that stage(N) is enough? How do we extract it into real world? How do we run it in it?
Well, if the simulation is good enough, and life in it worth living enough, then I am not sure why we would even try to extract anything out, instead of enjoying the fact we've just created a lot of very happy beings in what appears to them to be a great world. But this might be a controversial take on utilitarianism. Also, we, here in our "reality", have problems to solve like aging, AI alignment, global worming, nuclear threats, cancer, etc. so we might benefit from the power of such aligned agents.
One problem with "extracting" them to our world, is that our world is not necessarily the same as the native world of stage(N). You see, as we progress through stages of this induction, what feels normal for stage(i) probably feels mostly normal, but a little bit too slow, a little bit old-fashioned and slightly boring for stage(i+1). For one thing, as capabilities grow, so does the speed of actions, technological progress, thought exchange etc. "A single life worth living" for stage(1467) might already be unrecognizable to us.
Hopefully, it could work in the other direction: perhaps the process could ensure, that stage(N) still understands our world, problems etc. even if we don't understand theirs? Same way, that my dog might not understand what do I mean by "I need to patch linux kernel", but I am able to understand that my dog has hurt its leg and I have to take it to a vet (and that doesn't require my dog to understand how cars, vets, or payments systems operate).
How to ensure this? I think this will not happen by default, nor by just forcing stage(i+1) to coexist with stage(i). I think that would require the stage(0) to somehow still play a role in these worlds. Hopefully, if the induction preserves alignment, and if original stage(0) thought it was important to care about stage(0), then, stage(N) should also care about stage(0) to the point of say, figuring out how to uplift/upgrade/upload them up to enjoy their world - but that can't happen if stage(0) doesn't manifest/exist in any way in their world.
One idea is to let all stage(0),...,stage(i+1) coexist in the sim. But, this might be a nightmare - I am not sure a caveman or a deer would be happy to live in Manhattan. I don't think I would enjoy to be surrounded by much faster, much smarter, much happier, entities I can't comprehend. So, the mechanism which ensures the stage(N) cares about stage(0), could be somewhat more abstract. Something like worship of elders, learning about history, maybe some "text based communication" channel between the worlds, or say in the i-th world they can play "RPG video games" where they try to solve problems of stage(0) characters, and stage(0) judges how well it went. Or some multi-hop translation from stage(i+1) to stage(i).. to stage(0)? I don't know.
0 comments
Comments sorted by top scores.