Naive Hypotheses on AI Alignment

post by Shoshannah Tekofsky (DarkSym) · 2022-07-02T19:03:49.458Z · LW · GW · 29 comments

Contents

  H1 - Emotional Empathy
  H2 - Silo AI
  H3 - Kill Switch 
  H4 - Human Alignment
  Thoughts on Corrigibility
  Side Thoughts - Researcher Bias
None
29 comments

Apparently doominess works for my brain, cause Eliezer Yudkowsky’s AGI Ruin: A List of Lethalities [LW · GW] convinced me to look in to AI safety. Either I’d find out he’s wrong, and there is no problem. Or he’s right, and I need to reevaluate my life priorities. 


After a month of sporadic reading, I've learned the field is considered to be in a state of preparadigmicity [LW · GW]. In other words, we don’t know *how* to think about the problem yet, and thus novelty comes at a premium. The best way to generate novel ideas is to pull in people from other disciplines. In my case that's computational psychology: modeling people like agents. And I've mostly applied this to video games. My Pareto frontier [LW · GW] is "modeling people like agents based on their behavior logs in constructed games created to trigger reward signals + ITT'ing the hell out of all the new people I love to constantly meet". I have no idea if this background makes me more or less likely to generate a new idea that's useful to solving AI alignment, but the way I understand the problem now: everyone should at least try.

So I started studying AI alignment, but quickly realized there is a trade-off: The more I learn, the harder it is to think of anything new. At first I had a lot of naive ideas on how to solve the alignment problem. As I learned more about the field, my ideas all crumbled. At the same time, I can't really assess yet if there is a useful level of novelty in my naive hypotheses. I'm still currently generating ideas low on "contamination" by existing thought (cause I'm new), but also low on quality (cause I'm new). As I learn more, I'll start generating higher quality hypotheses, but these are likely to become increasingly constrained to the existing schools of thought, because of cognitive contamination from everyone reading the same material and thinking in similar ways. Which is exactly the thing we want to avoid at this stage.

Therefore, to get the best of both worlds, I figured I'd write down my naive hypotheses as I have them, and keep studying at the same time. Maybe an ostensibly "stupid" idea on my end, inspires someone with more experience to a workable idea on their end. Even if the probability of that is <0.1%, it's still worth it. Cause, you know, .... I prefer we don't all die.

So here goes:
 

H1 - Emotional Empathy

If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes. This is a trait in a subset of humans. What is this trait, and can we integrate it in to the reward function of an AGI?


 

H2 - Silo AI

Silo the hardware and functionality of AGI to particular tasks. Like governments are run in trifecta to avoid corruption. Like humans need to collaborate to make things greater than themselves. Similarly, limit AGI to functions and physicalities that force it to work together with multiple other, independent AGI’s to achieve any change in the world. 


 

H3 - Kill Switch 

Kill switch! Treat AGI like the next cold war. Make a perfect kill switch, where any massive failure state according to humans would blow up the entire sphere of existence of humans and AGI.


 

H4 - Human Alignment

AI alignment currently seems intractable because any alignment formula we come up with is inherently inconsistent cause humans are inconsistent. We can solve AI alignment by solving what humanity’s alignment actually is. 

Thoughts on Corrigibility

Still learning about it at the moment, but my limited understanding so far is: 

How to create an AI that is smarter than us at solving our problems, but dumber than us at interpreting our goals.  

In other words, how do we constrain an AI with respect to its cognition about its goals?


 

Side Thoughts - Researcher Bias

Do AGI optimists and pessimists differ in some dimension of personality or cognitive traits? It’s well established that political and ideological voting behavior correlate to personality. So if the same is true for AI risk stance, then this might point to a potential confounder is AI risk predictions. 

 


My thanks goes out to Leon Lang [LW · GW] and Jan Kirchner [LW · GW] for encouraging my beginner theorizing, discussing the details of each idea, and pointing me toward related essays and papers.

29 comments

Comments sorted by top scores.

comment by Rob Bensinger (RobbBB) · 2022-07-02T20:24:01.977Z · LW(p) · GW(p)

Therefore, to get the best of both worlds, I figured I'd write down my naive hypotheses as I have them, and keep studying at the same time.

I quite like this strategy!

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-07-02T20:26:47.674Z · LW(p) · GW(p)

I would also echo the advice in the Alignment Research Field Guide [LW · GW]:

We sometimes hear questions of the form “Even a summer internship feels too short to make meaningful progress on real problems. How can anyone expect to meet and do real research in a single afternoon?”

There’s a Zeno-esque sense in which you can’t make research progress in a million years if you can’t also do it in five minutes. It’s easy to fall into a trap of (either implicitly or explicitly) conceptualizing “research” as “first studying and learning what’s already been figured out, and then attempting to push the boundaries and contribute new content.”

The problem with this frame (according to us) is that it leads people to optimize for absorbing information, rather than seeking it instrumentally, as a precursor to understanding. (Be mindful of what you’re optimizing in your research!)

There’s always going to be more pre-existing, learnable content out there. It’s hard to predict, in advance, how much you need to know before you’re qualified to do your own original thinking and seeing, and it’s easy to Dunning-Kruger or impostor-syndrome yourself into endless hesitation or an over-reliance on existing authority.

Instead, we recommend throwing out the whole question of authority. Just follow the threads that feel alive and interesting. Don’t think of research as “study, then contribute.” Focus on your own understanding, and let the questions themselves determine how often you need to go back and read papers or study proofs.

Approaching research with that attitude makes the question “How can meaningful research be done in an afternoon?” dissolve. Meaningful progress seems very difficult if you try to measure yourself by objective external metrics. It is much easier when your own taste drives you forward.

Replies from: DarkSym
comment by Shoshannah Tekofsky (DarkSym) · 2022-07-02T20:47:34.843Z · LW(p) · GW(p)

Thank you! And adding that to my reading list :D

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-07-02T23:44:35.810Z · LW(p) · GW(p)

Yeah, I actually think Alignment Research Field Guide [LW · GW] is one of the best resources for EAs and rationalists to read regardless of what they're doing in life. :)

comment by Evan R. Murphy · 2022-07-03T02:31:31.613Z · LW(p) · GW(p)

I do think there's value in beginner's mind, glad you're putting your ideas on alignment out there :)

How to create an AI that is smarter than us at solving our problems, but dumber than us at interpreting our goals.

This interpretation of corrigibility seems too narrow to me. Some framings of corrigibility like Stuart Russell's CIRL-based are like this, where the AI is trying to understand human goals but has uncertainty about it. But there are other framings, for example myopia, where the AI's goal is such that it would never sacrifice reward now for reward later, so it would never be motivated to pursue an instrumental goal like disabling its own off-switch.

When you're looking to further contaminate your thoughts and want more on this topic, there's a recent thread where different folks are trying to define corrigibility in the comments: https://www.lesswrong.com/posts/AqsjZwxHNqH64C2b6/let-s-see-you-write-that-corrigibility-tag#comments [LW · GW]

Replies from: DarkSym
comment by Shoshannah Tekofsky (DarkSym) · 2022-07-03T08:19:23.951Z · LW(p) · GW(p)

Thank you! I'll definitely read that :)

comment by jacopo · 2022-07-03T15:43:00.656Z · LW(p) · GW(p)

I like the idea! Just a minor issue with the premise:

"Either I’d find out he’s wrong, and there is no problem. Or he’s right, and I need to reevaluate my life priorities."

There is a wide range of opinions, and EY's has one of the most pessimistic ones. It may be the case that he's wrong on several points, and we are way less doomed than he thinks, but that the problem is still there and a big one as well. 

(In fact, if EY is correct we might as well ignore the problem, as we are doomed anyway. I know this is not what he thinks, but it's the consequence I would take from his predictions)

Replies from: DarkSym
comment by Shoshannah Tekofsky (DarkSym) · 2022-07-03T19:13:52.371Z · LW(p) · GW(p)

The premise was intended to contextualize my personal experience of the issue. I did not intend to make a case that everyone should weigh their priorities in the same manner. For my brain specifically, a "hopeless" scenario registers as a Call to Arms where you simply need to drop what else you're doing, and get to work. In this case, I calculated the age of my children on to all the timelines. I realized either my kids or my grandkids will die from AGI if Eliezer is in any way right. Even a 10% chance of that happening is too high for me, so I'll pivot to whatever work needs to get done to avoid that. Even if the chance of my work making a difference are very slim, there isn't anything else worth doing.

Replies from: jacopo
comment by jacopo · 2022-07-04T11:23:53.526Z · LW(p) · GW(p)

I agree with you actually. My point is that in fact you are implicitly discounting EY pessimism - for example, he didn't release a timeline but often said "my timeline is way shorter than that" with respect to 30-years ones and I think 20-years ones as well. The way I read him, he thinks we personally are going to die from AGI, and our grandkids will never be born, with 90+% probability, and that the only chances to avoid it is that are either someone having a plan already three years ago which has been implemented in secret and will come to fruition next year, or some large out-of-context event happens (say, nuclear or biological war brings us back to the stone age).

My no-more-informed-than-yours opinion is that he's wrong on several points, but correct on others. From this I deduce that the risk of very bad outcomes is real and not negligible, but the situation is not as desperate and there are probably actions that will improve our outlook significantly. Note that in the framework "either EY is right or he's wrong and there's nothing to worry about" there's no useful action, only hope that he's wrong because if he's right we're screwed anyway. 

Implicitly, this is your world model as well from what you say. Discussing this then may look like nitpicking, but whether Yudkowsky or Ngo or Christiano are correct about possible scenarios changes a lot about which actions are plausibly helpful. Should we look for something that has a good chance to help in an "easier" scenario, rather than concentrate efforts on looking for solutions that work on the hardest scenario, given that the chance of finding one is negligible? Or would that be like looking for the keys under the streetlight?

Replies from: DarkSym
comment by Shoshannah Tekofsky (DarkSym) · 2022-07-04T18:16:57.090Z · LW(p) · GW(p)

I think we're reflecting on the material at different depths. I can't say I'm far enough along to assess who might be right about our prospects. My point was simply that telling someone with my type of brain "it's hopeless, we're all going to die" actually has the effect of me dropping whatever I'm doing, and applying myself to finding a solution anyway

comment by Joe Kwon · 2022-07-02T21:37:24.389Z · LW(p) · GW(p)

This is a really cool idea and I'm glad you made the post! Here are a few comments/thoughts:

H1: "If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes"

How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn't).  Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we're worried about is in a different reference class. Not sure.

 

H4 is something I'm super interested in and would be happy to talk about it in conversations/calls if you want to : )

Replies from: Ericf, DarkSym
comment by Ericf · 2022-07-02T22:01:20.999Z · LW(p) · GW(p)

I saw this note in another thread, but the just of it is that power doesn't corrupt. Rather,

  1. Evil people seek power, and are willing to be corrupt (shared cause correlation)
  2. Being corrupt helps to get more power - in the extreme statement of this, maintaining power requires corruption
  3. The process of gaining power creates murder-ghandis.
  4. People with power attract and/or need advice on how and for what goal to wield it, and that leads to mis-alignment with the agents pre-power values.
Replies from: Gunnar_Zarncke, DarkSym
comment by Gunnar_Zarncke · 2022-07-03T19:56:10.613Z · LW(p) · GW(p)

Can you add a link to the other thread please?

Replies from: Ericf
comment by Ericf · 2022-07-04T00:52:07.762Z · LW(p) · GW(p)

No, I don't remember exactly where on LW I saw it - just wanted to aknowledge that I was amplifying so.eone else's thoughts.

My college writing instructor was taken aback when I asked her how to cite something I could quote, but didn't recall from where, but her answer was "then you can't use it" which seemed harsh. There should be a way to aknowledge plagiarism without knowing or stating who is being plagiarized - and if the original author shown up, you've basically pre-conceded any question of originality to them.

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2022-07-04T01:35:46.363Z · LW(p) · GW(p)

Thx for being clear about it.

comment by Shoshannah Tekofsky (DarkSym) · 2022-07-03T08:18:43.406Z · LW(p) · GW(p)

Are you aware of any research in to this? I struggle to think of any research designs that would make it through an ethics board.

Replies from: Ericf
comment by Ericf · 2022-07-03T15:15:16.885Z · LW(p) · GW(p)

I don't know that anyone has done the studies, but you could look at how winners of large lotteries behave. That is a natural example of someone suddenly gaining a lot of money (and therefore power). Do they tend to keep thier previous goals, amd just scale up thier efforts, or do they start doing power-retaining things? I have no idea what the data will show - thought experiments and amecdotes could go either way.

Replies from: Richard_Kennaway
comment by Shoshannah Tekofsky (DarkSym) · 2022-07-03T08:16:53.228Z · LW(p) · GW(p)

Thank you!

If they are not orthogonal then presumably prosociality and power are inversely related, which is worse?

In this case, I'm hoping intelligence and prosociality-that-is-robust-to-absolute-power would hopefully be a positive correlation. However, I struggle to think how this might actually be tested... My intuitions may be born from the Stanford Prison experiment, which I think has been refuted since. So maybe we don't actually have as much data on prosociality in extreme circumstances as I initially intuited. I'm mostly reasoning this out now on the fly by zooming in on where my thoughts may have originally come from.

That said, it doesn't very much matter how frequent robust prosociality traits are, as long as they do exist and can be recreated in AGI.

I'll DM you my discord :)

comment by Gunnar_Zarncke · 2022-07-03T19:59:08.710Z · LW(p) · GW(p)

First candidate for this trait is “emotional empathy”, a trait that hitches one’s reward system to that of another organism.

It would be interesting to hear what the cognitive neuroscientist know about how empathy is implemented in the brain.

The H1 point sounds close to Steven Byrnes' brain-like AGI.

Replies from: PeterC
comment by PeterC · 2022-11-03T14:08:44.902Z · LW(p) · GW(p)

I believe that cognitive neuroscience has nothing much to say about how any experience at all is implemented in the brain - but I just read this book which has some interesting ideas: https://lisafeldmanbarrett.com/books/how-emotions-are-made/ 

comment by MSRayne · 2022-07-03T13:29:38.923Z · LW(p) · GW(p)

My personal opinion is that empathy is the one most likely to work. Most proposed alignment solutions feel to me like patches rather than solutions to the problem, which is AI not actually caring about the welfare of other beings intrinsically. If it did, it would figure out how to align itself. So that's the one I'm most interested in. I think Steven Byrnes has some interest in it as well - he thinks we ought to figure out how human social instincts are coded in the brain.

Replies from: DarkSym
comment by Shoshannah Tekofsky (DarkSym) · 2022-07-03T14:47:41.791Z · LW(p) · GW(p)

Hmmm, yes and no? 

 

e.g. many people that care about animal welfare differ on the decisions they would make for those animals. What if the AGI ends up a negative utilitarian and sterilizes us all to save humanity from all future suffering? The missing element would again be to have the AGI aligned with humanity, which brings us back to H4: What's humanity's alignment anyway?

Replies from: MSRayne
comment by MSRayne · 2022-07-03T16:48:57.841Z · LW(p) · GW(p)

I think "humanity's alignment" is a strange term to use. Perhaps you mean "humanity's values" or even "humanity's collective utility function."

I'll clarify what I mean by empathy here. I think the ideal form of empathy is wanting others to get what they themselves want. Given that entities are competing for scarce resources and tend to interfere with one another's desires, this leads to the necessity of making tradeoffs about how much you help each desire, but in principle this seems like the ideal to me.

So negative utilitarianism is not actually reasonably empathic, since it is not concerned with the rights of the entities in question to decide about their own futures. In fact I think it's one of the most dangerous and harmful philosophies I've ever seen, and an AI such as I would like to see made would reject it altogether.

comment by PeterC · 2022-11-03T14:16:33.706Z · LW(p) · GW(p)

Enjoyed this. 

Overall, I think that framing AI alignment as a problem is ... erm .. problematic. The best parts of my existence as a human do not feel like the constant framing and resolution of problems. Rather they are filled with flow, curiosity, wonder, love.

I think we have to look in another direction, than trying to formulate and solve the "problems" of flow, curiosity, wonder, love. I have no simple answer - and stating a simple answer in language would reveal that there was a problem, a category, that could "solve" AI and human alignment problems. 

I keep looking for interesting ideas - and find yours among the most fascinating to date.

comment by kimsolez · 2022-07-10T23:05:18.932Z · LW(p) · GW(p)

My take on this: countering Eliezer Yudkowsky 

Replies from: Benito, tidikanji
comment by Ben Pace (Benito) · 2022-07-10T23:32:46.559Z · LW(p) · GW(p)

You're right that an AGI being vastly smarter than humans is consistent with both good and bad outcomes for humanity. This video does not address any of the arguments that have been presented about why an AGI would by default have unaligned values with humanity, which I'd encourage you to engage with. It's mentioned in bullet -3 in the list, under the names instrumental convergence and orthogonality thesis, with the former being probably what I'd recommend reading about first.

Replies from: kimsolez
comment by kimsolez · 2022-07-11T01:43:37.499Z · LW(p) · GW(p)

Hi Ben, thanks for this. We are not passive victims of the future, waiting trembling to see a future we cannot escape because of rigid features such as you mention. We can help shape and literally make the future. I have 1500+ videos so you will run out of material before I do! What do you think of the idea of machines suggesting better ways to cooperate which humans could never attain themselves? Do you listen to the news? If you don’t listen to the news isn’t that because you are disappointed with how humans cooperate left to their own devices? They need better ideas from machines! See: 

comment by tidikanji · 2022-08-06T02:50:54.098Z · LW(p) · GW(p)

Kim, you're not addressing the points in the post. You can't repeat catch phrases like 'passive victims of the future' and expect it to have ground here. MIRI created well funded research institution devoted to positively shaping the future, while you make silly YouTube videos with platitudes. This interest in AI seems like recreation to you.