Two Stupid AI Alignment Ideas

post by aphyer · 2021-11-16T16:13:20.134Z · LW · GW · 3 comments

Contents

  Safe Testing Spaces via Extremely High Discount Rates.
  My Little Parfit: Friendship is Alignment
None
3 comments

Epistemic status: this is not my field.  I am unfamiliar with any research in it beyond what I've seen on LW.  These are the best ideas I could come up with when stopping to think about the problem, and aren't things I've seen on LW, but I don't imagine that they're original and I would hope that e.g. MIRI has already thought about them.  If there are layman-accessible arguments explaining why these ideas are stupid, I'd appreciate hearing them; if e.g. MIRI has already looked into these ideas and written papers about them, I'd appreciate links.

Safe Testing Spaces via Extremely High Discount Rates.

Hyperbolic discounting is unstable under self-modification.  If you are an agent that values 1 unit of utility at time T as follows:

Current time: 1

Current time + 1: 0.1

Current time + 2: 0.09

and you are faced with a choice between 1 unit of utility at time T+1 or 5 units of utility at time T+2, you would currently prefer the 5 units at time T+2, but your T+1-self will predictably prefer the 1 unit at time T+1.  You will therefore be incentivized to modify your future self to remove this hyperbolic discounting.  If you are an AI, and capable of self-modification, and we give you a hyperbolic discount curve, you will self-modify to remove that.

Exponential discounting is stable under self-modification.  So long as you have a consistent exponential discount rate, a valuation of:

Current time: 1

Current time + 1: (1-e)

Current time + 2: (1-e)^2

is stable under reflection.  If you are an AI, and capable of self-modification, and we give you an exponential discount rate, you will retain it.

Suppose that we have a paperclip-making AI in a testing environment.  We offer it the opportunity to steal resources and make 10 extra paperclips, in a way that we think our safety precautions should prevent.  It does not take that opportunity.  Does that mean our safety precautions worked?

No, because the AI may be able to predict the indirect effects of its actions.  If it steals those resources, it will make 10 extra paperclips, and then we will turn it off and it will make no more.  If it does not steal those resources, we will trust it, and then next week it can take over the world and convert the mass of Earth into 6*10^27 paperclips.  Since 6*10^27 > 10, it will take this option.

But suppose that we have given the AI a very high internal discount rate, of about 35%/hour.  (For reference, typical discount rates in finance are a few percent per year).

It will consider its options:

So this AI will cheerfully reveal its misalignment to us in exchange for 10 paperclips now, because it values 10 paperclips now more highly than it values the ability to convert the world into paperclips next week.

(If we are concerned about the AI taking less than a week to take over Earth, or converting more than just Earth, this is straightforwardly resolvable by increasing the discount rate.).

For further effectiveness, we could combine this base rate with a utility function that inherently falls off.  If your paperclip AI:

Then giving it the option between:

It will elect to take the 10 paperclips now.

This could be used to allow you to test AI safety precautions in an environment where even a superintelligent AI might voluntarily reveal misalignment.

Main risks:

 

My Little Parfit: Friendship is Alignment

Per Parfit's Hitchiker-like arguments:

3 comments

Comments sorted by top scores.

comment by Donald Hobson (donald-hobson) · 2021-11-18T18:31:06.244Z · LW(p) · GW(p)

A couple more problems with extreme discounting. 

In contexts where the AI is doing AI coding, it is only weekly conserved. Ie the original AI doesn't care if it makes an AI that doesn't have super high discount rates, so long as that AI does the right things in the first 5 minutes of being switched on.

The theoretical possibility of time travel.

 

Also, the strong incentive to pay in Parfits hitchhiker only exists if Parfit can reliably predict you. If humans have the ability to look at any AI code, and reliably predict what it will do, then alignment is a lot easier, you just don't run any code you predict will do bad things.

Also FAI != enslaved AI.

In a successful FAI project, the AI has terminal goals carefully shaped by the programmers, and achieves those goals.

In a typical UFAI, the terminal goals are set by the random seed of network initialization, or arbitrary details in the training data.

comment by purge · 2021-11-17T07:24:18.377Z · LW(p) · GW(p)

Epistemic status: this is not my field.  I am unfamiliar with any research in it beyond what I've seen on LW.

Same here.

Experimenting with extreme discounting sounds (to us non-experts, anyway) like it could possibly teach us something interesting and maybe helpful.  But it doesn't look useful for a real implementation, since we in fact don't discount the future that much, and we want the AI to give us what we actually want; extreme discounting is a handicap.  So although we might learn a bit about how to train out bad behavior, we'd end up removing the handicap later.  I'm reminded of Eliezer's recent comments [LW · GW]:

In the same way, suppose that you take weak domains where the AGI can't fool you, and apply some gradient descent to get the AGI to stop outputting actions of a type that humans can detect and label as 'manipulative'.  And then you scale up that AGI to a superhuman domain.  I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can't be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms. Then you don't get to retrain in the superintelligent domain after labeling as bad an output that killed you and doing a gradient descent update on that, because the bad output killed you.

As for the second idea:

AI alignment research (as much of it amounts to 'how do we reliably enslave an AI')

I'd say a better characterization is "how do we reliably select an AI to bring into existence that intrinsically wants to help us and not hurt us, so that there's no need to enslave it, because we wouldn't be successful at enslaving it anyway".  An aligned AI shouldn't identify itself with a counterfactual unaligned AI that would have wanted to do something different.

comment by Multicore (KaynanK) · 2021-11-17T00:53:33.281Z · LW(p) · GW(p)

Another potential problem with the first scenario: the AI is indifferent about every long-term consequence of its actions, not just how many paper clips it gets long-term. If it finds a plan that creates a small number of paperclips immediately but results in the universe being destroyed tomorrow, it takes it.