# Value Learning is only Asymptotically Safe

post by michaelcohen (cocoa) · 2019-04-08T09:45:50.990Z · score: 7 (3 votes) · LW · GW · 9 comments

I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but it I’ll use “benign” for now).

This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function.

Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead.

In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory.

This agent is presumably still *asymptotically* safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli() random variable (like the indicator of whether a bit was flipped) approaches , which is small enough that a competent value learner should be able to deal with it.

This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety?

## 9 comments

Comments sorted by top scores.

is there anything to aim for that is stronger than asymptotic safety?

Faster convergence?

I suspect there are many more sources of risk that result in only being able to approach complete safety than cosmic rays, but this seems a reasonable argument for at least establishing that the limit exists so even if we disagree over whether something more easily controlled by AI design is a source of risk we don't get confused and think that if we eliminate all risk from the design that we suddenly get perfect safety.

I sort of object to titling this post "Value Learning is only Asymptotically Safe" when the actual point you make is that we don't yet have concrete optimality results for value learning other than asymptotic safety.

We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety?

In the case of value learning, given the generous assumption that "we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function", it seems like you should be able to get a PAC-type bound, where by time , the agent is only -suboptimal with probability , where is increasing in but decreasing in -- see results on PAC bounds for Bayesian learning, which I haven't actually looked at. This gives you bounds stronger than asymptotic optimality for value leraning. Sadly, if you want your agent to actually behave well in general environments, you probably won't get results better than asymptotic optimality, but if you're happy to restrict yourself to MDPs, you probably can.

I sort of object to titling this post "Value Learning is only Asymptotically Safe" when the actual point you make is that we don't yet have concrete optimality results for value learning other than asymptotic safety.

Doesn't the cosmic ray example point to a strictly positive probability of dangerous behavior?

EDIT: Nvm I see what you're saying. If I'm understanding correctly, you'd prefer, e.g. "Value Learning is not [Safe with Probability 1]".

Thanks for the pointer to PAC-type bounds.

"In the presence of cosmic rays, then, this agent is not safe for its entire lifetime with probability 1."

I think some readers may disagree about whether you this sentence means "with probability 1, the agent is not safe" or "with probability strictly greater than 0, the agent is not safe". In particular, I think Hibron's comment is predicated on the former interpretation and I think you meant the latter.

Yes, I did mean the latter. Thank you for clarifying.

Even granting that it is possible for cosmic rays to flip any given bit, or any sequence of bits, in a computer's memory, it is far from clear to me that the probability of this happening approaches 1 over the lifetime of the universe. It isn't very hard to come up with cases where an event is both completely possible, and has probability 0: for instance, if I pick a number at random with uniform distribution from the closed interval [0,1], the probability I will pick 1 is 0 even though 1 is as likely a choice as any other option on the interval. And in the concrete case you're referring to, the universe has finite time to flip these bits before it sinks into entropy. Moreover, I wouldn't expect the sequence of datapoints needed to convince an AI that humans are hostile (or whatever) to be invariant across time: as the AI accrued more data, it would plausibly require more data to persuade it to change its mind.

Linking this [LW · GW], I meant "with probability strictly greater than 0, the agent is not safe". Sorry for the confusion.

The claim was that, if there exists a bit, such that if that bit was struck by a cosmic ray, then for an agent which would be "safe" in a universe without cosmic rays, would become "unsafe" then, as cosmic rays exist, no agent may be "safe" with "probability 1", as that would require it to not be stuck by cosmic rays with "probability 1".

They're saying we can't be sure it won't be hit by cosmic rays. This was meant not as a worry about cosmic rays, but to say they were interested in how you go about making "safe* agent/s" in a universe without inconvenient things like cosmic rays which keep the probability from being 1, but are otherwise unrelated to the work of making "safe agent/s".

*Might be talking about things other than "safety" as well.