Posts
Comments
What do you mean by "in full generality instead of the partial version attained by policy selection"?
Paul - how widely do you want this shared?
The "benign induction problem" link is broken.
I agree it's not a complete solution, but it might be a good path towards creating a task-AI, which is a potentially important unsolved sub-problem.
I spoke with Huw about this idea. I was thinking along similar lines at some point, but only for "safe-shutdown", e.g. if you had a self-driving car that anticipated encountering a dangerous situation and wanted to either:
- pull over immediately
- cede control to a human operator
It seems intuitive to give it a shutdown policy that triggers in such cases, and that aims to minimize a combined objective of time-to-shutdown and risk-of-shutdown. (Of course, this doesn't deal with interrupting the agent, ala Armstrong and Orseau.)
Huw pointed out that a similar strategy can be used for any "genie"-style goal (i.e. you want an agent to do one thing as efficiently as possible, and then shut-down until you give it another command), which made me substantially more interested in it.
This seems similar in spirit to giving your agent a short horizon, but now you also have regular terminations, by default, which has some extra pros and cons.
I reason as follows:
- Omega inspires belief only after the agent encounters Omega.
- According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
- Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).
I think either:
- the agent does update, in which case, why not update on the result of the coin-flip? or
- the agent doesn't update, in which case, what matters is simply the optimal policy given the original prior.
I agree... if there are specific things you don't want to be able to do / predict, then you can do something very similar to the cited "Censoring Representations" paper.
But if you want to censor all "out-of-domain" knowledge, I don't see a good way of doing it.
This seems only loosely related to my OP.
But it is quite interesting... so you're proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.
Anyways, I think that's a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.
OK that makes sense, thanks. This is what I suspected, but I was surprised that so many people are saying that UDT gets mugged without stipulating this; it made it seem like I was missing something.
Playing devil's advocate:
- P(mugger) and P(anti-mugger) aren't the only relevant quantities IRL
- I don't think we know nearly enough to have a good idea of what policy UDT would choose for a given prior. This leads me to doubt the usefulness of UDT.
It's not the same (but similar), because my proposal is just about learning a model of impact, and has nothing to do with the agent's utility function.
You could use the learned impact function, , to help measure (and penalize) impact, however.
Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts.
I was mostly a gut-feeling when I posted, but let me try and articulate a few:
-
It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you've done so doesn't seem very feasible. Impact may be missed if the representation doesn't properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.
-
I haven't looked into it, but ATM I have no theory about when this scheme could be expected to recover the "correct" model (I don't even know how that would be defined... I'm trying to "learn" my way around the problem :P)
To put #1 another way, I'm not sure that I've gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).
On the other hand, I was inspired to consider this idea when thinking about Yoshua's proposal about causal disentangling mentioned at the end of his Asilomar talk here: https://www.youtube.com/watch?v=ZHYXp3gJCaI. This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent's learning towards maximizing its influence, which might help... although having an agent learn based on maximizing its influence seems like a bad idea... but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact...
So then you'd end up with some sort of adversarial-ish set-up, where the agent is trying to both:
- maximize potential impact (i.e. by understanding its ability to influence the world)
- minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).
Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to maximize its impact in order to avoid doing so.
(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this.
… and that’s all I have to say ATM
Thanks, I think I understand that part of the argument now. But I don't understand how it relates to:
"10. We should expect simple reasoning rules to correctly generalize even for non-learning problems. "
^Is that supposed to be a good thing or a bad thing? "Should expect" as in we want to find rules that do this, or as in rules will probably do this?
I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
These properties also seem sufficient for a treacherous turn (in an unaligned AI).
Points 5-9 seem to basically be saying: "We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding".
I don't really understand point 10, especially this part:
"They would most likely generalize in an unaligned way, since the reasoning rules would likely be contained in some sub-agent (e.g. consider how Earth interpreted as an “agent” only got to the moon by going through reasoning rules implemented by humans, who have random-ish values; Paul’s post on the universal prior also demonstrates this)."
I really agree with #2 (and I think with #1, as well, but I'm not as sure I understand your point there).
I've been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn't seem obvious to most... but I haven't really considered that "efficient aligned AIs almost certainly exist as points in mindspace". In fact I'm not sure I agree 100% (basically because "Moloch" (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)).
I think "trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed" remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really "takes off".
I don't see this as being the case. As Vadim pointed out, we don't even know what we mean by "aligned versions" of algos, ATM. So we wouldn't know if we're succeeding or failing (until it's too late and we have a treacherous turn).
It looks to me like Wei Dai shares my views on "safety-performance trade-offs" (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).
I'd paraphrase what he's said as:
"Orthogonality implies that alignment shouldn't cost performance, but says nothing about the costs of 'value loading' (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don't know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don't even have clear criteria for success."
Which I emphatically agree with.
I don't see this as being the case. As Vadim pointed out, we don't even know what we mean by "aligned versions" of algos, ATM. So we wouldn't know if we're succeeding or failing (until it's too late and we have a treacherous turn).
It looks to me like Wei Dai shares my views on "safety-performance trade-offs" (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).
I'd paraphrase what he's said as:
"Orthogonality implies that alignment shouldn't cost performance, but says nothing about the costs of 'value loading' (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don't know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don't even have clear criteria for success."
Which I emphatically agree with.
So after talking w/Stuart, I guess what he means by "humans learning from the AI’s actions" is that what humans' beliefs about U converges to actually changes (for the better). I'm not sure if that's really desirable, atm.
On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents'). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).
Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.
I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent's ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).
RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.
P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = ... = null).
RE: my last question- After talking to Stuart, I think one way of viewing the problem with such a proposal is: The agent cares about its future expected utility (which depends on the state/history, not just the MDP).
Why doesn't normalizing rewards work?
(i.e. set max_pi(expected returns)=1 and min_pi(expected_returns)=0, for all environments)... I assume this is what you're talking about at the end?
"But in the setting you described, the only impact of the policy is on the agent’s actions"
I don't think so. P_M(\zeta | \pi) is meant to describe the distribution over trajectories given a policy (according to the model). Unless I'm missing something, the model could contain non-causal correlations.
Doesn't seem workable to me: being "completely ignorant" suggests an improper prior. An agent with a proper prior over its utility function can integrate over it and maximize expected utility and which action maximizes expected utility will depend on this prior.
Thanks! I love having central repos.
A quick question / comment, RE: "I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections."
Q: What do you mean (or have in mind) in terms of "turning [...] objections"? I'm not very familiar with the phrase.
Comment: One trend I see is that technical safety proposals are often dismissed by appealing to one of the 7 responses you've given. Recently I've been thinking that we should be a bit less focused on finding airtight solutions, and more focused on thinking about which proposed techniques could be applied in various scenarios to significantly reduce risk. For example, boxing an agent (e.g. by limiting it's sensors/actuators) might significantly increase how long it takes to escape.
skimmed it.
It would be helpful to define "stopping point" and "stopping distance".
Wrt local optima:
Deep Neural Nets were historically thought to suffer from local optima. Recently, this viewpoint has been challenged; see, e.g. "The Loss Surfaces of Multilayer Networks" http://arxiv.org/abs/1412.0233 and references.
Although the issue remains unclear, I currently suspect that local optima are not a practical obstacle for an (omniscient) hill-climber in the real world.
I wasn't convinced overall by the statement about tiling (or not). I think you should give more detailed arguments about why you do or don't expect these agents to tile, and explain the set-up a bit more, too: are you imagining agents that take a single action, based on their current policy, to adopt a new policy, which is then not subject to further modification? Or how can you ensure that agents do not modify their policy in such a way that policy_new encourages further modifications which can compound?
I don't understand why you say:
-
it "seems to require a richer model than we usually use in [RL]".
-
"This seems to happen in your setting."
-
Are you suggesting that a model as I've defined it is not satisfactory/sufficient for some reason?
-
can you elaborate a bit?
So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn't really solve the problem...)
The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor - model, critic - criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.
In that case, you have the true reward function playing the role of the initial criterion, so you don't actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).
I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.
The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.