Posts
Comments
You are of course perfectly right. What I meant was: so that their convex hull is full-dimensional and contains the origin. I fixed it. Thanks for spotting this!
Exactly! Thanks for providing this concise summary in your words.
In the next post we generalize the target from a single point to an interval to get even more freedom that we can use for increasing safety further.
In our current ongoing work, we generalize that further to the case of multiple evaluation metrics, in order to get closer to plausible real-world goals, see our teaser post.
Alex Turner's post you referenced first convinces me that his arguments about "orbit-level power-seeking" apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.
He also claims in that post that the "orbit-level power-seeking" issue affects all schemes that are based on expected utility: "There is no clever EU-based scheme which doesn't have orbit-level power-seeking incentives." I don't see a formal proof of that claim though, maybe I missed it. The rationale he gives below that claim seems to boil down to a counting argument again, which suggests to me some tacit assumption that the agent still chooses uniformly at random from some set of policies. As this is not what we suggest, I don't see how it applies to our algorithms.
Re power-seeking in general: I believe one important class of safety criteria one should use to select from the many possible plans that can fulfill an aspiration-type goal is criteria that aim to quantify the amount of power/resources/capabilities/control potential the agent has at each time step. There are some promising metrics for this already (including "empowerment", reachability, and Alex Turner's AUP). We are currently investigating some versions of such measures, including ones we believe might be novel. A key challenge in doing so is again tractability. Counting the reachable states for example might be intractable, but approximating that number by a recursively computable metric based on Wasserstein distance and Gaussian approximations to latent state distributions seems tractable and might turn out to be good enough.
Thank you for the warm encouragement.
We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.
In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don't believe there is a unique problem for safety that deserves being called "the" problem). In my thinking, the goals of an AGI system are always the direct or indirect consequences of the task it is given by some human that is authorized to give the system a task. If that is the case, the problem of conflicting goals is ultimately an issue of conflicting goals between humans. In your paperclip example, the system should reject the task of producing a trillion paperclips because that likely interferes with the foreseeable goals of other humans. I firmly believe we need to find a design feature that makes sure that the system rejects tasks that are conflicting with other human goals in this way. For the most powerful systems, we might have to do something like what davidad suggests in his Open Agency Architecture, where plans devised by the AGI need to be approved by some form of human jury. I believe such a system would reject almost any maximization-type goals and would only accept almost exclusively aspiration-type goals, and this is the reason why I want to find out how such a goal could then be fulfilled in a rather safe way.
Re quantilization/satisficing: I think that apart from the potentially conflicting goals issue, there are at least two more issues with plain satisficing/quantilization (understood as picking a policy uniformly at random from those that promise at least X return in expectation or among the top X% percent of the feasibility interval): (1) It might be computationally intractable in complex environments that require many steps, unless one finds a way to do that sequentially (i.e., from time step to time step). (2) The unsafe ways to fulfill the goal might not be scarce enough to have sufficiently small probability when choosing policies uniformly at random. The latter is the reason why I currently believe that the freedom to solve a given aspiration-type goal in all kinds of different ways should be used to select a policy that does so in a rather safe way, as judged on the basis of some generic safety criteria. This is why we also investigate in this project how generic safety criteria (such as those discussed for impact regularization in the maximization framework) should be integrated (see post #3 in the sequence).
"Hence the information what I will do cannot have been available to the predictor." If the latter statement is correct, then how can could have "often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation"?
There's many possible explanations for this data. Let's say I start my analysis with the model that the predictor is guessing, and my model attaches some prior probability for them guessing right in a single case. I might also have a prior about the likelihood of being lied about the predictor's success rate, etc. Now I make the observation that I am being told the predictor was right every single time in a row. Based on this incoming data, I can easily update my beliefs about what happened in the previous prediction excercises: I will conclude that (with some credence) the predictor was guessed right in each individual case or that (also with some credence) I am being lied to about their prediction success. This is all very simple Bayesian updating, no problem at all. As long as my prior beliefs assign nonzero credence to the possibility that the predictor guesses right (and I see not reason why that shouldn't be a possibility), I don't need to assign any posterior credence to the (physically impossible) assumption that they could actually foretell the actions.
Take a possible world in which the predictor is perfect (meaning: they were able to make a prediction, and there was no possible extension of that world's trajectory in which what I will actually do deviates from what they have predicted). In that world, by definition, I no longer have a choice. By definition I will do what the predictor has predicted. Whatever has caused what I will do lies in the past of the prediction, hence in the past of the current time point. There is no point in asking myself now what I should do as I have no longer causal influence on what I will do. I can simply relax and watch myself doing what I have been caused to do some time before. I can of course ask myself what might have caused my action and try to predict myself from that what I will do. If I come to believe that it was myself who decided at some earlier point in time what I will do, then I can ask myself what I should have decided at that earlier point in time. If I believe that at that earlier point in time I already knew that the predictor would act in the way it did, and if I believe that I have made the decision rationally, then I should conclude that I have decided to one-box.
The original version of Newcomb's paradox in Nozick 1969 is not about a perfect predictor however. It begins with (1) "Suppose a being in whose power to predict your choices you have enormous confidence.... You know that this being has often correctly predicted your choices in the past (and has never, so far as you know, made an incorrect prediction about your choices), and furthermore you know that this being has often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation to be described below". So the information you are given is explicitly only about things from the past (how could it be otherwise). It goes on to say (2) "You have a choice between two actions". Information (2) implies that what I will do has not been decided yet and I still have causal influence on what I will do. Hence the information what I will do cannot have been available to the predictor. This implies that the predictor cannot have made a perfect prediction about my behaviour. Indeed nothing in (1) implies that they have, the information given is not about my future action at all. After I will have made my decision, it might turn out, of course, that it happens to coincides with what the predictor has predicted. But that is irrelevant for my choice as it would only imply that the predictor will have been lucky this time. What should I make of information (1)? If I am confident that I still have a choice, that question is of no significance for the decision problem at hand and I should two-box. If I am confident that I don't have a choice but have decided already, the reasoning of the previous paragraph applies and I should hope to observe that I will one-box.
What if I am unsure whether or not I still have a choice? I might have the impression that I can try to move my muscles this way or that way, without being perfectly confident that they will obey. What action should I then decide to try? I should decide to try two-boxing. Why? Because that decision is the dominant strategy: if it turns out that indeed I can decide my action now, then we're in a world where the predictor was not perfect but merely lucky and in that world two-boxing is dominant; if it instead turns out that I was not able to override my earlier decision at this point, then we're in a world where what I try now makes no difference. In either case, trying to two-box is undominated by any other strategy.
Can you please explain the "zero-probability possible world"?
Hi Nathan,
I'm not sure. I guess it depends on what your definition of "agent" is. In my personal definition, following Yann LeCun's recent whitepaper, the "agent" is a system with a number of different modules, one of it being a world model (in our case, an MDP that it can use to simulate consequences of possible policies), one of it being a policy (in our case, an ANN that takes states as inputs and gives action logits as outputs), and one module being a learning algorithm (in our case, a variant of Q-learning that uses the world model to learn a policy that achieves a certain goal). The goal that the learning algorithm aims to find a suitable policy for is an aspiration-based goal: make the expected return equal some given value (or fall into some given interval). As a consequence, when this agent behaves like this very often in various environments with various goals, we can expect it to meet its goals on average (under mild conditions on the sequence of environments and goals, such as sufficient probabilistic independence of stochastic parts of the environment and bounded returns, so that the law of large number applies).
Now regarding your suggestion that the learned policy (what you call the frozen net I think) could be checked by humans before being used: that is a good idea for environments and policies that are not too complex for humans to understand. In more complex cases, one might want to involve another AI that tries to prove the proposed policy is unsafe for reasons not taken into account in selecting it in the first place, and one can think of many variations in the spirit of "debate" or "constitutional AI" etc.
Excellent! I have three questions
-
How would we get to a certain upper bound on ?
-
As collisions with the boundary happen exactly when one action's probability hits zero, it seems the resulting policies are quite large-support, hence quite probabilistic, which might be a problem in itself, making the agent unpredictable. What is your thinking about this?
-
Related to 2., it seems that while your algorithm ensures that expected true return cannot decrease, it might still lead to quite low true returns in individual runs. So do you agree that this type of algorithm is rather a safety ingredient amongst other ingredients, rather that meant to be a sufficient solution to satety?
I'm sorry but I fail to see the analogy to momentum or adam, in neither of which the vector or distance from the current point to the initial point plays any role as far as I can see. It is also different from regularizations that modify the objective function, say to penalize moving away from the initial point, which would change the location of all minima. The method I propose preserves all minima and just tries to move towards the one closest to the initial point. I have discussed it with some mathematical optimization experts and they think it's new.
I like the clarity of this post very much! Still, we should be aware that all this hinges on what exactly we mean by "the model".
If "the model" only refers to one or more functions, like a policy function pi(s) and/or a state-value function V(s) and/or a state-action -value function Q(s,a) etc., but does not refer to the training algorithm, then all you write is fine. This is how RL theory uses the word "model".
But some people here also use the term "the model" in a broader sense, potentially including the learning algorithm that adjusts said functions, and in that case "the model" does see the reward signal. A better and more common term for the combination of model and learning algorithm is "agent", but some people seem to be a little sloppy in distinguishing "model" and "agent". One can of course also imagine architectures in which the distinction is less clear, e.g., when the whole "AI system" consists of even more components such as several "agents", each of which using different "models". Some actor-critic systems can for example be interpreted as systems consisting of two agents (an actor and a critic). And one can also imagine hierarchical systems in which a parameterized learning algorithm used in the low level component is adjusted by a (hyper-)policy function on a higher level that is learned by a 2nd-level learning algorithm, which might as well be hyperparameterized by an even higher-level learned policy, and so on, up towards one final "base" learning algorithm that was hard-coded by the designer.
So, in the context of AGI or ASI, I believe the concept of an "AI system" is the most useful term in this ontology, as we cannot be sure what the architecture of an ASI will be, how many "agents" and "policies" on how many "hierarchical levels" it will contain, what their division of labor will be, and how many "models" they will use and adjust in response to observations in the environment.
In summary, as the outermost-level learning algorithm in such an "AI system" will generally see some form of "reward signal", I believe that most statements that are imprecisely phrased in terms of a "model" getting "rewarded" can be fixed by simply replacing the term "model" by "AI system".
replacing the SGD with something that takes the shortest and not the steepest path
Maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x – x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.
roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values.
I believe this is not correct as long as "gradient-descend" means some standard version of gradient descent because those are all local, can go highly nonlinear paths, and do not memorize the initial value to try staying close to it.
But maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x – x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.
Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected "savedness" of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which "just tries to save the world" with a decent probability, and doesn't aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?
I'm asking this because
- I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that "optimally" saves the world in some way we did not foresee and don't like.
- I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)
Definition 4: Expectation w.r.t. a Set of Sa-Measures
This definition is obviously motivated by the plan to later apply some version of maximin rule, so that only the inf matters.
I suggest that we also study versions what employ other decision-under-ambiguity rules such as Hurwicz' rule or Savage's minimax regret rule.
From my reading of quantilizers, they might still choose "near-optimal" actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to "make me a tea of this quantity and quality within this time and with this probability" and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.
even when the agents are unable to explicitly bargain or guarantee their fulfilment of their end by external precommitments
I believe there is a misconception here. The actual game you describe is the game between the programmers, and the fact that they know in advance that the others' programs will indeed be run with the code that their own program has access to does make each program submission a binding commitment to behave in a certain way.
Game Theory knows since long that if binding commitments are possible, most dilemmas can be solved easily. In other words, I believe this is very nice but is quite far from being the "huge success" you claim it is.
Put differently: The whole thing depends crucially on the fact that X can be certain that Y will use the strategy (=code) X thinks it will use. But how on Earth would a real agent ever be able to know such a thing about another agent?
I just stumbled upon this and noticed that a real-world mechanism for international climate policy cooperation that I recently suggested in this paper can be interpreted as a special case of your (G,X,Y) framework.
Assume a fixed game G where
- each player's action space is the nonnegative reals,
- U(x,y) is weakly decreasing in x and weakly increasing in y.
- V(x,y) is weakly decreasing in y and weakly increasing in x.
(Many public goods games, such as the Prisoners' Dilemma, have such a structure)
Let's call an object a Conditional Commitment Function (CCF) iff it is a bounded, continuous, and weakly increasing function from the nonnegative reals into the nonnegative reals. (Intended interpretation of a CCF C: If opponent agrees to do y, I agree to do any x that has x <= C(y))
Now consider programs of the following kind:
C = <some CCF>
if code(opponent) equals code(myself) except that C is replaced by some CCF D:
output the largest x >= 0 for which there is a y <= D(x) with x <= C(y)
else:
output 0
Let's denote this program Z(C) , where C is the CCF occurring in line 1 of the program. Finally, let's consider the meta-game where two programmers A and B, knowing G, each simultaneously choose a C and submit the program Z(C), the two programs are executed once to determine actions (x,y), A gets U(x,y) and B gets V(x,y).
(In the real world, the "programmers" could be the two parliaments of two countries that pass two binding laws (the "programs"), and the actions could be domestic levels of greenhouse gas emissions reductions.)
In our paper, we prove that the outcomes that will result from the strong Nash equilibria of this meta-game are exactly the Pareto-optimal outcomes (x,y) that both programmers prefer to the outcome (0,0).
(In an N (instead of 2) player context, the outcomes of strong Nash equilibria are exactly the ones from a certain version of the underlying base game's core, a subset of the Pareto frontier that might however be empty).
I'd be interested in learning whether you think this is an interesting application context to explore the theories you discuss.
Dear Robert, I just found out about your work and absolutely love it.
Has the following idea been explored yet?
- The AI system is made of two agents, a strategic agent S and a controller agent C.
- S's reward function approximates the actual objective function of the system as defined by the designer.
- S can only propose actions to C, only knows about the environment and the actual actions taken what C tells it, and only has as many compute resources as C gives it.
- C's reward function encodes hard constraints such as the three laws of robotics or some other formal ethical constraint system, in the form of a binary reward (1 for compliance, 0 for non-compliance).
- C has access to the actual observations and has the power to either actually take the action proposed by S or not.
- In addition, C is free to tell S anything regarding whether it actually took the proposed action and what the observations are, and can curtail S's compute resources to avoid being outsmarted by S.
- If indifferent in light of its reward function, C will take the proposed action, will be honest about observations, and will not curtail resources (but will not get a positive reward from this because that could be exploited by S).
Having just read Scott's Geometric Expectation stuff, I want to add that of course another variant of all of this is to replace every occurrence of a mean or expectation by a geometric mean or geometric expectation to make the whole thing more risk-averse.
In its suggested form Maximal Lottery-Lotteries is still a majoritarian system in the sense that a mere majority of 51% of the voters can make sure that candidate A wins regardless how the other 49% vote. For this, they only need to give A a rating of 1 and all other candidates a rating of 0.
One can also turn the system into a non-majoritarian system in which power is distributed proportionally in the sense that any group of x% of the voters can make sure that candidate A gets at least x% winning probability, similar to what is true of the MaxParC voting system used in vodle
The only modification needed to achieve this is to replace (the set of all lotteries on C) in your formula by the set of those lotteries on C which every single ballot rates at least as good as the benchmark lottery. In this, the benchmark lottery is the lottery of drawing one ballot uniformly at random and electing the highest-rated candidate (as in the "random ballot" or "random dictator" method).