Behavioral Sufficient Statistics for GoalDirectedness
post by adamShimi · 20210311T15:01:21.647Z · LW · GW · 12 commentsContents
Introduction Preliminaries Focus of a goal factor factor factor factor Complete Sufficient Statistics The one true goal Stochastic version Conclusion None 12 comments
Note: this is a new version  with a new title  of my recent post "A Behavioral Definition of GoalDirectedness". Most of the formulas are the same, except for the triviality one that deals better with what I wanted; the point of this rewrite is to present the ideas in a perspective that makes sense. I'm not proposing a definition of goaldirectedness, but just sufficient statistics on the complete behavior that make a behavioral study of goaldirectedness more humanlegible.
I also use this new version as a first experiment in another approach to feedback: this post includes a lot of questions asked through the elicit prediction feature. A lot. I definitely tried to overshoot the reasonable number to add, to compensate my tendency to never use them. But don't worry: whether or not there were too many questions will be the subject of another question at the end!
Introduction
In a previous post [? · GW], I argued for the study of goaldirectedness in two steps:
 Defining goaldirectedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
 Computing goaldirectedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.
Intuitively, understanding goaldirectedness should mean knowing which questions to ask about the complete behavior of the system to determine its goaldirectedness. Here the “complete” part is crucial; it simplifies the problem by removing the need to infer what the system will do based on limited behavior. Similarly, we don’t care about the tractability/computability of the questions asked; the point is to find what to look for, without worrying (yet) about how to get it.
Despite these simplifications, the behavioral approach still suffers from one massive problem: it's not humanlegible. We don't know what to do with this mass of loosely structured information, and have slim hopes of finding the right angle or question by sheer luck.
This post addresses this problem: it proposes humanlegible sufficient statistics on this complete behavior that should be enough to deconfuse and clarify most questions about goaldirectedness. The next posts then use these statistics to explore a formal understanding of goaldirectedness.
Eventually we will have to rely on internal structure; but knowing the property to derive/approximate beforehand should help quite a lot.
Thanks to Joe Collman and Michele Campolo for helpful feedback and discussion on this post.
Preliminaries
Let’s start with the formalisation of the environment. The interface is defined by the set of observations and the set of actions. We have a finite set of environments, which are just finite deterministic POMDPs with no reward , using and for observations and actions, with a uniform distribution over initial states. For an , is the set of states of .
My sufficient statistics for goaldirectedness actually extend to more reasonable settings (stochastic POMDPs and a general distribution over initial states) straightforwardly, but I start with the simpler deterministic case to get the intuitions right. On the other hand, the assumption that is finite (although maybe intractably big) is kept through the post because it ensures without additional work the welldefinedness of some expressions. There might be a way to extend the sufficient statistics to the countable case, but that’s beyond the scope of this post.
The system we study is given by a program that takes as inputs the successive observations and return the action taken. I use a program in place of a function from histories to actions because it hides the internal state (that I don’t use) while retaining the expressiveness of such a computable function. We can query the behavior of on any environment of by giving an initial state and seeing what happens; we can also ask potentially uncomputable questions about this behavior (as long as they are welldefined).
Now, when we call a system goaldirected, we usually have a goal for it in mind. The subtlety about a behavioral definition is that we can’t just look inside the model to find the goal; we somehow have to infer goals from the behavior. This is made easier in the setting of this post because we have access to all the behavior and uncomputable procedures  but we still have to do it.
In fact, the sufficient statistics for goaldirectedness of talk about all possible goals. More specifically, for each goal, I define a vector of numbers called focus, capturing how coherent the goal is with the behavior of .
The next section... focuses on defining and motivating this vector.
Focus of a goal
A goal is a a function from an environment to a subset of . That is, a goal gives for each environment the states to reach. This form is certainly limited; yet it captures enough intuitive goals to not be trivial. Another important constraint is that every goal considered satisfies , where K is the Kolmogorov complexity. What this means is that doesn't just capture the states that end up in environment by simulating in ; if that was the case, then the smallest program implementing should be more complex than the smallest program implementing , and we forbid that.
For each goal , its focus for is a 4tuple capturing important properties of and the goal. The last three correspond to the last three intuitions (without the farsightedness) about goaldirectedness from the literature review that we wrote with Joe Collman and Michele Campolo.
Complexity factor
This is just the Kolmogorov complexity of mapped into : . There’s not much more to say about it. It’s just useful to compare goal close or equal on the other factors, to reason about which one is more likely to emerge from training.
Generalization factor
This first element of the focus, the generalization factor , captures how much reaches the goal over the environments of . The formula is the following:
,
where , is the set of states of from which some goal state in is reachable, and such that measures the time it takes for the random uniform policy to put % of the probability mass on goal states. A bit more formally, if we start with a probability distribution over with on and everywhere else, and then update that probability distribution according to the random uniform policy and the environment, captures the first time (if any) where the probability distribution puts more than % of the probability mass on goal states. (It’s more involved than just “the probability that the random uniform policy reaches a goal state” because the simple version trivially goes to in a lot of simple and finite cases).
The intuition of the formula is straightforward: it’s the average generalization of for goal over . The expression averaged is the indicator of whether reaches a goal state, minus the "triviality" of the goal (a measure of how difficult it is to reach a goal state). Thanks to this correction (for a good choice of , which I don't know how to make and motivate yet), trivial goals, like the one outputting for environment , don't generalize well despite being trivially reachable.
A high generalization means that reaches a goal state most of the time; a small one that it rarely does. In the former case it makes more sense to consider as a goal of the system.
Efficiency factor
This second element of the focus, the efficiency factor , captures how efficiently reaches the goal in the (environment,initial state) pairs. The formula is the following:
where , the ratio between the number of steps taken by the optimal policy for to reach a goal state starting at , and the number of steps taken by to reach a goal state starting at .
It’s pretty straightforward; the only subtlety is that the so called optimal policy is the optimal policy for the reward (1 for any non goal state, 0 for a goal state  and then the episode stops), and for all environments in . Now, there might be multiple optimal policies (privileging different environments but getting the same expected return over ). I'm fine with using the one that maximize . Doing so mean comparing with the optimal policy for that is most similar to it in some sense.
While the generalization factor captures in what proportion of environments does reach a goal state, the efficiency factor captures how fast does that compared to the optimal policy for .
Explainability factor
This last element of the focus, the explainability factor , captures how well explained is by assuming it is directed towards . The formula is the following:
where measures the average deviation of from the actions favored by the actionvalue function of .
There are many details and subtleties to unravel here.
 (The policy ) Another policy is used in computing the prediction fitness for the goal . It is obtained by doing RL on the reward defined by (see the section on efficiency for more details); importantly, it’s not necessarily an optimal policy for this reward. It can be any policy that results from RL training (when you stop after some time), with the caveat that it must do times better than the random uniform policy (in terms of expected return). Here is a parameter that seems important (to not allow policies without any direction towards ), but I have no idea how to decide on its value.
The reason for the use of instead of an optimal policy for the reward is that explainability by a “decent” policy moving towards the goal seems just as good to me.
 (The time limit ) This is used for normalizing the prediction using the goal. If actually reaches the goal (and thus ends the episode), then is just the length of this episode. If never finishes, then is some arbitrary time limit after which the episode ends anyway.
Explainability captures Daniel Dennett’s intentional stance through RL: it measures how wellexplained is through some learned RL method on the goal. There’s one aspect that is missing though: the comparison with the best prediction of simple highlevel models (Dennett’s design stance). This other sufficient statistics depends only on , so it’s not defined in this section.
Complete Sufficient Statistics
The sufficient statistics on the behavior of for studying goaldirectedness are then the focus for each goal satisfying the complexity constraint, as well as the explainability of in terms of simple gearslevel model:
, where is a program capturing a simple model of (the upper bound on its Kolmogorov complexity is assumed to be ); is the proportion of time where the action taken by in its trajectory starting at is the action taken by .
When you compare it to the prediction using the goal, the prediction using a simple model looks very binary: either the action taken is the one takes or it’s completely wrong. We can rephrase this as saying that the actions are all equally far from each other.
This looks like a reasonable default assumption, but it doesn’t always hold. For example, if the actions are the angles by increments of 5 degrees, then getting the angle just next to the one took sure seems closer than taking the opposite one. That is, there is a notion of similarity for the action space. In these cases the definition should probably be altered to use that notion of similarity (and normalize it) instead of the binary default.
Now, given a goal , we can use to compute the difference in explainability in . Here being close to means being better explained by a simple model, being close to means being better explained by the goal , and being close to zero means that both are equally good (or bad) at predicting .
What I hope for these sufficient statistics is that they can provide more than just a simple number telling you how goal directed you is. They also allow us to think more clearly about many questions related to goals, like the importance of explainability, convergent instrumental subgoals, and inner alignment. The next posts in this sequence explore these in more detail, but we can look quickly at one example (to be revisited later in the sequence) now.
The one true goal
In some settings, knowing if a clear goal exists, and what it is, matters. Maybe we’re worried about the focus on too narrow a goal, and what it entails. Can we adapt the core definition of goaldirectedness to this application?
My current intuition is that this most representative goal should primarily depend on generalization. It should matter more because a goal with better generalization is a goal that points more often in the right direction. This leaves us two cases:
 If there is a goal with a massive lead on generalization (something like 2x the second largest generalization), then I think we should go with that one.
 If there isn't, then we lack a clear representative goal.
What’s even more exciting is that framing it that way highlights important questions: what if there are two goals, both with far more generalization, and one with even more than the other? In the absence of a representative goal, are all goals the same or is there a relevant hierarchy?
All these questions, and more, will be solv… addressed at the very least in the following posts of the sequence.
Stochastic version
As promised, I'll explain how to get from the deterministic case above to a more realistic stochastic one. The changes considered are making into a stochastic policy returning an element of (a distribution over actions); the environments being stochastic POMDP with stochastic transition function and stochastic observation function (returning the observation for a given state); and there is a distribution of initial states for each environment.
Here are the changes necessary for the computation of each factor of the focus (no change necessary for complexity, as it just depends on the program itself):
 (Generalization factor) goes from an element of to a the probability that eventually reaches a state in , computed by extracting a distribution over histories from the distribution over initial states, the POMDP and the policy. Then we take the probability of getting a history that reaches a goal state.
 (Efficiency factor) goes through similar changes, where the time taken becomes an expected value over the distributions on histories generated.
 (Explainability factor) The prediction error computed now compares distributions at each step. But that's doable with something like KL divergence (maybe we want something different if we allow distributions with 0 probability, which might make KL divergence... diverge)
Conclusion
I proposed sufficient statistics over the complete behavior of a system encoding relationships with goals (about reaching some states). These properties are the complexity of the goal, the generalization towards the goal, the efficiency of the system when it generalizes, and how well it is explained by a goalbased model. There’s an additional sufficient statistic for the system in general, about how well it is explained by a simple gearslevel model.
We can’t compute these directly for concrete systems, as they rely on the knowledge of the complete behavior, and ask many questions that might be uncomputable or at best intractable.
Nonetheless, I believe this is progress. Instead of arguing without common grounding, we can argue using these statistics. And any deeper understanding of goaldirectedness we get will provide a guiding light to checking the goaldirectedness of some actual AIs.
The next posts in this sequence will explore what can be done with these statistics
12 comments
Comments sorted by top scores.
comment by johnswentworth · 20210311T18:51:14.111Z · LW(p) · GW(p)
I think you are very confused about the conceptual significance of a "sufficient statistic".
Let's start with the prototypical setup of a sufficient statistic. Suppose I have a bunch of IID variables drawn from a maximumentropy distribution with features (i.e. the "true" distribution is maxentropic subject to a constraint on the expectation of ), BUT I don't know the parameters of the distribution (i.e. I don't know the expected value ). For instance, maybe I know that the variables are drawn from a normal distribution, but I don't know the mean and variance of the distribution. In a Bayesian sense, the variables are not actually independent: learning the value of one (or a few) data points tells me something about the distribution parameters (i.e. mean and variance in the Gaussian case), which in turn gives me information about the other (unobserved) data points .
However... if I have a few data points , then all of the information from those which is relevant to other (unobserved) data points is summarized by the sufficient statistic . Or, to put it differently: while and are not independent in a Bayesian sense, they are conditionally independent given the summary statistic . This is a special property of maximum entropy distributions, and is one of the main things which makes them pleasant to work with mathematically.
So: the conceptual significance of a "sufficient statistic" is that it summarizes all of the information from some data which is relevant to some other data/parameter/question .
Coming back to the post: if you want to claim that a set of variables together constitute "sufficient statistics for goaldirectedness", then you need to argue that those variables together summarize all information from the underlying system which could possibly be relevant to goal directedness. You have to argue that, once we know the sufficient statistics, then there is not any other information about the underlying system which could possibly be relevant to determining how goaldirected the system is. The main challenge is not to argue that all these statistics are relevant, but rather to argue that there cannot possibly be any other relevant information not already fully accounted for by these statistics. As far as I can tell, the post did not even attempt such an argument.
BTW, I do think you should attempt such an argument. The "sufficient statistics" in this post sound like adhoc measures which roughly capture some intuitions about goaldirectedness, but there's no obvious reason to think they're the right measures. Take the explainability factor, for instance. It's using maximums and averages all over the place; why these operations, rather than a softmax, or weighted average, or order statistic, or log transform, or ...? As far as I can tell, this was an adhoc choice, and I expect these sorts of adhoc choices to diverge from our intuitive interpretations in corner cases.
The sort of argument needed to justify the term "sufficient statistic"  i.e. arguing that no other information can possibly be relevant  is exactly the sort of argument which makes it clear that we're using the right statistics, rather than adhoc metrics which probably diverge from our interpretations in lots of corner cases.
Replies from: adamShimi↑ comment by adamShimi · 20210312T15:09:46.458Z · LW(p) · GW(p)
Thanks for the spoton pushback!
I do understand what a sufficient statistics is  which probably means I'm even more guilty of what you're accusing me of. And I agree completely that I don't defend correctly that the statistics I provide are really sufficient.
If I try to explain myself, what I want to say in this post is probably something like
 Knowing these intuitive properties about and the goals seems sufficient to express and address basically any question we have related to goals and goaldirectedness. (in a very vague intuitive way that I can't really justify).
 To think about that in a grounded way, here are formulas for each property that look like they capture these properties.
 Now what's left to do is to attack the aforementioned questions about goals and goaldirectedness with these statistics, and see if they're enough. (Which is the topic of the next few posts)
Honestly, I don't think there's an argument to show these are literally sufficient statistics. Yet I still think staking the claim that they are is quite productive for further research. It gives concreteness to an exploration of goaldirectedness, carving more grounded questions:
 Given a question about goals and goaldirectedness, are these properties enough to frame and study this question? If yes, then study it. If not, then study what's missing.
 Are my formula adequate formalization of the intuitive properties?
This post mostly focuses on the second aspect, and to be honest, not even in as much detail as one could go.
Maybe that means this post shouldn't exist, and I should have waited to see if I could literally formalize every question about goals and goaldirectedness. But posting it to gather feedback on whether these statistics makes sense to people, and if they feel like something's missing, seemed valuable.
That being said, my mistake (and what caused your kneejerk reaction) was to just say these are literally sufficient statistics instead of presenting it the way I did in this comment. I'll try to rewrite a couple of sentences to make that clear (and add another note at the beginning so your comment doesn't look obsolete.
Replies from: johnswentworth↑ comment by johnswentworth · 20210312T15:55:28.669Z · LW(p) · GW(p)
I still feel like you're missing something important here.
For instance... in the explainability factor, you measure "the average deviation of from the actions favored by the actionvalue function of ", using the formula
. But why this particular formula? Why not take the log of first, or use in the denominator? Indeed, there's a strong argument to be made this formula is a bad choice: the value function is invariant under multiplying by a scalar or adding a constant (i.e. these operations leave the preferences encoded by unchanged), yet this value is not invariant to adding a constant to . So we could change our representation of the "goal" to which we're comparing, in a way which should still represent the same goal, yet the supposed answer to "how well does this goal explain the system's behavior" changes.
Don't get too caught up on this one specific issue  there's a broader problem I'm pointing to here. The problem is with trying to use arbitrary formulas to represent intuitive concepts. If multiple nonequivalent formulas seem like similarlyplausible quantifications of an intuitive concept, then at least one of them is wrong; we have not yet understood the intuitive concept well enough to correctly quantify it. Unless every degree of freedom in the formula is nailed down (up to mathematical equivalence), we haven't actually quantified the intuitive concept, we've just come up with a proxy.
That's what these numbers are: they're not sufficient statistics, they're proxies, in exactly the same sense that "how often a human pushes an approval button" is a proxy for how good an AI's actions are. And they will break down, as proxies always do.
That puts this part in a somewhat different perspective:
Honestly, I don't think there's an argument to show these are literally sufficient statistics. Yet I still think staking the claim that they are is quite productive for further research. It gives concreteness to an exploration of goaldirectedness, carving more grounded questions:
 Given a question about goals and goaldirectedness, are these properties enough to frame and study this question? If yes, then study it. If not, then study what's missing.
 Are my formula adequate formalization of the intuitive properties?
I claim it makes more sense to word these questions as:
 Given a question about goals and goaldirectedness, are these proxies enough to frame and study this question?
 Are these proxies adequate formalizations of the intuitive properties?
The answer to the first question may sometimes be "yes". The answer to the second is definitely "no"; these are proxies, and they absolutely will not hold up if we try to put optimization pressure on them. Goodhart's law will kick in. For instance, tying back to the earlier example, at some point there may be a degree of freedom in how the goal is represented, without changing the substantive meaning of the goal (e.g. adding a constant to ). Normally, that won't be much of a problem, but if we put optimization pressure on it, then we'll end up with some big constant added to in order to change the explainability factor, and then the proxy will break down  the explainability factor will cease to be a good measure of explainability.
Replies from: adamShimi↑ comment by adamShimi · 20210315T18:11:11.144Z · LW(p) · GW(p)
To people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with him.
The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it's misleading to call that the solution.
So I've been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.
Replies from: johnswentworth↑ comment by johnswentworth · 20210315T18:43:39.298Z · LW(p) · GW(p)
I'm working on writing it up properly, should have a post at some point.
comment by Koen.Holtman · 20210301T14:06:41.569Z · LW(p) · GW(p)
This post proposes such a behavioral definition of goaldirectedness. If it survives the artillery fire of feedback and criticism, it will provide a more formal grounding for goaldirectedness,
I guess you are looking for critical comments. I'll bite.
Technical comment on the above post
So if I understand this correctly. then is a metric of goaldirectedness. However, I am somewhat puzzled because only measures directedness to the single goal .
But to get close to the concept of goaldirectedness introduced by Rohin, don't you need then do an operation over all possible values of ?
More general comments on goaldirectedness
Reading the earlier posts in this sequence and several of the linked articles, I see a whole bunch of problems.
I think you are being inspired by the The Misspecified Goal Argument. From Rohin's introductory post on goal directedness:
The Misspecified Goal Argument for AI Risk: Very intelligent AI systems will be able to make longterm plans in order to achieve their goals, and if their goals are even slightly misspecified then the AI system will become adversarial and work against us.
Rohin then speculates that if we remove the 'goal' from the above argument, we can make the AI safer. He then comes up with a metric of 'goaldirectedness' where an agent can have zero goaldirectedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin's terminology, an agent gets safer it if is less goaldirected.
Rohin then proposes that intuitively, a tabledriven agent is not goaldirected. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.
Where things completely move off the main sequence is in Rohin's next step in developing his intuitive notion of goaldirectedness:
This suggests a way to characterize these sorts of goaldirected agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.
So what I am reading here is that if an agent behaves more unpredictably offdistribution, it is becomes less goaldirected in Rohin's intuition. But I can't really make sense of this anymore, as Rohin also associates less goaldirectedness with more safety.
This all starts to look like a linguistic form of Goodharting: the meaning of the term 'goaldirected' collapses completely because too much pressure is placed on it for control purposes.
To state my own terminology preference: I am perfectly happy to call any possible AI agent a goaldirected agent. This is because people build AI agents to help them pursue some goals they have, which naturally makes these agents goaldirected. Identifying a subclass of agents which we then call nongoaldirected looks like a pretty strange program to me, which can only cause confusion (and an artillery fire of feedback and criticism).
To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.

Is your idea that a lower number on a metric implies more safety? This seems to be Rohin's original idea.

Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of 'will become adversarial and work against us' at all? If so I am not seeing the correlation.
↑ comment by adamShimi · 20210301T14:48:19.727Z · LW(p) · GW(p)
Thanks for taking the time to give feedback!
Technical comment on the above post
So if I understand this correctly. then is a metric of goaldirectedness. However, I am somewhat puzzled because only measures directedness to the single goal .
But to get close to the concept of goaldirectedness introduced by Rohin [? · GW], don't you need then do an operation over all possible values of ?
That's not what I had in mind, but it's probably on me for not explaining it clearly enough.
 First, for a fixed goal , the whole focus matters. That is, we also care about and . I plan on writing a post defending why we need all of them, but basically there are situations when using only one of them would makes us order things weirdly.
 You're right that we need to consider all goals. That's why the goaldirectedness of the system is defined as a function that send each goal (satisfying the nice conditions) on a focus, the vector of three numbers. So the goaldirectedness of contains the focus for every goal, and the focus captures the coherence of with the goal.
Rohin then speculates that if we remove the 'goal' from the above argument, we can make the AI safer. He then comes up with a metric of 'goaldirectedness' where an agent can have zero goaldirectedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin's terminology, an agent gets safer it if is less goaldirected.
This doesn't feel like a good summary of what Rohin says in his sequence.
 He says that many scenarios used to argue for AI risks implicitly use systems following goals, and thus that building AIs not having goal might make these scenarios go away. But he doesn't say that new problems can't emerge.
 He doesn't propose a metric of goaldirectedness. He just argues that every system is maximizing a utility function, and so this isn't the way to differenciate goaldirected with nongoaldirected systems. The point of this argument is also to say that reasons to believe that AGIs should maximize expected utility are not enough to say that such AGI must necessarily be goaldirected.
Rohin then proposes that intuitively, a tabledriven agent is not goaldirected. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.
Where things completely move off the main sequence is in Rohin's next step in developing his intuitive notion of goaldirectedness:
"This suggests a way to characterize these sorts of goaldirected agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal."
So what I am reading here is that if an agent behaves more unpredictably offdistribution, it is becomes less goaldirected in Rohin's intuition. But I can't really make sense of this anymore, as Rohin also associates less goaldirectedness with more safety.
This all starts to look like a linguistic form of Goodharting: the meaning of the term 'goaldirected' collapses completely because too much pressure is placed on it for control purposes.
My previous answer mostly addresses this issue, but let's spell it out: Rohin doesn't say that nongoaldirected system. What he defends is that
 Nongoaldirected (or lowgoaldirected) systems wouldn't be unsafe in many of the ways we study, because these depend on having a goal (convergent instrumental subgoals for example)
 Nongoaldirected competent agents are not a mathematical impossibility, even if every competent agent must maximize expected utility.
 Since removing goaldirectedness apparently gets rid of many big problem with aligning AI, and we don't have an argument for why making a competent nongoaldirected system is impossible, then we should try to look into nongoaldirected approaches.
Basically, the intuition of "less goaldirected means safer" makes sense when safer means "less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown", not when it means "less probability that the AI takes an unexpected and counterproductive action".
Another way to put it is that Rohin argues that removing goaldirectedness (if possible) seems to remove many of the specific issues we worry about in AI Alignment  and leaves mostly the nearterm "my automated car is running over people because it thinks they are parts of the road" kind of problems.
To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.
 Is your idea that a lower number on a metric implies more safety? This seems to be Rohin's original idea.
 Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of 'will become adversarial and work against us' at all? If so I am not seeing the correlation.
That's a very good and fair question. My reason for not using a single metric is that I think the whole structure of focuses for many goals can tell us many important things (for safety) when looked at from different perspective. That's definitely something I'm working on, and I think I have nice links for explainability (and others probably coming). But to take an example from the post, it seems that a system with one goal with far more generalization than any other is more at risk of the kind of safety problems Rohin related to goaldirectedness.
Replies from: Koen.Holtman↑ comment by Koen.Holtman · 20210301T19:23:28.317Z · LW(p) · GW(p)
This doesn't feel like a good summary of what Rohin says in his sequence.
I was not trying to summarize the entire sequence, only summarizing my impressions of some things he said in the first post of the sequence. Those impressions are that Rohin was developing his intuitive notion of goaldirectedness in a very different direction than you have been doing, given the examples he provides.
Which would be fine, but it does lead to questions of how much your approach differs. My gut feeling is that the difference in directions might be much larger than can be expressed by the mere adjective 'behavioral'.
On a more technical note, if your goal is to search for metrics related to "less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown", then the metrics that have been most productive in my opinion are, first, 'indifference', in the meaning where it is synonymous with 'not having a control incentive'. Other very relevant metrics are 'myopia' or 'short planning horizons' (see for example here [? · GW]) and 'power' (see my discussion in the post Creating AGI Safety Interlocks).
(My paper counterfactual planning has a definition of 'indifference' which I designed to be more accessible than the `not having a control incentive' definition, i.e. more accessible for people not familiar with Pearl's math.)
None of the above metrics look very much like 'nongoaldirectedness' to me, with the possible exception of myopia.
comment by romeostevensit · 20210311T19:53:12.795Z · LW(p) · GW(p)
I noticed myself being dismissive of this approach despite being potentially relevant to the way I've been thinking about things. Investigating that, I find that I've mostly been writing off anything that pattern matches to the 'cognitive architectures' family of approaches. The reason for this is that most such approaches want to reify modules and structure. And my current guess is that the brain doesn't have a canonical structure (at least, on the level of abstraction that cognitive architecture focuses on). That is to say, the modules are fluid and their connections to each other are contingent.
Replies from: adamShimi↑ comment by adamShimi · 20210312T15:15:44.311Z · LW(p) · GW(p)
Thanks for commenting on your reaction to this post!
That being said, I'm a bit confused by your comment. You seem to write off approaches which attempt to provide a computational model of mind, but my approach is literally the opposite: looking only at the behavior (but all the behavior), extract relevant statistics to study questions related to goaldirectedness.
Can you maybe give more details?
comment by Daniel Kokotajlo (danielkokotajlo) · 20210301T17:22:25.447Z · LW(p) · GW(p)
Potential typo: You call the efficiency and explainability factors "generalization factors" when you introduce them
Replies from: adamShimi