Comment by abramdemski on Pavlov Generalizes · 2019-05-19T00:18:12.294Z · score: 4 (2 votes) · LW · GW

Somewhat true, but without further bells and whistles, RL does not replicate the Pavlov strategy in Prisoner's Dilemma, so I think looking at it that way is missing something important about what's going on.

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-25T17:08:22.140Z · score: 4 (2 votes) · LW · GW

Ah, ok. I note that it may have been intended more as a meditative practice, since the goal appears to have been reaching a state of bliss, the epistemic practice being a means to that end. Practicing doubting everything could be an interesting meditation (though it could perhaps be dangerous).

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-23T19:51:39.127Z · score: 2 (1 votes) · LW · GW
this procedure is called (the weak form of) Pyrrhonian skepticism

What's the strong form?

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-23T07:19:12.937Z · score: 10 (5 votes) · LW · GW

I think in a conversation I had with you last year, I kept going back to 'state' despite protests because I kept thinking "if AUP works, surely it would be because some of the utility functions calculate a sensible state estimate in a humanlike ontology and then define utility from this". It isn't necessarily the right way to critique AUP, but I think I was right to think those thoughts conditional on that assumption -- ie, even if it isn't the argument you're trying to make for AUP, it seems like a not-unreasonable position to consider, and so thinking about how AUP does in terms of state can be a reasonable and important part of a thought-process assessing AUP. I believe I stopped making the assumption outright at some point, but kept bringing out the assumption as a tool for analysis -- for example, supporting a thought experiment with the argument that there would at least be some utility functions which thought about the external world enough to case about such-and-such. I think in our conversation I managed to appropriately flag these sorts of assumptions such that you were OK with the role it was playing in the wider argument (well... not in the sense of necessarily accepting the arguments, but in the sense of not thinking I was just repeatedly making the mistake of thinking it has to be about state, I think).

Other people could be thinking along similar lines without flagging it so clearly.

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-23T01:05:24.869Z · score: 23 (6 votes) · LW · GW
  • Giving people a slider with "safety" written on one end and "capability" written on the other, and then trying to get people to set it close enough to the "safety" end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)
    • An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.
    • Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal -- a way to directly optimize for "what I mean, not what I say", such that it is obvious that things are just worse if you depart from this.
    • However, maybe those things are just pipe dreams -- this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.
      • My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
      • Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)
        • In other words, conditioned on having a story more like "this is how you get the most of what you want" rather than a slider reading "safety ------- capability", I more expect to see a mild optimizer as opposed to an impact measure.
  • Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn't exactly what we want.
    • It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can't spot a concrete problem with a given impact measure: impact measures don't address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
    • If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.
      • (It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)
    • On the other hand, a mild optimizer seems to get the spirit of what's wanted from low-impact.
      • This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.
  • Both mild optimization and impact measures require separate approaches to "doing what people want".
    • Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of "do what I want" before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.
Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-19T20:58:55.025Z · score: 4 (2 votes) · LW · GW

Ok, so you find yourself in this situation where the Truth Tester has verified that the Predictor is accurate, and you've verified that the Truth Tester is accurate, and the Predictor tells you that the direction you're about to turn your head has a perfect correspondence to the orbit of some particular asteroid. Lacking the orbit information yourself, you now have a subjective link between your next action and the asteroid's path.

This case does appear to present some difficulty for me.

I think this case isn't actually so different from the previous case, because although you don't know the source code of the Predictor, you might reasonably suspect that the Predictor picks out an asteroid after predicting you (or, selects the equation relating your head movement to the asteroid orbit after picking out the asteroid). We might suspect this precisely because it is implausible that the asteroid is actually mirroring our computation in a more significant sense. So using a Truth Teller intermediary increases the uncertainty of the situation, but increased uncertainty is compatible with the same resolution.

What your revision does do, though, is highlight how the counterfactual expectation has to differ from the evidential conditional. We may think "the Predictor would have selected a different asteroid (or different equation) if its computation of our action had turned out different", but, we now know the asteroid (and the equation); so, our evidential expectation is clearly that the asteroid has a different orbit depending on our choice of action. Yet, it seems like the sensible counterfactual expectation given the situation is ... hm.

Actually, now I don't think it's quite that the evidential and counterfactual expectation come apart. Since you don't know what you actually do yet, there's no reason for you to tie any particular asteroid to any particular action. So, it's not that in your state of uncertainty choice of action covaries with choice of asteroid (via some particular mapping). Rather, you suspect that there is such a mapping, whatever that means.

In any case, this difficulty was already present without the Truth Teller serving as intermediary: the Predictor's choice of box is already known, so even though it is sensible to think of the chosen box as what counterfactually varies based on choice of action, on-the-spot what makes sense (evidentially) is to anticipate the same box having different contents.

So, the question is: what's my naive functionalist position supposed to be? What sense of "varies with" is supposed to necessitate the presence of a copy of me in the (logico-)causal ancestry of an event?

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-17T22:06:15.211Z · score: 2 (1 votes) · LW · GW

It occurs to me that although I have made clear that I (1) favor naive functionalism and (2) am far from certain of it, I haven't actually made clear that I further (3) know of no situation where I think the agent has a good picture of the world and where the agent's picture leads it to conclude that there's a logical correlation with its action which can't be accounted for by a logical cause (ie something like a copy of the agent somewhere in the computation of the correlated thing). IE, if there are outright counterexamples to naive functionalism, I think they're actually tricky to state, and I have at least considered a few cases -- your attempted counterexample comes as no surprise to me and I suspect you'll have to try significantly harder.

My uncertainty is, instead, in the large ambiguity of concepts like "instance of an agent" and "logical cause".

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-17T21:59:33.891Z · score: 2 (1 votes) · LW · GW
"How do you propose to reliably put an agent into the described situation?" - Why do we have to be able to reliably put an agent in that situation? Isn't it enough that an agent may end up in that situation?

For example, we can describe how to put an agent into the counterfactual mugging scenario as normally described (where Omega asks for $10 and gives nothing in return), but critically for our analysis, one can only reliably do so by creating a significant chance that the agent ends up in the other branch (where Omega gives the agent a large sum if and only if Omega would have received the asked-for $10 in the other branch). If this were not the case, the argument for giving the $10 would seem weaker.

But in terms of how the agent can know the predictor is accurate, perhaps the agent gets to examine its source code after it has run and its implemented in hardware rather than software so that the agent knows that it wasn't modified?

I'm asking for more detail about how the predictor is constructed such that the predictor can accurately point out that the agent has the same output as the box. Similarly to how counterfactual mugging would be less compelling if we had to rely on the agent happening to have the stated subjunctive dependencies rather than being able to describe a scenario in which it seems very reasonable for the agent to have those subjunctive dependencies, your example would be less compelling if the box just happens to contain a slip of paper with our exact actions, and the predictor just happens to guess this correctly, and we just happen to trust the predictor correctly. Then I would agree that something has gone wrong, but all that has gone wrong is that the agent had a poor picture of the world (one which is subjunctively incorrect from our perspective, even though it made correct predictions).

On the other hand, if the predictor runs a simulation of us, and then purposefully chose a box whose output is identical to ours, then the situation seems perfectly sensible: "the box" that's correlated with our output subjectively is a box which is chosen differently in cases where our output is different; and, the choice-of-box contains a copy of us. So the example works: there is a copy of us somewhere in the computation which correlates with us.

(Also, just wanted to check whether you've read the formal problem description in Logical Counterfactuals and the Co-operation Game)

I've read it now. I think you could already have guessed that I agree with the 'subjective' point and disagree with the 'meaningless to consider the case where you have full knowledge' point.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-17T03:39:16.735Z · score: 4 (2 votes) · LW · GW

I disagree, and I thought my objection was adequately explained. But I think my response will be more concrete/understandable/applicable if you first answer: how do you propose to reliably put an agent into the described situation?

The details of how you set up the scenario may be important to the analysis of the error in the agent's reasoning. For example, if the agent just thinks the predictor is accurate for no reason, it could be that the agent just has a bad prior (the predictor doesn't really reliably tell the truth about the agent's actions being correlated with the box). To that case, I could respond that of course we can construct cases we intuitively disagree with by giving the agent a set of beliefs which we intuitively disagree with. (This is similar to my reason for rejecting the typical smoking lesion setup as a case against EDT! The beliefs given to the EDT agent in smoking lesion are inconsistent with the problem setup.)

I'm not suggesting that you were implying that, I'm just saying it to illustrate why it might be important for you to say more about the setup.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-16T21:58:24.701Z · score: 4 (2 votes) · LW · GW

The box itself isn't necessarily thought of as possessing an instance of my consciousness. The bullet I want to bite is the weaker claim that anything subjunctively linked to me has me somewhere in its computation (including its past). In the same way that a transcript of a conversation I had contains me in its computation (I had to speak a word in order for it to end up in the text) but isn't itself conscious, a box which very reliably has the same output as me must be related to me somehow.

I anticipate that your response is going to be "but what if it is only a little correlated with you?", to which I would reply "how do we set up the situation?" and probably make a bunch of "you can't reliably put me into that epistemic state" type objections. In other words, I don't expect you to be able to make a situation where I both assent to the subjective subjunctive dependence and will want to deny that the box has me somewhere in its computation.

For example, the easiest way to make the correlation weak is for the predictor who tells me the box has the same output as me to be only moderately good. There are several possibilities. (1) I can already predict what the predictor will think I'll do, which screens off its prediction from my action, so no subjective correlation; (2) I can't predict confidently what the predictor will say, which means the predictor has information about my action which I lack; then, even if the predictor is poor, it must have a significant tie to me; for example, it might have observed me making similar decisions in the past. So there are copies of me behind the correlation.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-16T01:02:46.038Z · score: 2 (1 votes) · LW · GW

I haven't heard anyone else express the extremely naive view we're talking about that I recall, and I probably have some specific decision-theory-related beliefs that make it particularly appealing to me, but I don't think it's out of the ballpark of other people's views so to speak.

The point I make there is that the processes are subjunctively linked to you is more a matter of your state of knowledge than anything about the intrinsic properties of the object itself.

I (probably) agree with this point, and it doesn't seem like much of an argument against the whole position to me -- coming from a Bayesian background, it makes sense to be subjectivist about a lot of things, and link them to your state of knowledge. I'm curious how you would complete the argument -- OK, subjunctive statements are linked to subjective states of knowledge. Where does that speak against the naive functionalist position?

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-16T00:16:10.247Z · score: 2 (1 votes) · LW · GW
I'm very negative on Naive Functionalism. I've still got some skepticism about functionalism itself (property dualism isn't implausible in my mind), but if I had to choose between Functionalist theories, that certainly isn't what I'd pick.

I'm trying to think more about why I feel this outcome is a somewhat plausible one. The thing I'm generating is a feeling that this is 'how these things go' -- that the sign that you're on the right track is when all the concepts start fitting together like legos.

I guess I also find it kind of curious that you aren't more compelled by the argument I made early on, namely, that we should collapse apparently distinct notions if we can't give any cognitive difference between them. I think I later rounded down this argument to occam's razor, but there's a different point to be made: if we're talking about the cognitive role played by something, rather than just the definition (as is the case in decision theory), and we can't find a difference in cognitive role (even if we generally make a distinction when making definitions), it seems hard to sustain the distinction. Taking another example related to anthropics, it seems hard to sustain a distinction between 'probability that I'm an instance' and 'degree I care about each instance' (what's been called a 'caring measure' I think), when all the calculations come out the same either way, even generating something which looks like a Bayesian update of the caring measure. Initially it seems like there's a big difference, because it's a question of modeling something as a belief or a value; but, unless some substantive difference in the actual computations presents itself, it seems the distinction isn't real. A robot built to think with true anthropic uncertainty vs caring measures is literally running equivalent code either way; it's effectively only a difference in code comment.

Comment by abramdemski on Deconfusing Logical Counterfactuals · 2019-04-15T01:21:19.395Z · score: 2 (1 votes) · LW · GW

Sounds like the disagreement has mostly landed in the area of questions of what to investigate first, which is pretty firmly "you do you" territory -- whatever most improves your own picture of what's going on, that is very likely what you should be thinking about.

On the other hand, I'm still left feeling like your approach is not going to be embedded enough. You say that investigating 2->3 first risks implicitly assuming too much about 1->2. My sketchy response is that what we want in the end is not a picture which is necessarily even consistent with having any 1->2 view. Everything is embedded, and implicitly reflective, even the decision theorist thinking about what decision theory an agent should have. So, a firm 1->2 view can hurt rather than help, due to overly non-embedded assumptions which have to be discarded later.

Using some of the ideas from the embedded agency sequence: a decision theorist may, in the course of evaluating a decision theory, consider a lot of #1-type situations. However, since the decision theorist is embedded as well, the decision theorist does not want to assume realizability even with respect to their own ontology. So, ultimately, the decision theorist wants a decision theory to have "good behavior" on problems where no #1-type view is available (meaning some sort of optimality for non-realizable cases).

Comment by abramdemski on The Happy Dance Problem · 2019-04-15T00:45:09.750Z · score: 4 (2 votes) · LW · GW

I agree, that's a serious issue with the setup here. The simple answer is that I didn't think of that when I was writing the post. I later noticed the problem, but how to react isn't totally obvious.

Defense #1: An easy response is that I was talking about updateful DTs in my smoking lesion discussion. If a DT learns, it is hard to see why it would have seriously miscalibrated estimates of its own behavior. For UDT, there is no similar argument. Therefore the post as written above stands.

Reply: Perhaps that's not very satisfying, though -- despite UDT's fixed prior, failure due to lack of calibration about oneself seems like a particularly damning sort of failure. We might construct the prior using something similar to a reflective oracle to block this sort of problem.

Defense #2: Then, the next easy response is that material-conditional-based UDT 1.0 with such a self-knowledgeable prior has two possible fixed points. The probability distribution described in the post isn't one of them, but one with a more extreme assignment favoring dancing is: if the prior expects the agent to dance with certainty or almost certainly, then dancing looks good, and not dancing looks like a way to guarantee you don't get the money. Again, the concern raised in the post is a valid one, just requiring a tweak to the probabilities in the example.

Reply: Sure, but the solution in this case is very clear: you have to select the best fixed point. This seems like an option which is available to the agent, or to the agent designer.

Defense #3: True, but then you're essentially taking a different counterfactual to decide the consequences of a policy: consideration of what fixed point it puts you in. This implies that you have something richer than just a probability distribution to work with, vindicating the overall point of the post, which is to discuss an issue which arises if you try to "condition on a conditional" when given only a probability distribution on actions and outcomes. Reasoning involving fixed points is going to end up being a (very particular) way to add a more basic counterfactual, as suggested by the post.

Also, even if you do this, I would conjecture there's going to be some other problem with using the material conditional formulation of conditioning-on-conditionals. I would be interested if this turned out not to be true! Maybe there's some proof that the material-conditional approach turns out not to be equivalent to other possible approaches under some assumptions relating to self-knowledge and fixed-points. That would be interesting.

Also also, if we take the fixed-point idea seriously, there are problems we run into there as well. Reflective oracles (and their bounded cousins, for constructing computable priors) don't offer a wonderful notion of counterfactual. Selecting a fixed point offers some logical control over predictors which themselves call the reflective oracle to predict you, but if a predictor does something else (perhaps even re-computes the reflective oracle in a slightly different way, side-stepping a direct call to it but simulating it anyway), the result of using selection of fixed point as a notion of counterfactual could be intuitively wrong. You could try to define a special type of reflective oracle which lack this problem. You could also try other options like conditional oracles. But, it isn't clear how everything should fit together. In particular, if the oracle itself is treated as a part of the observation, what is the type of a policy?

So, "select the best fixed point" may not be the straightforward option it sounds like.

Reply: This seems to not take the concern seriously enough. The overall type signature of "conditioning on conditionals" seems wrong here. The idea of having a probability distribution on actions may be wrong, stopping the argument in the post in its tracks -- IE, the post may be right in its conclusion that there is a problem, but we should have been reasoning in a way which never went down that wrong path in the first place, and the conclusion of the post is making too small of a change to accomplish that.

For example, maybe distributed oracles offer a better picture of decision-making: the real process of deciding occurs in the construction of the fixed point, with nothing left over to decide once a fixed point has been constructed.

Clearly matters are getting too complicated for a simple correction to the argument in the post.

Defense #4: I still stand by the post as a cautionary tale about how not to define UDT, barring any "if you deal with self-reference appropriately, the material conditional option turns out to be equivalent to [some other options]" result, which could make me think the problem is more fundamental as opposed to a problem with a naive material-conditional approach to conditioning. The post might be improved by explicitly dealing with the self-reference issue, but the fact that it's not totally clear how to do so (ie 'select the best fixed point' seems to fix things on the surface but has its own more subtle issues when considered as a general approach) makes such a treatment potentially very complicated, so that it's better to look at the happy dance problem without explicitly worrying about all of that.

The basic point of the post is that formally specifying UDT is complicated even if you assume classical bayesian probability w/o worrying about logical uncertainty. Making UDT into a simple well-defined object requires the further assumption that there's a basic 'policy' object (the observation counterfactual, in the language of the post), with known probabilistic relationships to everything else. This essentially just gives you all the counterfactuals you need, begging the question of where such counterfactual information comes from. This point stands, however naive we might think such an approach is.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-11T01:29:57.589Z · score: 2 (1 votes) · LW · GW
I provided reasons why I believe that Naive Functionalism is implausible in an earlier comment. I'll admit that inconsistency is too strong of a word. My point is just that you need an independent reason to bite the bullet other than simplicity. Like simplicity combined with reasons why the bullets sound worse than they actually are.

Ah, I had taken you to be asserting possibilities and a desire to keep those possibilities open rather than held views and a desire for theories to conform to those views.

Maybe something about my view which I should emphasize is that since it doesn't nail down any particular notion of counterfactual dependence, it doesn't actually directly bite bullets on specific examples. In a given case where it may seem initially like you want counterfactual dependence but you don't want anthropic instances to live, you're free to either change views on one or the other. It could be that a big chunk of our differing intuitions lies in this. I suspect you've been thinking of me as wanting to open up the set of anthropic instances much wider than you would want. But, my view is equally amenable to narrowing down the scope of counterfactual dependence, instead. I suspect I'm much more open to narrowing down counterfactual dependence than you might think.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-11T00:33:02.048Z · score: 3 (2 votes) · LW · GW
The argument that you're making isn't that the Abstraction Approach is wrong, it's that by supporting other theories of consciousness, it increases the chance that people will mistakenly fail to choose Naive Functionalism. Wrong theories do tend to attract a certain number of people believing in them, but I would like to think that the best theory is likely to win out over time on Less Wrong.

(I note that I flagged this part as not being an argument, but rather an attempt to articulate a hazy intuition -- I'm trying to engage with you less as an attempt to convince, more to explain how I see the situation.)

I don't think that's quite the argument I want to make. The problem isn't that it gives people the option of making the wrong choice. The problem is that it introduces freedom in a suspicious place.

Here's a programming analogy:

Both of us are thinking about how to write a decision theory library. We have a variety of confusions about this, such as what functionality a decision theory library actually needs to support, what the interface it needs to present to other things is, etc. Currently, we are having a disagreement about whether it should call an external library for 'consciounsens' vs implement its own behavior. You are saying that we don't want to commit to implementing consciousness a particular way, because we may find that we have to change that later. So, we need to write the library in a way such that we can easily swap consciousness libraries.

When I imagine trying to write the code, I don't see how I'm going to call the 'consciousness' library while solving all the other problems I need to solve. It's not that I want to write my own 'consciousness' functionality. It's that I don't think 'consciousness' is an abstraction that's going to play well with the sort of things I need to do. So when I'm trying to resolve other confusions (about the interface, data types I will need, functionality which I may want to implement, etc) I don't want to have to think about calling arbitrary consciousness libraries. I want to think about the data structures and manipulations which feel natural to the problem being solved. If this ends up generating some behaviors which look like a call to the 'naive functionalism' library, this makes me think the people who wrote that library maybe were on to something, but it doesn't make me any more inclined to re-write my code in a way which can call 'consciousness' libraries.

If another programmer sketches a design for a decision theory library which can call a given consciousness library, I'm going to be a bit skeptical and ask for more detail about how it gets called and how the rest of the library is factored such that it isn't just doing a bunch of work in two different ways or something like that.

Actually, I'm confused about how we got here. It seems like you were objecting to the (reductive-as-opposed-to-merely-analogical version of the) connection I'm drawing between decision theory and anthropics. But then we started discussing the question of whether a (logical) decision theory should be agnostic about consciousness vs take a position. This seems to be a related but separate question; if you reject (or hold off on deciding) the connection between decision theory and anthropics, a decision theory may or may not have to take a position on consciousness for other reasons. It's also not entirely clear that you have to take a particular position on consciousness if you buy the dt-anthropics connection. I've actually been ignoring the question of 'consciousness' in itself, and instead mentally substituting it with 'anthropic instance-ness'. I'm not sure what I would want to say about consciousness proper; it's a very complicated topic.

This is an argument for Naive Functionalism vs other theories of consciousness. It isn't an argument for the Abstracting Approach over the Reductive approach. The Abstracting Approach is more complicated, but it also seeks to do more. In order to fairly compare them, you have to compare both on the same domain. And given the assumption of Naive Functionalism, the Abstracting Approach reduces to the Reductive Approach.

An argument in favor of naive functionalism makes applying the abstraction approach less appealing, since it suggests the abstraction is only opening the doors to worse theories. I might be missing something about what you're saying here, but I think you are not only arguing that you can abstract without losing anything (because the agnosticism can later be resolved to naive functionalism), but that you strongly prefer to abstract in this case.

But, I agree that that's not the primary disagreement between us. I'm fine with being agnostic about naive functionalism; I think of myself as agnostic, merely finding it appealing. Primarily I'm reacting to the abstraction approach, because I think it is better in this case for a theory of logical counterfactuals to take a stand on anthropics. The fact that I'm uncertain about naive functionalism is tied to the fact that I'm uncertain about counterfactuals; the structure of my uncertainty is such that I expect information about one to provide information about the other. You want to maintain agnosticism about consciousness, and as a result, you don't want to tie those beliefs together in that way. From my perspective, it seems better to maintain that agnosticism (if desired) by remaining agnostic about the specific connection between anthropics and decision theory which I outlined, rather than by trying to do decision theory in a way which is agnostic about anthropics in general.

Comment by abramdemski on Comparison of decision theories (with a focus on logical-counterfactual decision theories) · 2019-04-10T20:45:44.759Z · score: 9 (4 votes) · LW · GW

You can formalize UDT in a more standard game-theoretic setting, which allows many problems like Parfit's Hitchhiker to be dealt with, if that is enough for what you're interested in. However, the formalism assumes a lot about the world (such as the identity of the agent being a nonproblematic given, as Wei Dai mentions), so if you want to address questions of where that structure is coming from, you have to do something else.

Comment by abramdemski on Comparison of decision theories (with a focus on logical-counterfactual decision theories) · 2019-04-10T20:37:29.278Z · score: 18 (5 votes) · LW · GW

Various comments, written while reading:

The broad categories of causal/evidential/logical are definitely right in terms of what people generally talk about, but it is important to keep in mind that these are clusters rather than fully formalized options. There are many different formalizations of causal counterfactuals, which may have significantly different consequences. Though, around here, people think of Pearlian causality almost exclusively.

"Evidential" means basically one thing, but we can differentiate between what happens in different theories of uncertainty. Obviously, Bayesianism is popular in these parts, but we also might be talking about evidential reasoning in a logically uncertain framework, like logical induction.

Logical counterfactuals are wide open, since there's no accepted account of what exactly they are. Though, modal DT is a concrete proposal which is often discussed.

Again, the causal/evidential/logical split seems good for capturing how people mostly talk about things here, but internally I think of it more as two dimensions: causal/evidential and logical/not. Logical counterfactuals are more or less the "causal and logical" option, conveying intuitions of there being some kind of "logical causality" which tells you how to take counterfactuals.

Also, getting into nitpicks: some might say "evidential" is the non-counterfactual option. A broader term which could be used is "conditional", with counterfactual conditionals (aka subjunctive conditionals) being a subtype. I think evidential conditionals would fall under "indicative conditional" as opposed to "counterfactual conditional". Academic philosophers might also nitpick that logical counterfactuals are not counterfactuals. "Counterfactual" in academic philosophy usually does not include the possibility of counterfacting on logical impossibilities; "counterlogical" is used when logical impossibilities are being considered. Posts on this forum usually ignore all the nitpics in this paragraph, and I'm not sure I'm even capturing the language of academic decision theorists accurately -- just attempting to mention some distinctions I've encountered.

Other Dimensions:

You're right that reflective consistency is something which is supposed to emerge (or not emerge) from the specification of the decision theory. If there were a 'reflective consistency' option, we would want to just set it to 'yes'; but unfortunately, things are not so easy.

Another source of variation, related to your 'graphical models' point, could broadly be called choice of formalism. A decision problem could be given as an extensive-form game, a causal Bayes net, a program (probabilistic or deterministic), a logical theory (with some choices about how actions, utilities, etc get represented, whether causality needs to be specified, and so on), or many other possibilities.

This is critical; new formalisms such as reflective oracles may allow us to accomplish new things, illuminate problems which were previously murky, make distinctions between things which were previously being conflated, and so on. However, the high-level clusters like CDT, EDT, FDT, and UDT do not specify formalism -- they are more general ideas, which can be formalized in multiple ways.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-10T02:01:30.907Z · score: 2 (1 votes) · LW · GW

I agree that we are using "depends" in different ways. I'll try to avoid that language. I don't think I was confusing the two different notions when I wrote my reply; I thought, and still think, that taking the abstraction approach wrt consciousness is in itself a serious point against a decision theory. I don't think the abstraction approach is always bad -- I think there's something specific about consciousness which makes it a bad idea.

Actually, that's too strong. I think taking the abstraction approach wrt consciousness is satisfactory if you're not trying to solve the problem of logical counterfactuals or related issues. There's something I find specifically worrying here.

I think part of it is, I can't imagine what else would settle the question. Accepting the connection to decision theory lets me pin down what should count as an anthropic instance (to the extent that I can pin down counterfactuals). Without this connection, we seem to risk keeping the matter afloat forever.

Making a theory of counterfactuals take an arbitrary theory of consciousness as an argument seems to cement this free-floating idea of consciousness, as an arbitrary property which a lump of matter can freely have or not have. My intuition that decision theory has to take a stance here is connected to an intuition that a decision theory needs to depend on certain 'sensible' aspects of a situation, and is not allowed to depend on 'absurd' aspects. For example, the table being wood vs metal should be an inessential detail of the 5&10 problem.

This isn't meant to be an argument, only an articulation of my position. Indeed, my notion of "essential" vs "inessential" details is overtly functionalist (eg, replacing carbon with silicon should not matter if the high-level picture of the situation is untouched).

Still, I think our disagreement is not so large. I agree with you that the question is far from obvious. I find my view on anthropics actually fairly plausible, but far from determined.

When you talk about "depends" and say that this is a disadvantage, you mean that in order to obtain a complete theory of anthropics, you need to select a theory of consciousness to be combined with your decision theory. I think that this is actually unfair, because in the Reductive Approach, you do implicitly select a theory of consciousness, which I'll call Naive Functionalism. I'm not using this name to be pejorative, it's the best descriptor I can think of for the version of functionalism which you are using that ignores any concerns that high-level predictors might deserve to be labelled as a consciousness.

"Naive" seems fine here; I'd agree that the position I'm describing is of a "the most naive view here turns out to be true" flavor (so long as we don't think of "naive" as "man-on-the-street"/"folk wisdom").

I don't think it is unfair of me to select a theory of consciousness here while accusing you of requiring one. My whole point is that it is simpler to select the theory of consciousness which requires no extra ontology beyond what decision theory already needs for other reasons. It is less simple if we use some extra stuff in addition. It is true that I've also selected a theory of consciousness, but the way I've done so doesn't incur an extra complexity penalty, whereas you might, if you end up going with something else than I do.

My argument is that Occams' razor is about accepting the simplest theory that is consistent with the situation. In my mind it seems like you are allowing simplicity to let you ignore the fact that your theory is inconsistent with the situation, which is not how I believe Occam's Razor is suppose to work. So it's not just about the cost, but about whether this is even a sensible way of reasoning.

We agree that Occam's razor is about accepting the simplest theory that is consistent with the situation. We disagree about whether the theory is inconsistent with the situation.

What is the claimed inconsistency? So far my perception of your argument has been that you insist we could make a distinction. When you described your abstraction approach, you said that we could well choose naive functionalism as our theory of consciousness.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-10T00:45:42.176Z · score: 2 (1 votes) · LW · GW

Well, in any case, the claim I'm raising for consideration is that these two may turn out to be the same. The argument for the claim is the simplicity of merging the decision theory phenomenon with the anthropic phenomenon.

Comment by abramdemski on Deconfusing Logical Counterfactuals · 2019-04-10T00:39:23.664Z · score: 2 (1 votes) · LW · GW
In so far as both of us want to talk about decision problems where multiple possible options are considered, we need to provide a different interpretation of what decision problems are. Your approach is to allow the selection of inconsistent actions, while I suggest erasing information to provide a consistent situation.

I can agree that there's an interpretational issue, but something is bugging me here which I'm not sure how to articulate. A claim which I would make and which might be somehow related to what's bugging me is: the interpretation issue of a decision problem should be mostly gone when we formally specify it. (There's still a big interpretation issue relating to how the formalization "relates to real cases" or "relates to AI design in practice" etc -- ie, how it is used -- but this seems less related to our disagreement/miscommunication.)

If the interpretation question is gone once a problem is framed in a formal way, then (speaking loosely here and trying to connect with what's bugging me about your framing) it seems like either the formalism somehow forces us to do the forgetting (which strikes me as odd) or we are left with problems which really do involve impossible actions w/o any interpretation issue. I favor the latter.

My response is to argue as per my previous comment that there doesn't seem to be any criteria for determining which inconsistent actions are considered and which ones aren't.

The decision algorithm considers each output from a given set. For example, with proof-based decision theories such as MUDT, it is potentially convenient to consider the case where output is true or false (so that the decision procedure can be thought of as a sentence). In that case, the decision procedure considers those two possibilities. There is no "extract the set of possible actions from the decision problem statement" step -- so you don't run into a problem of "why not output 2? It's inconsistent with the problem statement, but you're not letting that stop you in other cases".

It's a property of the formalism, but it doesn't seem like a particularly concerning one -- if one imagines trying to carry things over to, say, programming a robot, there's a clear set of possible actions even if you know the code may come to reliably predict its own actions. The problem of known actions seems to be about identifying the consequences of actions which you know you wouldn't take, rather than about identifying the action set.

I suppose you could respond that I haven't provided criteria for determining what information should be erased, but my approach has the benefit that if you do provide such criteria, logical counterfactuals are solved for free, while it's much more unclear how to approach this problem in the allowing inconsistency approach (although there has been some progress with things like playing chicken with the universe).

I feel like I'm over-stating my position a bit in the following, but: this doesn't seem any different from saying that if we provide a logical counterfactual, we solve decision theory for free. IE, the notion of forgetting has so many free parameters that it doesn't seem like much of a reduction of the problem. You say that a forgetting criterion would solve the problem of logical counterfactuals, but actually it is very unclear how much or how little it would accomplish.

You're at the stage of trying to figure out how agents should make decisions. I'm at the stage of trying to understand what a making a good decision even means. Once there is a clearer understanding of what a decision is, we can then write an algorithm to make good decisions or we may discover that the concept dissolves, in which case we will have to specify the problem more precisely. Right now, I'd be perfectly happy just to have a clear criteria by which an external evaluator could say whether an agent made a good decision or not, as that would constitute substantial progress.

I disagree with the 'stage' framing (I wouldn't claim to understand what making a good decision even means; I'd say that's a huge part of the confusion I'm trying to stare at -- for similar reasons, I disagree with your foundations foundations post in so far as it describes what I'm interested in as not being agent foundations foundations), but otherwise this makes sense.

This does seem like a big difference in perspective, and I agree that if I take that perspective, it is better to simply reject problems where the action taken by the agent is already determined (or call them trivial, etc). To me, that the agent itself needs to judge is quite central to the confusion about decisions.

My point was that there isn't any criteria for determining which inconsistent actions are considered and which ones aren't if you are just thrown a complete description of a universe and an agent.

As mentioned earlier, this doesn't seem problematic to me. First, if you're handed a description of a universe with an agent already in it, then you don't have to worry about defining what the agent considers: the agent already considers what it considers (just like it already does what it does). You can look at a trace of the executed decision procedure and read off which actions it considers. (Granted, you may not know how to interpret the code, but I think that's not the problem either of us are talking about.)

But there's another difference here in how we're thinking about decision theory, connected with the earlier-clarified difference. Your version of the 5&10 problem is that a decision theorist is handed a complete specification of the universe, including the agent. The agent takes some action, since it is fully defined, and the problem is that the decision theorist doesn't know how to judge the agent's decision.

(This might not be how you would define the 5&10 problem, but my goal here is to get at how you are thinking about the notion of decision problem in general, not 5&10 in particular -- so bear with me.)

My version of the 5&10 problem is that you give a decision theorist the partially defined universe with the $5 bill on the table and te $10 bill on the table, stipulating that whatever source code the decision theorist chooses for the agent, the agent itself should know the source code and be capable of reasoning about it appropriately. (This is somewhat vague but can be given formalizations such as that of the setting of proof-based DT.) In other words, the decision theorist works with a decision problem which is a "world with a hole in it" (a hole waiting for an agent). The challenge lies in the fact that whatever agent is placed into the problem by the decision theorist, the agent is facing a fully-specified universe with no question marks remaining.

So, for the decision theorist, the challenge presented by the 5&10 problem is to define an agent which selects the 10. (Of course, it had better select the 10 via generalizable reasoning, not via special-case code which fails to do the right thing on other decision problems.) For a given agent inserted into the problem, there might be an issue or no issue at all.

We can write otherwise plausible-looking agents which take the $5, and for which it seems like the problem is spurious proofs; hence part of the challenge for the decision theorist seems to be the avoidance of spurious proofs. But, not all agents face this problem when inserted into the world of 5&10. For example, agents which follow the chicken rule don't have this problem. This means that from the agent's perspective, the 5&10 problem does not necessarily look like a problem of how to think about inconsistent actions.

Transparent Newcomb's already comes with the options and counterfactuals attached. My interest is in how to construct them from scratch.

In the framing above, where we distinguish between the view of the decision theorist and the view of the agent, I would say that:

  • Often, as is (more or less) the case with transparent newcomb, a decision problem as-presented-to-the-decision-theorist does come with options and counterfactuals attached. Then, the interesting problem is usually to design an agent which (working from generalizable principles) recovers these correctly from within its embedded perspective.
  • Sometimes, we might write down a decision problem as source code, or in some other formalism. Then, it may not be obvious what the counterfactuals are / should be, even from the decision theorist's perspective. We take something closer to the agent's perspective, having to figure out for ourselves how to reason counterfactually about the problem.
  • Sometimes, a problem is given with a full description of its counterfactuals, but the counterfactuals as stated are clearly wrong: putting on our interpret-what-the-counterfactuals-are hats, we come up with an answer which differs from the one given in the problem statement. This means we need to be skeptical of the first case I mentioned, where we think we know what the counterfactuals are supposed to be and we're just trying to get our agents to recover them correctly.

Point being, in all three cases I'm thinking about the problem of how to construct the counterfactuals from scratch -- even the first case where I endorse the counterfactuals as given by the problem. This is only possible because of the distinction I'm making between a problem as given to a decision theorist and the problem as faced by an agent.

Comment by abramdemski on Deconfusing Logical Counterfactuals · 2019-04-08T22:07:03.796Z · score: 4 (2 votes) · LW · GW
I kind of agree with it, but in a way that makes it trivially true. Once you have erased information to provide multiple possible raw counterfactuals, you have the choice to frame the decision problem as either choosing the best outcome or avoiding sub-optimal outcomes. But of course, this doesn't really make a difference.

I think our disagreement is around the status of decision problems before you've erased information, not after. In your post, you say that before erasing information, a problem where what you do is determined is trivial, in that you only have the one option. That's the position I'm disagreeing with. To the extent that erasing information is a useful idea, it is useful precisely for dealing with such problems -- otherwise you would not need to erase the information. The way you're describing it, it sounds like erasing information isn't something agents themselves are supposed to ever have to do. Instead, it is a useful tool for a decision theorist, to transform trivial/meaningless decision problems into nontrivial/meaningful ones. This seems wrong to me.

It seems rather strange to talk about making an outcome inconsistent which was already inconsistent. Why is this considered an option that was available for you to choose, instead of one that was never available to choose? Consider a situation where the world and agent have both been precisely defined. Determinism means there is only one possible option, but decisions problems have multiple possible options. It is not clear which decisions that are inconsistent with what actually happened count as "could have been chosen" and which count as, "were never possible".

I'm somewhat confused about what you're saying in this paragraph and what assumptions you might be making. I think it might help to focus on examples. Two examples which I think motivate the idea:

  • Smoking lesion. It can often be quite a stretch to put an agent into a smoking lesion problem, because the problem assumes certain population statistics which may be impossible to achieve if the population is assumed to make decisions in a particular way. My impression is that some philosophers hold a decision theory like CDT and EDT responsible for what advice it offers in a particular situation, even if it would be impossible to put agents in that situation if they were the sort of agents who followed the advice of the decision theory in question. In other words, even if it is impossible to put EDT agents in a situation where they are representative of a population as described in the smoking lesion problem, EDT is held responsible for offering bad advice to agents in such a situation. I take the motto "decisions are for making bad outcomes inconsistent" as speaking against this view, instead giving EDT credit for making it impossible for an agent to end up in such a situation.
    • (In my post on smoking lesion, I came up with a way to get EDT agents into a smoking-lesion situation; however, it required certain assumptions about their internal architecture. We could take the argument as speaking against such an architecture, rather than EDT. This interpretation seems quite natural to me, because the setup required to get EDT into a smoking lesion situation is fairly unnatural, and one could simply refuse to build agents with such an unnatural architecture.)
  • Transparent Newcomb. In the usual setup, the agent is described as already facing a large sum of money. We are also told that this situation is only possible if the agent one-boxes -- a two-boxing agent won't get this opportunity (or, will get it with much smaller probability). Academic decision theorists tend to, again, judge the decision theory on the quality of advice offered under the assumption that an agent ends up in the situation, disregarding the effect of the decision on whether an agent could be in the situation in the first place. On this view, decision theories such as UDT which one-box are giving bad advice, because if you are already in the situation, you can get more money by two-boxing. In this case, the motto "decisions are for making bad outcomes inconsistent" is supposed to indicate that agents should one-box, so that they can end up in the better situation. A two-boxing decision theory like CDT is judged poorly for making it impossible to get a very good payout.

Importantly, transparent Newcomb (with a perfect predictor) is a case where the agent has enough information to know its own action: it must one-box, since it could not be in this situation if it two-boxed. Yet, we can talk about decision theories such as CDT which two-box in such cases. So it is not meaningless to talk about what happens if you take an action which is inconsistent with what you know! What you do in such situations has consequences.

I don't know that you disagree with any of this, since in your original essay you say:

For example, when you have perfect knowledge of the environment and the agent, unless you run into issues with unprovability. Note that degeneracy is more common than you might think since knowing, for example, that it is a utility maximiser, tells you its exact behaviour in situations without options that are tied.

However, you go on to say:

Again, in these cases, the answer to the question, "What should the agent do?" is, "The only action consistent with the problem statement".

which is what I was disagreeing with. We can set up a sort of reverse transparent Newcomb, where you should take the action which makes the situation impossible: Omega cooks you a dinner selected out of those which it predicts you will eat. Knowing this, you should refuse to eat meals which you don't like, even though when presented with such a meal you know you must eat it (since Omega only presents you with a meal you will eat).

(Aside: the problem isn't fully specified until we also say what Omega does if there is nothing you will eat. We could say that Omega serves you nothing in that case.)

Talking about making your current situation inconsistent doesn't make sense literally, only analogically. After all, if you're in a situation it has to be consistent. The way that I get round this in my post is by replacing talk of decisions given a situation with talk of decisions given an input representing a situation. While you can't make your current situation inconsistent, it is sometimes possible for a program to be written such that it cannot be put in the situation representing an input as its output would be inconsistent with that. And that let's us define what we wanted to define, without the nudge to fudge philosophically.

This seems basically consistent with what I'm saying (indeed, almost the same as what I'm saying), except I take strong objection to some of your language. I don't think you "analogically" make situations inconsistent; I think you actually do. Replacing "situation" with "input representing a situation" seems sort of in the right direction, but the notion of "input" is problematic, because it can be your own internal reasoning which predicts your action accurately.

Of the chicken rule, for example, it is literally (not analogically) correct to say that the algorithm takes a different action if it ever proves that it takes a certain action. It is also true that it never ends up in this situation. We could also say that you never take an action if you have an internal state representing taking certainty that you take that action. However, it is furthermore true that you never get into such an internal state.

Similarly, in the example where Omega cooks you something which you will eat, I would think it literally correct to say that you would not eat pudding (supposing that's a property of your decision algorithm).

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-08T20:54:29.501Z · score: 4 (2 votes) · LW · GW

Second point: the motto is something like "anything which is dependent on you must have you inside its computation". Something which depends on you because it is causally downstream of you contains you in its computation in the sense that you have to be calculated in the course of calculating it because you're in its past. The claim is that this observation generalizes.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-08T20:35:45.045Z · score: 2 (1 votes) · LW · GW

I note that I'm still overall confused about what the miscommunication was. Your response now seems to fit my earlier interpretation.

First point: I disagree about how to consider things as pure decision theory problems. Taking as input a list of conscious entities seems like a rather large point against a decision theory, since it makes it dependent on a theory of consciousness. If you want to be independent of questions like that, far better to consider decision theory on its own (thinking only in terms of logical control, counterfactuals, etc), and remain agnostic on the question of a connection between anthropics and decision theory.

In my analogy to mathematics, it could be that there's a lot of philosophical baggage on the logic side and also a lot of philosophical baggage on the mathematical side. Claiming that all of math is tautology could create a lot of friction between these two sets of baggage, meaning one has to bite a lot of bullets which other people wouldn't consider biting. This can be a good thing: you're allowing more evidence to flow, pinning down your views on both sides more strongly. In addition to simplicity, that's related to a theory doing more to stick its neck out, making bolder predictions. To me, when this sort of thing happens, the objections to adopting the simpler view have to be actually quite strong.

I suppose that addresses your third point to an extent. I could probably give some reasons besides simplicity, but it seems to me that simplicity is a major consideration here, perhaps my true reason. I suspect we don't actually disagree that much about whether simplicity should be a major consideration (unless you disagree about the weight of Occam's razor, which would surprise me). I suspect we disagree about the cost of biting this particular bullet.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-08T02:28:05.283Z · score: 2 (1 votes) · LW · GW

Maybe I'm confused about the relevance of your original comment to my answer. I interpreted

Firstly, I think it is cleaner to seperate issues about whether simulations have consciousness or not from questions of decision theory

as being about the relationship I outline between anthropics and decision theory -- ie, anthropic reasoning may want to take consciousness into account (while you might think it plausible you're in a physics simulation, it is plausible to hold that you can't be living in a more general type of model which predicts you by reasoning about you rather than simulating you if the model is not detailed enough to support consciousness) whereas decision theory only takes logical control into account (so the relevant question is not whether a model is detailed enough to be conscious, but rather, whether it is detailed enough to create a logical dependence on your behavior).

I took

"The idea is that you have to be skeptical of whether you're in a simulation" - I'm not a big fan of that framing,

as an objection to the connection I drew between thinking you might be in a simulation (ie, the anthropic question) and decision theory. Maybe you were objecting to the connection between thinking your in a simulation and anthropics? If so, the claimed connection to decision theory is still relevant. If you buy the decision theory connection, it seems hard to not buy the connection between thinking you're in a simulation and anthropics.

I took

it seems as though that you might be able to perfectly predict someone from high level properties without simulating them sufficiently to instantiate a consciousness

to be an attempt to drive a wedge between anthropics and decision theory, by saying that a prediction might introduce a logical correlation without introducing an anthropic question of whether you might be 'living in' the prediction. To which my response was, I may want to bite the bullet on that one for the elegance of treating anthropic questions as decision-theoretic in nature.

I took

there isn't necessarily a one-to-one relationship between "real" world runs and simulations. We only need to simulate an agent once in order to predict the result of any number of identical runs.

to be an attempt to drive a wedge between decision-theoretic and anthropic cases by the fact that we need to assign a small anthropic probability to being the simulation if it is only run once, to which I responded by saying that the math will work out the same on the decision theory side according to Jessica Taylor's theorem.

My interpretation now is that you were never objecting to the connection between decision theory and anthropics, but rather to the way I was talking about anthropics. If so, my response is that the way I was talking about anthropics is essentially forced by the connection to decision theory.

Comment by abramdemski on Deconfusing Logical Counterfactuals · 2019-04-07T21:32:49.017Z · score: 4 (2 votes) · LW · GW
  • How does the forgetting approach differ from an updateless approach (if it is supposed to)?
  • Why do you think there is a good way to determine which information should be forgotten in a given problem, aside from hand analysis? (Hand analysis utilizes the decision theorist's perspective, which is an external perspective the agent lacks.)
Comment by abramdemski on Deconfusing Logical Counterfactuals · 2019-04-07T21:24:50.075Z · score: 6 (3 votes) · LW · GW
Other degenerative cases include when you already know what decision you'll make or when you have the ability to figure it out. For example, when you have perfect knowledge of the environment and the agent, unless you run into issues with unprovability. Note that degeneracy is more common than you might think since knowing, for example, that it is a utility maximiser, tells you its exact behaviour in situations without options that are tied. Again, in these cases, the answer to the question, "What should the agent do?" is, "The only action consistent with the problem statement".

Why should this be the case? What do you think of the motto decisions are for making bat outcomes inconsistent?

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-07T21:04:37.575Z · score: 2 (1 votes) · LW · GW

(Definitely not reporting MIRI consensus here, just my own views:) I find it appealing to collapse the analogy and consider the DT considerations to be really touching on the anthropic considerations. It isn't just functionalism with respect to questions about other brains (such as their consciousness); it's also what one might call cognitive functionalism -- ie functionalism with respect to the map as opposed to the territory (my mind considering questions such as consciousness). What I mean is: if the decision-theoretic questions were isomorphic to the anthropic questions, serving the same sort of role in decision-making, then if I were to construct a mind thinking about one or the other, and ask it about what it is thinking, then there wouldn't be any questions which would differentiate anthropic reasoning from the analogous decision-theoretic reasoning. This would seem like a quite strong argument in favor of discarding the distinction.

I'm not saying that's the situation (we would need to agree, individually, on seperate settled solutions to both anthropics and decision theory in order to compare them side by side in that way). I'm saying that things seem to point in that direction.

It seems rather analogous to thinking that logic and mathematics are distinct (logical knowledge encoding tautology only, mathematical knowledge encoding a priori analytic knowledge, which one could consider distinct... I'm just throwing out philosophy words here to try to bolster the plausibility of this hypothetical view) -- and then discovering that within the realm of what you considered to be pure logic, there's a structure which is isomorphic to the natural numbers, with all reasoning about the natural numbers being explainable as purely logical reasoning. It would be possible to insist on maintaining the distinction between the mathematical numbers and the logical structure which is analogous to them, referring to the first as analytic a priori knowledge and the second as tautology. However, one naturally begins to question the mathematical/logical distinction which was previously made. Was the notion of "logical" too broad? (Is higher-order logic really a part of mathematics?) Is number theory really a part of logic, rather than mathematics proper? What other branches of mathematics can be seen as structures within logic? Perhaps all mathematics is tautology, as Wittgenstein had it?

This position certainly has some counterintuitive consequences, which should be controversial. From a decision-theoretic perspective, it is practical to regard any means of predicting you which has equivalent predictive power to be equally "an instance of you" and hence equally conscious: a physics-style attempt to simulate you, or a logic-style attempt to reason about what you would do.

As for the question of simulating a person once to predict them a hundred times, the math all works out nicely if you look at Jessica Taylor's post on the memoryless cartesian setting. A subjectively small chance of being the one simulation when a million meat copies are playing Newcomb's will suffice for decision-theoretic purposes. How exactly everything works out depends on the details of formalizing the problem in the memoryless cartesian setting, but the theorem guarantees that everything balances. (I find this fact surprising and somewhat counterintuitive, but working out some examples myself helped me.)

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-07T20:15:17.699Z · score: 2 (1 votes) · LW · GW

My thoughts on that are described further here.

Comment by abramdemski on Would solving logical counterfactuals solve anthropics? · 2019-04-06T22:39:19.053Z · score: 20 (7 votes) · LW · GW

I think it depends on how much you're willing to ask counterfactuals to do.

In the paper Anthropic Decision Theory for Self-Locating Agents, Stuart Armstrong says "ADT is nothing but the anthropic version of the far more general Updateless Decision Theory and Functional Decision Theory" -- suggesting that he agrees with the idea that a proposed solution to counterfactual reasoning gives a proposed solution to anthropic reasoning. The overall approach of that paper is to side-step the issue of assigning anthropic probabilities, instead addressing the question of how to make decisions in cases where anthropic questions arise. I suppose this might either be said to "solves anthropics" or "side-steps anthropics", and this choice would determine whether one took Stuart's view to answer "yes" or "no" to your question.

Stuart mentions in that paper that agents making decisions via CDT+SIA tend to behave the same as agents making decisions via EDT+SSA. This can be seen formally in Jessica Taylor's post about CDT+SIA in memoryless cartesian environments, and Caspar Oesterheld's comment about the parallel for EDT+SSA. The post discusses the close connection to pure UDT (with no special anthropic reasoning). Specifically, CDT+SIA (and EDT+SSA) are consistent with the optimality notion of UDT, but don't imply it (UDT may do better, according to its own notion of optimality). This is because UDT (specifically, UDT 1.1) looks for the best solution globally, whereas CDT+SIA can have self-coordination problems (like hunting rabbit in a game of stag hunt with identical copies of itself).

You could see this as giving a relationship between two different notions of counterfactual, with anthropic reasoning mediating the connection.

CDT and EDT are two different ways of reasoning about the consequences of actions. Both of them are "updateful": they make use of all information available in estimating the consequences of actions. We can also think of them as "local": they make decisions from the situated perspective of an information state, whereas UDT makes decisions from a "global" perspective considering all possible information states.

I would claim that global counterfactuals have an easier job than local ones, if we buy the connection between the two suggested here. Consider the transparent Newcomb problem: you're offered a very large pile of money if and only if you're the sort of agent who takes most, but not all, of the pile. It is easy to say from an updateless (global) perspective that you should be the sort of agent who takes most of the money. It is more difficult to face the large pile (an updateful/local perspective) and reason that it is best to take most-but-not-all; your counterfactuals have to say that taking all the money doesn't mean you get all the money. The idea is that you have to be skeptical of whether you're in a simulation; ie, your counterfactuals have to do anthropic reasoning.

In other words: you could factor the whole problem of logical decision theory in two different ways.

  • Option 1:
    • Find a good logically updateless perspective, providing the 'global' view from which we can make decisions.
    • Find a notion of logical counterfactual which combines with the above to yield decisions.
  • Option 2:
    • Find an updateful but skeptical perspective, which takes (logical) observations into account, but also accounts for the possibility that it is in a simulation and being fooled about those observations.
    • Find a notion of counterfactual which works with the above to make good decisions.
    • Also, somehow solve the coordination problems (which otherwise make option 1 look superior).

With option 1, you side-step anthropic reasoning. With option 2, you have to tackle it explicitly. So, you could say that in option 1, you solve anthropic reasoning for free if you solve counterfactual reasoning; in option 2, it's quite the opposite: you might solve counterfactual reasoning by solving anthropic reasoning.

I'm more optimistic about option 2, recently. I used to think that maybe we could settle for the most basic possible notion of logical counterfactual, ie, evidential conditionals, if combined with logical updatelessness. However, a good logically updateless perspective has proved quite elusive so far.

Alignment Research Field Guide

2019-03-08T19:57:05.658Z · score: 194 (66 votes)

Pavlov Generalizes

2019-02-20T09:03:11.437Z · score: 66 (19 votes)
Comment by abramdemski on Alignment Newsletter #41 · 2019-01-22T01:47:46.644Z · score: 4 (2 votes) · LW · GW
This seems like it is not about the "motivational system", and if this were implemented in a robot that does have a separate "motivational system" (i.e. it is goal-directed), I worry about a nearest unblocked strategy.

I am confused about where you think the motivation system comes into my statement. It sounds like you are imagining that what I said is a constraint, which could somehow be coupled with a seperate motivation system. If that's your interpretation, that's not what I meant at all, unless random sampling counts as a motivation system. I'm saying that all you do is sample from what's consented to.

But, maybe what you are saying is that in "the intersection of what the user expects and what the user wants", the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system). If that's what you meant, I think that's a valid concern. What I was imagining is that you are trying to infer "what the user wants" not in terms of end goals, but rather in terms of actions (really, policies) for the AI. So, it is more like an approval-directed agent to an extent. If the human says "get me groceries", the job of the AI is not to infer the end state the human is asking the robot to optimize for, but rather, to infer the set of policies which the human is trying to point at.

There's no optimization on top of this finding perverse instantiations of the constraints; the AI just follows the policy which it infers the human would like. Of course the powerful learning system required for this to work may perversely instantiate these beliefs (ie, there may be daemons aka inner optimizers).

(The most obvious problem I see with this approach is that it seems to imply that the AI can't help the human do anything which the human doesn't already know how to do. For example, if you don't know how to get started filing your taxes, then the robot can't help you. But maybe there's some way to differentiate between more benign cases like that and less benign cases like using nanotechnology to more effectively get groceries?)

A third interpretation of your concern is that you're saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it's going to be pushing toward perverse instantiations one way or another. I don't have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.

(I feel compelled to mention again that I don't feel strongly that the whole idea makes any sense. I just want to convey why I don't think it's about constraining an underlying motivation system.)

Comment by abramdemski on Alignment Newsletter #41 · 2019-01-20T07:59:27.379Z · score: 8 (4 votes) · LW · GW
Non-Consequentialist Cooperation? (Abram Demski): [...]
However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.

You could definitely think of it as a limitation to put on a system, but I actually wasn't thinking of it that way when I wrote the post. I was trying to imagine something which only operates from this principle. Granted, I didn't really explain how that could work. I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.

(It now seems to me that although I put "non-consequentialist" in the title of the post, I didn't explain the part where it isn't consequentialist very well. Which is fine, since the post was very much just spitballing.)

Comment by abramdemski on CDT=EDT=UDT · 2019-01-19T20:22:04.956Z · score: 2 (1 votes) · LW · GW

Agreed. I'll at least edit the post to point to this comment.

Comment by abramdemski on CDT=EDT=UDT · 2019-01-18T21:41:42.577Z · score: 2 (1 votes) · LW · GW

I'm not sure which you're addressing, but, note that I'm not objecting to the practice of illustrating variables with diamonds and boxes rather than only circles so that you can see at a glance where the choices and the utility are (although I don't tend to use the convention myself). I'm objecting to the further implication that doing this makes it not a Bayes net.

Comment by abramdemski on XOR Blackmail & Causality · 2019-01-18T18:57:41.311Z · score: 3 (2 votes) · LW · GW
I hear there is a way to fiddle with the foundations of probability theory so that conditional probabilities are taken as basic and ordinary probabilities are defined in terms of them. Maybe this would solve the problem?

This does help somewhat. See here. But, in order to get good answers from that, you need to already know enough about the structure of the situation.

Maybe I'm late to the party, in which case sorry about that & I look forward to hearing why I'm wrong, but I'm not convinced that epsilon-exploration is a satisfactory way to ensure that conditional probabilities are well-defined. Here's why:

I agree, but I also think there are some things pointing in the direction of "there's something interesting going on with epsilon exploration". Specifically, there's a pretty strong analogy between epsilon exploration and modal UDT: MUDT is like the limit as you send exploration probability to zero, so it never actually happens but it still happens in nonstandard models. However, that only seems to work when you know the structure of the situation logically. When you have to learn it, you have to actually explore sometimes to get it right.

To the extent that MUDT looks like a deep result about counterfactual reasoning, I take this as a point in favor of epsilon exploration telling us something about the deep structure of counterfactual reasoning.

Anyway, see here for some more recent thoughts of mine. (But I didn't discuss the question of epsilon exploration as much as I could have.)

Comment by abramdemski on CDT=EDT=UDT · 2019-01-18T18:40:56.722Z · score: 2 (1 votes) · LW · GW

I disagree. All the nodes in the network should be thought of as grounding out in imagination, in that it's a world-model, not a world. Maybe I'm not seeing your point.

I would definitely like to see a graphical model that's more capable of representing the way the world-model itself is recursively involved in decision-making.

One argument for calling an influence diagram a generalization of a bayes could be that the conditional probability table for the agent's policy given observations is not given as part of the influence diagram, and instead must be solved for. But we can still think of this as a special case of a Bayes net, rather than a generalization, by thinking of an influence diagram as a special sort of Bayes net in which the decision nodes have to have conditional probability tables obeying some optimality notion (such as the CDT optimality notion, the EDT optimality notion, etc).

This constraint is not easily represented within the Bayes net itself, but instead imposed from outside. It would be nice to have a graphical model in which you could represent that kind of constraint naturally. But simply labelling things as decision nodes doesn't do much. I would rather have a way of identifying something as agent-like based on the structure of the model for it. (To give a really bad version: suppose you allow directed cycles, rather than requiring DAGs, and you think of the "backwards causality" as agency. But, this is really bad, and I offer it only to illustrate the kind of thing I mean -- allowing you to express the structure which gives rise to agency, rather than taking agency as a new primitive.)

Comment by abramdemski on What makes people intellectually active? · 2019-01-18T18:28:06.383Z · score: 2 (1 votes) · LW · GW
All in all, I can't wrap my head around "what is the difference between a producer and a consumer of thought?" because the question as posed seems to hold rigor, even quality, constant/irrelevant.

I'm not trying to hold it constant, I'm just trying to understand a relatively low standard, because that's the part I feel confused about. It seems relatively much easier to look at bad intellectual output and say how it could have been better, think about the thought processes involved, etc. Much harder to say what goes into producing output at all vs not doing so.

Comment by abramdemski on Is there a.. more exact.. way of scoring a predictor's calibration? · 2019-01-16T18:26:05.525Z · score: 25 (8 votes) · LW · GW

It's important to note that accuracy and calibration are two different things. I'm mentioning this because the OP asks for calibration metrics, but several answers so far give accuracy metrics. Any proper scoring rule is a measure of accuracy as opposed to calibration.

It is possible to be very well-calibrated but very inaccurate; for example, you might know that it is going to be Monday 1/7th of the time, so you give a probability of 1/7th. Everyone else just knows what day it is. On a calibration graph, you would be perfectly lined up; when you say 1/7th, the thing happens 1/7th of the time.

It is also possible to have high accuracy and poor calibration. Perhaps you can guess coin flips when no one else can, but you are wary of your precognitive powers, which makes you underconfident. So, you always place 60% probability on the event that actually happens (heads or tails). Your calibration graph is far out of line, but your accuracy is higher than anyone else.

In terms of improving rationality, the interesting thing about calibration is that (as in the precog example) if you know you're poorly calibrated, you can boost your accuracy simply by improving your calibration. In some sense it is a free improvement: you don't need to know anything more about the domain; you get more accurate just by knowing more about yourself (by seeing a calibration chart and adjusting).

However, if you just try to be more calibrated without any concern for accuracy, you could be like the person who says 1/7th. So, just aiming to do well on a score of calibration is not a good idea. This could be part of the reason why calibration charts are presented instead of calibration scores. (Another reason being that calibration charts help you know how to adjust to increase calibration.)

That being said, a decomposition of a proper scoring rule into components including a measure of calibration, like Dark Denego gives, seems like the way to go.

Comment by abramdemski on CDT=EDT=UDT · 2019-01-16T04:18:38.870Z · score: 12 (3 votes) · LW · GW

I guess, philosophically, I worry that giving the nodes special types like that pushes people toward thinking about agents as not-embedded-in-the-world, thinking things like "we need to extend Bayes nets to represent actions and utilities, because those are not normal variable nodes". Not that memoryless cartesian environments are any better in that respect.

Comment by abramdemski on CDT=EDT=UDT · 2019-01-16T02:45:28.521Z · score: 14 (8 votes) · LW · GW

Hrm. I realize that the post would be comprehensible to a much wider audience with a glossary, but there's one level of effort needed for me to write posts like this one, and another level needed for posts where I try to be comprehensible to someone who lacks all the jargon of MIRI-style decision theory. Basically, if I write with a broad audience in mind, then I'm modeling all the inferential gaps and explaining a lot more details. I would never get to points like the one I'm trying to make in this post. (I've tried.) Posts like this are primarily for the few people who have kept up with the CDT=EDT sequence so far, to get my updated thinking in writing in case anyone wants to go through the effort of trying to figure out what in the world I mean. To people who need a glossary, I recommend searching lesswrong and the stanford encyclopedia of philosophy.

What are the components of intellectual honesty?

2019-01-15T20:00:09.144Z · score: 32 (8 votes)
Comment by abramdemski on Combat vs Nurture & Meta-Contrarianism · 2019-01-15T19:36:05.442Z · score: 10 (4 votes) · LW · GW

I've avoided people/conversations on those grounds, but I'm not sure it is the best way to deal with it. And I really do think good intellectual progress can be made at level 2. As Ruby said in the post I'm replying to, intellectual debate is common in analytic philosophy, and it does well there.

Maybe my description of intellectual debate makes you think of all the bad arguments-are-soldiers stuff. Which it should. But, I think there's something to be said about highly developed cultures of intellectual debate. There are a lot of conventions which make it work better, such as a strong norm of being charitable to the other side (which, in intellectual-debate culture, means an expectation that people will call you out for being uncharitable). This sort of simulates level 3 within level 2.

As for level 1, you might be able to develop some empathy for it at times when you feel particularly vulnerable and need people to do something to affirm your belongingness in a group or conversation. Keep an eye out for times when you appreciate level-one behavior from others, times when you would have appreciated some level-one comfort, or times when other people engage in level one (and decide whether it was helpful in the situation). It's nice when we can get to a place where no one's ego is on the line when they offer ideas, but sometimes it just is. Ignoring it doesn't make it go away, it just makes you manage it ineptly. My guess is that you are involved with more level one situations than you think, and would endorse some of it.

Comment by abramdemski on CDT Dutch Book · 2019-01-14T07:52:28.950Z · score: 2 (1 votes) · LW · GW

(lightly edited version of my original email reply to above comment; note that Diffractor was originally replying to a version of the Dutch-book which didn't yet call out the fact that it required an assumption of nonzero probability on actions.)

I agree that this Dutch-book argument won't touch probability zero actions, but my thinking is that it really should apply in general to actions whose probability is bounded away from zero (in some fairly broad setting). I'm happy to require an epsilon-exploration assumption to get the conclusion.

Your thought experiment raises the issue of how to ensure in general that adding bets to a decision problem doesn't change the decisions made. One thought I had was to make the bets always smaller than the difference in utilities. Perhaps smaller Dutch-books are in some sense less concerning, but as long as they don't vanish to infinitesimal, seems legit. A bet that's desirable at one scale is desirable at another. But scaling down bets may not suffice in general. Perhaps a bet-balancing scheme to ensure that nothing changes the comparative desirability of actions as the decision is made?

For your cosmic ray problem, what about: 

You didn't specify the probability of a cosmic ray. I suppose it should have probability higher than the probability of exploration. Let's say 1/million for cosmic ray, 1/billion for exploration.

Before the agent makes the decision, it can be given the option to lose .01 util if it goes right, in exchange for +.02 utils if it goes right & cosmic ray. This will be accepted (by either a CDT agent or EDT agent), because it is worth approximately +.01 util conditioned on going right, since cosmic ray is almost certain in that case.

Then, while making the decision, cosmic ray conditioned on going right looks very unlikely in terms of CDT's causal expectations. We give the agent the option of getting .001 util if it goes right, if it also agrees to lose .02 conditioned on going right & cosmic ray.

CDT agrees to both bets, and so loses money upon going right.

Ah, that's not a very good money pump. I want it to lose money no matter what. Let's try again: 

Before decision: option to lose 1 millionth of a util in exchange for 2 utils if right&ray.

During decision: option to gain .1 millionth util in exchange for -2 util if right&ray.

That should do it. CDT loses .9 millionth of a util, with nothing gained. And the trick is almost the same as my dutch book for death in damascus. I think this should generalize well.

The amounts of money lost in the Dutch Book get very small, but that's fine.

CDT=EDT=UDT

2019-01-13T23:46:10.866Z · score: 42 (11 votes)

When is CDT Dutch-Bookable?

2019-01-13T18:54:12.070Z · score: 25 (4 votes)
Comment by abramdemski on CDT Dutch Book · 2019-01-13T08:28:40.350Z · score: 4 (2 votes) · LW · GW

"The expectations should be equal for actions with nonzero probability" -- this means a CDT agent should have equal causal expectations for any action taken with nonzero probability, and EDT agents should similarly have equal evidential expectations. Actually, I should revise my statement to be more careful: in the case of epsilon-exploring agents, the condition is >epsilon rather than >0. In any case, my statement there isn't about evidential and causal expectations being equal to each other, but rather about one of them being conversant across (sufficiently probable) actions.

"differing counterfactual and evidential expectations are smoothly more and more tenable as actions become less and less probable" -- this means that the amount we can take from a CDT agent through a Dutch Book, for an action which is given a different casual expectation than evidential expectation, smoothly reduces as the probability of an action goes to zero. In that statement, I was assuming you hold the difference between evidential and causal expectations constant add you reduce the probability of the action. Otherwise it's not necessarily true.

CDT Dutch Book

2019-01-13T00:10:07.941Z · score: 27 (8 votes)
Comment by abramdemski on Combat vs Nurture & Meta-Contrarianism · 2019-01-12T23:11:02.508Z · score: 9 (4 votes) · LW · GW

I think it's usually a good idea overall, but there is a less cooperative conversational tactic which tries to masquerade as this: listing a number of plausible straw-men in order to create the appearance that all possible interpretations of what the other person is saying are bad. (Feels like from the inside: all possible interpretations are bad; i'll demonstrate it exhaustively...)

It's not completely terrible, because even this combative version of the conversational move opens up the opportunity for the other person to point out the (n+1)th interpretation which hasn't been enumerated.

You can try to differentiate yourself from this via tone (by not sounding like you're trying to argue against the other person in asking the question), but, this will only be somewhat successful since someone trying to make the less cooperative move will also try to sound like they're honestly trying to understand.

Comment by abramdemski on Non-Consequentialist Cooperation? · 2019-01-11T23:54:00.135Z · score: 2 (1 votes) · LW · GW

My gut response is that hillclimbing is itself consequentialist, so this doesn't really help with fragility of value; if you get the hillclimbing direction slightly wrong, you'll still end up somewhere very wrong. On the other hand, Paul's approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).

Comment by abramdemski on Non-Consequentialist Cooperation? · 2019-01-11T23:15:16.353Z · score: 4 (2 votes) · LW · GW

I wouldn't say that preference utilitarianism "falls apart"; it just becomes much harder to implement.

And I'd like a little more definition of "autonomy" as a value - how do you operationally detect whether you're infringing on someone's autonomy?

My (still very informal) suggestion is that you don't try to measure autonomy directly and optimize for it. Instead, you try to define and operate from informed consent. This (maybe) allows a system to have enough autonomy to perform complex and open-ended tasks, but not so much that you expect perverse instantiations of goals.

My proposed definition of informed consent is "the human wants X and understands the consequences of the AI doing X", where X is something like a probability distribution on plans which the AI might enact. (... that formalization is very rough)

Is it just the right to make bad decisions (those which contradict stated goals and beliefs)?

This is certainly part of respecting an agent's autonomy. I think more generally respecting someone's autonomy means not taking away their freedom, not making decisions on their behalf without having prior permission to do so, and avoiding operating from assumptions about what is good or bad for a person.

Comment by abramdemski on Non-Consequentialist Cooperation? · 2019-01-11T22:56:11.184Z · score: 7 (4 votes) · LW · GW
Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone's autonomy when they value it themselves. If they don't, why impose it upon them?

One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn't want that, why impose it?

Corrigibility makes sense as something to ensure in its own right because it is good to have in case the value learning is not doing what it should (or something else is going wrong).

I think respect for autonomy is similarly useful. It helps avoid evil-genie (perverse instantiation) type failures by requiring that we understand what we are asking the AI to do. It helps avoid preference-manipulation problems which value learning approaches might otherwise have, because regardless of how well expected-human-value is optimized by manipulating human preferences, such manipulation usually involves fooling the human, which violates autonomy.

(In cases where humans understand the implications of value manipulation and consent to it, it's much less concerning -- though we still want to make sure the AI isn't prone to pressure humans into that, and think carefully about whether it is really OK.)

Is the point here that you expect we can't solve those problems and therefore need an alternative? The idea doesn't help with "the difficulties of assuming human rationality" though so what problems does it help with?

It's less an alternative in terms of avoiding the things which make value learning hard, and more an alternative in terms of providing a different way to apply the same underlying insights, to make something which is less of a ruthless maximizer at the end.

In other words, it doesn't avoid the central problems of ambitious value learning (such as "what does it mean for irrational beings to have values?"), but it is a different way to try to put those insights together into a safe system. You might add other safety precautions to an ambitious value learner, such as [ambitious value learning + corrigibility + mild optimization + low impact + transparency]. Consent-based systems could be an alternative to that agglomerated approach, either replacing some of the safety measures or making them less difficult to include by providing a different foundation to build on.

Is the idea that even trying to do ambitious value learning constitutes violating someone's autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?

I think there are a couple of ways in which this is true.

  • I mentioned cases where a value-learner might violate privacy in ways humans wouldn't want, because the overall result is positive in terms of the extent to which the AI can optimize human values. This is somewhat bad, but it isn't X-risk bad. It's not my real concern. I pointed it out because I think it is part of the bigger picture; it provides a good example of the kind of optimization a value-learner is likely to engage in, which we don't really want.
  • I think the consent/autonomy idea actually gets close (though maybe not close enough) to something fundamental about safety concerns which follow an "unexpected result of optimizing something reasonable-looking" pattern. As such, it may be better to make it an explicit design feature, rather than trust the system to realize that it should be careful about maintaining human autonomy before it does anything dangerous.
  • It seems plausible that, interacting with humans over time, a system which respects autonomy at a basic level would converge to different overall behavior than a value-learning system which trades autonomy off with other values. If you actually get ambitious value learning really right, this is just bad. But, I don't endorse your "why impose it on them?" argument. Humans could eventually decide to run all-out value-learning optimization (without mild optimization, without low-impact constraints, without hard-coded corrigibility). Preserving human autonomy in the meantime seems
Comment by abramdemski on What makes people intellectually active? · 2019-01-11T21:20:51.846Z · score: 4 (2 votes) · LW · GW

Abstracting your idea a little: in order to go beyond first thoughts, you need some kind of strategy for developing ideas further. Without one, you will just have the same thoughts when you try to "think more" about a subject. I've edited my answer to elaborate on this idea.

Non-Consequentialist Cooperation?

2019-01-11T09:15:36.875Z · score: 40 (13 votes)

Combat vs Nurture & Meta-Contrarianism

2019-01-10T23:17:58.703Z · score: 55 (16 votes)
Comment by abramdemski on What makes people intellectually active? · 2019-01-01T06:59:21.301Z · score: 2 (1 votes) · LW · GW

Well, my original intention was definitely more like "why don't more people keep developing their ideas further?" as opposed to "why don't more people have ideas?" -- but, I definitely grant that sharing ideas is what I actually am able to observe.

Comment by abramdemski on What makes people intellectually active? · 2019-01-01T06:25:21.541Z · score: 10 (7 votes) · LW · GW

If someone had commented with a one-line answer like "people are intellectually active if it is rewarding", I would have been very meh about it -- it's obvious, but trivial. All the added detail you gave makes it seem like a pretty useful observation, though.

Two possible caveats --

  • What determines what's rewarding? Any set of behaviors can be explained by positing that they're rewarding, so for this kind of model to be meaningful, there's got to be a set of rewards involved which are relatively simple and have relatively broad explanatory power.
  • In order for a behavior to be rewarded in the first place, it has to be generated the first time. How does that happen? Animal trainers build up complicated tricks by rewarding steps incrementally approaching the desired behavior. Are there similar incremental steps here? What are they, and what rewards are associated with them?

(Your spelled-out details give some ideas in those directions.)

What makes people intellectually active?

2018-12-29T22:29:33.943Z · score: 75 (31 votes)

Embedded Agency (full-text version)

2018-11-15T19:49:29.455Z · score: 86 (31 votes)

Embedded Curiosities

2018-11-08T14:19:32.546Z · score: 76 (27 votes)

Subsystem Alignment

2018-11-06T16:16:45.656Z · score: 114 (35 votes)

Robust Delegation

2018-11-04T16:38:38.750Z · score: 109 (36 votes)

Embedded World-Models

2018-11-02T16:07:20.946Z · score: 79 (24 votes)

Decision Theory

2018-10-31T18:41:58.230Z · score: 84 (29 votes)

Embedded Agents

2018-10-29T19:53:02.064Z · score: 150 (61 votes)

A Rationality Condition for CDT Is That It Equal EDT (Part 2)

2018-10-09T05:41:25.282Z · score: 17 (6 votes)

A Rationality Condition for CDT Is That It Equal EDT (Part 1)

2018-10-04T04:32:49.483Z · score: 21 (7 votes)

In Logical Time, All Games are Iterated Games

2018-09-20T02:01:07.205Z · score: 83 (26 votes)

Track-Back Meditation

2018-09-11T10:31:53.354Z · score: 57 (21 votes)

Exorcizing the Speed Prior?

2018-07-22T06:45:34.980Z · score: 11 (4 votes)

Stable Pointers to Value III: Recursive Quantilization

2018-07-21T08:06:32.287Z · score: 17 (7 votes)

Probability is Real, and Value is Complex

2018-07-20T05:24:49.996Z · score: 44 (20 votes)

Complete Class: Consequentialist Foundations

2018-07-11T01:57:14.054Z · score: 43 (16 votes)

Policy Approval

2018-06-30T00:24:25.269Z · score: 47 (15 votes)

Machine Learning Analogy for Meditation (illustrated)

2018-06-28T22:51:29.994Z · score: 93 (35 votes)

Confusions Concerning Pre-Rationality

2018-05-23T00:01:39.519Z · score: 36 (7 votes)

Co-Proofs

2018-05-21T21:10:57.290Z · score: 84 (23 votes)

Bayes' Law is About Multiple Hypothesis Testing

2018-05-04T05:31:23.024Z · score: 81 (20 votes)

Words, Locally Defined

2018-05-03T23:26:31.203Z · score: 50 (15 votes)

Hufflepuff Cynicism on Hypocrisy

2018-03-29T21:01:29.179Z · score: 33 (17 votes)

Learn Bayes Nets!

2018-03-27T22:00:11.632Z · score: 84 (24 votes)

An Untrollable Mathematician Illustrated

2018-03-20T00:00:00.000Z · score: 260 (89 votes)

Explanation vs Rationalization

2018-02-22T23:46:48.377Z · score: 31 (8 votes)

The map has gears. They don't always turn.

2018-02-22T20:16:13.095Z · score: 54 (14 votes)

Toward a New Technical Explanation of Technical Explanation

2018-02-16T00:44:29.274Z · score: 127 (45 votes)

Two Types of Updatelessness

2018-02-15T20:19:54.575Z · score: 45 (12 votes)

Two Types of Updatelessness

2018-02-15T20:16:41.000Z · score: 0 (0 votes)

Hufflepuff Cynicism on Crocker's Rule

2018-02-14T00:52:37.065Z · score: 36 (12 votes)

Hufflepuff Cynicism

2018-02-13T02:15:50.945Z · score: 43 (16 votes)

Stable Pointers to Value II: Environmental Goals

2018-02-09T06:03:00.244Z · score: 27 (8 votes)

Stable Pointers to Value II: Environmental Goals

2018-02-09T06:02:43.000Z · score: 0 (0 votes)

Two Coordination Styles

2018-02-07T09:00:18.594Z · score: 82 (28 votes)

All Mathematicians are Trollable: Divergence of Naturalistic Logical Updates

2018-01-28T14:50:25.000Z · score: 4 (4 votes)

An Untrollable Mathematician

2018-01-23T18:46:17.000Z · score: 8 (8 votes)

Policy Selection Solves Most Problems

2017-12-01T00:35:47.000Z · score: 2 (2 votes)

Timeless Modesty?

2017-11-24T11:12:46.869Z · score: 25 (7 votes)

Gears Level & Policy Level

2017-11-24T07:17:51.525Z · score: 85 (30 votes)

Where does ADT Go Wrong?

2017-11-17T23:31:44.000Z · score: 2 (2 votes)

The Happy Dance Problem

2017-11-17T00:50:03.000Z · score: 9 (3 votes)