## Posts

Do Sufficiently Advanced Agents Use Logic? 2019-09-13T19:53:36.152Z · score: 38 (14 votes)
Troll Bridge 2019-08-23T18:36:39.584Z · score: 67 (38 votes)
Conceptual Problems with UDT and Policy Selection 2019-06-28T23:50:22.807Z · score: 39 (12 votes)
What's up with self-esteem? 2019-06-25T03:38:15.991Z · score: 40 (17 votes)
How hard is it for altruists to discuss going against bad equilibria? 2019-06-22T03:42:24.416Z · score: 52 (15 votes)
Paternal Formats 2019-06-09T01:26:27.911Z · score: 60 (27 votes)
Mistakes with Conservation of Expected Evidence 2019-06-08T23:07:53.719Z · score: 135 (43 votes)
Does Bayes Beat Goodhart? 2019-06-03T02:31:23.417Z · score: 45 (14 votes)
Selection vs Control 2019-06-02T07:01:39.626Z · score: 102 (26 votes)
Separation of Concerns 2019-05-23T21:47:23.802Z · score: 70 (22 votes)
Alignment Research Field Guide 2019-03-08T19:57:05.658Z · score: 197 (69 votes)
Pavlov Generalizes 2019-02-20T09:03:11.437Z · score: 68 (20 votes)
What are the components of intellectual honesty? 2019-01-15T20:00:09.144Z · score: 32 (8 votes)
CDT=EDT=UDT 2019-01-13T23:46:10.866Z · score: 42 (11 votes)
When is CDT Dutch-Bookable? 2019-01-13T18:54:12.070Z · score: 25 (4 votes)
CDT Dutch Book 2019-01-13T00:10:07.941Z · score: 27 (8 votes)
Non-Consequentialist Cooperation? 2019-01-11T09:15:36.875Z · score: 43 (15 votes)
Combat vs Nurture & Meta-Contrarianism 2019-01-10T23:17:58.703Z · score: 55 (16 votes)
What makes people intellectually active? 2018-12-29T22:29:33.943Z · score: 79 (39 votes)
Embedded Agency (full-text version) 2018-11-15T19:49:29.455Z · score: 89 (34 votes)
Embedded Curiosities 2018-11-08T14:19:32.546Z · score: 79 (30 votes)
Subsystem Alignment 2018-11-06T16:16:45.656Z · score: 115 (36 votes)
Robust Delegation 2018-11-04T16:38:38.750Z · score: 109 (36 votes)
Embedded World-Models 2018-11-02T16:07:20.946Z · score: 80 (25 votes)
Decision Theory 2018-10-31T18:41:58.230Z · score: 84 (29 votes)
Embedded Agents 2018-10-29T19:53:02.064Z · score: 151 (68 votes)
A Rationality Condition for CDT Is That It Equal EDT (Part 2) 2018-10-09T05:41:25.282Z · score: 17 (6 votes)
A Rationality Condition for CDT Is That It Equal EDT (Part 1) 2018-10-04T04:32:49.483Z · score: 21 (7 votes)
In Logical Time, All Games are Iterated Games 2018-09-20T02:01:07.205Z · score: 83 (26 votes)
Track-Back Meditation 2018-09-11T10:31:53.354Z · score: 57 (21 votes)
Exorcizing the Speed Prior? 2018-07-22T06:45:34.980Z · score: 11 (4 votes)
Stable Pointers to Value III: Recursive Quantilization 2018-07-21T08:06:32.287Z · score: 20 (9 votes)
Probability is Real, and Value is Complex 2018-07-20T05:24:49.996Z · score: 44 (20 votes)
Complete Class: Consequentialist Foundations 2018-07-11T01:57:14.054Z · score: 43 (16 votes)
Policy Approval 2018-06-30T00:24:25.269Z · score: 49 (18 votes)
Machine Learning Analogy for Meditation (illustrated) 2018-06-28T22:51:29.994Z · score: 99 (36 votes)
Confusions Concerning Pre-Rationality 2018-05-23T00:01:39.519Z · score: 36 (7 votes)
Co-Proofs 2018-05-21T21:10:57.290Z · score: 91 (25 votes)
Bayes' Law is About Multiple Hypothesis Testing 2018-05-04T05:31:23.024Z · score: 81 (20 votes)
Words, Locally Defined 2018-05-03T23:26:31.203Z · score: 50 (15 votes)
Hufflepuff Cynicism on Hypocrisy 2018-03-29T21:01:29.179Z · score: 33 (17 votes)
Learn Bayes Nets! 2018-03-27T22:00:11.632Z · score: 84 (24 votes)
An Untrollable Mathematician Illustrated 2018-03-20T00:00:00.000Z · score: 264 (93 votes)
Explanation vs Rationalization 2018-02-22T23:46:48.377Z · score: 31 (8 votes)
The map has gears. They don't always turn. 2018-02-22T20:16:13.095Z · score: 54 (14 votes)
Toward a New Technical Explanation of Technical Explanation 2018-02-16T00:44:29.274Z · score: 132 (47 votes)
Two Types of Updatelessness 2018-02-15T20:19:54.575Z · score: 45 (12 votes)
Two Types of Updatelessness 2018-02-15T20:16:41.000Z · score: 0 (0 votes)
Hufflepuff Cynicism on Crocker's Rule 2018-02-14T00:52:37.065Z · score: 36 (12 votes)
Hufflepuff Cynicism 2018-02-13T02:15:50.945Z · score: 43 (16 votes)

Comment by abramdemski on Do Sufficiently Advanced Agents Use Logic? · 2019-09-15T07:22:00.796Z · score: 2 (1 votes) · LW · GW

Backwards, thanks!

Comment by abramdemski on Formalising decision theory is hard · 2019-09-14T22:03:01.036Z · score: 2 (1 votes) · LW · GW
I am not sure what you mean by "meets the LIC or similar" in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent's past choices),

Yeah, that's what I meant.

, then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent's prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box.

Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)

But I wasn't sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.

As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn't literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but "smaller") scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.

I was never convinced that "logical ASP" is a "fair" problem. I once joked with Scott that we can consider a "predictor" that is just the single line of code "return DEFECT" but in the comments it says "I am defecting only because I know you will defect." It was a joke, but it was half-serious. The notion of "weak predictor" taken to the limit leads to absurdity, and if you don't take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a "weak predictor", but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.

Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I'm fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.

I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.

Yeah, I guess we can do a variety of things:

• Naming a time limit for the commitment.
• Naming a time at which a time limit for the commitment will be named.
• Naming an ordinal (in some ordinal notation), so that a smaller ordinal must be named every time-step, until a smaller ordinal cannot be named, at which point the commitment runs out

I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.

Thinking about LI policy selection where we choose a slow-growing function f(n) which determines how long we think before we choose the policy to follow on day n –– there's this weird trade-off between how (apparently) "good" the updatelessness is vs how long it takes to be any good at all. I'm fine with notions of rationality being parameterized by an ordinal or some such if it's just a choose-the-largest-number game. But in this case, choosing too slow-growing a function makes you worse off; so the fact that the rationality principle is parameterized (by the slow-growing function) is problematic. Choosing a commitment scheme seems similar.

So it would be nice to have a rationality notion which clarified this situation.

My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren't as nice in the non-realizable cases such as logical uncertainty. But it doesn't make sense to abandon updateless principles altogether because of this!

Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-14T01:43:48.285Z · score: 16 (3 votes) · LW · GW

Response to Section IV:

FDT fails to get the answer Y&S want in most instances of the core example that’s supposed to motivate it

I am basically sympathetic to this concern: I think there's a clear intuition that FDT is 2-boxing more than we would like (and a clear formal picture, in toy formalisms which show FDT-ish DTs failing on Agent Simulates Predictor problems).

Of course, it all depends on how logical counterfactuals are supposed to work. From a design perspective, I'm happy to take challenges like this as extra requirements for the behavior of logical counterfactuals, rather than objections to the whole project. I intuitively think there is a notion of logical counterfactual which fails in this respect, but, this does not mean there isn't some other notion which succeeds. Perhaps we can solve the easy problem of one-boxing with a strong predictor first, and then look for ways to one-box more generally (and in fact, this is what we've done -- one-boxing with a strong predictor is not so difficult).

However, I do want to add that when Omega uses very weak prediction methods such as the examples given, it is not so clear that we want to one-box. Will is presuming that Y&S simply want to one-box in any Newcomb problem. However, we could make a distinction between evidential Newcomb problems and functional Newcomb problems. Y&S already state that they consider some things to be functional Newcomb problems despite them not being evidential Newcomb problems (such as transparent Newcomb). It stands to reason that there would be some evidential Newcomb problems which are not functional Newcomb problems, as well, and that Y&S would prefer not to one-box in such cases.

However, the predictor needn’t be running your algorithm, or have anything like a representation of that algorithm, in order to predict whether you’ll one box or two-box. Perhaps the Scots tend to one-box, whereas the English tend to two-box.

In this example, it seems quite plausible that there's a (logico-causal) reason for the regularity, so that in the logical counterfactual where you act differently, your reference class also acts somewhat differently. Say you're Scottish, and 10% of Scots read a particular fairy tale growing up, and this is connected with why you two-box. Then in the counterfactual in which you one-box, it is quite possible that those 10% also one-box. Of course, this greatly weakens the connection between Omega's prediction and your action; perhaps the change of 10% is not enough to tip the scales in Omega's prediction.

But, without any account of Y&S’s notion of subjunctive counterfactuals, we just have no way of assessing whether that’s true or not. Y&S note that specifying an account of their notion of counterfactuals is an ‘open problem,’ but the problem is much deeper than that. Without such an account, it becomes completely indeterminate what follows from FDT, even in the core examples that are supposed to motivate it — and that makes FDT not a new decision theory so much as a promissory note.

In the TDT document, Eliezer addresses this concern by pointing out that CDT also takes a description of the causal structure of a problem as given, begging the question of how we learn causal counterfactuals. In this regard, FDT and CDT are on the same level of promissory-note-ness.

It might, of course, be taken as much more plausible that a technique of learning the physical-causal structure can be provided, in contrast to a technique which learns the logical-counterfactual structure.

I want to inject a little doubt about which is easier. If a robot is interacting with an exact simulation of itself (in an iterated prisoner's dilemma, say), won't it be easier to infer that it directly controls the copy than it is to figure out that the two are running on different computers and thus causally independent?

Put more generally: logical uncertainty has to be handled one way or another; it cannot be entirely put aside. Existing methods of testing causality are not designed to deal with it. It stands to reason that such methods applied naively to cases including logical uncertainty would treat such uncertainty like physical uncertainty, and therefore tend to produce logical-counterfactual structure. This would not necessarily be very good for FDT purposes, being the result of unprincipled accident -- and the concern for FDT's counterfactuals is that there may be no principled foundation. Still, I tend to think that other decision theories merely brush the problem under the rug, and actually have to deal with logical counterfactuals one way or another.

Indeed, on the most plausible ways of cashing this out, it doesn’t give the conclusions that Y&S would want. If I imagine the closest world in which 6288 + 1048 = 7336 is false (Y&S’s example), I imagine a world with laws of nature radically unlike ours — because the laws of nature rely, fundamentally, on the truths of mathematics, and if one mathematical truth is false then either (i) mathematics as a whole must be radically different, or (ii) all mathematical propositions are true because it is simple to prove a contradiction and every propositions follows from a contradiction.

To this I can only say again that FDT's problem of defining counterfactuals seems not so different to me from CDT's problem. A causal decision theorist should be able to work in a mathematical universe; indeed, this seems rather consistent with the ontology of modern science, though not forced by it. I find it implausible that a CDT advocate should have to deny Tegmark's mathematical universe hypothesis, or should break down and be unable to make decisions under the supposition. So, physical counterfactuals seem like they have to be at least capable of being logical counterfactuals (perhaps a different sort of logical counterfactual than FDT would use, since physical counterfactuals are supposed to give certain different answers, but a sort of logical counterfactual nonetheless).

(But this conclusion is far from obvious, and I don't expect ready agreement that CDT has to deal with this.)

Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-14T00:48:18.075Z · score: 12 (3 votes) · LW · GW

Response to Section VIII:

An alternative approaches that captures the spirit of FDT’s aims

I'm somewhat confused about how you can buy FDT as far as you seem to buy it in this section, while also claiming not to understand FDT to the point of saying there is no sensible perspective at all in which it can be said to achieve higher utility. From the perspective in this section, it seems you can straightforwardly interpret FDT's notion of expected utility maximization via an evaluative focal point such as "the output of the algorithm given these inputs".

This evaluative focal point addresses the concern you raise about how bounded ability to implement decision procedures interacts with a "best decision procedure" evaluative focal point (making it depart from FDT's recommendations in so far as the agent can't manage to act like FDT), since those concerns don't arise (at least not so clearly) when we consider what FDT would recommend for the response to one situation in particular. On the other hand, we also can make sense of the notion that taking the bomb is best, since (according to both global-CDT and global-EDT) it is best for an algorithm to output "left" when given the inputs of the bomb problem (in that it gives us the best news about how that agent would do in bomb problems, and causes the agent to do well when put in bomb problems, in so far as a causal intervention on the output of the algorithm also affects a predictor running the same algorithm).

Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-14T00:39:57.538Z · score: 16 (5 votes) · LW · GW

Responses to Sections V and VI:

Implausible discontinuities

I'm puzzled by this concern. Is the doctrine of expected utility plagued by a corresponding 'implausible discontinuity' problem because if action 1 has expected value .999 and action 2 has expected value 1, then you should take action 2, but a very small change could mean you should take action 1? Is CDT plagued by an implausible-discontinuity problem because two problems which EDT would treat as the same will differ in causal expected value, and there must be some in-between problem where uncertainty about the causal structure balances between the two options, so CDT's recommendation implausibly makes a sharp shift when the uncertainty is jiggled a little? Can't we similarly boggle at the implausibility that a tiny change in the physical structure of a problem should make such a large difference in the causal structure so as to change CDT's recommendation? (For example, the tiny change can be a small adjustment to the coin which determines which of two causal structures will be in play, with no overall change in the evidential structure.)

It seems like what you find implausible about FDT here has nothing to do with discontinuity, unless you find CDT and EDT similarly implausible.

FDT is deeply indeterminate

This is obviously a big challenge for FDT; we don't know what logical counterfactuals look like, and invoking them is problematic until we do.

However, I can point to some toy models of FDT which lend credence to the idea that there's something there. The most interesting may be MUDT (see the "modal UDT" section of this summary post). This decision theory uses the notion of "possible" from the modal logic of provability, so that despite being a deterministic agent and therefore only taking one particular action in fact, agents have a well-defined possible-world structure to consider in making decisions, derived from what they can prove.

I have a post planned that focuses on a different toy model, single-player extensive-form games. This has the advantage of being only as exotic as standard game theory.

In both of these cases, FDT can be well-specified (at least, to the extent we're satisfied with calling the toy DTs examples of FDT -- which is a bit awkward, since FDT is kind of a weird umbrella term for several possible DTs, but also kind of specifically supposed to use functional graphs, which MUDT doesn't use).

It bears mentioning that a Bayesian already regards the probability distribution representing a problem to be deeply indeterminate, so this seems less bad if you start from such a perspective. Logical counterfactuals can similarly be thought of as subjective objects, rather than some objective fact which we have to uncover in order to know what FDT does.

On the other hand, greater indeterminacy is still worse; just because we already have lots of degrees of freedom to mess ourselves up with doesn't mean we happily accept even more.

And in general, it seems to me, there’s no fact of the matter about which algorithm a physical process is implementing in the absence of a particular interpretation of the inputs and outputs of that physical process.

Part of the reason that I'm happy for FDT to need such a fact is that I think I need such a fact anyway, in order to deal with anthropic uncertainty, and other issues.

If you don't think there's such a fact, then you can't take a computationalist perspective on theory of mind -- in which case, I wonder what position you take on questions such as consciousness. Obviously this leads to a number of questions which are quite aside from the point at hand, but I would personally think that questions such as whether an organism is experiencing suffering have to do with what computations are occurring. This ultimately cashes out to physical facts, yes, but it seems as if suffering should be a fundamentally computational fact which cashes out in terms of physical facts only in a substrate-independent way (ie, the physical facts of importance are precisely those which pertain to the question of which computation is running).

But almost all accounts of computation in physical processes have the issue that very many physical processes are running very many different algorithms, all at the same time.

Indeed, I think this is one of the main obstacles to a satisfying account -- a successful account should not have this property.

Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-14T00:39:50.433Z · score: 15 (5 votes) · LW · GW

Response to Section VII:

Assessing by how well the decision-maker does in possible worlds that she isn’t in fact in doesn’t seem a compelling criterion (and EDT and CDT could both do well by that criterion, too, depending on which possible worlds one is allowed to pick).

You make the claim that EDT and CDT can claim optimality in exactly the same way that FDT can, here, but I think the arguments are importantly not symmetric. CDT and EDT are optimal according to their own optimality notions, but given the choice to implement different decision procedures on later problems, both the CDT and EDT optimality notions would endorse selecting FDT over themselves in many of the problems mentioned in the paper, whereas FDT will endorse itself.

Most of this section seems to me to be an argument to make careful level distinctions, in an attempt to avoid the level-crossing argument which is FDT's main appeal. Certainly, FDTers such as myself will often use language which confuses the various levels, since we take a position which says they should be confusable -- the best decision procedures should follow the best policies, which should take the best actions. But making careful level distinctions does not block the level-crossing argument, it only clarifies it. FDT may not be the only "consistent fixed-point of normativity" (to the extent that it even is that), but CDT and EDT are clearly not that.

Fourth, arguing that FDT does best in a class of ‘fair’ problems, without being able to define what that class is or why it’s interesting, is a pretty weak argument.

I basically agree that the FDT paper dropped the ball here, in that it could have given a toy setting in which 'fair' is rigorously defined (in a pretty standard game-theoretic setting) and FDT has the claimed optimality notion. I hope my longer writeup can make such a setting clear.

Briefly: my interpretation of the "FDT does better" claim in the FDT paper is that FDT is supposed to take UDT-optimal actions. To the extent that it doesn't take UDT-optimal actions, I mostly don't endorse the claim that it does better (though I plan to note in a follow-up post an alternate view in which the FDT notion of optimality may be better).

The toy setting I have in mind that makes “UDT-optimal” completely well-defined is actually fairly general. The idea is that if we can represent a decision problem as a (single-player) extensive-form game, UDT is just the idea of throwing out the requirement of subgame-optimality. In other words, we don't even need a notion of "fairness" in the setting of extensive-form games -- the setting isn't rich enough to represent any "unfair" problems. Yet it is a pretty rich setting.

This observation was already made here: https://www.lesswrong.com/posts/W4sDWwGZ4puRBXMEZ/single-player-extensive-form-games-as-a-model-of-udt. Note that there are some concerns in the comments. I think the concerns make sense, and I’m not quite sure how I want to address them, but I also don’t think they’re damning to the toy model.

The FDT paper may have left out this model out of a desire for greater generality, which I do think is an important goal -- from my perspective, it makes sense not to reduce things to the toy model in which everything works out nicely.

Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-14T00:35:59.300Z · score: 22 (5 votes) · LW · GW

Here are some (very lightly edited) comments I left on Will's draft of this post. (See also my top-level response.)

Responses to Sections II and III:

I’m not claiming that it’s clear what this means. E.g. see here, second bullet point, arguing there can be no such probability function, because any probability function requires certainty in logical facts and all their entailments.

This point shows the intertwining of logical counterfactuals (counterpossibles) and logical uncertainty. I take logical induction to represent significant progress generalizing probability theory to the case of logical uncertainty, ie, objects which have many of the virtues of probability functions while not requiring certainty about entailment of known facts. So, we can substantially reply to this objection.

However, replying to this objection does not necessarily mean we can define logical counterfactuals as we would want. So far we have only been able to use logical induction to specify a kind of "logically uncertain evidential conditional". (IE, something closer in spirit to EDT, which does behave more like FDT in some problems but not in general.)

I want to emphasize that I agree that specifying what logical counterfactuals are is a grave difficulty, so grave as to seem (to me, at present) to be damning, provided one can avoid the difficulty in some other approach. However, I don't actually think that the difficulty can be avoided in any other approach! I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals. Even EDT has to grapple with this problem, ultimately, due to the need to handle cases where one's own action can be logically known. (Or provide a convincing argument that such cases cannot arise, even for an agent which is computable.)

Guaranteed Payoffs: In conditions of certainty — that is, when the decision-maker has no uncertainty about what state of nature she is in, and no uncertainty about the utility payoff of each action is — the decision-maker should choose the action that maximises utility.

(Obligatory remark that what maximizes utility is part of what's at issue here, and for precisely this reason, an FDTist could respond that it's CDT and EDT which fail in the Bomb example -- by failing to maximize the a priori expected utility of the action taken.)

FDT would disagree with this principle in general, since full certainty implies certainty about one's action, and the utility to be received, as well as everything else. However, I think we can set that aside and say there's a version of FDT which would agree with this principle in terms of prior uncertainty. It seems cases like Bomb cannot be set up without either invoking prior uncertainty (taking the form of the predictor's failure rate) or bringing the question of how to deal with logically impossible decisions to the forefront (if we consider the case of a perfect predictor).

Why should prior uncertainty be important, in cases of posterior certainty? Because of the prior-optimality notion (in which a decision theory is judged on a decision problem based on the utility received in expectation according to the prior probability which defines the decision problem).

Prior-optimality considers the guaranteed-payoff objection to be very similar to objecting to a gambling strategy by pointing out that the gambling strategy sometimes loses. In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn't look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn't pay well.

The right action, according to FDT, is to take Left, in the full knowledge that as a result you will slowly burn to death. Why? Because, using Y&S’s counterfactuals, if your algorithm were to output ‘Left’, then it would also have outputted ‘Left’ when the predictor made the simulation of you, and there would be no bomb in the box, and you could save yourself $100 by taking Left. And why, on your account, is this implausible? To my eye, this is right there in the decision problem, not a weird counterintuitive consequence of FDT: the decision problem stipulates that algorithms which output 'left' will not end up in the situation of taking a bomb, with very, very high probability. Again, complaining that you now know with certainty that you're in the unlucky position of seeing the bomb seems irrelevant in the way that a gambler complaining that they now know how the dice fell seems irrelevant -- it's still best to gamble according to the odds, taking the option which gives the best chance of success. (But what I most want to convey here is that there is a coherent sense in which FDT does the optimal thing, whether or not one agrees with it.) One way of thinking about this is to say that the FDT notion of "decision problem" is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified 'bomb' with just the certain information that 'left' is (causally and evidentially) very bad and 'right' is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem. Another way to think about this is to say that FDT "rejects" decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy. Also, I note that this analysis (on the part of FDT) does not hinge in this case on exotic counterfactuals. If you set Bomb up in the Savage framework, you would be forced to either give only the certain choice between bomb and not-bomb (so you don't represent the interesting part of the problem, involving the predictor) or to give the decision in terms of the prior, in which case the Savage framework would endorse the FDT recommendation. Another framework in which we could arrive at the same analysis would be that of single-player extensive-form games, in which the FDT recommendation corresponds to the simple notion of optimal strategy, whereas the CDT recommendation amounts to the stipulation of subgame-optimality. Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-13T22:07:27.113Z · score: 23 (8 votes) · LW · GW Replying to one of Will's edits on account of my comments to the earlier draft: Finally, in a comment on a draft of this note, Abram Demski said that: “The notion of expected utility for which FDT is supposed to do well (at least, according to me) is expected utility with respect to the prior for the decision problem under consideration.” If that’s correct, it’s striking that this criterion isn’t mentioned in the paper. But it also doesn’t seem compelling as a principle by which to evaluate between decision theories, nor does it seem FDT even does well by it. To see both points: suppose I’m choosing between an avocado sandwich and a hummus sandwich, and my prior was that I prefer avocado, but I’ve since tasted them both and gotten evidence that I prefer hummus. The choice that does best in terms of expected utility with respect to my prior for the decision problem under consideration is the avocado sandwich (and FDT, as I understood it in the paper, would agree). But, uncontroversially, I should choose the hummus sandwich, because I prefer hummus to avocado. Yeah, the thing is, the FDT paper focused on examples where "expected utility according to the prior" becomes an unclear notion due to logical uncertainty issues. It wouldn't have made sense for the FDT paper to focus on that, given the desire to put the most difficult issues into focus. However, FDT is supposed to accomplish similar things to UDT, and UDT provides the more concrete illustration. The policy that does best in expected utility according to the prior is the policy of taking whatever you like. In games of partial information, decisions are defined as functions of information states; and in the situation as described, there are separate information states for liking hummus and liking avocado. Choosing the one you like achieves a higher expected utility according to the prior, in comparison to just choosing avocado no matter what. In this situation, optimizing the decision in this way is equivalent to updating on the information; but, not always (as in transparent newcomb, Bomb, and other such problems). To re-state that a different way: in a given information state, UDT is choosing what to do as a function of the information available, and judging the utility of that choice according to the prior. So, in this scenario, we judge the expected utility of selecting avocado in response to liking hummus. This is worse (according to the prior!) than selecting hummus in response to liking hummus. Comment by abramdemski on A Critique of Functional Decision Theory · 2019-09-13T21:38:01.762Z · score: 55 (17 votes) · LW · GW I saw an earlier draft of this, and hope to write an extensive response at some point. For now, the short version: As I understand it, FDT was intended as an umbrella term for MIRI-style decision theories, which illustrated the critical points without making too many commitments. So, the vagueness of FDT was partly by design. I think UDT is a more concrete illustration of the most important points relevant to this discussion. • The optimality notion of UDT is clear. "UDT gets the most utility" means "UDT gets the highest expected value with respect to its own prior". This seems quite well-defined, hopefully addressing your (VII). • There are problems applying UDT to realistic situations, but UDT makes perfect sense and is optimal in a straightforward sense for the case of single-player extensive form games. That doesn't address multi-player games or logical uncertainty, but it is enough for much of Will's discussion. • FDT focused on the weird logical cases, which is in fact a major part of the motivation for MIRI-style decision theory. However, UDT for single-player extensive-form games actually gets at a lot of what MIRI-style decision theory wants, without broaching the topic of logical counterfactuals or proving-your-own-action directly. • The problems which create a deep indeterminacy seem, to me, to be problems for other decision theories than FDT as well. FDT is trying to face them head-on. But there are big problems for applying EDT to agents who are physically instantiated as computer programs and can prove too much about their own actions. • This also hopefully clarifies the sense in which I don't think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem. • There's a subtle point here, though, since Will describes the decision problem from an updated perspective -- you already know the bomb is in front of you. So UDT "changes the problem" by evaluating "according to the prior". From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances. • Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let's call the way-you-put-agents-into-the-scenario the "construction". We then evaluate agents on how well they deal with the construction. • For examples like Bomb, the construction gives us the overall probability distribution -- this is then used for the expected value which UDT's optimality notion is stated in terms of. • For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible. The point about "constructions" is possibly a bit subtle (and hastily made); maybe a lot of the disagreement will turn out to be there. But I do hope that the basic idea of UDT's optimality criterion is actually clear -- "evaluate expected utility of policies according to the prior" -- and clarifies the situation with FDT as well. Comment by abramdemski on Rationality Exercises Prize of September 2019 ($1,000) · 2019-09-13T03:33:29.135Z · score: 2 (1 votes) · LW · GW

By asking people to leave a comment here linking to their exercises, are you discouraging writing exercises directly as a comment to this post? (Perhaps you're wanting something longer, and so discouraging comments as the arena for listing exercises?)

Comment by abramdemski on Embedded Agency via Abstraction · 2019-09-13T03:01:21.235Z · score: 20 (5 votes) · LW · GW

I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.

On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you've described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn't achieve all the things I mentioned, but doesn't that seem more "dynamic" than "static"? Today's deep learning systems aren't as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.

More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.

I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that's overstating things a bit -- I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.

In any case, I have no sense that you're making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.

Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.

Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don't get everything right on the first try.

Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I've said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures -- such as the one provided by reflective oracles -- give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture -- perhaps the picture you are seeking -- would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn't within your scope).

Comment by abramdemski on Formalising decision theory is hard · 2019-09-13T01:30:36.422Z · score: 4 (2 votes) · LW · GW
This might seem surprising at first, because there is also a different incomplete model Φ that says "if you pays the blackmail, infestation will not happen". Φ is false if you use physical causal counterfactuals, but from the agent's perspective Φ is consistent with all observations. However, Φ only guarantees the payoff −c (because it is unknown whether the blackmail will arrive). Therefore, Φ will have no effect on the ultimate behavior of the agent.

What happens in ASP? (Say you're in an iterated Newcomb's problem with a predictor much slower than you, but which meets the LIC or similar.) I'm concerned that it will either settle on two-boxing, or possibly not settle on one strategy, since if it settles on two-boxing then a model which says "you can get the higher reward by one-boxing" (ie, the agent has control over the predictor) looks appealing; but, if it settles on one-boxing, a model which says "you can get higher reward by two-boxing" (ie, the agent's action doesn't control the predictor) looks appealing. This concern is related to the way asymptotic decision theory fails -- granted, for cases outside of its definition of "fair".

The precommitments have to expire after some finite time.

I agree that something like this generally does the right thing in most cases, with the exception of superrationality in games as a result of commitment races.

I still have a little hope that there will be a nice version, which doesn't involve a commitment-races problem and which doesn't make use of an arbitrary commitment cutoff. But I would agree that things don't look good, and so it is reasonable to put this kind of thing outside of "fair" problems.

Let me add that I am not even sure what are the correct desiderata. In particular, I don't think that we should expect any group of good agents to converge to a Pareto optimal outcome.

I don't currently see why we shouldn't ask to converge to pareto optima. Obviously, we can't expect to do so with arbitrary other agents; but it doesn't seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm. This even seems reasonable in the standard iterated Nash picture (where not all strategies achieve pareto optima, but there exist strategies which achieve pareto optima with a broad-ish class of other strategies, including others who use strategies like their own -- while being very difficult to exploit).

But yeah, I'm pretty uncertain about what the desiderata should be -- both with respect to game theory, and with respect to scenarios which require updatelessness/precommitments in order to do well. I agree that it should all be approached with a learning-theoretic perspective.

Comment by abramdemski on Formalising decision theory is hard · 2019-09-12T20:00:15.109Z · score: 2 (1 votes) · LW · GW

Ahh thanks :p fixed

Comment by abramdemski on Counterfactuals are an Answer, Not a Question · 2019-09-07T05:17:41.492Z · score: 14 (4 votes) · LW · GW

Those of a Bayesian leaning will tend to say things like "probability is subjective", and claim this is an important insight into the nature of probability -- one might even go so far as to say "probability is an answer, not a question". But this doesn't mean you can believe what you want; not exactly. There are coherence constraints. So, once we see that probability is subjective, we can then seek a theory of the subjectivity, which tells us "objective" information about it (yet which leaves a whole lot of flexibility).

The same might be true of counterfactuals. I personally lean toward the position that the constraints on counterfactuals are that they be consistent with evidential predictions, but I don't claim to be unconfused. My position is a "counterfactuals are subjective but have significant coherence constraints" type position, but (arguably) a fairly minimal one -- the constraint is a version of "counterfacting on what you actually did should yield what actually happened", one of the most basic constraints on what counterfactuals should be.

On the other hand, my theory of counterfactuals is pretty boring and doesn't directly solve problems -- it more says "look elsewhere for the interesting stuff".

Edit --

Oh, also, I wanted to pitch the idea that counterfactuals, like a whole bunch of things, should be thought of as "constructed rather than real". This is subtly different from "subjective". We humans are pretty far along in an ongoing process of figuring out how to be and act in the world. Sometimes we come up with formal theories of things like probability, utility, counterfactuals, and logic. The process of coming up with these formal theories informs our practice. Our practice also informs the formal theories. Sometimes a theory seems to capture what we wanted really nicely. My argument is that in an important sense we've invented, not discovered, what we wanted.

So, for example, utility functions. Do utility functions capture human preferences? No, not really, they are pretty far from preferences observed in the wild. However, we're in the process of figuring out what we prefer. Utility functions capture some nice ideas about idealized preferences, so that when we're talking about idealized versions of what we want (trying to figure out what we prefer upon reflection) it is (a) often pretty convenient to think in terms of utilities, and (b) somewhat difficult to really escape the framework of utilities. Similarly for probability and logic as formal models of idealized reasoning.

So, just as utility functions aren't really out there in the world, counterfactuals aren't really out there in the world. But just as it might be that we should think about our preferences in terms of utility anyway (...or maybe abandon utility in favor of better theoretical tools), we might want to equip our best world-model with counterfactuals anyway (...or abandon them in favor of better theoretical tools).

Comment by abramdemski on Formalising decision theory is hard · 2019-09-05T21:56:00.641Z · score: 13 (4 votes) · LW · GW

I very much agree with the point about not decoupling learning and decision theory. I wrote a comment making somewhat similar points.

I believe that this indeed solves both INP and IXB.

I'd like to understand this part.

One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable.

I'm not sure I should find the precommitment solution satisfying. Won't it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?

The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality.

If we take the learning-theoretic view, then we get to bring in tools from iterated games. There's a Pavlov-like strategy for playing deterministic iterated games which converges to optimal responses to non-agentic environments and converges to Pareto optima for environments containing agents who use the Pavlov-like strategy. It is not the greatest at being unexploitable, and it also has fairly bad convergence.

However, I don't yet see how to translate the result to logical-induction type learners. Besides requiring deterministic payouts (a property which can probably be relaxed somehow), the algorithm requires an agent to have a definite history -- a well-defined training sequence. Agents based on logical induction are instead forming generalizations based on any sufficiently analogous situation within logic, so they don't have a well-defined history in the right way. (An actual instance of a logical induction agent has an actual temporal history, but this temporal history is not necessarily what it is drawing on to play the game -- it may have never personally encountered a similar situation.)

In other words, I'm hopeful that there could be a learning-theoretic solution, but I don't know what it is yet.

As for superrationality for agents w/o learning theory, there's cooperative oracles, right? We can make computable analogues with distributed oracles. It's not a real solution, specifically in that it ignores learning. So I sort of think we know how to do it in the "static" setting, but the problem is that we live in a learning-theoretic setting rather than a static-rationality setting.

Comment by abramdemski on Embedded Agency via Abstraction · 2019-09-04T05:52:43.393Z · score: 25 (8 votes) · LW · GW
The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I'll write up a post on it at some point.

I look forward to it.

When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it

Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.

Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as "hard mode", but nonetheless consider it to be central.

To name some considerations:

• The "static" way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
• Moreover, "static" decision theory puts a box around "epistemics" with an output to decision-making. This implicitly suggests: "Decision theory is about optimal action under uncertainty -- the generation of that uncertainty is relegated to epistemics." This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
• Viewing things from a learning-theoretic perspective, it doesn't generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can't expect to make a single decision "correctly" for basically any notion of "correctly". What we can expect is to be "moving in the right direction" -- not at a particular time, but generally over time (if nothing kills us).
• So, describing an embedded agent in some particular situation, the notion of "rational (bounded) agency" should not expect anything optimal about its actions in that circumstance -- it can only talk about the way the agent updates.
• Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what's going on epistemically -- it might not have learned to appropriately react to the given situation yet. So even "reacting optimally given your (epistemic) uncertainty" isn't realistic as an expectation for bounded agents.
• Obviously I also think the "dynamic" view is better in the purely epistemic case as well -- logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
• Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses "rise to the top" over time. It is quite analogous to replicator dynamics as a model of evolution.
• You can do "equilibrium analysis" of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
• (The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL -- multi-agent learning) and think about where the correlations come from.)
Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.

I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I'm suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.

Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their "static" nature.

[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]

Edit: Vanessa made a related point in a comment on another post.

Comment by abramdemski on Deconfuse Yourself about Agency · 2019-09-04T04:23:46.880Z · score: 4 (2 votes) · LW · GW
I think of agent-like architectures as something objective, or related to the territory. In contrast, agent-like behavior is something subjective, something in the map. Importantly, agent-like behavior, or the lack of it, of some X is something that exists in the map of some entity Y (where often Y≠X).
The selection/control distinction seems related, but not quite similar to me. Am I missing something there?

A(Θ)-morphism seems to me to involve both agent-like architecture and agent-like behavior, because it just talks about prediction generally. Mostly I was asking if you were trying to point it one way or the other (we could talk about prediction-of-internals exclusively, to point at structure, or prediction-of-external exclusively, to talk about behavior -- I was unsure whether you were trying to do one of those things).

Since you say that you are trying to formalize how we informally talk, rather than how we should, I guess you weren't trying to make A(Θ)-morphism get at this distinction at all, and were separately mentioning the distinction as one which should be made.

Comment by abramdemski on Troll Bridge · 2019-09-04T04:15:12.224Z · score: 11 (4 votes) · LW · GW
I don't see how this agent seems to control his sanity.

The agent in Troll Bridge thinks that it can make itself insane by crossing the bridge. (Maybe this doesn't answer your question?)

Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they've thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?

I make no claim that this sort of case is common. Scenarios where it comes up and is relevant to X-risk might involve alien superintelligences trolling human-made AGI. But it isn't exactly high on my list of concerns. The question is more about whether particular theories of counterfactual are right. Troll Bridge might be "too hard" in some sense -- we may just have to give up on it. But, generally, these weird philosophical counterexamples are more about pointing out problems. Complex real-life situations are difficult to deal with (in terms of reasoning about what a particular theory of counterfactuals will actually do), so we check simple examples, even if they're outlandish, to get a better idea of what the counterfactuals are doing in general.

Comment by abramdemski on Troll Bridge · 2019-09-04T04:05:21.169Z · score: 4 (2 votes) · LW · GW

Yep, sorry. The illustrations were not actually originally meant for publication; they're from my personal notes. I did it this way (1) because the pictures are kind of nice, (2) because I was frustrated that no one had written a good summary post on Troll Bridge yet, (3) because I was in a hurry. Ideally I'll edit the images to be more suitable for the post, although adding the omitted content is a higher priority.

Comment by abramdemski on Embedded Agency via Abstraction · 2019-08-29T08:24:34.889Z · score: 6 (3 votes) · LW · GW

I think one difference between us is, I really don't expect standard game-theoretic ideas to survive. They're a good starting point, but, we need to break them down to something more fundamental. (Breaking down probability (further than logical induction already does, that is), while on my radar, is far more speculative than that.)

Basic game theory uses equilibrium analysis. We need a theory of dynamics instead of only equilibrium, because a reasoner needs to find an equilibrium somehow -- and the "somehow" is going to involve computational learning theory. Evolutionary game theory is a step in the right direction but not powerful enough for thinking about superintelligent AI. Other things which seem like steps in the right direction include correlated equilibria (which have somewhat nice "dynamic" stories of reaching equilibrium through learning).

Logical induction is a success case for magically getting nice self reference properties after a set of desired properties fell into place. Following the "abstraction" intuition could definitely work out that way. Another passion example is how Hartry Field followed a line of research about the sorities paradox developed a logic of vagueness, and ended up with a theory of self-referential truth. But the first example involved leaving the Bayesian paradigm, and the second involved breaking map/territory intuitions and classical logic.

Hadn't seen the dice example, is it from Jaynes? (I don't yet see why you're better off randomising)

Comment by abramdemski on Embedded Agency via Abstraction · 2019-08-28T20:56:44.754Z · score: 15 (4 votes) · LW · GW
We’re speculating about a map making predictions based on a game-theoretic mixed strategy, but at this point we haven’t even defined the rules of the game. What is the map’s “utility function” in this game? The answer to that sort of question should come from thinking about the simpler questions from earlier. We want a theory where the “rules of the game” for self-referential maps follow naturally from the theory for non-self-referential maps.

• A significant part of the utility of a map comes from the self-referential effects on the territory; the map needs to be chosen with this in mind to avoid catastrophic self-fulfilling prophecies. (This doesn't feel especially important for your point, but it is part of the puzzle.)
• The definition of naturalized epistemic-goodness can take inspiration from non-self-referential versions of the problem, but faces additional wireheading-like problems, which places significant burden on it. You probably can't just take the "epistemic utility function" from the non-self-referential case. The paper epistemic decision theory by Hilary Greaves explores this issue.
• Thinking about self-reference may influence the "kind of thing" which is being scored. For example, in the non-self-referential setting, classical logic is a reasonable choice. Despite the ambiguities introduced by uncertain reasoning and abstraction, it might be reasonable to think of statements as basically being true or false, modulo some caveats. However, self-reference paradoxes may make non-classical logics more appropriate, with more radically different notions of truth-value. For example, reflective oracles deal with self-reference via probability (as you mention in the post, using Nash equilibria to avoid paradox in the face of self-reference). However, although it works to an extent, it isn't obviously right. Probability in the sense of uncertainty and probability in the sense of I-have-to-treat-this-as-random-because-it-structurally-depends-on-my-belief-in-a-way-which-diagonalizes-me might be fundamentally different from one another.
• This same argument may also apply to the question of what abstraction even is.

I don't think you were explicitly denying any of this; I just wanted to call out that these things may create complications for the research agenda. My personal sense is that it could be possible to come up with the right notion by focusing on the non-self-referential case alone (and paying very close attention to what feels right/wrong), but anticipating the issues which will arise in the self-referential case provides significantly more constraints and thus significantly more guidance. A wide variety of tempting simplifications are available in the absence of self-reference.

I'm especially worried about the "kind of thing" point above. It isn't clear at all what kind of thing beliefs for embedded agents should be. Reflective oracles give a way to rescue probability theory for the embedded setting, but, are basically unrealistic. Logical inductors are of course somewhat more realistic (being computable), and look quite different. But, logical inductors don't have great decision-theoretic properties (so far).

Comment by abramdemski on Troll Bridge · 2019-08-26T20:40:37.414Z · score: 4 (2 votes) · LW · GW

I don't totally disagree, but see my reply to Gurkenglas as well as my reply to Andrew Sauer. Uncertainty doesn't really save us, and the behavior isn't really due to the worst-case-minimizing behavior. It can end up doing the same thing even if getting blown up is only slightly worse than not crossing! I'll try to edit the post to add the argument wherein logical induction fails eventually (maybe not for a week, though). I'm much more inclined to say "Troll Bridge is too hard; we can't demand so much of our counterfactuals" than I am to say "the counterfactual is actually perfectly reasonable" or "the problem won't occur if we have reasonable uncertainty".

Comment by abramdemski on Troll Bridge · 2019-08-26T20:32:13.571Z · score: 4 (2 votes) · LW · GW

In this case, we have (by assumption) an output of the program, so we just look at the cases where the program gives that output.

Comment by abramdemski on Troll Bridge · 2019-08-26T20:30:32.546Z · score: 11 (4 votes) · LW · GW

I agree that "it seems that it should". I'll try and eventually edit the post to show why this is (at least) more difficult to achieve than it appears. The short version is that a proof is still a proof for a logically uncertain agent; so, if the Löbian proof did still work, then the agent would update to 100% believing it, eliminating its uncertainty; therefore, the proof still works (via its Löbian nature).

Comment by abramdemski on Troll Bridge · 2019-08-26T20:26:46.360Z · score: 4 (2 votes) · LW · GW

No, but I probably should have said "iff" or "if and only if". I'll edit.

Comment by abramdemski on Troll Bridge · 2019-08-26T20:25:29.383Z · score: 4 (2 votes) · LW · GW

Comment by abramdemski on Troll Bridge · 2019-08-26T20:24:30.923Z · score: 12 (4 votes) · LW · GW

I agree with your English characterization, and I also agree that it isn't really obvious that the reasoning is pathological. However, I don't think it is so obviously sane, either.

• It seems like counterfactual reasoning about alternative actions should avoid going through "I'm obviously insane" in almost every case; possibly in every case. If you think about what would happen if you made a particular chess move, you need to divorce the consequences from any "I'm obviously insane in that scenario, so the rest of my moves in the game will be terrible" type reasoning. You CAN'T assess that making a move would be insane UNTIL you reason out the consequences w/o any presumption of insanity; otherwise, you might end up avoiding a move only because it looks insane (and it looks insane only because you avoid it, so you think you've gone mad if you take it). This principle seems potentially strong enough that you'd want to apply it to the Troll Bridge case as well, even though in Troll Bridge it won't actually help us make the right decision (it just suggests that expecting the bridge to blow up isn't a legit counterfactual).
• Also, counterfactuals which predict that the bridge blows up seem to be saying that the agent can control whether PA is consistent or inconsistent. That might be considered unrealistic.
Comment by abramdemski on Deconfuse Yourself about Agency · 2019-08-23T23:06:52.167Z · score: 6 (3 votes) · LW · GW

You mention the distinction between agent-like architecture and agent-like behavior (which I find similar to my distinction between selection and control), but how does the concept of -morphism account for this distinction? I have a sense that (formalized) versions of -morphism are going to be more useful (or easier?) for the behavioral side, though it isn't really clear.

Comment by abramdemski on Project Proposal: Considerations for trading off capabilities and safety impacts of AI research · 2019-08-10T07:07:56.212Z · score: 11 (7 votes) · LW · GW

I am a bit surprised to see you begin this post by saying there seems to be a consensus that people shouldn't worry about capabilities consequences of their work, but then, I come from the miri-influenced crowd. I agree that it would be good to have a lot more clarity on how to think about this.

I agree it could be somewhat good for miri to have a hit ml publication, particularly if it was something unlikely to shift progress significantly. I could imagine a universe where this happened if miri happened upon a very interesting safety-advanced thing, the way adversarial counterexamples were this big new thing slightly outside the usual ml way of doing business (ie, not achieving high scores on a task with some improved technique). But it seems fairly unlikely to be worth it to try to play the usual ml game at the level of top ml groups simply for the sake of prestige, because it is likely too hard to gain prestige that way with so many others trying. It seems better in spirit to gain credibility by doing what miri does best and getting recognition for what's good (of the open research). O suspect we have some deep disagreements about background models.

I think the best way to reach ml people in the long run is not through credibility, but through good arguments presented well. Let me clarify: credibility/prestige definitely play a huge role in what the bulk of people think. But the credibility system is good enough that the top credible people are really pretty smart, so to an extent can be swayed by good arguments presented well. This case can definitely be overstated and I feel like I'm presenting a picture which will right be criticised as over-optimistic. But I think there are some success stories, and it's the honest leverage path (in contrast to fighting for prestige in a system in which lots of people are similarly doing so).

Anyway, I've hardly said anything about your main point. I don't know how to think about it, and I wish I did. I usually try to think about differential progress and then fail, and fall back on an assessment of how surprised I'd be if something lead to big AI progress, and am cautious if it seems within the realm of possibility.

Comment by abramdemski on Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) · 2019-08-08T21:58:38.527Z · score: 12 (5 votes) · LW · GW

I think I already want to back off on my assertion that the categories should not be applied to controllers. However, I see the application to controllers as more complex. It's more clear what it means to (successfully) point a selection-style optimization process at a proxy. In a selection setting, you have the proxy (which the system can access), and the true value (which is not accessible). Wireheading only makes sense when "true" is partially accessible, and the agent severs that connection.

I definitely appreciate your posts on this; it hadn't occurred to me to ask whether the four types apply equally well to selection and control.

Comment by abramdemski on AI Alignment Open Thread August 2019 · 2019-08-08T21:51:07.067Z · score: 5 (3 votes) · LW · GW

I am confused about how the normative question isn't decision-relevant here. Is it that I have a model where it is the relevant question, but you have one where it isn't? To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.

It's possible that this isn't my true disagreement, because actually the question seems more complicated than just a question of how large potential downsides are if things go poorly in comparison to potential upsides if things go well. But some kind of analysis of the risks seems relevant here -- if there weren't such large downside risks, I would have lower standards of evidence for claims that things will go well.

The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)

It sounds like we would have to have a longer discussion to resolve this. I don't expect this to hit the mark very well, but here's my reply to what I understand:

• I don't see how you can be confident enough of that view for it to be how you really want to check.
• A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy.

I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be (keeping in mind that those aren't the only two things in the universe). I'm not sure which of those two disagreements is more important here.

Comment by abramdemski on AI Alignment Open Thread August 2019 · 2019-08-07T21:44:46.167Z · score: 17 (7 votes) · LW · GW

The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:

• They are part of a research program, not an end result. Rough intuitions can absolutely be a useful guide which (hopefully eventually) helps us figure out what mathematical results are possible and useful.
• They primarily point at problems rather than solutions. Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good". IE, an argument that there is a risk can be fairly rough and nonetheless be sufficient for me to "not push the button" (in a hypothetical where I could choose to turn on a system today). On the other hand, an argument that pushing the button is net positive has to be actually quite strong. I want there to be a small set of assumptions, each of which individually seem very likely to be true, which taken together would be a guarantee against catastrophic failure.

[This is an "or" condition -- either one of those two conditions suffices for me to take vague arguments seriously.]

On the other hand, I agree with you that I set up a false dichotomy between proof and empiricism. Perhaps a better model would be a spectrum between "theory" and empiricism. Mathematical arguments are an extreme point of rigorous theory. Empiricism realistically comes with some amount of theory no matter what. And you could also ask for a "more of both" type approach, implying a 2d picture where they occupy separate dimensions.

Still, though, I personally don't see much of a way to gain understanding about failure modes of very very capable systems using empirical observation of today's systems. I especially don't see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.

Comment by abramdemski on AI Alignment Open Thread August 2019 · 2019-08-07T21:21:53.371Z · score: 10 (6 votes) · LW · GW

Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don't see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.

If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.

Comment by abramdemski on AI Alignment Open Thread August 2019 · 2019-08-06T17:03:48.278Z · score: 21 (7 votes) · LW · GW

My thoughts: we can't really expect to prove something like "this ai will be beneficial". However, relying on empiricism to test our algorithms is very likely to fail, because it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don't know how to make good guesses about the behavior of very capable systems except through mathematical analysis.

There are two overlapping traditions in machine learning. There's a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there's machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.

(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)

Comment by abramdemski on Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) · 2019-08-05T05:28:56.862Z · score: 17 (4 votes) · LW · GW

I tentatively think it makes the most sense to apply Goodhart exclusively to selection processes rather than control, except perhaps for the causal case.

All flavors of Goodhart require a system to be clearly optimizing a proxy. Selection-style optimization gets direct feedback of some kind; this feedback can be "the true function" or "the proxy" - although calling it either requires anthropomorphic analysis of what the system is trying to do - perhaps via observation of how/why the system was set up. So we can talk about Goodhart by comparing what the selection process explicitly optimizes to our anthropomorphic analysis of what it "tries to" optimize.

A control system, on the other hand, is all anthropomorphism. We look at it efficiently steering the world into a narrow space of possibility, and we conclude that's what it is "trying" to do. So where is the true-vs-proxy to compare?

It might be that we have two or more plausible ways to ascribe goals to the controller, and that these conflict. For example, maybe the controller happens to have an explicit specification of a utility function somewhere in its machine code, and on the other hand, we know that it was build for a specific purpose -- and the two differ from each other. This is simple value misalignment (arguably outright misspecification, not Goodhart?).

However, I can't think of a reason why the thing would need an explicit utility function inside of it unless it was internally implementing a selection process, like a planning algorithm or something. So that brings us back to applying the Goodhart concept to selection processes, rather than control.

You mention model errors. If the controller is internally using model-based reasoning, it seems very likely that it is doing selection-style planning. So again, the Goodhart concept seems to apply to the selection part.

There are some situations where we don't need to apply any anthropomorphic analysis to infer a goal for a controller, because it is definitely responsive to a specific form of feedback: namely, reinforcement learning. In the case of reinforcement learning, a "goodhart-ish" failure which can occur is wireheading. Interesting that I have never been quite comfortable classifying wireheading as one of the four types of Goodhart; perhaps that's because I was applying the selection-vs-control distinction implicitly.

I mentioned that causal goodhart might be the exception. It seems to me that causal failure applies to selection processes too, but ONLY to selection which is being used to implement a controller. In effect it's a type of model error for a model-based controller.

All of this is vague and fuzzy and only weakly endorsed.

Comment by abramdemski on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-08-05T04:50:56.067Z · score: 8 (4 votes) · LW · GW

I think Eliezer's definition still basically makes sense as a measure of optimization power, but the model of optimization which inspired it (basically, optimization-as-random-search) doesn't make sense.

Though, I would very likely have a better way of measuring optimization power if I understood what was really going on better.

Comment by abramdemski on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-08-05T04:47:18.809Z · score: 19 (5 votes) · LW · GW

I agree that these abstractions are very limited and should mainly be used to raise concerns. Due to existential risk, there's an asymmetry between concerns vs positive arguments against concerns: if we want to avoid large negative outcomes, we have to take vague concerns seriously in the absence of good arguments against them; but, asymmetrically, seek strong arguments that systems avoid risk. Recently I worry that this can give people an incorrect picture of which ideas I think are firm enough to take seriously. I'll happily discuss fairly vague ideas such as instrumental convergence when talking about risk. But (at least for me personally) the very same ideas will seem overly vague and suspicious, if used in an argument that things will go well. I think this is basically the right attitude to take, but could be confusing to other people.

Comment by abramdemski on Conceptual Problems with UDT and Policy Selection · 2019-07-03T21:16:10.054Z · score: 6 (3 votes) · LW · GW

I'm saying that non-uniqueness of the solution is part of the conceptual problem with Nash equilibria.

Decision theory doesn't exactly provide a "unique solution" -- it's a theory of rational constraints on subjective belief, so, you can believe and do whatever you want within the confines of those rationality constraints. And of course classical decision theory also has problems of its own (such as logical omniscience). But there is a sense in which it is better than game theory about this, since game theory gives rationality constraints which depend on the other player in ways that are difficult to make real.

I'm not saying there's some strategy which works regardless of the other player's strategy. In single-player decision theory, you can still say "there's no optimal strategy due to uncertainty about the environment" -- but, you get to say "but there's an optimal strategy given our uncertainty about the environment", and this ends up being a fairly satisfying analysis. The nash-equilibrium picture of game theory lacks a similarly satisfying analysis. But this does not seem essential to game theory.

Comment by abramdemski on Let's talk about "Convergent Rationality" · 2019-07-03T20:59:22.854Z · score: 8 (4 votes) · LW · GW

Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn't make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn't that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I'm not sure where the cutoff should be, so I said 1%.]

My personal position isn't CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.

Another obstacle to those strategies is, even if future AGI isn't sufficiently strategic/agenty/rational to fall into the "rationality attractor", it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don't apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.

Of course it doesn't make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.

One "something" is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there's more reasoning to be unpacked.

I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.

Comment by abramdemski on Conceptual Problems with UDT and Policy Selection · 2019-07-01T19:38:46.680Z · score: 6 (3 votes) · LW · GW

True, but, I think that's a bad way of thinking about game theory:

• The Nash equilibrium model assumes that players somehow know what equilibrium they're in. Yet, it gives rise to an equilibrium selection problem due to the non-uniqueness of equilibria. This casts doubt on the assumption of common knowledge which underlies the definition of equilibrium.
• Nash equilibria also assume a naive best-response pattern. If an agent faces a best-response agent and we assume that the Nash-equilibrium knowledge structure somehow makes sense (there is some way that agents successfully coordinate on a fixed point), then it would make more sense for an agent to select its response function (to, possibly, be something other than argmax), based on what gets the best response from the (more-naive) other player. This is similar to the UDT idea. Of course you can't have both players do this or you're stuck in the same situation again (ie there's yet another meta level which a player would be better off going to).

Going to the meta-level like that seems likely to make the equilibrium selection problem worse rather than better, but, that's not my point. My point is that Nash equilibria aren't the end of the story; they're a somewhat weird model. So it isn't obvious whether a similar no-free-lunch idea applies to a better model of game theory.

Correlated equilibria are an obvious thing to mention here. They're a more sensible model in a few ways. I think there are still some unjustified and problematic assumptions there, though.

Comment by abramdemski on Conceptual Problems with UDT and Policy Selection · 2019-07-01T19:24:46.601Z · score: 2 (1 votes) · LW · GW

I don't want to claim there's a best way, but I do think there are certain desirable properties which it makes sense to shoot for. But this still sort of points at the wrong problem.

A "naturalistic" approach to game theory is one in which game theory is an application of decision theory (not an extension) -- there should be no special reasoning which applies only to other agents. (I don't know a better term for this, so let's use naturalistic for now.)

Standard approaches to game theory lack this (to varying degrees). So, one frame is that we would like to come up with an approach to game theory which is naturalistic. Coming from the other side, we can attempt to apply existing decision theory to games. This ends up being more confusing and unsatisfying than one might hope. So, we can think of game theory as an especially difficult stress-test for decision theory.

So it isn't that there should be some best strategy in multiplayer games, or even that I'm interested in a "better" player despite the lack of a notion of "best" (although I am interested in that). It's more that UDT doesn't give me a way to think about games. I'd like to have a way to think about games which makes sense to me, and which preserves as much as possible what seems good about UDT.

Desirable properties such as coordination are important in themselves, but are also playing an illustrative role -- pointing at the problem. (It could be that coordination just shouldn't be expected, and so, is a bad way of pointing at the problem of making game theory "make sense" -- but I currently think better coordination should be possible, so, think it is a good way to point at the problem.)

Comment by abramdemski on What's up with self-esteem? · 2019-06-26T19:50:18.346Z · score: 2 (1 votes) · LW · GW

Cool, thanks!

Comment by abramdemski on How hard is it for altruists to discuss going against bad equilibria? · 2019-06-25T02:48:54.919Z · score: 7 (2 votes) · LW · GW
FAI is a sidetrack, if we don't have any path to FNI (friendly natural intelligence).

I don't think I understand the reasoning behind this, though I don't strongly disagree. Certainly it would be great to solve the "human alignment problem". But what's your claim?

If a bunch of fully self-interested people are about to be wiped out by an avoidable disaster (or even actively malicious people, who would like to hurt each other a little bit, but value self-preservation more), they're still better off pooling their resources together to avert disaster.

You might have a prisoner's dilemma / tragedy of the commons -- it's still even better if you can get everyone else to pool resources to avert disaster, while stepping aside yourself. BUT:

• that's more a coordination problem again, rather than an everyone-is-too-selfish problem
• that's not really the situation with AI, because what you have is more a situation where you can either work really hard to build AGI or work even harder to build safe AGI; it's not a tragedy of the commons, it's more like lemmings running off a cliff!
One point of confusion in trying to generalize bad behavior (bad equilibrium is an explanation or cause, bad behavior is the actual problem) is that incentives aren't exogenous - they're created and perpetuated by actors, just like the behaviors we're trying to change. One actor's incentives are another actor's behaviors.

Yeah, the incentives will often be crafted perversely, which likely means that you can expect even more opposition to clear discussion, because there are powerful forces trying to coordinate on the wrong consensus about matters of fact in order to maintain plausible deniability about what they're doing.

In the example being discussed here, it just seems like a lot of people coordinating on the easier route, partly due to momentum of older practices, partly because certain established people/institutions are somewhat threatened by the better practices.

I find it very difficult to agree to any generality without identifying some representative specifics. It feels way too much like I'm being asked to sign up for something without being told what. Relatedly, if there are zero specifics that you think fit the generalization well enough to be good examples, it seems very likely that the generalization itself is flawed.

My feeling is that small examples of the dynamic I'm pointing at come up fairly often, but things pretty reliably go poorly if I point them out, which has resulted in an aversion to pointing such things out.

The conversation has so much gravity toward blame and self-defense that it just can't go anywhere else.

I'm not going to claim that this is a great post for communicating/educating/fixing anything. It's a weird post.

Comment by abramdemski on No, it's not The Incentives—it's you · 2019-06-23T00:57:43.083Z · score: 2 (1 votes) · LW · GW

I see what you mean, but there's a tendency to think of 'homo economicus' as having perfectly selfish, non-altruistic values.

Also, quite aside from standard economics, I tend to think of economic decisions as maximizing profit. Technically, the rational agent model in economics allows arbitrary objectives. But, what kinds of market behavior should you really expect?

When analyzing celebrities, it makes sense to assume rationality with a fame-maximizing utility function, because the people who manage to become and remain celebrities will, one way or another, be acting like fame-maximizers. There's a huge selection effect. So Homo Hollywoodicus can probably be modeled well with a fame-maximizing assumption.

This has nothing to do with the psychology of stardom. People may have all kinds of motives for what they do -- whether they're seeking stardom consciously or just happen to engage in behavior which makes them a star.

Similarly, when modeling politics, it is reasonable to make a Homo Politicus assumption that people seek to gain and maintain power. The politicians whose behavior isn't in line with this assumption will never break into politics, or at best will be short-lived successes. This has nothing to do with the psychology of the politicians.

And again, evolutionary game theory treats reproductive success as utility, despite the many other goals which animals might have.

So, when analyzing market behavior, it makes some sense to treat money as the utility function. Those who aren't going for money will have much less influence on the behavior of the market overall. Profit motives aren't everything, but other motives will be less important that profit motives in market analysis.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-14T08:42:28.221Z · score: 2 (1 votes) · LW · GW
My current understanding of quantilization is "choose randomly from the top X% of actions". I don't see how this helps very much with staying on-distribution... as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.

The base distribution you take the top X% of is supposed to be related to the "on-distribution" distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer's own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently "off-distribution" possibilities for there to be a concern. (I'm not saying these are entirely reasonable assumptions; I'm just saying that this is one way of thinking about quantilization.)

In any case, quantilization seems like it shouldn't work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth's atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren't very valuable.

The base distribution quantilization samples from is about actions, or plans, or policies, or things like that -- not about configurations of atoms.

So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T23:24:13.709Z · score: 19 (9 votes) · LW · GW

Thinking up actual historical examples is hard for me. The following is mostly true, partly made up.

• (#4) I don't necessarily have trouble talking about my emotions, but when there are any clear incentives for me to make particular claims, I tend to shut down. It feels viscerally dishonest (at least sometimes) to say things, particularly positive things, which I have an incentive to say. For example, responding "it's good to see you too" in response to "it's good to see you" sometimes (not always) feels dishonest even when true.
• (#4) Talking about money with an employer feels very difficult, in a way that's related to intuitively discarding any motivated arguments and expecting others to do the same.
• (#6) I'm not sure if I was at the party, but I am generally in the crowd Grognor was talking about, and very likely engaged in similar behavior to what he describes.
• (#5) I have tripped up when trying to explain something because I noticed myself reaching for examples to prove my point, and the "cherry-picking" alarm went off.
• (#5, #4) I have noticed that a friend was selecting arguments that I should go to the movies with him in a biased way which ignored arguments to the contrary, and 'shut down' in the conversation (become noncommittal / slightly unresponsive).
• (#3) I have thought in mistaken ways which would have accepted modest-epistemology arguments, when thinking about decision theory.
Comment by abramdemski on The Schelling Choice is "Rabbit", not "Stag" · 2019-06-09T22:15:22.639Z · score: 15 (7 votes) · LW · GW

By "is a PD", I mean, there is a cooperative solution which is better than any Nash equilibrium. In some sense, the self-interest of the players is what prevents them from getting to the better solution.

By "is a SH", I mean, there is at least one good cooperative solution which is an equilibrium, but there are also other equilibria which are significantly worse. Some of the worse outcomes can be forced by unilateral action, but the better outcomes require coordinated action (and attempted-but-failed coordination is even worse than the bad solutions).

In iterated PD (with the right assumptions, eg appropriately high probabilities of the game continuing after each round), tit-for-tat is an equilibrium strategy which results in a pure-cooperation outcome. The remaining difficulty of the game is the difficulty of ending up in that equilibrium. There are many other equilibria which one could equally well end up in, including total mutual defection. In that sense, iteration can turn a PD into a SH.

Other modifications, such as commitment mechanisms or access to the other player's source code, can have similar effects.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T21:57:38.178Z · score: 3 (2 votes) · LW · GW
I view the issue of intellectual modesty much like the issue of anthropics. The only people who matter are those whose decisions are subjunctively linked to yours (it only starts getting complicated when you start asking whether you should be intellectually modest about your reasoning about intellectual modesty)

I agree fairly strongly, but this seems far from the final word on the subject, to me.

One issue with the clever arguer is that the persuasiveness of their arguments might have very little to do with how persuasive they should be, so attempting to work off expectations might fail.

Ah. I take you to be saying that the quality of the clever arguer's argument can be high variance, since there is a good deal of chance in the quality of evidence cherry-picking is able to find. A good point. But, is it 'too high'? Do we want to do something (beyond the strategy I sketched in the post) to reduce variance?

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T21:45:19.269Z · score: 4 (3 votes) · LW · GW