Comment by abramdemski on Conceptual Problems with UDT and Policy Selection · 2019-07-03T21:16:10.054Z · score: 6 (3 votes) · LW · GW

I'm saying that non-uniqueness of the solution is part of the conceptual problem with Nash equilibria.

Decision theory doesn't exactly provide a "unique solution" -- it's a theory of rational constraints on subjective belief, so, you can believe and do whatever you want within the confines of those rationality constraints. And of course classical decision theory also has problems of its own (such as logical omniscience). But there is a sense in which it is better than game theory about this, since game theory gives rationality constraints which depend on the other player in ways that are difficult to make real.

I'm not saying there's some strategy which works regardless of the other player's strategy. In single-player decision theory, you can still say "there's no optimal strategy due to uncertainty about the environment" -- but, you get to say "but there's an optimal strategy given our uncertainty about the environment", and this ends up being a fairly satisfying analysis. The nash-equilibrium picture of game theory lacks a similarly satisfying analysis. But this does not seem essential to game theory.

Comment by abramdemski on Let's talk about "Convergent Rationality" · 2019-07-03T20:59:22.854Z · score: 6 (3 votes) · LW · GW

Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn't make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn't that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I'm not sure where the cutoff should be, so I said 1%.]

My personal position isn't CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.

Another obstacle to those strategies is, even if future AGI isn't sufficiently strategic/agenty/rational to fall into the "rationality attractor", it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don't apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.

Of course it doesn't make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.

One "something" is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there's more reasoning to be unpacked.

I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.

Comment by abramdemski on Conceptual Problems with UDT and Policy Selection · 2019-07-01T19:38:46.680Z · score: 6 (3 votes) · LW · GW

True, but, I think that's a bad way of thinking about game theory:

  • The Nash equilibrium model assumes that players somehow know what equilibrium they're in. Yet, it gives rise to an equilibrium selection problem due to the non-uniqueness of equilibria. This casts doubt on the assumption of common knowledge which underlies the definition of equilibrium.
  • Nash equilibria also assume a naive best-response pattern. If an agent faces a best-response agent and we assume that the Nash-equilibrium knowledge structure somehow makes sense (there is some way that agents successfully coordinate on a fixed point), then it would make more sense for an agent to select its response function (to, possibly, be something other than argmax), based on what gets the best response from the (more-naive) other player. This is similar to the UDT idea. Of course you can't have both players do this or you're stuck in the same situation again (ie there's yet another meta level which a player would be better off going to).

Going to the meta-level like that seems likely to make the equilibrium selection problem worse rather than better, but, that's not my point. My point is that Nash equilibria aren't the end of the story; they're a somewhat weird model. So it isn't obvious whether a similar no-free-lunch idea applies to a better model of game theory.

Correlated equilibria are an obvious thing to mention here. They're a more sensible model in a few ways. I think there are still some unjustified and problematic assumptions there, though.

Comment by abramdemski on Conceptual Problems with UDT and Policy Selection · 2019-07-01T19:24:46.601Z · score: 2 (1 votes) · LW · GW

I don't want to claim there's a best way, but I do think there are certain desirable properties which it makes sense to shoot for. But this still sort of points at the wrong problem.

A "naturalistic" approach to game theory is one in which game theory is an application of decision theory (not an extension) -- there should be no special reasoning which applies only to other agents. (I don't know a better term for this, so let's use naturalistic for now.)

Standard approaches to game theory lack this (to varying degrees). So, one frame is that we would like to come up with an approach to game theory which is naturalistic. Coming from the other side, we can attempt to apply existing decision theory to games. This ends up being more confusing and unsatisfying than one might hope. So, we can think of game theory as an especially difficult stress-test for decision theory.

So it isn't that there should be some best strategy in multiplayer games, or even that I'm interested in a "better" player despite the lack of a notion of "best" (although I am interested in that). It's more that UDT doesn't give me a way to think about games. I'd like to have a way to think about games which makes sense to me, and which preserves as much as possible what seems good about UDT.

Desirable properties such as coordination are important in themselves, but are also playing an illustrative role -- pointing at the problem. (It could be that coordination just shouldn't be expected, and so, is a bad way of pointing at the problem of making game theory "make sense" -- but I currently think better coordination should be possible, so, think it is a good way to point at the problem.)

Conceptual Problems with UDT and Policy Selection

2019-06-28T23:50:22.807Z · score: 33 (8 votes)
Comment by abramdemski on What's up with self-esteem? · 2019-06-26T19:50:18.346Z · score: 2 (1 votes) · LW · GW

Cool, thanks!

What's up with self-esteem?

2019-06-25T03:38:15.991Z · score: 41 (14 votes)
Comment by abramdemski on How hard is it for altruists to discuss going against bad equilibria? · 2019-06-25T02:48:54.919Z · score: 7 (2 votes) · LW · GW
FAI is a sidetrack, if we don't have any path to FNI (friendly natural intelligence).

I don't think I understand the reasoning behind this, though I don't strongly disagree. Certainly it would be great to solve the "human alignment problem". But what's your claim?

If a bunch of fully self-interested people are about to be wiped out by an avoidable disaster (or even actively malicious people, who would like to hurt each other a little bit, but value self-preservation more), they're still better off pooling their resources together to avert disaster.

You might have a prisoner's dilemma / tragedy of the commons -- it's still even better if you can get everyone else to pool resources to avert disaster, while stepping aside yourself. BUT:

  • that's more a coordination problem again, rather than an everyone-is-too-selfish problem
  • that's not really the situation with AI, because what you have is more a situation where you can either work really hard to build AGI or work even harder to build safe AGI; it's not a tragedy of the commons, it's more like lemmings running off a cliff!
One point of confusion in trying to generalize bad behavior (bad equilibrium is an explanation or cause, bad behavior is the actual problem) is that incentives aren't exogenous - they're created and perpetuated by actors, just like the behaviors we're trying to change. One actor's incentives are another actor's behaviors.

Yeah, the incentives will often be crafted perversely, which likely means that you can expect even more opposition to clear discussion, because there are powerful forces trying to coordinate on the wrong consensus about matters of fact in order to maintain plausible deniability about what they're doing.

In the example being discussed here, it just seems like a lot of people coordinating on the easier route, partly due to momentum of older practices, partly because certain established people/institutions are somewhat threatened by the better practices.

I find it very difficult to agree to any generality without identifying some representative specifics. It feels way too much like I'm being asked to sign up for something without being told what. Relatedly, if there are zero specifics that you think fit the generalization well enough to be good examples, it seems very likely that the generalization itself is flawed.

My feeling is that small examples of the dynamic I'm pointing at come up fairly often, but things pretty reliably go poorly if I point them out, which has resulted in an aversion to pointing such things out.

The conversation has so much gravity toward blame and self-defense that it just can't go anywhere else.

I'm not going to claim that this is a great post for communicating/educating/fixing anything. It's a weird post.

Comment by abramdemski on No, it's not The Incentives—it's you · 2019-06-23T00:57:43.083Z · score: 2 (1 votes) · LW · GW

I see what you mean, but there's a tendency to think of 'homo economicus' as having perfectly selfish, non-altruistic values.

Also, quite aside from standard economics, I tend to think of economic decisions as maximizing profit. Technically, the rational agent model in economics allows arbitrary objectives. But, what kinds of market behavior should you really expect?

When analyzing celebrities, it makes sense to assume rationality with a fame-maximizing utility function, because the people who manage to become and remain celebrities will, one way or another, be acting like fame-maximizers. There's a huge selection effect. So Homo Hollywoodicus can probably be modeled well with a fame-maximizing assumption.

This has nothing to do with the psychology of stardom. People may have all kinds of motives for what they do -- whether they're seeking stardom consciously or just happen to engage in behavior which makes them a star.

Similarly, when modeling politics, it is reasonable to make a Homo Politicus assumption that people seek to gain and maintain power. The politicians whose behavior isn't in line with this assumption will never break into politics, or at best will be short-lived successes. This has nothing to do with the psychology of the politicians.

And again, evolutionary game theory treats reproductive success as utility, despite the many other goals which animals might have.

So, when analyzing market behavior, it makes some sense to treat money as the utility function. Those who aren't going for money will have much less influence on the behavior of the market overall. Profit motives aren't everything, but other motives will be less important that profit motives in market analysis.

How hard is it for altruists to discuss going against bad equilibria?

2019-06-22T03:42:24.416Z · score: 51 (13 votes)
Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-14T08:42:28.221Z · score: 2 (1 votes) · LW · GW
My current understanding of quantilization is "choose randomly from the top X% of actions". I don't see how this helps very much with staying on-distribution... as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.

The base distribution you take the top X% of is supposed to be related to the "on-distribution" distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer's own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently "off-distribution" possibilities for there to be a concern. (I'm not saying these are entirely reasonable assumptions; I'm just saying that this is one way of thinking about quantilization.)

In any case, quantilization seems like it shouldn't work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth's atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren't very valuable.

The base distribution quantilization samples from is about actions, or plans, or policies, or things like that -- not about configurations of atoms.

So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T23:24:13.709Z · score: 19 (9 votes) · LW · GW

Thinking up actual historical examples is hard for me. The following is mostly true, partly made up.

  • (#4) I don't necessarily have trouble talking about my emotions, but when there are any clear incentives for me to make particular claims, I tend to shut down. It feels viscerally dishonest (at least sometimes) to say things, particularly positive things, which I have an incentive to say. For example, responding "it's good to see you too" in response to "it's good to see you" sometimes (not always) feels dishonest even when true.
  • (#4) Talking about money with an employer feels very difficult, in a way that's related to intuitively discarding any motivated arguments and expecting others to do the same.
  • (#6) I'm not sure if I was at the party, but I am generally in the crowd Grognor was talking about, and very likely engaged in similar behavior to what he describes.
  • (#5) I have tripped up when trying to explain something because I noticed myself reaching for examples to prove my point, and the "cherry-picking" alarm went off.
  • (#5, #4) I have noticed that a friend was selecting arguments that I should go to the movies with him in a biased way which ignored arguments to the contrary, and 'shut down' in the conversation (become noncommittal / slightly unresponsive).
  • (#3) I have thought in mistaken ways which would have accepted modest-epistemology arguments, when thinking about decision theory.
Comment by abramdemski on The Schelling Choice is "Rabbit", not "Stag" · 2019-06-09T22:15:22.639Z · score: 15 (7 votes) · LW · GW

By "is a PD", I mean, there is a cooperative solution which is better than any Nash equilibrium. In some sense, the self-interest of the players is what prevents them from getting to the better solution.

By "is a SH", I mean, there is at least one good cooperative solution which is an equilibrium, but there are also other equilibria which are significantly worse. Some of the worse outcomes can be forced by unilateral action, but the better outcomes require coordinated action (and attempted-but-failed coordination is even worse than the bad solutions).

In iterated PD (with the right assumptions, eg appropriately high probabilities of the game continuing after each round), tit-for-tat is an equilibrium strategy which results in a pure-cooperation outcome. The remaining difficulty of the game is the difficulty of ending up in that equilibrium. There are many other equilibria which one could equally well end up in, including total mutual defection. In that sense, iteration can turn a PD into a SH.

Other modifications, such as commitment mechanisms or access to the other player's source code, can have similar effects.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T21:57:38.178Z · score: 3 (2 votes) · LW · GW
I view the issue of intellectual modesty much like the issue of anthropics. The only people who matter are those whose decisions are subjunctively linked to yours (it only starts getting complicated when you start asking whether you should be intellectually modest about your reasoning about intellectual modesty)

I agree fairly strongly, but this seems far from the final word on the subject, to me.

One issue with the clever arguer is that the persuasiveness of their arguments might have very little to do with how persuasive they should be, so attempting to work off expectations might fail.

Ah. I take you to be saying that the quality of the clever arguer's argument can be high variance, since there is a good deal of chance in the quality of evidence cherry-picking is able to find. A good point. But, is it 'too high'? Do we want to do something (beyond the strategy I sketched in the post) to reduce variance?

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T21:45:19.269Z · score: 4 (3 votes) · LW · GW

That seems about right.

A concern I didn't mention in the post -- it isn't obvious how to respond to game-theoretic concerns. Carefully estimating the size of the update you should make when someone fails to provide good reason can be difficult, since you have to model other agents, and you might make exploitable errors.

An extreme way of addressing this is to ignore all evidence short of mathematical proof if you have any non-negligible suspicion about manipulation, similar to the mistake I describe myself making in the post. This seems too extreme, but it isn't clear what the right thing to do overall is. The fully-Bayesian approach to estimating the amount of evidence should act similarly to a good game-theoretic solution, I think, but there might be reason to use a simpler strategy with less chance of exploitable patterns.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2019-06-09T21:32:52.634Z · score: 19 (7 votes) · LW · GW

Thank you! Appreciative comments really help me to be less risk-averse about posting.

Comment by abramdemski on Paternal Formats · 2019-06-09T21:19:11.590Z · score: 11 (4 votes) · LW · GW

I like your suggestion well enough that I might edit the post. (I'll let it sit a bit to see whether I change my mind.)

Comment by abramdemski on Paternal Formats · 2019-06-09T21:15:53.450Z · score: 6 (3 votes) · LW · GW

Maybe serial-access vs random-access, as in computer memory.

Comment by abramdemski on Paternal Formats · 2019-06-09T21:14:24.099Z · score: 2 (1 votes) · LW · GW

Yeah.... I thought about this problem while writing, but didn't think of an alternative I liked.

Comment by abramdemski on Paternal Formats · 2019-06-09T21:12:14.984Z · score: 3 (2 votes) · LW · GW

I'm curious what your guess was.

Comment by abramdemski on The Schelling Choice is "Rabbit", not "Stag" · 2019-06-09T04:32:35.830Z · score: 27 (8 votes) · LW · GW

We should really be calling it Rabbit Hunt rather than Stag Hunt.

  • The schelling choice is rabbit. Calling it stag hunt makes the stag sound schelling.
  • The problem with stag hunt is that the schelling choice is rabbit. Saying of a situation "it's a stag hunt" generally means that the situation sucks because everyone is hunting rabbit. When everyone is hunting stag, you don't really bring it up. So, it would make way more sense if the phrase was "it's a rabbit hunt"!
  • Well, maybe you'd say "it's a rabbit hunt" when referring to the bad equilibrium you're seeing in practice, and "it's a stag hunt" when saying that a better equilibrium is a utopian dream.
  • So, yeah, calling the game "rabbit hunt" is a stag hunt.
I used to think a lot in terms of Prisoner's Dilemma, and "Cooperate"/"Defect." I'd see problems that could easily be solved if everyone just put a bit of effort in, which would benefit everyone. And people didn't put the effort in, and this felt like a frustrating, obvious coordination failure. Why do people defect so much?
Eventually Duncan shifted towards using Stag Hunt rather than Prisoner's Dilemma as the model here. If you haven't read it before, it's worth reading the description in full. If you're familiar you can skip to my current thoughts below.

In the book The Stag Hunt, Skyrms similarly says that lots of people use Prisoner's Dilemma to talk about social coordination, and he thinks people should often use Stag Hunt instead.

I think this is right. Most problems which initially seem like Prisoner's Dilemma are actually Stag Hunt, because there are potential enforcement mechanisms available. The problems discussed in Meditations on Moloch are mostly Stag Hunt problems, not Prisoner's Dilemma problems -- Scott even talks about enforcement, when he describes the dystopia where everyone has to kill anyone who doesn't enforce the terrible social norms (including the norm of enforcing).

This might initially sound like good news. Defection in Prisoner's Dilemma is an inevitable conclusion under common decision-theoretic assumptions. Trying to escape multipolar traps with exotic decision theories might seem hopeless. On the other hand, rabbit in Stag Hunt is not an inevitable conclusion, by any means.

Unfortunately, in reality, hunting stag is actually quite difficult. ("The schelling choice is Rabbit, not Stag... and that really sucks!")

Rabbit in this case was "everyone just sort of pursues whatever conversational types seem best to them in an uncoordinated fashion", and Stag is "we deliberately choose and enforce particular conversational norms."

This sounds a lot like Pavlov-style coordination vs Tit for Tat style coordination. Both strategies can defeat Moloch in theory, but they have different pros and cons. TfT-style requires agreement on norms, whereas Pavlov-style doesn't. Pavlov-style can waste a lot of time flailing around before eventually coordinating. Pavlov is somewhat worse at punishing exploitative behavior, but less likely to lose a lot of utility due to feuds between parties who each think they've been wronged and must distribute justice.

When discussing whether to embark on a stag hunt, it's useful to have shorthand to communicate why you might ever want to put a lot of effort into a concerted, coordinated effort. And then you can discuss the tradeoffs seriously.
Much of the time, I feel like getting angry and frustrated... is something like "wasted motion" or "the wrong step in the dance."

Not really strongly contradicting you, but I remember Critch once outlined something like the following steps for getting out of bad equilibria. (This is almost definitely not the exact list of steps he gave; I think there were 3 instead of 4 -- but step #1 was definitely in there.)

1. Be the sort of person who can get frustrated at inefficiencies.

2. Observe the world a bunch. Get really curious about the ins and outs of the frustrating inefficiencies you notice; understand how the system works, and why the inefficiencies exist.

3. Make a detailed plan for a better equilibrium. Justify why it is better, and why it is worth the effort/resources to do this. Spend time talking to the interested parties to get feedback on this plan.

4. Finally, formally propose the plan for approval. This could mean submitting a grant proposal to a relevant funding organization, or putting something up for a vote, or other things. This is the step where you are really trying to step into the better equilibrium, which means getting credible backing for taking the step (perhaps a letter signed by a bunch of people, or a formal vote), and creating common knowledge between relevant parties (making sure everyone can trust that the new equilibrium is established). It can also mean some kind of official deliberation has to happen, depending on context (such as a vote, or some kind of due-diligence investigation, or an external audit, etc).

Comment by abramdemski on Selection vs Control · 2019-06-09T02:48:04.186Z · score: 4 (2 votes) · LW · GW

I guess what I think isn't that the mainstream isn't explicitly confused about the distinction (ie, doesn't make confused claims), but that it isn't clearly made/taught, which leaves some individuals confused.

I think this has a little to do with the (also often implicit) distinction between research and application (ie, research vs engineering). In the context of pure research, it might make a lot of sense to take shortcuts with toy models which you could not take in the intended application of the algorithms, because you are investigating a particular phenomenon and the shortcuts don't interfere with that investigation. However, these shortcuts can apparently change the type of the problem, and other people can become confused about what problem type you are really trying to solve.

To be a bit more concrete, you might test an AI on a toy model, and directly feed the AI some information about the toy model (as a shortcut). You can do this because the toy model is a simulation you built, so, you have direct access to it. Your intention in the research might be that such direct-fed information would be replaced with learning one day. (To you, your AI is "controller" type.) Others may misinterpret your algorithm as a search technique which takes an explicit model of a situation (they see it as "selection" type).

This could result in other people writing papers which contrast your technique with other "selection"-type techniques. Your algorithm might compare poorly because you made some decisions motivated by eventual control-type applications. This becomes hard to point out because the selection/control distinction is a bit tricky.

As far as I can see, no one there thinks search and planning are the same task.

I'm not sure what you mean about search vs planning. My guess is that search=selection and planning=control. While I do use "search" and "selection" somewhat interchangeably, I don't want to use "planning" and "control" interchangeably; "planning" suggests a search-type operation applied to solve a control problem (the selection-process-within-a-control-process idea).

Also, it seems to me that tons of people would say that planning is a search problem, and AI textbooks tend to reflect this.

With regard to search algorithms being controllers: Here's a discussion I had with ErickBall where they argue that planning will ultimately prove useful for search and I argue it won't.

In the discussion, you say:

Optimization algorithms used in deep learning are typically pretty simple. Gradient descent is taught in sophomore calculus. Variants on gradient descent are typically used, but all the ones I know of are well under a page of code in complexity.

Gradient descent is extremely common these days, but much less so when I was first learning AI (just over ten years ago). To a large extent, it has turned out that "dumber" methods are easier to scale up.

However, much more sophisticated search techniques (with explicit consequentialist reasoning in the inner loop) are still discussed occasionally, especially for cases where evaluating a point is more costly. "Bayesian Optimization" is the subfield in which this is studied (that I know of). Here's an example:

Gaussian Processes for Global Optimization (the search is framed as a sequential decision problem!)

Later, you ask:

How do you reckon long-term planning will be useful for architecture search? It's not a stateful system.

The answer (in terms of Bayesian Optimization) is that planning ahead is still helpful in the same way that planning a sequence of experiments can be helpful. You are exploring the space in order to find the best solution. At every point, you are asking "what question should I ask next, to maximize the amount of information I'll uncover in the long run?". This does not reduce to "what question should I ask next, in order to maximize the amount of information I have right now?" -- but, most optimization algorithms don't even go that far. Most optimization algorithms don't explicitly reason about value-of-information at all, instead doing reasoning which is mainly designed to steer toward the best points it knows how to steer to immediately, with some randomness added in to get some exploration.

Yet, this kind of reasoning is not usually worth it, or so it seems based on the present research landscape. The overhead of planning-how-to-search is too costly; it doesn't save time overall.

Comment by abramdemski on Selection vs Control · 2019-06-09T01:52:09.851Z · score: 4 (2 votes) · LW · GW

I agree with most of what you say here, but I think you're over-emphasizing the idea that search deals with unknowns whereas control deals with knows. Optimization via search works best when you have a good model of the situation. The extreme case for usefulness of search is a game like Chess, where the rules are perfectly known, there's no randomness, and no hidden information. If you don't know a lot about a situation, you can't build an optimal controller, but you also can't set up a very good representation of the problem to solve via search.

This is backwards, actually. “Control” isn’t the crummy option you have to resort to when you can’t afford to search. Searching is what you have to resort to when you can’t do control theory.

Why not both? Most of your post is describing situations where you can't easily solve a control problem with a direct rule, so you spin up a search based on a model of the situation. My paragraph which you quoted was describing a situation where dumb search becomes harder and harder, so you spin up a controller (inside the search process) to help out. Both of these things happen.

Paternal Formats

2019-06-09T01:26:27.911Z · score: 58 (23 votes)

Mistakes with Conservation of Expected Evidence

2019-06-08T23:07:53.719Z · score: 132 (40 votes)
Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-04T16:55:22.770Z · score: 5 (3 votes) · LW · GW

If there's 50% on a paperclips-maximizing utility function and 50% on staples, there's not really any optimization pressure put toward satisfying both.

  • As you say, there's no reason to make 50% of the universe into paperclips; that's just not what 50% probability on paperclips means.
  • It could be that there's a sorta-paperclip-sorta-staple (let's say 'stapleclip' for short), which the AGI will be motivated to find in order to get a moderately high rating according to both strategies.
  • However, it could be that trying to be both paperclip and staple at the same time reduces the overall efficiency. Maybe the most efficient nanometer-scale stapleclip is significantly larger than the most efficient paperclip or staple, as a result of having to represent the critical features of both paperclips and staples. In this case, the AGI will prefer to gamble, tiling the universe with whatever is most efficient, and giving no consideration at all to the other hypothesis.

That's the essence of my concern: uncertainty between possibilities does not particularly push toward jointly maximizing the possibilities. At least, not without further assumptions.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-04T16:30:48.169Z · score: 2 (1 votes) · LW · GW

See this comment. Stuart and I are discussing what happens after things have converged as much as they're going to, but there's still uncertainty left.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-04T07:48:46.907Z · score: 4 (2 votes) · LW · GW

Concerning #3: yeah, I'm currently thinking that you need to make some more assumptions. But, I'm not sure I want to make assumptions about resources. I think there may be useful assumptions related to the way the hypotheses are learned -- IE, we expect hypotheses with nontrivial weight to have a lot of agreement because they are candidate generalizations of the same data, which makes it somewhat hard to entirely dissatisfy some while satisfying others. This doesn't seem quite helpful enough, but, perhaps something in that direction.

In any case, I agree that it seems interesting to explore assumptions about the mutual satisfiability of different value functions.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-04T07:26:22.775Z · score: 2 (1 votes) · LW · GW
Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly disliked this corner case were ruled out.

That's not the case I'm considering. I'm imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.

Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are "on-distribution", ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for "off-distribution" plans.

There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is "smaller"; there's room for a lot to happen in the off-distribution space. That's why it reaches weird corner cases.

And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according to every hypothesis. But that is not what the search is told to do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)

When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they're off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-03T23:52:23.953Z · score: 15 (5 votes) · LW · GW
I fully agree that realizability is needed here. In practice, for the research I'm doing, I'm defining the desired utility as being defined by a constructive process. Therefore the correct human preference set is in there by definition. This requires that the set of possible utilities be massive enough that we're confident we didn't miss anything. Then, because the process is constructive, it has to be realizable once we've defined the "normative assumptions" that map observations to updates of value functions.

My current intuition is that thinking in terms of non-realizable epistemology will give a more robust construction process, even though the constructive way of thinking justifies a kind of realizability assumption. This is partly because it allows us to do without the massive-enough set of hypotheses (which one may have to do without in practice), but also because it seems closer to the reality of "humans don't really have a utility function, not exactly".

However, I think I haven't sufficiently internalized your point about utility being defined by a constructive process, so my opinion on that may change as I think about it more.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-03T23:43:38.831Z · score: 4 (2 votes) · LW · GW

I am a little frustrated with your reply (particularly the first half), but I'm not sure if you're really missing my point (perhaps I'll have to think of a different way of explaining it) vs addressing it, but not giving me enough of an argument for me to connect the dots. I'll have to think more about some of your points.

Many of your statements seem true for moderately-intelligent systems of the sort you describe, but, don't clearly hold up when a lot of optimization pressure is applied.

If there's even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high.

The VOI incentive can't be so strong that the AI is willing to pay arbitrarily high costs (commit the resources of the whole galaxy to investigating ever-finer details of human preferences, deconstruct each human atom by atom, etc...). So, at some point, it can be worthwhile to entirely compromise one somewhat-plausible for the sake of others.

This would be untrue if, for example, the system maximized the weighted product (the weight is used as an exponent of the hypothesis ). It would then actually never be worth it to entirely zero out one possible utility function for the sake of optimizing others. That proposal likely has its own issues, but I mention it just to make clear that I'm not bemoaning an inevitable fact of decision theory -- there are alternatives.

As I said, I think Goodhart's law is largely about distributional shift. My scheme incentivizes the AI to mostly take "on-distribution" plans: plans it is confident are good, because many different ways of looking at the data all point to them being good.

This is one of the assertions which seems generally true of moderately intelligent systems optimizing under value uncertainty, but doesn't seem to hold up as a lot of optimization pressure is applied. Good plans will tend to be on-distribution, because that's a good way to reap the gains of many different remaining hypotheses which agree for on-distribution things but disagree elsewhere. Why would the best plans tend to be on-distribution? Why wouldn't they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?

Part of me wants to say "if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!" Or: "If the overseer isn't capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!" But maybe there's an elegant way to implement a more conservative design.

Yeah, that's the direction I'm thinking in. By the way -- I'm not even trying to say that maximizing subjective expected utility is actually the wrong thing to do (particularly if you've got calibration properties, or knows-what-it-knows properties, or some other learning-theoretic properties which we haven't realized we want yet). I'm just saying that the case is not clear, and it seems like we'd want the case to be clear.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-03T06:26:59.874Z · score: 2 (1 votes) · LW · GW

If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.

To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can’t easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.

I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?

I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can't be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won't this run into siren worlds and so on, by default?

Technically, you don’t need this assumption. As I wrote in this comment: “it’s not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there’s some model in the ensemble that also makes that veto.”

Yeah, it seems possible and interesting to formalize an argument like that.

(I haven’t read a lot about quantilization so I can’t say much about that. However, a superintelligent adversary seems like something to avoid.)

The "adversary" can be something like a mesa-optimizer arising from a search which the system runs in order to solve a problem. If you've got rich enough of a hypothesis space (due to using a rich hypothesis space of world-models, or a rich set of possible human utility functions, etc etc), then you'll have some of those lurking in the hypothesis space. Reasoning in an appropriate way about the possibility, even if you manage to avoid mesa-optimizers in reality, could require game-theoretic reasoning.

OTOH, although quantilization can be justified by a story involving an actual adversary, that's not necessarily the best way to think about what it is really doing. Robustness properties tend to involve some kind of universal quantifier over a bunch of possibilities. Maintaining a property under such a universal quantification is like adversarial game theory; you're trying to do well no matter what strategy the other player uses. So, robustness properties tend to be conveniently described in adversarial terms. That's basically what's going on in the case of quantilization.

Similarly, "adversarial Goodhart" doesn't have to be about superintelligent adversaries, in general. It can be about cases where we want stronger guarantees, and so, are willing to compromiso some decision-theoretic optimality in return for better worst-case guarantees.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-03T06:07:35.845Z · score: 2 (1 votes) · LW · GW

If I understand correctly, extremal Goodhart is essentially the same as distributional shift from the Concrete Problems in AI Safety paper.

I think that's right. Perhaps there is a small distinction to be brought out, but basically, extremal Goodhart is distributional shift brought about by the fact that the AI is optimizing hard.

In any case… I’m not exactly sure what you mean by “calibration”, but when I say “calibration”, I refer to “knowing what you know”. For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I’m reasonably “well-calibrated”; that is, I have a sense of what I do and don’t know.

A calibrated AI system, to me, is one that correctly says “this thing I’m looking at is an unusual thing I’ve never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high”.

Here's what I mean by calibration: there's a function from the probability you give to the frequency observed (or from the expected value you give to the average value observed), and the function approaches a straight x=y line as you learn. That's basically what you describe in the example of the online test. However, in ML, there's a difference between knows-what-it-knows learning (KWIK learning) and calibrated learning. KWIK learning is more like what you describe in the second paragraph above. Calibrated learning is focused on the idea that a system should learn when it is systematically over/under confident, correcting such predictable biases. KWIK learning is more focused on not making claims when you have insufficient evidence to pinpoint the right answer.

Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.

I don't think the inner alignment problem, or the unintended optimization problem, reduce to calibrated learning (or KWIK learning). However, my reasons are somewhat complex. I think it is reasonable to try to make those reductions, so long as you grapple with the real issues.

I’m not super hung up on statistical guarantees, as I haven’t yet seen a way to make them in general which doesn’t require making some sort of unreasonable or impractical assumption about the world (and I’m skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.

Statistical guarantees are just a way to be able to say something with confidence. I agree that they're often impractical, and therefore only a toy model of how things can work (at best). However, I'm not very sympathetic to attempts to solve the problem without actually-quite-strong arguments for alignment-relevant properties being made somehow. The question is, how?

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-03T05:44:40.100Z · score: 5 (3 votes) · LW · GW

However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.

Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?

If there's a systematic bias in the score, the thing with the highest score may not always be the best action to take. Calibrating the estimates may change the ranking of options.

For example, it could be that expected values above 0.99 are almost always significant overestimates, with an average true value of 0.5. A calibrated learner would observe this and systematically correct such items downwards. The new top choices would probably have values like 0.989 (if that's the only correction applied).

This provides something of a guarantee that systematic Goodhart-type problems will eventually be recognized and corrected, to the extent which they occur.

A meta-rule like that, which corrects observed biases in the aggregate scores, isn't easy to represent as a direct object-level hypothesis about the data. That's why calibrated learning may not be Bayesian. And, without a calibration guarantee, you'd need some other argument as to why representing uncertainty helps to avoid Goodhart.

Comment by abramdemski on Does Bayes Beat Goodhart? · 2019-06-03T04:06:10.223Z · score: 2 (1 votes) · LW · GW

D: Wow, I'm surprised I made such a mistake so repeatedly. Thanks!

Comment by abramdemski on Yes Requires the Possibility of No · 2019-06-03T03:42:48.942Z · score: 5 (3 votes) · LW · GW

I think there isn't a consistent change in policy that's best in all the examples, but all the examples show someone who might benefit from recognizing the common dynamic that all the examples illustrate.

Comment by abramdemski on Yes Requires the Possibility of No · 2019-06-03T03:39:40.209Z · score: 4 (2 votes) · LW · GW

WRT #9, a Bayesian might want to believe X because they are in a weird decision theory problem where beliefs make things come true. This seems relatively common for humans unless they can hide their reactions well.

The issue of wanting X to happen does seem rather subtle, especially since there isn't a clean division between things you want to know about and things you might want to influence. The solution of this paradox in classical decision theory is that the agent should already know its own plans, so its beliefs already perfectly reflect any influence which it has on X. Of course, this comes from an assumption of logical omniscience. Bounded agents with logical uncertainty can't reason like that.

Comment by abramdemski on Yes Requires the Possibility of No · 2019-06-03T03:28:47.743Z · score: 4 (2 votes) · LW · GW
I feel that there is an argument to be made that when rejection danger realises you should just eat it in the face without resisting and the failure mode prominently features resisting the rejection. And on the balance if you can't withstand a no then you will not have earned the yes and should not be asking the question in the first place.

10. Jill decides to face any "yes requires the possibility of no" situation by (ahem) eating it in the face. She is frequently happy with this decision, because it forces her to face the truth in situations where she otherwise wouldn't, which makes her feel brave, and gives her more accurate information. However, she finds herself unsure whether she really wants to face the music every single time -- not because she has any concrete reasons to doubt the quality of the policy, but because she isn't sure she would be able to admit to herself if she did. Seeing the problem, she eventually stops forcing herself.

Comment by abramdemski on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-06-03T02:31:41.347Z · score: 14 (4 votes) · LW · GW

Something like a response to this post.

Does Bayes Beat Goodhart?

2019-06-03T02:31:23.417Z · score: 38 (11 votes)
Comment by abramdemski on Selection vs Control · 2019-06-02T22:10:16.433Z · score: 7 (4 votes) · LW · GW

Yeah, I agree that this seems possible, but extremely unclear. If something uses a fairly complex algorithm like FFT, is it search? How "sophisticated" can we get without using search? How can we define "search" and "sophisticated" so that the answer is "not very sophisticated"?

Comment by abramdemski on Selection vs Control · 2019-06-02T18:34:12.628Z · score: 11 (7 votes) · LW · GW

Yeah, I agree with most of what you're saying here.

  • A learned controller which isn't implementing any internal selection seems more likely to be incoherent out-of-distribution (ie lack a strong IRL interpretation of its behavior), as compared with a mesa-optimizer;
  • However, this is a low-confidence argument at present; it's very possible that coherent controllers can appear w/o necessarily having a behavioral objective which matches the original objective, in which case a version of the internal alignment problem applies. (But this might be a significantly different version of the internal alignment problem.)

I think a crux here is: to what extent are mesa-controllers with simple behavioral objectives going to be simple? The argument that mesa-optimizers can compress coherent strategies does not apply here.

Actually, I think there's an argument that certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world model; explicit representation of objective within that world model; but, optimal policy does not use any search). There is also other reason to suspect that these could survive techniques which are designed to make sure mesa-optimizers don't arise: they aren't expending a bunch of processing power on an internal search, so, you can't eliminate them with some kind of processing-power device. (Not that we know of any such device that eliminates mesa-optimizers -- but if we did, it may not help with rocket-type mesa-controllers.)

Terminology point: I like the term 'selection' for the cluster I'm pointing at, but, I keep finding myself tempted to say 'search' in an AI context. Possibly, 'search vs control' would be better terminology.

Comment by abramdemski on Risks from Learned Optimization: Introduction · 2019-06-02T07:54:36.944Z · score: 25 (7 votes) · LW · GW

I wrote something which is sort of a reply to this post (although I'm not really making a critique or any solid point about this post, just exploring some ideas which I see as related).

Selection vs Control

2019-06-02T07:01:39.626Z · score: 97 (24 votes)
Comment by abramdemski on Separation of Concerns · 2019-05-28T21:46:56.191Z · score: 5 (3 votes) · LW · GW

In the scenario I'm imagining, it doesn't apply because you don't fully realize/propagate the fact that you're filtering evidence for yourself. This is partly because the evidence-filtering strategy is smart enough to filter out evidence about its own activities, and partly just because agency is hard and you don't fully propagate everything by default.

I'm intending this mostly as an 'internal' version of "perverse anti-epistemic social pressures". There's a question of why this would exist at all (since it doesn't seem adaptive). My current guess is, some mixture of perverse anti-epistemic social pressures acting on evolutionary timelines, and (again) "agency is hard" -- it's plausible that this kind of thing emerges accidentally from otherwise useful mental architectures, and doesn't have an easy and universally applicable fix.

Comment by abramdemski on Separation of Concerns · 2019-05-24T19:30:28.193Z · score: 15 (5 votes) · LW · GW
I don't understand the OP's point at all, but

If I had to summarize: "Talking about feeling is often perceived as a failure of separation-of-concerns by people who are skilled at various other cognitive separations-of-concerns; but, it isn't necessarily. In fact, if you're really good at separation-of-concerns, you should be able to talk about feelings a lot more than otherwise. This is probably just a good thing to do, because people care about other people's feelings"

Comment by abramdemski on Separation of Concerns · 2019-05-24T19:25:09.891Z · score: 7 (4 votes) · LW · GW

In practice, a lot of things about one person's attitudes toward cooperation 'leak out' to others (as in, are moderately detectable). This includes reading things like pauses before making decisions, which means that merely thinking about an alternative can end up changing the outcome of a situation.

Comment by abramdemski on Separation of Concerns · 2019-05-24T19:21:12.438Z · score: 2 (1 votes) · LW · GW

Significant if true & applicable to other people. I'm a bit skeptical -- I sort of think it works like that, but, sort of think it would be hard to notice places where this strategy failed.

Comment by abramdemski on Separation of Concerns · 2019-05-24T19:13:24.042Z · score: 4 (2 votes) · LW · GW

I agree, it's not quite right. Signalling equilibria in which mostly 'true' signals are sent can evolve in the complete absence of a concept of truth, or even in the absence of any model-based reasoning behind the signals at all. Similarly, beliefs can manage to be mostly true without any explicit modeling of what beliefs are or a concept of truth.

What's interesting to me is how the separation of concerns emerges at all.

Moreover, there isn't an obvious decision-theoretic reason why someone might not want to think about possibilities they don't want to come true (wouldn't you want to think about such possibilities, in order to understand and steer away from them?). So, such perceived incentives are indicative of perverse anti-epistemic social pressures, e.g. a pressure to create a positive impression of how one's life is going regardless of how well it is actually going.

It does seem like it's largely that, but I'm fairly uncertain. I think there's also a self-coordination issue (due to hyperbolic discounting and other temporal inconsistencies). You might need to believe that a plan will work with very high probability in order to go through with every step rather than giving in to short-term temptations. (Though, why evolution crafted organisms which do something-like-hyperbolic-discounting rather than something more decision-theoretically sound is another question.)

Comment by abramdemski on Separation of Concerns · 2019-05-24T19:00:16.707Z · score: 8 (5 votes) · LW · GW

In an environment with very low epistemic-instrumental separation of concerns, an especially intelligent individual actually has an incentive to insulate their own epistemics (become good at lying and acting), so that they can think. Optimizing for true beliefs then becomes highly instrumentally valuable (modulo the cost of keeping up the barrier / the downside risk if discovered).

Also, still thinking about the environment where the overarching social structure doesn't support much separation of concerns, there's still a value to be had in associating with individuals who often naively speak the truth (not the liar/actor type), because there's a lot of misaligned preferences floating around. Your incentives for bending the truth are different from another person's, so you prefer to associate with others who don't bend the truth much (especially if you internally separate concerns). So, separation of concerns seems like a convergent instrumental goal which will be bubbling under the surface even in a dysfunctional superstructure.

Of course, both effects are limited in their ability encourage truth, and it's a little like arguing that cooperation in the prisoner's dilemma is a convergent instrumental subgoal bubbling under the surface of a defective equilibrium.

Comment by abramdemski on Separation of Concerns · 2019-05-24T18:42:04.887Z · score: 2 (1 votes) · LW · GW

Yeah, I later realized that my comment was not really addressing what you were interested in.

I read you as questioning the argument "separation of concerns, therefore, separation of epistemic vs instrumental" -- not questioning the conclusion, which is what I initially responded to.

I think separation-of-concerns just shouldn't be viewed as an argument in itself (ie, identifying some concerns which you can make a distinction between does not mean you should separate them). That conclusion rests on many other considerations.

Part of my thinking in writing the post was that humans have a relatively high degree of separation between epistemic and instrumental even without special scientific/rationalist memes. So, you can observe the phenomenon, take it as an example of separation-of-concerns, and think about why that may happen without thinking about abandoning evolved strategies.

Sort of like the question "why would an evolved species invent mathematics?" -- why would an evolved species have a concept of truth? (But, I'm somewhat conflating 'having a concept of truth' and 'having beliefs at all, which an outside observer might meaningfully apply a concept of truth to'.)

Comment by abramdemski on Separation of Concerns · 2019-05-24T03:08:33.895Z · score: 7 (4 votes) · LW · GW

I think a lot of words have been spent on this debate elsewhere, but all I feel like citing is biases against overcoming bias. The point it mentions about costs accruing mostly to you is related to the point you made about group rationality. The point about not knowing how to evaluate whether epistemic rationality is useful without developing epistemic rationality -- while it is perhaps intended as little more than a cute retort, I take it fairly seriously; it seems to apply to specific examples I encounter.

My recent view on this is mostly: but if you actually look, doesn't it seem really useful to be able to separate these concerns? Overwhelmingly so?

Separation of Concerns

2019-05-23T21:47:23.802Z · score: 70 (22 votes)
Comment by abramdemski on Pavlov Generalizes · 2019-05-19T00:18:12.294Z · score: 4 (2 votes) · LW · GW

Somewhat true, but without further bells and whistles, RL does not replicate the Pavlov strategy in Prisoner's Dilemma, so I think looking at it that way is missing something important about what's going on.

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-25T17:08:22.140Z · score: 4 (2 votes) · LW · GW

Ah, ok. I note that it may have been intended more as a meditative practice, since the goal appears to have been reaching a state of bliss, the epistemic practice being a means to that end. Practicing doubting everything could be an interesting meditation (though it could perhaps be dangerous).

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-23T19:51:39.127Z · score: 2 (1 votes) · LW · GW
this procedure is called (the weak form of) Pyrrhonian skepticism

What's the strong form?

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-23T07:19:12.937Z · score: 10 (5 votes) · LW · GW

I think in a conversation I had with you last year, I kept going back to 'state' despite protests because I kept thinking "if AUP works, surely it would be because some of the utility functions calculate a sensible state estimate in a humanlike ontology and then define utility from this". It isn't necessarily the right way to critique AUP, but I think I was right to think those thoughts conditional on that assumption -- ie, even if it isn't the argument you're trying to make for AUP, it seems like a not-unreasonable position to consider, and so thinking about how AUP does in terms of state can be a reasonable and important part of a thought-process assessing AUP. I believe I stopped making the assumption outright at some point, but kept bringing out the assumption as a tool for analysis -- for example, supporting a thought experiment with the argument that there would at least be some utility functions which thought about the external world enough to case about such-and-such. I think in our conversation I managed to appropriately flag these sorts of assumptions such that you were OK with the role it was playing in the wider argument (well... not in the sense of necessarily accepting the arguments, but in the sense of not thinking I was just repeatedly making the mistake of thinking it has to be about state, I think).

Other people could be thinking along similar lines without flagging it so clearly.

Comment by abramdemski on Best reasons for pessimism about impact of impact measures? · 2019-04-23T01:05:24.869Z · score: 23 (6 votes) · LW · GW
  • Giving people a slider with "safety" written on one end and "capability" written on the other, and then trying to get people to set it close enough to the "safety" end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)
    • An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.
    • Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal -- a way to directly optimize for "what I mean, not what I say", such that it is obvious that things are just worse if you depart from this.
    • However, maybe those things are just pipe dreams -- this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.
      • My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
      • Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)
        • In other words, conditioned on having a story more like "this is how you get the most of what you want" rather than a slider reading "safety ------- capability", I more expect to see a mild optimizer as opposed to an impact measure.
  • Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn't exactly what we want.
    • It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can't spot a concrete problem with a given impact measure: impact measures don't address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
    • If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.
      • (It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)
    • On the other hand, a mild optimizer seems to get the spirit of what's wanted from low-impact.
      • This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.
  • Both mild optimization and impact measures require separate approaches to "doing what people want".
    • Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of "do what I want" before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.

Alignment Research Field Guide

2019-03-08T19:57:05.658Z · score: 194 (66 votes)

Pavlov Generalizes

2019-02-20T09:03:11.437Z · score: 68 (20 votes)

What are the components of intellectual honesty?

2019-01-15T20:00:09.144Z · score: 32 (8 votes)


2019-01-13T23:46:10.866Z · score: 42 (11 votes)

When is CDT Dutch-Bookable?

2019-01-13T18:54:12.070Z · score: 25 (4 votes)

CDT Dutch Book

2019-01-13T00:10:07.941Z · score: 27 (8 votes)

Non-Consequentialist Cooperation?

2019-01-11T09:15:36.875Z · score: 42 (14 votes)

Combat vs Nurture & Meta-Contrarianism

2019-01-10T23:17:58.703Z · score: 55 (16 votes)

What makes people intellectually active?

2018-12-29T22:29:33.943Z · score: 78 (32 votes)

Embedded Agency (full-text version)

2018-11-15T19:49:29.455Z · score: 88 (33 votes)

Embedded Curiosities

2018-11-08T14:19:32.546Z · score: 76 (27 votes)

Subsystem Alignment

2018-11-06T16:16:45.656Z · score: 115 (36 votes)

Robust Delegation

2018-11-04T16:38:38.750Z · score: 109 (36 votes)

Embedded World-Models

2018-11-02T16:07:20.946Z · score: 80 (25 votes)

Decision Theory

2018-10-31T18:41:58.230Z · score: 84 (29 votes)

Embedded Agents

2018-10-29T19:53:02.064Z · score: 151 (62 votes)

A Rationality Condition for CDT Is That It Equal EDT (Part 2)

2018-10-09T05:41:25.282Z · score: 17 (6 votes)

A Rationality Condition for CDT Is That It Equal EDT (Part 1)

2018-10-04T04:32:49.483Z · score: 21 (7 votes)

In Logical Time, All Games are Iterated Games

2018-09-20T02:01:07.205Z · score: 83 (26 votes)

Track-Back Meditation

2018-09-11T10:31:53.354Z · score: 57 (21 votes)

Exorcizing the Speed Prior?

2018-07-22T06:45:34.980Z · score: 11 (4 votes)

Stable Pointers to Value III: Recursive Quantilization

2018-07-21T08:06:32.287Z · score: 17 (7 votes)

Probability is Real, and Value is Complex

2018-07-20T05:24:49.996Z · score: 44 (20 votes)

Complete Class: Consequentialist Foundations

2018-07-11T01:57:14.054Z · score: 43 (16 votes)

Policy Approval

2018-06-30T00:24:25.269Z · score: 47 (15 votes)

Machine Learning Analogy for Meditation (illustrated)

2018-06-28T22:51:29.994Z · score: 93 (35 votes)

Confusions Concerning Pre-Rationality

2018-05-23T00:01:39.519Z · score: 36 (7 votes)


2018-05-21T21:10:57.290Z · score: 90 (24 votes)

Bayes' Law is About Multiple Hypothesis Testing

2018-05-04T05:31:23.024Z · score: 81 (20 votes)

Words, Locally Defined

2018-05-03T23:26:31.203Z · score: 50 (15 votes)

Hufflepuff Cynicism on Hypocrisy

2018-03-29T21:01:29.179Z · score: 33 (17 votes)

Learn Bayes Nets!

2018-03-27T22:00:11.632Z · score: 84 (24 votes)

An Untrollable Mathematician Illustrated

2018-03-20T00:00:00.000Z · score: 262 (91 votes)

Explanation vs Rationalization

2018-02-22T23:46:48.377Z · score: 31 (8 votes)

The map has gears. They don't always turn.

2018-02-22T20:16:13.095Z · score: 54 (14 votes)

Toward a New Technical Explanation of Technical Explanation

2018-02-16T00:44:29.274Z · score: 130 (46 votes)

Two Types of Updatelessness

2018-02-15T20:19:54.575Z · score: 45 (12 votes)

Two Types of Updatelessness

2018-02-15T20:16:41.000Z · score: 0 (0 votes)

Hufflepuff Cynicism on Crocker's Rule

2018-02-14T00:52:37.065Z · score: 36 (12 votes)

Hufflepuff Cynicism

2018-02-13T02:15:50.945Z · score: 43 (16 votes)

Stable Pointers to Value II: Environmental Goals

2018-02-09T06:03:00.244Z · score: 27 (8 votes)

Stable Pointers to Value II: Environmental Goals

2018-02-09T06:02:43.000Z · score: 0 (0 votes)