## Posts

Lessons I've Learned from Self-Teaching 2021-01-23T19:00:55.559Z
Review of 'Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More' 2021-01-12T03:57:06.655Z
Review of 'But exactly how complex and fragile?' 2021-01-06T18:39:03.521Z
Collider bias as a cognitive blindspot? 2020-12-30T02:39:35.700Z
2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs 2020-12-23T17:16:10.174Z
Avoiding Side Effects in Complex Environments 2020-12-12T00:34:54.126Z
Is it safe to spend time with people who already recovered from COVID? 2020-12-02T22:06:13.469Z
Non-Obstruction: A Simple Concept Motivating Corrigibility 2020-11-21T19:35:40.445Z
Math That Clicks: Look for Two-Way Correspondences 2020-10-02T01:22:18.177Z
Power as Easily Exploitable Opportunities 2020-08-01T02:14:27.474Z
Generalizing the Power-Seeking Theorems 2020-07-27T00:28:25.677Z
GPT-3 Gems 2020-07-23T00:46:36.815Z
To what extent is GPT-3 capable of reasoning? 2020-07-20T17:10:50.265Z
What counts as defection? 2020-07-12T22:03:39.261Z
Corrigibility as outside view 2020-05-08T21:56:17.548Z
How should potential AI alignment researchers gauge whether the field is right for them? 2020-05-06T12:24:31.022Z
Insights from Euclid's 'Elements' 2020-05-04T15:45:30.711Z
Problem relaxation as a tactic 2020-04-22T23:44:42.398Z
A Kernel of Truth: Insights from 'A Friendly Approach to Functional Analysis' 2020-04-04T03:38:56.537Z
Research on repurposing filter products for masks? 2020-04-03T16:32:21.436Z
ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' 2020-03-25T20:03:39.590Z
Conclusion to 'Reframing Impact' 2020-02-28T16:05:40.656Z
Reasons for Excitement about Impact of Impact Measure Research 2020-02-27T21:42:18.903Z
Attainable Utility Preservation: Scaling to Superhuman 2020-02-27T00:52:49.970Z
How Low Should Fruit Hang Before We Pick It? 2020-02-25T02:08:52.630Z
Continuous Improvement: Insights from 'Topology' 2020-02-22T21:58:01.584Z
Attainable Utility Preservation: Empirical Results 2020-02-22T00:38:38.282Z
Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z
On Being Robust 2020-01-10T03:51:28.185Z
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z
Seeking Power is Often Robustly Instrumental in MDPs 2019-12-05T02:33:34.321Z
How I do research 2019-11-19T20:31:16.832Z
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z
The Gears of Impact 2019-10-07T14:44:51.212Z
World State is the Wrong Abstraction for Impact 2019-10-01T21:03:40.153Z
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z
Deducing Impact 2019-09-24T21:14:43.177Z
Value Impact 2019-09-23T00:47:12.991Z
Reframing Impact 2019-09-20T19:03:27.898Z
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z

Comment by turntrout on Distinguishing claims about training vs deployment · 2021-02-25T17:28:03.656Z · LW · GW

A more accurate description might be something like "ubiquitous instrumentality"? But this isn't a very aesthetically pleasing name.

I'd considered 'attractive instrumentality' a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of 'attractive' isn't 'having attractor-like properties.'

Comment by turntrout on Distinguishing claims about training vs deployment · 2021-02-25T17:26:16.056Z · LW · GW

But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.

Ah. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions 'robustly instrumental.'

Comment by turntrout on Distinguishing claims about training vs deployment · 2021-02-23T02:00:15.468Z · LW · GW

The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.

One possibility is that we have to individuate these "instrumental convergence"-adjacent theses using different terminology. I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are

However, it doesn't make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about 'instrumentality' from the system's perspective. e.g. I'm not sure it makes sense to say 'edge detectors are robustly instrumental for this network structure on this dataset after X epochs'.

(These are early thoughts; I wanted to get them out, and may revise them later or add another comment)

EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find 'robust instrumentality' to be better as an informal handle, but its formal operationalization seems better for precise thinking.

Comment by turntrout on Formal Solution to the Inner Alignment Problem · 2021-02-19T02:40:47.190Z · LW · GW

civilization that realizes it's in a civilization

"in a simulation", no?

Comment by turntrout on Covid: CDC Issues New Guidance on Opening Schools · 2021-02-18T16:08:19.246Z · LW · GW

One thing I haven't seen worried about as much is: are children going to suffer from long COVID? After it became clear that children were basically safe from severe COVID, long COVID became my next concern. If I had children, I certainly wouldn't want them suffering e.g. long-term fatigue... But I don't recall reading much about this question in your posts.

Without digging into the details right now, a quick Google returns Evidence that long COVID affects children:

The average period from diagnosis to evaluation was ~163 days. Of the group, ~42% had a complete recovery. Within the group, 53% of children were reported to have one or more symptom 120 or more days after diagnosis, fitting the diagnosis of Long COVID. Strikingly, 36% of them had one or two symptoms at the time of evaluation, and 23% three or more symptoms.

(I haven't checked this study's methodology, or done a broader lit review yet)

Comment by turntrout on TurnTrout's shortform feed · 2021-02-17T16:53:49.338Z · LW · GW

I went into a local dentist's office to get more prescription toothpaste; I was wearing my 3M p100 mask (with a surgical mask taped over the exhaust, in order to protect other people in addition to the native exhaust filtering offered by the mask). When I got in, the receptionist was on the phone. I realized it would be more sensible for me to wait outside and come back in, but I felt a strange reluctance to do so. It would be weird and awkward to leave right after entering. I hovered near the door for about 5 seconds before actually leaving. I was pretty proud that I was able to override my naive social instincts in a situation where they really didn't make sense (I will never see that person again, and they would probably see my minimizing shared air as courteous anyways), to both my benefit and the receptionist's.

Also, p100 masks are amazing! When I got home, I used hand sanitizer. I held my sanitized hands right up to the mask filters, but I couldn't even sniff a trace of the alcohol. When the mask came off, the alcohol slammed into my nose immediately.

Comment by turntrout on The feeling of breaking an Overton window · 2021-02-17T15:29:50.545Z · LW · GW

I don't know what you mean to imply by "cheating" (I had expected 'technical truths which are socially innocuous'), but some of these seem downright bad and wrong to say to someone.

Comment by turntrout on “PR” is corrosive; “reputation” is not. · 2021-02-16T02:54:17.140Z · LW · GW

Without taking a position on this dispute, I'd like to note that I've had a similar conversation with Zack ( / Said).

Comment by turntrout on Insights from Euclid's 'Elements' · 2021-02-15T03:47:19.581Z · LW · GW

Yes, this is correct; this phrasing was misleading. IMO, the most succinct formally correct characterization of similarities is:

As a map , a similarity of ratio  takes the form , where  is an  orthogonal matrix and  is a translation vector.

The only difference compared to congruence is that congruence requires .

Comment by turntrout on What does the FDA actually do between getting the trial results and having their meeting? · 2021-02-13T16:05:35.522Z · LW · GW

I agree. I’m saying that if, hypothetically, this were difficult to check and the FDA couldn’t have checked the intermediate data and if... etc, then you could still hire more people.

Comment by turntrout on What does the FDA actually do between getting the trial results and having their meeting? · 2021-02-13T00:50:47.882Z · LW · GW

If a lot of the work is making sure that the stats / methods check out, that's a local validity issue that scales with more people, right?

Comment by turntrout on What does the FDA actually do between getting the trial results and having their meeting? · 2021-02-12T23:45:45.501Z · LW · GW

even if it were rocket science, you could always just hire more rocket scientists to get it done more quickly?

Comment by turntrout on Deducing Impact · 2021-02-12T20:04:59.411Z · LW · GW

The spoiler seems to be empty?

Comment by turntrout on TurnTrout's shortform feed · 2021-02-12T15:03:10.670Z · LW · GW

What kind of reasoning would have allowed me to see MySpace in 2004, and then hypothesize the current craziness as a plausible endpoint of social media? Is this problem easier or harder than the problem of 15-20 year AI forecasting?

Comment by turntrout on TurnTrout's shortform feed · 2021-02-11T17:38:27.469Z · LW · GW

In an interesting parallel to John Wentworth's Fixing the Good Regulator Theorem, I have an MDP result that says:

Suppose we're playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for  reward functions (one for each state in the environment), and you're able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism).

Roughly: being able to complete linearly many tasks in the state space means you have enough information to model the entire environment.

Comment by turntrout on Fixing The Good Regulator Theorem · 2021-02-11T16:33:56.351Z · LW · GW

Later information can “choose many different games” - specifically, whenever the posterior distribution of system-state  given two possible  values is different, there must be at least one  value under which optimal play differs for the two  values.

Given your four conditions, I wonder if there's a result like "optimally power-seeking agents (minimizing information costs) must model the world." That is, I think power is about being able to achieve a wide range of different goals (to win at 'many different games' the environment could ask of you), and so if you want to be able to sufficiently accurately estimate the expected power provided by a course of action, you have to know how well you can win at all these different games.

Comment by turntrout on We got what's needed for COVID-19 vaccination completely wrong · 2021-02-10T20:17:17.442Z · LW · GW

I just read Christian's post and I don't see what 'comes close to health fraud.' Please explain?

Comment by turntrout on Fixing The Good Regulator Theorem · 2021-02-10T19:32:13.934Z · LW · GW

Okay, I agree that if you remove their determinism & full observability assumption (as you did in the post), it seems like your construction should work.

I still think that the original paper seems awful (because it's their responsibility to justify choices like this in order to explain how their result captures the intuitive meaning of a 'good regulator').

Comment by turntrout on Fixing The Good Regulator Theorem · 2021-02-10T18:45:04.257Z · LW · GW

Status: strong opinions, weakly held. not a control theorist; not only ready to eat my words, but I've already set the table.

As I understand it, the original good regulator theorem seems even dumber than you point out.

First, the original paper doesn't make sense to me. Not surprising, old papers are often like that, and I don't know any control theory... but here's John Baez also getting stuck, giving up, and deriving his own version of what he imagines the theorem should say:

when I tried to read Conant and Ashby’s paper, I got stuck. They use some very basic mathematical notation in nonstandard ways, and they don’t clearly state the hypotheses and conclusion of their theorem...

However, I have a guess about the essential core of Conant and Ashby’s theorem. So, I’ll state that, and then say more about their setup.

Needless to say, I looked around to see if someone else had already done the work of figuring out what Conant and Ashby were saying...

As pointed out by the authors of [3], the importance and generality of this theorem in control theory makes it comparable in importance to Einstein's  for physics. However, as John C. Baez carefully argues in a blog post titled The Internal Model Principle it's not clear that Conant and Ashby's paper demonstrates what it sets out to prove. I'd like to add that many other researchers, besides myself, share John C. Baez' perspective.

Hello?? Isn't this one of the fundamental results of control theory? Where's a good proof of it? It's been cited 1,317 times and confidently brandished to make sweeping claims about how to e.g. organize society or make an ethical superintelligence

It seems plausible that people just read the confident title (Every good regulator of a system must be a model of that system - of course the paper proves the claim in its title...), saw the citation count / assumed other people had checked it out (yay information cascades!), and figured it must be correct...

### Motte and bailey

The paper starts off by introducing the components of the regulatory system:

OK, so we're going to be talking about how regulators which ensure good outcomes also model their environments, right? Sounds good.

Wait...

Later...

We're talking about the entire outcome space  again. In the introduction we focused on regulators ensuring 'good' states, but we immediately gave that up to talk about entropy .

Why does this matter? Well...

### The original theorem seems even dumber than John points out

John writes:

Also, though I don’t consider it a “problem” so much as a choice which I think most people here will find more familiar:

• The theorem uses entropy-minimization as its notion of optimality, rather than expected-utility-maximization

I suppose my intuition is that this is actually a significant problem.

At first glance, Good Regulator seems to basically prove something like 'there's a deterministic optimal policy wrt the observations', but even that's too generous - it proves that there's a deterministic way to minimize outcome entropy. But what does that guarantee us - how do we know that's a 'good regulator'? Like, imagine an environment with a strong "attractor" outcome, like the streets being full of garbage. The regulator can try to push against this, but they can't always succeed due to the influence of random latent variables (this cuts against the determinism assumption, but you note that this can be rectified by reincluding ). However, by sitting back, they can ensure that the streets are full of garbage.

The regulator does so, minimizes the entropy over the unconditional outcome distribution , and is christened a 'good regulator' which has built a 'model' of the environment. In reality, we have a deterministic regulator which does nothing, and our streets are full of trash.

Now, I think it's possible I've misunderstood, so I'd appreciate correction if I have. But if I haven't, and if no control theorists have in fact repaired and expanded this theorem before John's post...

If that's true, what the heck happened? Control theorists just left a 100 bill on the ground for decades? A quick !scholar search doesn't reveal any good critiques. Comment by turntrout on Open & Welcome Thread – February 2021 · 2021-02-10T04:02:34.878Z · LW · GW The intercom seems to be down right now, but here's a bug: images can show up as very, very large in private message conversations: Comment by turntrout on Fixing The Good Regulator Theorem · 2021-02-10T03:53:07.210Z · LW · GW the regulator could just be the identity function: it takes in and returns . This does not sound like a “model”. What is the type signature of the regulator? It's a policy on state space , and it returns states as well? Are those its "actions"? (From the point of view of the causal graph, I suppose just depends on whatever the regulator outputs, and the true state , so maybe it's not important what the regulator outputs. Just that by the original account, any deterministic regulator could be "optimal", even if it doesn't do meaningful computation.) Comment by turntrout on Promoting Prediction Markets With Meaningless Internet-Point Badges · 2021-02-09T16:29:00.874Z · LW · GW Buying status: pay people to make dumb bets against you. The Metaculus equivalent of buying Likes or Amazon reviews. On priors, if Amazon can't squash this problem, it probably can't be squashed. Note that this could be mitigated by other people being able to profit off of obvious epistemic inefficiencies in the prediction markets: if your bots drive the community credence down super far, and if other people notice this, then other people might come in and correct part of the issue. This would reduce your advantage relative to other Metaculites. Comment by turntrout on How do you optimize productivity with respect to your menstrual cycle? · 2021-02-09T00:01:28.254Z · LW · GW This may seem like a gross or weirdly personal question but I think it's actually quite important. I'd like to express social approval of this kind of question going on this site. I see no reason why discussing the menstrual cycle should be any more taboo than discussing the REM cycle. Comment by turntrout on abramdemski's Shortform · 2021-02-07T02:33:28.836Z · LW · GW Second on reddit being net-negative. Would recommend avoiding before it gets hooks in your brain. Comment by turntrout on 2019 Review: Voting Results! · 2021-02-05T19:33:38.234Z · LW · GW From my review: I think that this debate suffers for lack of formal grounding, and I wouldn't dream of introducing someone to these concepts via this debate. While the debate is clearly historically important, I don't think it belongs in the LessWrong review. I don't think people significantly changed their minds, I don't think that the debate was particularly illuminating, and I don't think it contains the philosophical insight I would expect from a LessWrong review-level essay. I am surprised that no one gave it a positive vote. Comment by turntrout on Distinguishing claims about training vs deployment · 2021-02-05T00:16:05.085Z · LW · GW I guess maybe now I'm making the present tense claim [that we have good justification for making the "vast majority" claim]? I mean, on a very skeptical prior, I don't think we have good enough justification to believe it's more probable than not that take-over-the-world behavior will be robustly incentiized for the actual TAI we build, but I think we have somewhat more evidence for the 'vast majority' claim than we did before. (And I agree with a point I expect Richard to make, which is that the power-seeking theorems apply for optimal agents, which may not look much at all like trained agents) I also wrote about this (and received a response from Ben Garfinkel) about half a year ago. Comment by turntrout on Non-Obstruction: A Simple Concept Motivating Corrigibility · 2021-02-05T00:09:56.027Z · LW · GW Thanks for leaving this comment. I think this kind of counterfactual is interesting as a thought experiment, but not really relevant to conceptual analysis using this framework. I suppose I should have explained more clearly that the off-state counterfactual was meant to be interpreted with a bit of reasonableness, like "what would we reasonably do if we, the designers, tried to achieve goals using our own power?". To avoid issues of probable civilizational extinction by some other means soon after without the AI's help, just imagine that you time-box the counterfactual goal pursuit to, say, a month. I can easily imagine what my (subjective) attainable utility would be if I just tried to do things on my own, without the AI's help. In this counterfactual, I'm not really tempted to switch on similar non-obstructionist AIs. It's this kind of counterfactual that I usually consider for AU landscape-style analysis, because I think it's a useful way to reason about how the world is changing. Comment by turntrout on Distinguishing claims about training vs deployment · 2021-02-04T18:54:54.561Z · LW · GW One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we're actually likely to train. But this isn't a component of my distinction - in both cases I'm talking about policies which actually arise from training. Right - I was pointing at the similarity in that both of our distinctions involve some aspect of training, which breaks from the tradition of not really considering training's influence on robust instrumentality. "Quite similar" was poor phrasing on my part, because I agree that our two distinctions are materially different. On terminology, would you prefer the "training goal convergence thesis"? I think that "training goal convergence thesis" is way better, and I like how it accomodates dual meanings: the "goal" may be an instrumental or a final goal. I think "robust" is just as misleading a term as "convergence", in that neither are usually defined in terms of what happens when you train in many different environments. Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment. And so, given switching costs, I think it's fine to keep talking about instrumental convergence. I agree that switching costs are important to consider. However, I've recently started caring more about establishing and promoting clear nomenclature, both for the purposes of communication and for clearer personal thinking. My model of the 'instrumental convergence' situation is something like: • The switching costs are primarily sensitive to how firmly established the old name is, to how widely used the old name is, and the number of "entities" which would have to adopt the new name. • I think that if researchers generally agree that 'robust instrumentality' is a clearer name[1] and used it to talk about the concept, that the shift would naturally propagate through AI alignment circles and be complete within a year or two. This is just my gut sense, though. • The switch from "optimization daemons" to "mesa-optimizers" seemed to go pretty well • But 'optimization daemons' didn't have a wikipedia page yet (unlike 'instrumental convergence') Of course, all of this is conditional on your agreeing that 'robust instrumentality' is in fact a better name; if you disagree, I'm interested in hearing why.[2] But if you agree, I think that the switch would probably happen if people are willing to absorb a small communicational overhead for a while as the meme propagates. (And I do think it's small - I talk about robust instrumentality all the time, and it really doesn't take long to explain the switch) On the bright side, I think the situation for 'instrumental convergence / robust instrumentality' is better than the one for 'corrigibility', where we have a single handle for wildly different concepts! [1] A clearer name - once explained to the reader, at least; 'robust instrumentality' unfortunately isn't as transparent as 'factored cognition hypothesis.' [2] Especially before the 2019 LW review book is published, as it seems probable that Seeking Power is Often Robustly Instrumental in MDPs will be included. I am ready to be convinced that there exists an even better name than 'robust instrumentality' and to rework my writing accordingly. Comment by turntrout on Distinguishing claims about training vs deployment · 2021-02-04T00:00:05.336Z · LW · GW Training convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behaviour aimed towards certain convergent goals. I think this is important and I've been thinking about it for a while (in fact, it seems quite similar to a distinction I made in a comment on your myopic training post). I'm glad to see a post giving this a crisp handle. But I think that the 'training convergence thesis' is a bad name, and I hope it doesn't stick (just as I'm pushing to move away from 'instrumental convergence' towards 'robust instrumentality'). There are many things which may converge over the course of training; although it's clear to us in the context of this post, to an outsider, it's not that clear what 'training convergence' refers to. Furthermore, 'convergence' in the training context may imply that these instrumental incentives tend stick in the limit of training, which may not be true and distracts from the substance of the claim. Perhaps "robust instrumentality thesis (training)" (versus "robust instrumentality thesis (optimality)" or "robust finality thesis (training)")? Fragility of value I like this decomposition as well. I recently wrote about fragility of value from a similar perspective, although I think fragility of value extends beyond AI alignment (you may already agree with that). Comment by turntrout on A Critique of Non-Obstruction · 2021-02-03T21:15:46.439Z · LW · GW Once the AI's bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar. This seems like an untenable approach to me Er - non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler's laws are an approach for going to the moon. Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won't reach good outcomes. I don't see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals. the "give money then shut off" only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers. I agree it's possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don't see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility. The point of all this analysis is to think about why we want corrigibility in the real world, and whether there's a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn't empower pol a whole lot, or which makes silly tradeoffs... I guess I just don't see the relevance of that for thinking about the alignment properties of a given AI system in the real world. Comment by turntrout on A Critique of Non-Obstruction · 2021-02-03T18:15:25.929Z · LW · GW As an author, I'm always excited to see posts like this - thanks for writing this up! I think there are a couple of important points here, and also a couple of apparent misunderstandings. I'm not sure I understood all of your points, so let me know if I missed something. Here are your claims: 1 Non-obstruction seems to be useful where our AU landscape is pretty flat by default. 2 Our AU landscape is probably spikey by default. 3 Non-obstruction locks in default spike-tops in S, since it can only make Pareto improvements. (modulo an epsilon here or there) 4 Locking in spike-tops is better than nothing, but we can, and should, do better. I disagree with #2. In an appropriately analyzed multi-agent system, an individual will be better at some things, and worse at other things. Obviously, strategy-stealing is an important factor here. But in the main, I think that strategy-stealing will hold well enough for this analysis, and that the human policy function can counterfactually find reasonable ways to pursue different goals, and so it won't be overwhelmingly spiky. This isn't a crux for me, though. I agree with #3 and #4. The AU landscape implies a partial ordering over AI designs, and non-obstruction just demands that you do better than a certain baseline (to be precise: that the AI be greater than a join over various AIs which mediocrely optimize a fixed goal). There are many ways to do better than the green line (the human AU landscape without the AI); I think one of the simplest is just to be broadly empowering. Let me get into some specifics where we might disagree / there might be a misunderstanding. In response to Adam, you wrote: Oh it's possible to add up a load of spikes, many of which hit the wrong target, but miraculously cancel out to produce a flat landscape. It's just hugely unlikely. To expect this would seem silly. We aren't adding or averaging anything, when computing the AU landscape. Each setting of the independent variable (the set of goals we might optimize) induces a counterfactual where we condition our policy function on the relevant goal, and then follow the policy from that state. The dependent variable is the value we achieve for that goal. Importantly (and you may or may not understand this, it isn't clear to me), the AU landscape is not the value of "the" outcome we would achieve "by default" without turning the AI on. We don't achieve "flat" AU landscapes by finding a wishy-washy outcome which isn't too optimized for anything in particular. We counterfact on different goals, see how much value we could achieve without the AI if we tried our hardest for each counterfactual goal, and then each value corresponds to a point on the green line. (You can see how this is amazingly hard to properly compute, and therefore why I'm not advocating non-obstruction as an actual policy selection criterion. But I see non-obstruction as a conceptual frame for understanding alignment, not as a formal alignment strategy, and so it's fine.) Furthermore, I claim that it's in-principle-possible to design AIs which empower you (and thereby don't obstruct you) for payoff functions and The AI just gives you a lot of money and shuts off. Let's reconsider your very nice graph. I don't know whether this graph was plotted with the understanding of how the counterfactuals are computed, or not, so let me know. Anyways, I think a more potent objection to the "this AI not being activated" baseline is "well what if, when you decide to never turn on AI #1, you turn on AI #2, which destroys the world no matter what your goals are. Then you have spikiness by default." This is true, and I think that's also a silly baseline to use for conceptual analysis. Also, a system of non-obstructing agents may exhibit bad group dynamics and systematically optimize the world in a certain bad direction. But many properties aren't preserved under naive composition: corrigible subagents doesn't imply corrigible system, pairwise independent random variables usually aren't mutually independent, and so on. Similar objections can be made for multi-polar scenarios: the AI isn't wholly responsible for the whole state of the world and the other AIs already in it. However, the non-obstruction / AU landscape frame still provides meaningful insight into how human autonomy can be chipped away. Let me give an example. • You turn on the first clickthrough maximizer, and each individual's AU landscape looks a little worse than before (in short, because there's optimization pressure on the world towards the "humans click ads" direction, which trades off against most goals) • ... • You turn on clickthrough maximizer n and it doesn't make things dramatically worse, but things are still pretty bad either way. • Now you turn on a weak aligned AI and it barely helps you out, but still classes as "non-obstructive" (comparing 'deploy weak aligned AI' to 'don't deploy weak aligned AI'). What gives? • Well, in the 'original / baseline' world, humans had a lot more autonomy. If the world is already being optimized in a different direction, your AU will be less sensitive to your goals because it will be harder for you to optimize in the other direction. The aligned weak AI may have been a lot more helpful in the baseline world. • Yes, you could argue that if they hadn't originally deployed clickthrough-maximizers, they'd have deployed something else bad, and so the comparison isn't that good. But this is just choosing a conceptually bad baseline. The point (which I didn't make in the original post) isn't that we need to literally counterfact on "we don't turn on this AI", it's that we should compare deploying the AI to the baseline state of affairs (e.g. early 21st century). Comment by turntrout on Has anybody used quantification over utility functions to define "how good a model is"? · 2021-02-03T02:06:22.470Z · LW · GW I ask because I already have a result that says this in MDPs: you can compute all optimal value functions iff you know the environment dynamics up to isomorphism. Comment by turntrout on Has anybody used quantification over utility functions to define "how good a model is"? · 2021-02-02T20:52:49.904Z · LW · GW For sufficiently rich Z, that means that the summary must include a full model of the environment. Is this a thoerem you've proven somewhere? Comment by turntrout on Elephant seal 2 · 2021-02-02T14:58:40.028Z · LW · GW I agree. This seals the deal for me. Comment by turntrout on Open & Welcome Thread - January 2021 · 2021-02-02T02:46:30.617Z · LW · GW I hadn't thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about. I explain my thoughts on this in The Catastrophic Convergence Conjecture. Not sure if you've read that, or if you think it's false, or you have another position entirely. Comment by turntrout on TurnTrout's shortform feed · 2021-02-02T02:00:54.805Z · LW · GW If Hogwarts spits back an error if you try to add a non-integer number of house points, and if you can explain the busy beaver function to Hogwarts, you now have an oracle which answers for arbitrary : just state " points to Ravenclaw!". You can do this for other problems which reduce to divisibility tests (so, any decision problem which you can somehow get Hogwarts to compute; if ). Homework: find a way to safely take over the world using this power, and no other magic. Comment by turntrout on Open & Welcome Thread - January 2021 · 2021-01-30T23:13:44.207Z · LW · GW I basically don't see the human mimicry frame as a particularly relevant baseline. However, I think I agree with parts of your concern, and I hadn't grasped your point at first. The [AUP] equations incentivize the AI to take actions that will provide an immediate reward in the next timestep, but penalizes its ability to achieve rewards in later timesteps. I'd consider a different interpretation. The intent behind the equations is that the agent executes plans using its "current level of resources", while being seriously penalized for gaining resources. It's like if you were allowed to explore, you're currently on land which is 1,050 feet above sea level, and you can only walk on land with elevation between 1,000 and 1,400 feet. That's the intent. The equations don't fully capture that, and I'm pessimistic that there's a simple way to capture it: But what if the only way to receive a reward is to do something that will only give a reward several timesteps later? In realistic situations, when can you ever actually accomplish the goal you're trying to accomplish in a single atomic action? For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it's just rewarded for making paperclips, and it can't make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything. I agree that it might be penalized hard here, and this is one reason I'm not satisfied with equation 5 of that post. It penalizes the agent for moving towards its objective. This is weird, and several other commenters share this concern. Over the last year, I think that the "penalize own AU gain" is worse than "penalize average AU gain", in that I think the latter penalty equation leads to more sensible incentives. I still think that there might be some good way to penalize the agent for becoming more able to pursue its own goal. Equation 5 isn't it, and I think that part of your critique is broadly right. Comment by turntrout on Open & Welcome Thread - January 2021 · 2021-01-30T02:12:59.138Z · LW · GW I'm wondering how and agent using the attainable utility implementation in equations 3, 4, and 5 could actually be superhuman. In the "superhuman" analysis post, I was considering whether that reward function would incentivize good policies if you assumed a superintelligently strong optimizer optimized that reward function. For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it's just rewarded for making paperclips, and it can't make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything. Not necessarily; an optimal policy maximizes the sum of discounted reward over time, and so it's possible for the agent to take actions which aren't locally rewarding but which lead to long-term reward. For example, in a two-step game where I can get rewarded on both time steps, I'd pick actions which maximize . In this case, could be 0, but the pair of actions could still be optimal. I know you could adjust the reward function to reward the AI doing things that you think will help it accomplish your primary goal in the future. For example, you know the AI moving its arm towards the wire is useful, so you could reward that. But then I don't see how the AI could do anything clever or superhuman to make paperclips. This idea is called "reward shaping" and there's a good amount of literature on it! Comment by turntrout on Simulacrum 3 As Stag-Hunt Strategy · 2021-01-29T21:03:01.731Z · LW · GW So: we make our case for Stag, try to convince people it’s the obviously-correct choice no matter what. And… they’re not fooled. But they all pretend to be fooled. And they all look around at each other, see everyone else also pretending to be fooled, and deduce that everyone else will therefore choose Stag. And if everyone else is choosing Stag… well then, Stag actually is the obvious choice. Just like that, Stag becomes the new Schelling point. This seems like it could be easier for certain kinds of people than for others. One might want to insist to the group that yes, I know that Stag isn't always right, that's silly, and before you know it, you've shattered any hope of reaching Stag-Schelling via this method. Comment by turntrout on Simulacrum 3 As Stag-Hunt Strategy · 2021-01-28T22:56:23.749Z · LW · GW This is a great point. I think the WSB situation is fascinating and I hope that someone does a postmortem once the dust has settled. I think it contains a lot of important lessons with respect to coordination problems. See also Eliezer's Medium post Comment by turntrout on Covid: Bill Gates and Vaccine Production · 2021-01-28T22:48:41.036Z · LW · GW To elaborate on your last point: beyond the benefits of appearing to have power, there's a difference between being in a high-power world state (having billions of dollars) and being able to actually compute (and then follow) policies which exercise your power in order to achieve your goals (see Appendix B.4 of Optimal Policies Tend to Seek Power for a formal example of this in reinforcement learning). Comment by turntrout on Thoughts on theGME situation · 2021-01-28T15:43:22.627Z · LW · GW

Supposing that r/WSB is actually responsible (in part) for spiking $GME (do they really make up enough volume?), I find it fascinating how they're resolving coordination problems and creating common knowledge. One post last night was titled something like "upvote if you aren't selling before ($1000/$5000/some number I forget)!". At first this seemed like classic Reddit karma-whoring / circle-jerking, but then I realized that this actually creates some degree of common knowledge: by upvoting, you weakly signal support to others - and to yourself via consistency pressure - that you won't defect by selling early. And they know you know ..., because you've all seen and upvoted the same post. You know this is true for many people, because there are so many upvotes. There's the diamond / paper hands dichotomy, honoring cooperators and looking down on defectors. There's the "IF HE'S IN I'M IN" meme going on with one of the original traders, u/DeepFuckingValue, expressing the sentiment that given u/DFV's (apparent) multi-million ongoing stake in$GME, the other users will follow their lead by not selling.

I'm sure there's other things I've missed. But this was one of my first thoughts when I learned about this hilarious situation.

Comment by turntrout on Voting Phase for 2019 Review · 2021-01-28T15:32:00.471Z · LW · GW

It seems that I can still modify my votes, despite voting being over?

Comment by turntrout on Optimal play in human-judged Debate usually won't answer your question · 2021-01-27T16:09:23.286Z · LW · GW

I expect optimal play would be approachable via gradient descent in most contexts. With k bits, you can slide pretty smoothly from using all k as a direct answer to using all k to provide high value information, one bit at a time. In fact, I expect there are many paths to ignorance.

This seems off; presumably, gradient descent isn't being performed on the bits of the answer provided, but on the parameters of the agent which generated those bits.

Comment by turntrout on Lessons I've Learned from Self-Teaching · 2021-01-26T15:26:26.013Z · LW · GW

I think I do find Anki’d subjects much easier when I go back to them, yes.

I think that’s probably it - focus on concepts. If you like, I’d be happy to take a look at a deck you make / otherwise give feedback!

Comment by turntrout on The Gears of Impact · 2021-01-26T01:01:40.353Z · LW · GW

See e.g. my most recent AUP paper, equation 1, for simplicity. Why would optimal policies for this reward function have the agent simulate copies of itself, or why would training an agent on this reward function incentivize that behavior?

I think there's an easier way to break any current penalty term, which is thanks to Stuart Armstrong: the agent builds a successor which ensures that the no-op leaves the agent totally empowered and safe, and so no penalty is applied.

Comment by turntrout on Lessons I've Learned from Self-Teaching · 2021-01-25T20:11:24.495Z · LW · GW

my one great success with it was memorizing student names

+1, I used Anki before my CFAR workshop. had to remind myself not to call people by their names before they even had their nametags on. This was great, because remembering names is generally a source of very mild social anxiety for me.

Comment by turntrout on Lessons I've Learned from Self-Teaching · 2021-01-25T18:24:37.656Z · LW · GW

Because that time you spend using Anki to retain it is time that you could spend seeking deep understandings elsewhere.

Right, but part of having deep understandings of things is to have the deep understanding, and if I just keep acquiring deep understanding without working on retention, I'll lose all but an echo. If I don't use Anki, I won't have the deep understanding later. I therefore consider Anki to help with the long-term goal of having deep understanding, even if moments spent making Anki cards could be used on object-level study.

Comment by turntrout on Lessons I've Learned from Self-Teaching · 2021-01-25T14:48:44.896Z · LW · GW

Thanks for all of these thoughts!

My impression and experiences is that obtaining a deep understanding is a lot more fruitful than doing spaced repetition.

I don't see why these should be mutually exclusive, or even significantly trade off against each other. This sounds like saying "drinking water is more important than getting good sleep", which is also true in a literal sense, but which implies some kind of strange tradeoff that I haven't encountered. I aim to get a deep / intuitive understanding, and then use Anki to make myself use that understanding on a regular basis so that I retain it.

However, I think that playing around with those annoying details might be a good means to the end of grasping central concepts. I spent a lot of time telling myself, "It's only the big concepts that matter, so you can skip this section, you don't have to answer these questions, you don't have to do these practice problems."

I agree, and I'm not trying to tell people to not sweat the details. It's more like, you don't have to do every single practice problem, or complete all the bonus sections / applications which don't interest you. Which is a thing I did - e.g. for Linear Algebra Done Right, and then I forgot a lot of linear algebra anyways.

Another approach is, you start with the papers you want to understand, and then backchain into the concepts you have to learn in a big tree. Then you're guaranteed not to waste time. I haven't tried this yet, but it sounds very sensible and I mean to try it soon.

Comment by turntrout on Lessons I've Learned from Self-Teaching · 2021-01-25T14:38:58.140Z · LW · GW

I like to randomly sample X% of the exercises, and read the rest; this lets me know later whether or not I missed something important. Simple rules like "do every fifth exercise" should suffice, with the rule tailored to the number of exercises in the book.