Corrigibility as outside view 2020-05-08T21:56:17.548Z · score: 31 (13 votes)
How should potential AI alignment researchers gauge whether the field is right for them? 2020-05-06T12:24:31.022Z · score: 20 (8 votes)
Insights from Euclid's 'Elements' 2020-05-04T15:45:30.711Z · score: 121 (46 votes)
Problem relaxation as a tactic 2020-04-22T23:44:42.398Z · score: 102 (42 votes)
A Kernel of Truth: Insights from 'A Friendly Approach to Functional Analysis' 2020-04-04T03:38:56.537Z · score: 32 (6 votes)
Research on repurposing filter products for masks? 2020-04-03T16:32:21.436Z · score: 44 (8 votes)
ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' 2020-03-25T20:03:39.590Z · score: 38 (9 votes)
Conclusion to 'Reframing Impact' 2020-02-28T16:05:40.656Z · score: 37 (12 votes)
Reasons for Excitement about Impact of Impact Measure Research 2020-02-27T21:42:18.903Z · score: 29 (9 votes)
Attainable Utility Preservation: Scaling to Superhuman 2020-02-27T00:52:49.970Z · score: 26 (9 votes)
How Low Should Fruit Hang Before We Pick It? 2020-02-25T02:08:52.630Z · score: 28 (8 votes)
Continuous Improvement: Insights from 'Topology' 2020-02-22T21:58:01.584Z · score: 28 (8 votes)
Attainable Utility Preservation: Empirical Results 2020-02-22T00:38:38.282Z · score: 41 (9 votes)
Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z · score: 33 (9 votes)
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z · score: 40 (12 votes)
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z · score: 45 (14 votes)
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z · score: 15 (6 votes)
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z · score: 7 (1 votes)
On Being Robust 2020-01-10T03:51:28.185Z · score: 40 (17 votes)
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z · score: 25 (8 votes)
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z · score: 22 (5 votes)
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z · score: 41 (14 votes)
Seeking Power is Instrumentally Convergent in MDPs 2019-12-05T02:33:34.321Z · score: 109 (32 votes)
How I do research 2019-11-19T20:31:16.832Z · score: 56 (22 votes)
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z · score: 57 (27 votes)
The Gears of Impact 2019-10-07T14:44:51.212Z · score: 43 (15 votes)
World State is the Wrong Abstraction for Impact 2019-10-01T21:03:40.153Z · score: 55 (17 votes)
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z · score: 54 (18 votes)
Deducing Impact 2019-09-24T21:14:43.177Z · score: 59 (17 votes)
Value Impact 2019-09-23T00:47:12.991Z · score: 52 (21 votes)
Reframing Impact 2019-09-20T19:03:27.898Z · score: 90 (35 votes)
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z · score: 30 (10 votes)
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 29 (6 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 76 (17 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 111 (38 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)
Overcoming Clinginess in Impact Measures 2018-06-30T22:51:29.065Z · score: 42 (14 votes)
Worrying about the Vase: Whitelisting 2018-06-16T02:17:08.890Z · score: 84 (20 votes)
Swimming Upstream: A Case Study in Instrumental Rationality 2018-06-03T03:16:21.613Z · score: 118 (40 votes)
Into the Kiln: Insights from Tao's 'Analysis I' 2018-06-01T18:16:32.616Z · score: 69 (19 votes)
Confounded No Longer: Insights from 'All of Statistics' 2018-05-03T22:56:27.057Z · score: 56 (13 votes)


Comment by turntrout on Arguments against myopic training · 2020-07-10T12:12:53.404Z · score: 2 (1 votes) · LW · GW

That's not what I'm asking. Why would that lead to power-seeking? You seem to be identifying "manipulation" with "power-seeking"; power-seeking implies manipulation, but the opposite isn't always true. 

Comment by turntrout on Arguments against myopic training · 2020-07-10T01:09:58.740Z · score: 2 (1 votes) · LW · GW

I described how this happens in the section on manipulating humans

Yes, but I don't understand your case for "finding chains of manipulative inputs which increase myopic reward" entailing power-seeking? Why would that behavior, in particular, lead to the highest myopic reward? If we didn't already know about power-seeking reward maximizers, why would we promote this hypothesis to attention?

This is exactly analogous to why the reward function you described takes over the world.

I disagree? Those objectives seem qualitatively dissimilar. 

Comment by turntrout on Arguments against myopic training · 2020-07-09T22:30:18.886Z · score: 2 (1 votes) · LW · GW

what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).

Sure. But, consider maximizing “TurnTrout has a fun day”-reward (for some imperfect grounding of that concept), and maximizing my approval of actions based on whether i think they’ll lead to a fun adventure. 

The former takes over the world, and I don’t have a very fun day. but what about the latter? 

To some extent, I won’t approve of actions that cause the agent to break, so there will be at least some instrumental subgoal pursuit for the agent. But for a successful power-seeking policy to be optimal, there is a conjunctive burden — we aren’t maximizing long-term discounted reward anymore, and the actions are evaluated locally, independently of any explicit global reward signal. 

Many quasi-independently predicted approval judgments must cohere into a dangerous policy. It’s quite possible that this happens, but I’m not very convinced of that right now. 

Comment by turntrout on Arguments against myopic training · 2020-07-09T18:38:27.214Z · score: 2 (1 votes) · LW · GW

The main argument for this is that most "simple" reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that's what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)

+1, I was about to write an argument to this effect. 

Also, you can’t always rationalize  as state-based reward maximization, but even if you could, that doesn’t tell you much. Taken on its own, the argument about -equivalence proves too much, because it would imply random policies have convergent instrumental subgoals: 

Let  be uniformly randomly drawn from the unit interval, the first time it’s called. Have the agent choose the argmax for its policy. This can be rationalized as some  maximization, so it’s probably power-seeking.

This doesn’t hold, obviously. Any argument about approval maximization should use specific facts about how approval is computed. 

Put otherwise, specifying an actual reward function seems to be a good way to get a catastrophic maximizer, but arbitrary action-scoring rules don’t seem to have this property, as Rohin said above. Most reward functions have power-seeking optimal policies, and every policy is optimal for some reward function, but most policies aren’t power-seeking. 

Comment by turntrout on Arguments against myopic training · 2020-07-09T16:49:52.820Z · score: 2 (1 votes) · LW · GW

The equivalence doesn’t just hold in a few cases - in fact, every function M which myopically assigns a value to all state-action pairs is the optimal Q-function for some reward function. So for any myopic training setup, there’s some equivalent nonmyopic training setup - specifically, the one with reward function 

As an aside, you can't rationalize all  like this if you restrict yourself to state-/outcome-based reward functions. Relevant to the main point of the section.

Comment by turntrout on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-08T20:17:00.860Z · score: 2 (1 votes) · LW · GW

The baseline is "I'm not shut off now, and i can avoid shutdown", so anything like "I let myself be shutdown" would be heavily penalized (big optimal value difference).

Comment by turntrout on TurnTrout's shortform feed · 2020-07-07T23:04:03.243Z · score: 13 (4 votes) · LW · GW

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment). 

What can we learn about this?

Comment by turntrout on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T13:30:58.448Z · score: 4 (2 votes) · LW · GW

Nice post! I think this notion of time-inconsistency points to a key problem in impact measurement, and if we could solve it (without backtracking on other problems, like interference/offsetting), we would be a lot closer to dealing with subagent issues.

I think the other baselines can also induce time-inconsistent behavior, for the same reason: if reaching the main goal has a side effect of allowing the agent to better achieve the auxiliary goal (compared to starting state / inaction / stepwise inaction), the agent is willing to pay a small amount to restrict its later capabilities. Sometimes this is even a good thing - the agent might "pay" by increasing its power in a very specialized and narrow manner, instead of gaining power in general, and we want that.

Here are some technical quibbles which don't affect the conclusion (yay).

If using an inaction rollout of length , just multiply that penalty by 

I don't think so - the inaction rollout formulation (as I think of it) compares the optimal value after taking action  and waiting for  steps, with the optimal value after  steps of waiting. There's no additional discount there.

Fortunately, when summing up the penalties, you sum terms like , so a lot of the terms cancel.

Why do the absolute values cancel?

Comment by turntrout on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-07T10:59:38.653Z · score: 3 (2 votes) · LW · GW

Ah, yes, the "compare with current state" baseline. I like that one a lot, and my thoughts regularly drift back to it, but AFAICT it unfortunately leads to some pretty heavy shutdown avoidance incentives.

Since we already exist in the world, we're optimizing the world in a certain direction towards our goals. Each baseline represents a different assumption about using that information (see the original AUP paper for more along these lines). 

Another idea is to train a "dumber" inaction policy and using that for the stepwise inaction baseline at each state. This would help encode "what should happen normally", and then you could think of AUP as performing policy improvement on the dumb inaction policy.

Comment by turntrout on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-06T23:06:17.086Z · score: 3 (2 votes) · LW · GW

Yeah - how do you know what should happen?

Comment by turntrout on Classifying games like the Prisoner's Dilemma · 2020-07-05T13:42:35.094Z · score: 4 (2 votes) · LW · GW

If , then defecting while your playmate cooperates creates value (relative to cooperating). From a social perspective, Krump/Flitz or Flitz/Krump is preferable to Krump/Krump; and in an iterated game of this sort, you'd prefer to alternate  with , than to get a constant . Wikipedia still classes this as a Prisoner's Dilemma, but I think that's dubious terminology, and I don't think it's standard. I might offhand suggest calling it the Too Many Cooks game.

But the only Nash is still , and so there is an incentive to defect from Krump/Flitz. Furthermore, that defection reduces total social welfare, so I think it makes sense to call this variant "PD" still.

Comment by turntrout on Noise on the Channel · 2020-07-02T12:28:11.007Z · score: 10 (2 votes) · LW · GW

I think there's also the "no awkward silences" factor.

Comment by turntrout on AvE: Assistance via Empowerment · 2020-07-01T03:24:56.612Z · score: 3 (2 votes) · LW · GW

I'm a big fan of this, conceptually (will read the paper tomorrow morning). Attainable utility preservation is secretly trying to preserve human power. As a nitpick, though, they should probably approximate "average goal achievement ability" instead of empowerment (for formal reasons outlined in Appendix A of Optimal Farsighted Agents Tend to Seek Power).

As I've written previously, if we could build competitive agents which reliably increased human control-over-the-future, I think that would be pretty damn good. Don't worry about CEV for now - let's just get into a stable future. 

But, getting accurate models of humans seems difficult, and human power is best measured with respect to the policies which our cognitive algorithms can actually discover (I recently gave a curated talk on this - transcript coming soon). Assuming optimality could create weird incentives, but maybe the paper has something to say about that.

All in all, I don't feel optimistic about AvE-like approaches actually scaling to superhuman, if they need to explicitly pick out a human from the environment. 

Comment by turntrout on TurnTrout's shortform feed · 2020-06-29T03:16:21.709Z · score: 2 (1 votes) · LW · GW

aren't we all secretly hurting still?

Comment by turntrout on TurnTrout's shortform feed · 2020-06-29T00:46:47.566Z · score: 17 (6 votes) · LW · GW

For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.

No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.

Text: "Consider a bijection "

My mental narrator: "Cap consider a bijection space dollar foxtrot colon cap x backslash tango oscar cap y dollar"

Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer "click" and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?

And then.... and then, a month ago, I got fed up. What if it was all just in my head, at this point? I'm only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?

I wanted my hands back - I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed his hands were fine and healing and the pain was all psychosomatic.

And... it worked, as far as I can tell. It totally worked. I haven't dictated in over three weeks. I play Beat Saber as much as I please. I type for hours and hours a day with only the faintest traces of discomfort.


Comment by turntrout on - A Petition · 2020-06-28T13:13:34.065Z · score: 4 (2 votes) · LW · GW

I'm supposing the "violence" and "force" here is figurative, as in "a forceful response to their deanonymization of Scott"?

Comment by turntrout on TurnTrout's shortform feed · 2020-06-28T04:07:40.864Z · score: 4 (2 votes) · LW · GW

Maybe. I don't feel like that's the key thing I'm trying to point at here, though. The fact that you can understand any one of those in a reasonable amount of time is still surprising, if you step back far enough.

Comment by turntrout on TurnTrout's shortform feed · 2020-06-27T20:29:07.335Z · score: 2 (1 votes) · LW · GW

Physics has existed for hundreds of years. Why can you reach the frontier of knowledge with just a few years of study? Think of all the thousands of insights and ideas and breakthroughs that have been had - yet, I do not imagine you need most of those to grasp modern consensus.

Idea 1: the tech tree is rather horizontal - for any given question, several approaches and frames are tried. Some are inevitably more attractive or useful. You can view a Markov decision process in several ways - through the Bellman equations, through the structure of the state visitation distribution functions, through the environment's topology, through Markov chains induced by different policies. Almost everyone thinks about them in terms of Bellman equations, there were thousands of papers on that frame pre-2010, and you don't need to know most of them to understand how deep Q-learning works.

Idea 2: some "insights" are wrong (phlogiston) or approximate (Newtonian mechanics) and so are later discarded. The insights become historical curiosities and/or pedagogical tools and/or numerical approximations of a deeper phenomenon. 

Idea 3: most work is on narrow questions which end up being dead-ends or not generalizing. As a dumb example, I could construct increasingly precise torsion balance pendulums, in order to measure the mass of my copy of Dune to increasing accuracies. I would be learning new facts about the world using a rigorous and accepted methodology. But no one would care. 

More realistically, perhaps only a few other algorithms researchers care about my refinement of a specialized sorting algorithm (from  to ), but the contribution is still quite publishable and legible. 

I'm not sure what publishing incentives were like before the second half of the 20th century, so perhaps this kind of research was less incentivized in the past.

Comment by turntrout on TurnTrout's shortform feed · 2020-06-26T19:41:44.688Z · score: 5 (3 votes) · LW · GW

Idea: learn by making conjectures (math, physical, etc) and then testing them / proving them, based on what I've already learned from a textbook. 

Learning seems easier and faster when I'm curious about one of my own ideas.

Comment by turntrout on Abstraction, Evolution and Gears · 2020-06-26T02:20:08.450Z · score: 4 (2 votes) · LW · GW

I like this analysis. Whereas Seeking Power is Instrumentally Convergent in MDPs aimed to explain why seeking power is instrumentally convergent, this predicts qualitative properties of resource-hungry policies ("only cares about resource-relevant summary to the extent it affects long-term action spaces"). I think this frame also makes plain that most sensory reward functions don't differentiate between "humans flourish + I have power" world histories, and "humans perish + I have power" world histories.

More generally, I think mediating-long-term-action-space is part of how we intuitively decide what to call “resources” in the first place.

At this point, I ground my understanding of resources in the POWER(s) formulation (ability to achieve goals in general), and I think your take agrees with my model of that. Most goals we could have care more about the long-term action space, and so care about the stable and abstraction-friendly components of the environment which can widen that space (eg "resources"). Alternatively, resources are those things which tend to increase your POWER.

Comment by turntrout on MichaelA's Shortform · 2020-06-25T18:55:09.862Z · score: 2 (1 votes) · LW · GW

Thanks for the clarification. Yes, I'm suggesting bullet point 2.

LH intelligence evaluates learning algorithms. It makes sense to say an algorithm can adapt to a wide range of environments (in their precise formal sense: achieves high return under the universal mixture over computable environments), and maybe that it's more "charismatic" (has hard-coded social skills, or can learn them easily in relevant environments). But it doesn't make sense to say that an algorithm is physically stronger - that has to be a fact which is encoded by the environment's state (especially in this dualistic formalism).

The paper's math automatically captures these facts, in my opinion. I agree the boundary gets fuzzier in an embedded context, but so do a lot of things right now.

Comment by turntrout on MichaelA's Shortform · 2020-06-24T13:10:10.720Z · score: 2 (1 votes) · LW · GW

But if you gave me a billion dollars, that would vastly increase my ability to achieve goals in a wide range of environments, even if it doesn’t affect anything we’d typically want to refer to as my “intelligence”.

I don't think it would - the "has a billion dollars" is a stateful property - it depends on the world state. I think the LH metric is pretty reasonable and correctly ignores how much money you have. The only thing you "bring" to every environment under the universal prior, is your reasoning abilities.

My understanding is that this analysis conflates "able to achieve goals in general in a fixed environment" (power/resources) vs "able to achieve high reward in a wide range of environments" (LH intelligence), but perhaps I have misunderstood.

Comment by turntrout on Open & Welcome Thread - June 2020 · 2020-06-24T11:52:08.939Z · score: 5 (3 votes) · LW · GW

It's on EAForum, perhaps?

Comment by turntrout on TurnTrout's shortform feed · 2020-06-24T01:16:46.096Z · score: 2 (1 votes) · LW · GW

When under moral uncertainty, rational EV maximization will look a lot like preserving attainable utility / choiceworthiness for your different moral theories / utility functions, while you resolve that uncertainty.

Comment by turntrout on The ground of optimization · 2020-06-23T19:50:25.698Z · score: 2 (1 votes) · LW · GW

Would this work for highly non-monotonic utility functions? 

Comment by turntrout on The ground of optimization · 2020-06-23T19:48:49.093Z · score: 4 (2 votes) · LW · GW

I see the tension you're pointing at. I think I had in mind something like "an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations", and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs). 

I'm still confused about parts of this.

Comment by turntrout on The ground of optimization · 2020-06-23T18:15:29.928Z · score: 2 (1 votes) · LW · GW

I'm saying the target set for non-mild optimization is the set of configurations which maximize proxy-ness. Just take the argmax. By contrast, we might want to sample uniformly randomly from the set of satisficing configurations, which is much larger. 

(This is assuming a fixed initial state)

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T17:33:57.050Z · score: 4 (3 votes) · LW · GW

Looks beautiful. Great work! :)

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T16:11:21.336Z · score: 6 (3 votes) · LW · GW

I think it's more understandable that reasonable people would be upset about doxxing than about criticism. I don't think it's understandable to the point that outside observers would actually go "oh, OK, fair reply to the NYT's bad taste". Realistically speaking, I think they would think very poorly of us for "retaliating".

It seems improbable that the responses suggested by Mati_Roy would lead to positive changes at NYT. 

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T15:57:12.874Z · score: 2 (1 votes) · LW · GW


Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T15:46:49.496Z · score: 3 (2 votes) · LW · GW

Makes sense. I hadn't realized my comment might be seen as an endorsement of Mati_Roy's list. 

To clarify: I think it's more reasonable to respond like this to the name issue than to criticism. I don't personally endorse that list.

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T14:08:21.688Z · score: 20 (11 votes) · LW · GW

I do not expect RationalWiki to write favorably of LW/EY in worlds where LW/EY is net-good, and unfavorably in worlds where LW/EY is net-bad. I expect RW to write unfavorably of LW/EY either way, so I don't care much for their analysis (although I have read some of it, to make sure I'm not missing something). 

I think LW cares much more about truth-seeking, while RW gives me more of a "360-degree-spin-and-then-dunk on the Outgroup" vibe.

Disclaimer: I've only read a few articles on there.

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T14:01:42.660Z · score: 4 (10 votes) · LW · GW

I literally did not reply to the top-level comment, but rather to its child. Did you reply to the wrong person?

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T13:32:46.267Z · score: 43 (17 votes) · LW · GW

While I think it's wrong for the NYT to publish Scott's name and I support disincentivizing them from doing that, I don't support "ruining [the reporter's life]", "making [the reporter an enemy]", or "painting a target on [their] back". I understand the game theory here, but I also don't want us to be the kind of community that does stuff like that. 

Comment by turntrout on SlateStarCodex deleted because NYT wants to dox Scott · 2020-06-23T13:29:25.077Z · score: 8 (5 votes) · LW · GW

The difference is, I think, that we're just asking them to not publish Scott's name. I would not support this kind of pushback against criticism, even if it were silly. 

Edit: This is not an endorsement of Mati_Roy's list. I do endorse politely writing to NYT.

Comment by turntrout on The ground of optimization · 2020-06-22T23:43:30.162Z · score: 5 (3 votes) · LW · GW

Mild optimization: the easiest way to solve hard tasks may be to specify a proxy, which an AI maximizes. The AI steers into configurations which maximize the proxy function. Simple proxies don't usually have target sets which we like, because human value is complex. However, maybe we just want the AI to randomly select a configuration which satisfies the proxy, instead of finding the maximally-proxy-ness configuration, which may be bad due to extremal Goodhart. 

Quantilization tries to solve this by randomly selecting a target configuration from some top quantile, but this is sensitive to how world states are individuated. 

Comment by turntrout on Zut Allais! · 2020-06-22T02:46:01.769Z · score: 10 (2 votes) · LW · GW

Fun fact: "Zut Allais!" is a play on words with respect to the French zut alors, which roughly translates to "darn it".

Comment by turntrout on Image GPT · 2020-06-21T15:29:41.319Z · score: 4 (2 votes) · LW · GW


I also worked for a day as a cashier at a grocery store.

Comment by turntrout on Sunday Jun 21st – Online Talks by Curated Authors · 2020-06-21T14:04:40.104Z · score: 2 (1 votes) · LW · GW

We'll share a call link on this post (most likely using Zoom). 

Will this be added soon?

Comment by turntrout on Seeking Power is Instrumentally Convergent in MDPs · 2020-06-19T20:07:32.844Z · score: 2 (1 votes) · LW · GW

Update: I generalized these results to stochastic MDPs (before, I assumed determinism).

Comment by turntrout on TurnTrout's shortform feed · 2020-06-10T21:29:33.673Z · score: 6 (3 votes) · LW · GW

The new "Broader Impact" NeurIPS statement is a good step, but incentives are misaligned. Admitting fatally negative impact would set a researcher back in their career, as the paper would be rejected. 

Idea: Consider a dangerous paper which would otherwise have been published. What if that paper were published title-only on the NeurIPS website, so that the researchers can still get career capital?

Problem: How do you ensure resubmission doesn't occur elsewhere?

Comment by turntrout on Inaccessible information · 2020-06-06T13:54:57.991Z · score: 2 (1 votes) · LW · GW

As an intuition pump, imagine a company that is run entirely by A/B tests for metrics that can be easily checked. This company would burn every resource it couldn’t measure — its code would become unmaintainable, its other infrastructure would crumble, it would use up goodwill with customers, it would make no research progress, it would become unable to hire, it would get on the wrong side of regulators.

It seems like part of this problem is easy-ish, and part is hard.

The easy part: seems like you can formally capture  what resources are via average optimal value. A system which actually increased my average optimal value wrt the future seems quite helpful. Basically, this just an alternative statement of instrumental convergence - ceteris paribus, making sure I'm highly able to paint houses blue also probably means I can autonomously pursue my actual values.*

* This probably reads weird, but I don't have time to go in depth on this right now. Happy to clarify more later.

But, average optimal value is itself inaccessible. It's less inaccessible than eg my true moral values and desires, but it still requires reasoning about something in the world which cannot be directly observed. Furthermore, "average optimal value" relies on a notion of counterfactual that is itself an abstraction - "how well could (this person) achieve this other goal (which they won't actually pursue)". We'd have to pin down that abstraction, too.

Comment by turntrout on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-03T02:30:14.526Z · score: 7 (4 votes) · LW · GW

I'm not Wei, but I think my estimate falls within that range as well. 

Comment by turntrout on The Presumptuous Philosopher, self-locating information, and Solomonoff induction · 2020-06-01T15:42:21.033Z · score: 2 (1 votes) · LW · GW

If it isn't constant-length, then it seems strange to assume Solomonoff induction would posit a large objective universe, given that such positing wouldn't help it predict its inputs efficiently (since such prediction requires locating agents).

but a solomonoff ind doesn’t rank hypotheses on whether they allow efficient predictions of some feature of interest, it ranks them based on posterior probabilities (prior probability + to what extent the hypothesis accurately predicted observations so far).

Comment by turntrout on Open & Welcome Thread—May 2020 · 2020-06-01T02:03:30.704Z · score: 3 (2 votes) · LW · GW

welcome! Come on in, the water's fine. 

Comment by turntrout on The Presumptuous Philosopher, self-locating information, and Solomonoff induction · 2020-05-31T17:46:26.775Z · score: 10 (2 votes) · LW · GW

The way I understand Solomonoff induction, it doesn't seem like the complexity of specifying the observer scales logarithmically with the total number of observers. it's not like there's a big phone book of observers in which locations are recorded. Rather, it should be the complexity of saying "and my camera is here". 

Comment by turntrout on AGIs as populations · 2020-05-27T00:19:38.666Z · score: 4 (2 votes) · LW · GW

In other words, I think of patching your way to good arguments

As opposed to what?

Comment by turntrout on TurnTrout's shortform feed · 2020-05-26T18:01:51.657Z · score: 6 (3 votes) · LW · GW

Sentences spoken aloud are a latent space embedding of our thoughts; when trying to move a thought from our mind to another's, our thoughts are encoded with the aim of minimizing the other person's decoder error.

Comment by turntrout on Conclusion to 'Reframing Impact' · 2020-05-25T17:13:11.912Z · score: 2 (1 votes) · LW · GW

if you're managing a factory, I can say "Rohin, I want you to make me a lot of paperclips this month, but if I find out you've increased production capacity or upgraded machines, I'm going to fire you". You don't even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.

I think this is a natural concept and is distinct from particular formalizations of it.

edit: consider the three plans

  1. Make 10 paperclips a day
  2. Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
  3. take over the planet and make millions of paperclips each day.
Comment by turntrout on Conclusion to 'Reframing Impact' · 2020-05-25T00:50:13.636Z · score: 2 (1 votes) · LW · GW

Why do you object to the latter?