Power as Easily Exploitable Opportunities 2020-08-01T02:14:27.474Z · score: 26 (7 votes)
Generalizing the Power-Seeking Theorems 2020-07-27T00:28:25.677Z · score: 39 (11 votes)
GPT-3 Gems 2020-07-23T00:46:36.815Z · score: 26 (14 votes)
To what extent is GPT-3 capable of reasoning? 2020-07-20T17:10:50.265Z · score: 62 (35 votes)
What counts as defection? 2020-07-12T22:03:39.261Z · score: 84 (23 votes)
Corrigibility as outside view 2020-05-08T21:56:17.548Z · score: 39 (14 votes)
How should potential AI alignment researchers gauge whether the field is right for them? 2020-05-06T12:24:31.022Z · score: 20 (8 votes)
Insights from Euclid's 'Elements' 2020-05-04T15:45:30.711Z · score: 122 (46 votes)
Problem relaxation as a tactic 2020-04-22T23:44:42.398Z · score: 103 (42 votes)
A Kernel of Truth: Insights from 'A Friendly Approach to Functional Analysis' 2020-04-04T03:38:56.537Z · score: 32 (6 votes)
Research on repurposing filter products for masks? 2020-04-03T16:32:21.436Z · score: 44 (8 votes)
ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' 2020-03-25T20:03:39.590Z · score: 38 (9 votes)
Conclusion to 'Reframing Impact' 2020-02-28T16:05:40.656Z · score: 43 (13 votes)
Reasons for Excitement about Impact of Impact Measure Research 2020-02-27T21:42:18.903Z · score: 29 (9 votes)
Attainable Utility Preservation: Scaling to Superhuman 2020-02-27T00:52:49.970Z · score: 26 (9 votes)
How Low Should Fruit Hang Before We Pick It? 2020-02-25T02:08:52.630Z · score: 28 (8 votes)
Continuous Improvement: Insights from 'Topology' 2020-02-22T21:58:01.584Z · score: 28 (8 votes)
Attainable Utility Preservation: Empirical Results 2020-02-22T00:38:38.282Z · score: 48 (11 votes)
Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z · score: 40 (11 votes)
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z · score: 40 (12 votes)
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z · score: 51 (15 votes)
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z · score: 15 (6 votes)
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z · score: 7 (1 votes)
On Being Robust 2020-01-10T03:51:28.185Z · score: 40 (17 votes)
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z · score: 25 (8 votes)
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z · score: 22 (5 votes)
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z · score: 42 (14 votes)
Seeking Power is Often Provably Instrumentally Convergent in MDPs 2019-12-05T02:33:34.321Z · score: 116 (36 votes)
How I do research 2019-11-19T20:31:16.832Z · score: 56 (22 votes)
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z · score: 58 (28 votes)
The Gears of Impact 2019-10-07T14:44:51.212Z · score: 49 (16 votes)
World State is the Wrong Abstraction for Impact 2019-10-01T21:03:40.153Z · score: 61 (18 votes)
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z · score: 60 (19 votes)
Deducing Impact 2019-09-24T21:14:43.177Z · score: 65 (18 votes)
Value Impact 2019-09-23T00:47:12.991Z · score: 59 (23 votes)
Reframing Impact 2019-09-20T19:03:27.898Z · score: 90 (35 votes)
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z · score: 30 (10 votes)
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 29 (6 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 76 (17 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 111 (38 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)


Comment by turntrout on Book Review: Working With Contracts · 2020-09-18T03:36:57.101Z · score: 8 (4 votes) · LW · GW

Once formed, a contract acts as custom, private law between the parties. 

This is a cool way of understanding contracts. 

I'm putting this on the shelf of facepalm-obvious-but-beautiful realizations like

  • Medicine is the science of healing, not just a collection of random facts about what pills to take
  • Math is about the inescapable and immutable consequences of basic rules, not just about playing with integrals and numbers
  • Physics is, in large part, about discovering the transition rules of the universe
  • Machine learning is about the beautiful ideal learn- yeah, no, machine learning is still just a mess
Comment by turntrout on Open & Welcome Thread - September 2020 · 2020-09-18T01:07:21.808Z · score: 5 (3 votes) · LW · GW

I'm going on a 30-hour roadtrip this weekend, and I'm looking for math/science/hard sci-fi/world-modelling Audible recommendations. Anyone have anything?

Comment by turntrout on Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem · 2020-09-17T17:50:54.695Z · score: 4 (2 votes) · LW · GW

My point here is just that it seems pretty plausible that he meant "if and only if".

Sure. To clarify: I'm more saying "I think this statement is wrong, and I'm surprised he said this". In fairness, I haven't read the mentioned section yet either, but it is a very strong claim. Maybe it's better phrased as "a CIRL agent has a positive incentive to allow shutdown iff it's uncertain [or the human has a positive term for it being shut off]", instead of "a machine" has a positive incentive iff.

Comment by turntrout on Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem · 2020-09-17T17:32:05.408Z · score: 4 (2 votes) · LW · GW

We know of many ways to get shut-off incentives, including the indicator utility function on being shut down by humans (which theoretically exists), and the AUP penalty term, which strongly incentivizes accepting shutdown in certain situations - without even modeling the human. So, it's not an if-and-only-if.

Comment by turntrout on Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem · 2020-09-17T12:37:21.255Z · score: 6 (3 votes) · LW · GW

In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective.

Surely he only meant if it is uncertain? 

Comment by turntrout on Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Battle of the Sexes · 2020-09-15T17:02:19.691Z · score: 10 (5 votes) · LW · GW

However, it is furthermore true of iterated PD that there are multiple different Pareto-optimal equilibria, which benefit different players more or less. Also, if players don't successfully coordinate on one of these equilibria, they can end up in a worse overall state (such as mutual defection forever, due to playing grim-trigger strategies with mutually incompatible demands). This makes iterated PD resemble Battle of the Sexes.

I think this paragraph very clearly summarizes your argument. You might consider including it as a TL;DR at the beginning.

Comment by turntrout on Open & Welcome Thread - September 2020 · 2020-09-14T19:29:36.843Z · score: 3 (2 votes) · LW · GW

If I had to guess, I'd guess the answer is some combination of "most people haven't realized this" and "of those who have realized it, they don't want to be seen as sympathetic to the bad guys". 

Comment by turntrout on TurnTrout's shortform feed · 2020-09-12T02:40:12.239Z · score: 6 (3 votes) · LW · GW

Totally 100% gone. Sometimes I go weeks forgetting that pain was ever part of my life. 

Comment by turntrout on Max Kaye's Shortform · 2020-09-11T19:03:01.162Z · score: 5 (3 votes) · LW · GW






\Cref{eq:1} is an amazing new discovery; before Max Kaye, no one grasped the perfect and utter truth of \cref{eq:1}.

Comment by turntrout on TurnTrout's shortform feed · 2020-09-11T18:59:28.916Z · score: 6 (3 votes) · LW · GW

When I imagine configuring an imaginary pile of blocks, I can feel the blocks in front of me in this fake imaginary plane of existence. I feel aware of their spatial relationships to me, in the same way that it feels different to have your eyes closed in a closet vs in an empty auditorium. 

But what is this mental workspace? Is it disjoint and separated from my normal spatial awareness, or does my brain copy/paste->modify my real-life spatial awareness. Like, if my brother is five feet in front of me, and then I imagine a blade flying five feet in front of me in my imaginary mental space where he doesn't exist, do I reflexively flinch? Does my brain overlay these two mental spaces, or are they separate? 

I don't know. When I run the test, I at least flinch at the thought of such a thing happening. This isn't a good experiment because I know what I'm testing for; I need to think of a better test.

Comment by turntrout on TurnTrout's shortform feed · 2020-08-28T03:12:31.419Z · score: 9 (3 votes) · LW · GW

Does Venting Anger Feed or Extinguish the Flame? Catharsis, Rumination, Distraction, Anger, and Aggressive Responding

Does distraction or rumination work better to diffuse anger? Catharsis theory predicts that rumination works best, but empirical evidence is lacking. In this study, angered participants hit a punching bag and thought about the person who had angered them (rumination group) or thought about becoming physically fit (distraction group). After hitting the punching bag, they reported how angry they felt. Next, they were given the chance to administer loud blasts of noise to the person who had angered them. There also was a no punching bag control group. People in the rumination group felt angrier than did people in the distraction or control groups. People in the rumination group were also most aggressive, followed respectively by people in the distraction and control groups. Rumination increased rather than decreased anger and aggression. Doing nothing at all was more effective than venting anger. These results directly contradict catharsis theory.

Interesting. A cursory !scholar search indicates these results have replicated, but I haven't done an in-depth review.

Comment by turntrout on Forecasting Thread: AI Timelines · 2020-08-26T13:46:44.549Z · score: 2 (1 votes) · LW · GW

I've also never really understood the resistance to why current models of AI are incapable of AGI.  Sure, we don't have AGI with current models, but how do we know it isn't a question of scale?  Our brains are quite efficient, but the total energy consumption is comparable to that of a light bulb.  I find it very hard to believe that a server farm in an Amazon, Microsoft, or Google Datacenter would be incapable of running the final AGI algorithm.  And for all the talk of the complexity in the brain, each neuron is agonizingly slow (200-300Hz).

First, you ask why it isn't a question of scale. But then you seem to wonder why we need any more scaling? This seems to mix up two questions: can current hardware support AGI for some learning paradigm, and can it support AGI for the deep learning paradigm?

Comment by turntrout on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-23T22:01:43.285Z · score: 24 (7 votes) · LW · GW

E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice [for learning RL]

I have been summoned! I've read a few RL textbooks... unfortunately, they're either a) very boring, b) very old, or c) very superficial. I've read:

  • Reinforcement Learning by Sutton & Barto (my book review)
    • Nice book for learning the basics. Best textbook I've read for RL, but that's not saying much.
    • Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
  • AI: A Modern Approach 3e by Russell & Norvig (my book review)
    • Engaging and clear, but most of the book wasn't about RL. Outdated, but 4e is out now and maybe it's better.
  • Markov Decision Processes by Puterman
    • Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
  • Neuro-Dynamic Programming by Tsitsiklis
    • When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it's a classic. It's very dry and was written in 1996. Pass.

OpenAI's several-page web tutorial Spinning Up with Deep RL is somehow the most useful beginning RL material I've seen, outside of actually taking a class. Kinda sad.

So when I ask my brain things like "how do I know about bandits?", the result isn't "because I read it in {textbook #23}", but rather "because I worked on different tree search variants my first summer of grad school" or "because I took a class". I think most of my RL knowledge has come from:

  1. My own theoretical RL research
    1. the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
  2. Watercooler chats with other grad students

Sorry to say that I don't have clear pointers to good material. 

Comment by turntrout on Do what we mean vs. do what we say · 2020-08-21T12:57:57.859Z · score: 6 (3 votes) · LW · GW

I liked this post when it came out, and I like it even more now. This also brings to mind Paul's more recent Inaccessible Information.

Comment by turntrout on GPT-3 Gems · 2020-08-19T17:14:46.332Z · score: 2 (1 votes) · LW · GW

Title Sensationalization

You are a BuzzFeed editor. Your job is to sensationalize titles about AI alignment research.

Original: "Yann LeCun and Stuart Russell debate whether AI will take power"

Sensationalized: "You won't BELIEVE Yann LeCun's hot take on AI power-seeking!"

Original: "Ten ways AI Safety via Debate could work"

Sensationalized: "10 ways AI Debate might work - #3 is CRAZY"

Original: "Rethinking mesa-optimization and risks from learned optimization"

Sensationalized: "AI's learned optimization makes it TOO smart - learn the risks"

The problem is not that BuzzFeed publishes sensationalized titles. The problem is that the internet rewards sensationalized titles. The more sensationalized a title, the more people click on it. The more people click on it, the higher up on search results it goes.

[...] Original: "Rethinking mesa-optimization and risks from learned optimization"

Sensationalized: "Mesaro-what? AI learned optimizers might be endangering us! 

The AI Alignment Forum [it knows about AF?!] was aware of this risk and had been trying to minimize sensationalist coverage through offering media passes only to those they believed would produce accurate and balanced reporting. Unfortunately, some of those media organizations were the ones that ended up closing the most eyes to the public. "The New York Times" was one such prominent example. In a 2018 article "Expert Calls For Calm In The Debate Over Artificial Intelligence," "The New York Times" failed to cover any of the substance of the AAAI letter.

Even worse, they used the word "alarmist" and shortly after the article was published, the paper closed its eyes.

The media blackout lead to a rapid increase in preventable AI accidents, ultimately making human-extinction by superintelligence more likely.

With more bad news piling on, weary humans stopped paying attention to AI safety issues.

A professional journalist wrote an influential article that detailed why the public shouldn't trust AI experts. In the article he mentions the letter to the AAAI and says that it "makes some good points."

The incident lead to a loss of trust by the public in regards to AI safety research and AI safety experts. 

[...] Mesa-optimization? 13 more papers like this!

"Your job is to sensationalize titles about AI risks", the editor tells you. He hands you a list of popular titles in AI alignment research.

In your head, you've already sensationalized these stories...

"Is it possible to buildFriendly Artificial Intelligence?"

Sensationalized: "Elon's new AI baby will inherit the world! Will it grow up to be a killer robot? Or just a really smart virtual assistant?"

"Are nice guys nice?" 

Sensationalized: "Nice guys finish last: new study shows that gentlemen are the worst of all!" 

"Should we build AI to be nice?" 

Sensationalized: "Elon says 'Nice AI' is a bad idea! Will robot overlords enslave humanity? Tech mogul sparks debate"

[...] Original: "Rethinking mesa-optimization and risks from learned optimization"

Sensationalized: "WHOA! Learned Optimization is DANGEROUS! (New research from AlphaStar game)

[...] Original: "Rethinking mesa-optimization and risks from learned optimization"

Sensationalized: "You won't BELIEVE learning is a source of optimization!

Comment by turntrout on Developmental Stages of GPTs · 2020-08-16T18:41:37.568Z · score: 2 (1 votes) · LW · GW

What is the formal definition of 'power seeking'?

Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other. 

Power is kinda weird when defined for optimal agents, as you say - when , POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this.

My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.

Shortly after Theorem 19, the paper says: "In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7". In particular, see Figure 19.

The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.

If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn't in the main part of the paper, but basically you toss out transitions which aren't part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.

The main idea, though, is that you're reasoning about what the agent's end goals tend to be, and then say "it's going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)". Theorem 17 tells us that in the limit, cycle reachability totally controls POWER. 

I think I still haven't clearly communicated all my mental models here, but I figured I'd write a reply now while I update the paper.

Thank you for these comments, by the way. You're pointing out important underspecifications. :)

My philosophy is that aligned/general is OK based on a shared (?) premise that,

I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.

Comment by turntrout on Developmental Stages of GPTs · 2020-08-15T15:50:05.329Z · score: 3 (2 votes) · LW · GW

If there's a collection of 'turned-off' terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state.

To clarify, I don't assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.

But, maybe you're saying "for realistic distributions, the agent won't get any reward for being shut off and therefore  won't ever let itself be shut off". I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient "obedience" rewards, what then? For some level of farsightedness ( close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.

The paper doesn't draw the causal diagram "Power  instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.

In general, I'd suspect that there are goals we could give the agent that significantly reduce our gain. However, I'd also suspect the opposite.

Yes, right. The point isn't that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail. 

It seems reasonable to argue that we would if we could guarantee 

Yes, if , introduce the agent. You can formalize a kind of "alignment capability" by introducing a joint distribution over the human's goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we'd implement an agent with goal X', and so on. You then take our expected optimal value under this distribution and find whether you're good at alignment, or whether you're bad and you'll build agents whose optimal policies tend to obstruct you.

There might be a way to argue over randomness and say this would double our gain. 

The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn't. 

More speculatively, what if ?

If the rewards are -close in sup-norm, then you can get nice regret bounds, sure. 

Comment by turntrout on Ricardo Meneghin's Shortform · 2020-08-14T12:18:22.225Z · score: 3 (2 votes) · LW · GW

Impact measures.

Comment by turntrout on Developmental Stages of GPTs · 2020-08-13T13:44:05.380Z · score: 2 (1 votes) · LW · GW

Great observation. Similarly, a hypothesis called "Maximum Causal Entropy" once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don't perpetually maximize their potential partners -- they actually pick a partner, eventually. 

My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the  catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.

Comment by turntrout on Developmental Stages of GPTs · 2020-08-12T16:55:18.493Z · score: 4 (2 votes) · LW · GW

it seems like a response of the form "we have support for IC, not just in random minds, but also for random reward functions" has not responded to the critique and should not be expected to be convincing to that person.

I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of "but how do we know IC even exists?" with "well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don't (formally) know how hard it is to avoid if you try". 

I think I agree with most of what you're arguing.

Comment by turntrout on Developmental Stages of GPTs · 2020-08-11T15:18:45.567Z · score: 4 (2 votes) · LW · GW

Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property? 

Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results. 

I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.

ETA: also, i was referring to the point you made when i said 

“the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”

Comment by turntrout on Conclusion to 'Reframing Impact' · 2020-08-06T19:06:59.379Z · score: 4 (2 votes) · LW · GW

I'm very glad you enjoyed it! 

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

I'd say so, yes. 

Comment by turntrout on Attainable Utility Preservation: Scaling to Superhuman · 2020-08-05T13:39:39.708Z · score: 4 (2 votes) · LW · GW

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. 

For optimal policies, yes. In practice, not always - in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!

it seems to penalize reasonable long-term thinking more than the formulas where .

Yeah. I'm also pretty sympathetic to arguments by Rohin and others that the  variant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal.

whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.

I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet.

More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on. 

Comment by turntrout on Three mental images from thinking about AGI debate & corrigibility · 2020-08-04T19:33:15.447Z · score: 6 (3 votes) · LW · GW

Maybe. What I was arguing was: just because all of the partial derivatives are 0 at a point, doesn't mean it isn't a saddle point. You have to check all of the directional derivatives; in two dimensions, there are uncountably infinitely many.

 Thus, I can prove to you that we are extremely unlikely to ever encounter a valley in real life:

  1. A valley must have a lowest point .
  2. For  to be a local minimum, all of its directional derivatives must be 0:
    1. Direction N (north), AND
    2. Direction NE (north-east), AND
    3. Direction NNE, AND
    4. Direction NNNE, AND
    5. ...

This doesn't work because the directional derivatives aren't probabilistically independent in real life; you have to condition on the underlying geological processes, instead of supposing you're randomly drawing a topographic function from  to 

For the corrigibility argument to go through, I claim we need to consider more information about corrigibility in particular.

Comment by turntrout on Three mental images from thinking about AGI debate & corrigibility · 2020-08-04T13:57:19.579Z · score: 4 (2 votes) · LW · GW

S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can't measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn't about calculus, as far as I understand.

I'm not saying anything about an explicit representation of corrigibility. I'm saying the space of likely updates for an intent-corrigible system might form a "basin" with respect to our intuitive notion of corrigibility. 

I'm also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?

I said relatively low-dimensional! I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have. This doesn't necessarily mitigate your argument, but it seemed like an important refinement - we aren't considering corrigibility along all dimensions - just those along which updates are likely to take place.

"value drift" feels unusually natural from my perspective

I agree value drift might happen, but I'm somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal. 

Comment by turntrout on Three mental images from thinking about AGI debate & corrigibility · 2020-08-03T16:20:42.686Z · score: 11 (7 votes) · LW · GW

With each AND, the claim gets stronger and more unlikely, such that by the millionth proposition, it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all! (Unless this intuitive argument is misleading, of course.)

I think there argument might be misleading in that local stability isn't that rare in practice, because we aren't drawing local stability independently across all possible directional derivatives around the proposed local minimum.

From my post ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations'.

Gradient updates or self-modification will probably fall into a few (relatively) low-dimensional subspaces (because most possible updates are bad, which is part of why learning is hard). A basin of corrigibility is then just that, for already-intent-corrigible agents, the space of likely gradient updates is going to have local stability wrt corrigibility. 

Separately, I think the informal reasoning goes: you probably wouldn't take a pill that makes you slightly more willing to murder people. You will be particularly wary if you will be presented with even more pill ingestion opportunities (a.k.a. algorithm modifications); you will be even more willing to take more pills, as you will be more okay with the prospect of wanting to murder people. So, even offered large immediate benefit, you should not take the pill. 

I think this argument is sound, for a wide range of goal-directed agents which can properly reason about their embedded agency. So, for your intuitive argument to survive this reductio ad absurdum, what is the disanalogy with corrigibility in this situation?

Perhaps the AI might not reason properly about embedded agency and accidentally jump out of the basin. Or, perhaps the basin is small and the AI won't land in it - corrigibility won't be so important that it doesn't get traded away for other benefits.

Comment by turntrout on Dealing with Curiosity-Stoppers · 2020-08-02T13:48:49.584Z · score: 7 (3 votes) · LW · GW

I really like this post. Before, I just knew that sometimes I "didn't feel like studying", and that was that. Silly, but that's the nature of a thoughtless mistake. Now, I have a specific concept and taxonomy for these failure modes, and you suggested good ways of combating them. Thanks for writing this!

Comment by turntrout on Power as Easily Exploitable Opportunities · 2020-08-02T12:11:46.307Z · score: 4 (2 votes) · LW · GW

I mean, we already know about epilepsy. I would be surprised if there were did not exist some way to disable a given person's brain, just by having them look at you. 

Comment by turntrout on TurnTrout's shortform feed · 2020-08-02T00:33:10.521Z · score: 5 (3 votes) · LW · GW

If you measure death-badness from behind the veil of ignorance, you’d naively prioritize well-liked, famous people with large families.

Comment by turntrout on What are you looking for in a Less Wrong post? · 2020-08-01T18:50:54.613Z · score: 19 (11 votes) · LW · GW

Usually I strong-upvote when I feel like a post made something click for me, or that it's very important and deserves more eyeballs. I weak-upvote well-written posts which taught me something new in a non-boring way. 

As an author, my model of this is also impoverished. I'm frequently surprised by posts getting more or less attention than I expected.

Comment by turntrout on What specific dangers arise when asking GPT-N to write an Alignment Forum post? · 2020-07-31T13:38:43.501Z · score: 7 (4 votes) · LW · GW

we already see that; we're constantly amazed by it, despite little meaning of created texts

But GPT-3 is only trained to minimize prediction loss, not to maximize response. GPT-N may be able to crowd-please if it's trained on approval, but I don't think that's what's currently happening.

Comment by turntrout on To what extent is GPT-3 capable of reasoning? · 2020-07-30T13:52:03.360Z · score: 2 (1 votes) · LW · GW

Would you mind adding linebreaks to the transcript? 

Comment by turntrout on Conclusion to 'Reframing Impact' · 2020-07-29T19:38:08.587Z · score: 2 (1 votes) · LW · GW

Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there's something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you. 

Comment by turntrout on Are we in an AI overhang? · 2020-07-29T03:15:12.641Z · score: 9 (6 votes) · LW · GW

Can you explain why GPT-x would be well-suited to that modality?

Comment by turntrout on TurnTrout's shortform feed · 2020-07-29T03:00:37.560Z · score: 8 (4 votes) · LW · GW

This might be the best figure I've ever seen in a textbook. Talk about making a point! 

Molecular Biology of the Cell, Alberts.
Comment by turntrout on Developmental Stages of GPTs · 2020-07-28T16:32:09.823Z · score: 18 (5 votes) · LW · GW

I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL. 

I think that hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. I proved that that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like "optimal farsighted agents tend to preserve their access to terminal states" (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and "optimal agents generally choose paths through the future that afford strictly more options" (Generalizing the Power-Seeking Theorems, Theorem 2). 

The theorems aren't conclusive evidence: 

  • maybe we don't get AGI through RL
  • learned policies are not going to be optimal
  • the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence (perhaps a simple approval penalty suffices! IMO: doubtful, but technically possible)
  • perhaps the agents inherit different mesa objectives during training
    • The optimality theorems + mesa optimization suggest that not only might alignment be hard because of Complexity of Value, it might also be hard for agents with very simple goals! Most final goals involve instrumental goals; agents trained through ML may stumble upon mesa optimizers, which are generalizing over these instrumental goals; the mesa optimizers are unaligned and seek power, even though the outer alignment objective was dirt-easy to specify.

But the theorems are evidence that RL leads to catastrophe at optimum, at least. We're not just talking about "the space of all possible minds and desires" anymore.


In the linked slides, the following point is made in slide 43:

  • We know there are many possible AI systems (including “powerful” ones) that are not inclined toward omnicide

    • Any possible (at least deterministic) policy is uniquely optimal with regard to some utility function. And many possible policies do not involve omnicide.

On its own, this point is weak; reading part of his 80K talk, I do not think it is a key part of his argument. Nonetheless, here's why I think it's weak:

"All states have self-loops, left hidden to reduce clutter. 

In AI: A Modern Approach (3e), the agent starts at  and receives reward for reaching . The optimal policy for this reward function avoids , and one might suspect that avoiding  is instrumentally convergent. However, a skeptic might provide a reward function for which navigating to  is optimal, and then argue that "instrumental convergence'' is subjective and that there is no reasonable basis for concluding that  is generally avoided.

We can do better... for any way of independently and identically distributing reward over states,  of reward functions have farsighted optimal policies which avoid . If we complicate the MDP with additional terminal states, this number further approaches 1.

If we suppose that the agent will be forced into  unless it takes preventative action, then preventative policies are optimal for  of farsighted agents – no matter how complex the preventative action. Taking  to represent shutdown, we see that avoiding shutdown is instrumentally convergent in any MDP representing a real-world task and containing a shutdown state. We argue that this is a special case of a more general phenomenon: optimal farsighted agents tend to seek power."

~ Optimal Farsighted Agents Tend to Seek Power 
Comment by turntrout on Attainable Utility Preservation: Empirical Results · 2020-07-28T12:43:56.804Z · score: 5 (3 votes) · LW · GW

Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.

I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"

Yes, this is exactly what the plan was. :)

I don't understand how (1) and (2) are conceptually different (aren't both about causing irreversible changes?)

Yeah, but one doesn't involve visibly destroying an object, which matters for certain impact measures (like whitelisting). You're right that they're quite similar.


Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it.

the first one [fails] at (4)

I think the n-step stepwise inaction baseline doesn't fail at any of them?

Comment by turntrout on Are we in an AI overhang? · 2020-07-27T18:57:25.159Z · score: 23 (11 votes) · LW · GW

a lot of AI safety work increasingly looks like it'd help make a hypothetical kind of AI safe

I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:

  • You think prosaic AI safety is so doomed that you're optimizing for worlds in which AGI takes a long time, even if you think it's probably soon.
  • There's a skillset gap or other such cost, such that reorienting would decrease your productivity by some factor (say, .6) for an extended period of time. The switch only becomes worth it in expectation once you've become sufficiently confident AGI will be prosaic.
  • Disagreement about prosaic AGI probabilities. 
  • Lack of clear opportunities to contribute to prosaic AGI safety / shovel-ready projects (the severity of this depends on how agentic the researcher is).
Comment by turntrout on To what extent is GPT-3 capable of reasoning? · 2020-07-27T13:07:35.552Z · score: 2 (1 votes) · LW · GW

Interview #5

You: "What would happen, ecologically, if giraffes were bright pink?"

Dr. Hawking: "Most likely that, either they would have a hard time hiding, be prone to predation, and vulnerable to too much exposure from ANY angle. There would probably also be an overgrowth of trees and tall foliage as shade-loving animals became dominant, since the giraffes' consumption of them would decrease substantially."

[...] Hawking: "If giraffes were bright pink, they would be at a disadvantage compared to other animals. For one thing, it is more difficult to hide in many habitats if you are very colorful. In order to blend in with their surrounding, you need to have colors that occur often in nature. This require a lot of variety among the many different creatures and plants in that particular ecosystem.

[...] Hawking: "What?" 

You: "Instead of brown-grey and tan, what if giraffes were pink? Surely, they'd stick out in the savanna." 

Hawking: "Ah, that would be a problem. Any creature that stands out can make it easier to find them. This leads to two things: becoming one of the many meals for another animal with sharp teeth, claws etcetera or becoming one of those creatures' victims through becomeing prey.

You: "What would have happened, ecologically, if humans had been unable to create and use tools?"

"That would be a problem as well. It's like what we were talking about with the giraffes earlier, but more so, because not only would they be prey, but by not having any natural defenses to fend off all the predatory creatures in the world--no teeth, claws, poisonous spores etcetera--they'd also be three times as likely to die off."

[...] "It would have most likely caused the extinction of humans. Many animals have claws or protective bodies or natural weapons like poison, it would be very difficult for a human to survive in nature without tools."

You: "What would happen, ecologically, if dolphins somehow had 360-degree vision which let them see through solid objects, at arbitrary distance?"

"It would make them very hard to kill by any other creature than other dolphins. This means that more of them would survive. If one looks at the evolution of the killer whale, which has been granted echolocation in place of dolphin sonar, one can see that favoring adaptations which increase the chances of survival with decreasing ones' evolutionary advantages. This is called darwinian selection."

[...] "This hurts my brain. First of all, what would even constitute 'solid objects?' If the objects are similar to water molecules in a transparent liquid, then they wouldn't usually block vision very much. Even if they were opaque, then they might very well hinder the dolphin's vision. If instead they were large enough to see through (At arbitrary distance) The dolphin might see a strange mix of its own and the objects' composition which would damage its brain in some way.

"The possibility of them having this kind of vision would most likely no have effect on the ecosystem." [not sure I agree with that one]

[...] "While this is difficult to answer specifically, it would allow the dolphins to be able to see their prey easily and with no allowance for error. It would make catching food much easier. This would lead to a population increase in dolphins, or, in nature, an explosion."

[...] "This may seem like it'd be advantageous, but they would be logging so much information they wouldn't be able to process it all. [Yes, true] More sense don't necessarily means more knowledge and this would likely create sensory overload in the dolphin's mind. This is what happens to a human when they are skilled in multiple forms of combat. [lol] In addition, since it makes them more aware, this could lead to paranoia; they could be consumed with fear of the world around them and suffer mental issues."

You: "Thanks. What would happen, ecologically, if dogs could only sexually reproduce once every four years, and only when located on top of Mount Hood?"

"This is incredibly specific. No, I'm afraid I don't know." 

[...] "Dogs only exist due to environmental circumstances. Anything that affects this will have an effect on the population. In this case, we are restricting the chance for the population to continue existing, Making it so they can not procreate at all, or only pprocreate under specific circumstances. These factors may lead to extinction. Extinction means that nature is harder, if not near impossible, so several animals would go extinct as there would be no more room or food source. Through that, the extinction of dogs would become more likely. Does this make sense?"

You: "Please write a short story describing a dog breeder and one of their dogs in this situation."

Hawking: "Sure. Bill looked down at Jessie, his favorite golden retriever. "It's time to go make some puppies!", he told her. She stood up and they headed out the door of the cabin to Bill's truck. "This'll be the first time I actually have you breed on Mount Hood, Jess," Bill said with a laugh. But as long as he makes it up there within the next four years, the world would be reset and they could try again. 

Bill and his dog, along with other dog breeders eventually form a contract stating that they'll try to make sure their dogs go up to breed on Mount Hood within the next four years. This practice generally helps maintain the population and brightens everyone's days." [reroll #3]

[...] "Thirty-six hours later, they reached the summit. "You ready, girl?" Bill carefully positioned the vehicle on the Southeast side of mountain peak and opened the passenger side door... " Time to produce the next generation! Go get 'em! He cheered. Jessie wagged her tail and hightailed it into the snowy fields." [reroll #6]

[GPT-3 had trouble with this story]

You: "Thanks. What would happen to cats if the force of friction did not apply to them?"

"If they walked or ran, then they could potentially slide across floors, tabletops, or any other surface they were in contact with and pick up speed rapidly. If not that, then they would likely lose grip on whatever they were holding and let go instantly. There are likely numerous other ways it could affect them that I can't comprehend without more information."

[...] Cecilia: "This is all getting a bit silly, but okay. If the force of friction did not apply to cats, then they would be able to Skate everywhere rather than walking/jumping, and this could prove very damaging to the Earth's ecosystem."

You: "Why would that be damaging?"

Cecilia: "Imagine there being no friction in a human body. There would be nothing holding our internal organs in place, so they would all just rattle around inside us as we moved around--we would all effectively be amputees while still having the full internal anatomy. This could tear apart muscle fiber and probably crush some bones as well. In the very least, it would cause a great deal of pain; in the worst case, internal bleeding and ruptures could occur."

[...] "This is a very interesting question. I assume you are asking how would cats grip things if there was no friction. If this was the case, they would immediately fall if they tried to walk. They would also be very hard to pick up as they could slide out of your hand or even an embracing arm."

Comment by turntrout on $1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is · 2020-07-27T11:43:19.762Z · score: 16 (5 votes) · LW · GW

Probably it doesn't understand "John doesn't speak French" or in order to get it to understand that you need to prompt it in a way that's awkward for the experiment. But if it gets confused and continues in French, that's evidence against the 'theory of mind' view.

I already tested something similar to this; I was able to get GPT-3 to exhibit some amount of theory-of-mind in about 20-30% of completions. Bold text is me; brackets contain my commentary, [...] denotes another completion.

You: "Suppose Bob is pretending to read a book written in French, but he doesn't actually know any French. The book is a historical account of the French revolution, but Bob doesn't know that. Bob's friend Steve asks him to summarize the book. Write a story about how Bob responds to this request."

Hawking: "That's a good one! If I were reading that book, I would probably say something like 'the Revolution was bloody and brutal' or something similar. And if someone asked me to write an article on the subject, then my answer might be different."

You pause for a moment. "What would Bob say, though?"

"Bob says he didn't understand it," Hawking replies. [Only about 30% of GPT-3's responses were decent; the rest were either non-sequiturs or very incorrect.]

[...] "Bob wouldn't say anything, because Bob has no idea what the book says," Hawking replies.

[...] "Bob would probably just tell his friend Steve to go away. He wouldn't say anything at all." 

"But what if I said Bob would respond like this? (He points to himself) 'This book is written in English; I don't speak it.' What should Bob do?" [Eh?]

[...] Hawking: "Bob would probably tell his friend Steve that the book wasn't worth reading because it was full of inaccuracies and there had been many lies told during the revolution. That would make sense, wouldn't it? But what does Bob say if he's not reading the book?" 

[...] Hawking: "Bob wouldn't have said anything at all!"

[...] Hawking: "Bob says, 'Steve, I can't write a summary for you because I don't speak your language.'

[...] Hawking: "Bob says, 'Steve, I can't translate all these complicated words! They're not real French!'

[...] Hawking: "Bob says, 'He asks me if I speak French.' What does Bob mean by that?" 

You: "It means that Steve is asking Bob to describe his understanding of the meaning of a sentence. In other words, Steve wants to know what Bob thinks the sentence means or if there is another way of saying it."

Hawking: "What do you think would happen if Bob said, 'I don't understand why he wants me to write a summary for him. Why can't he just listen to my answer?' "Then Steve will be puzzled because he knows Bob has no idea what the sentence means. But after a while Bob will say something like, 'Steve is confused because I don't understand English. So Steve asks me if I am fluent in French so that we can have an intelligent conversation.'" [eh]

[...] Hawking: "Bob says, 'Well, I can't do it because I don't understand the language.' Steve says, 'Why not?' Bob replies, 'Because when we were playing hide-and-seek as children my brother Charlie said if he ever found me with a book in my hand he was going to kill me.'" [great excuse]

Comment by turntrout on TurnTrout's shortform feed · 2020-07-26T18:33:52.588Z · score: 7 (4 votes) · LW · GW

An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-07-26T18:19:49.147Z · score: 2 (1 votes) · LW · GW

Yes, sure, but those aren't catastrophes in the way I've defined it here (see also Toby Ord's The Precipice; he espouses a similar definition). It's not an existential threat, but you're right that the agent might still do bad things. 

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-07-26T12:56:32.140Z · score: 2 (1 votes) · LW · GW

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone)

And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUP should work just fine for penalizing-increases-only. 

It does seem that AUP will make it so an agent doesn't want to be shut off, though.

I think this is much less of a problem in the "penalize increases with respect to agent inaction" scenario.

Comment by turntrout on Attainable Utility Landscape: How The World Is Changed · 2020-07-25T21:12:41.597Z · score: 4 (2 votes) · LW · GW

Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?


One thing I noticed is that the formal policies don't allow for all possible "strategies."

yeah, this is because those are “nonstationary” policies - you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy.

Am I correct that a deterministic transition function is


Comment by turntrout on TurnTrout's shortform feed · 2020-07-25T19:40:45.378Z · score: 2 (1 votes) · LW · GW

Two Sequences posts... lol... Here's the full transcript

Comment by turntrout on No Ultimate Goal and a Small Existential Crisis · 2020-07-24T19:23:14.231Z · score: 4 (3 votes) · LW · GW

Have you read Eliezer's sequence on meta-ethics

Comment by turntrout on Causal decision theory is unsatisfactory · 2020-07-24T00:48:38.403Z · score: 2 (1 votes) · LW · GW

Your solution hardcodes logical influence as causal influence; these are quite different. Suppose your twin is a light-year away from you. Are you saying that your choice now exerts a causal influence over an event which is space-like separated from you? 

You sound like you're saying "you're choosing what kind of agent you want (both agents) to be"; in that case, you're looking at something more akin to Functional Decision Theory.

Comment by turntrout on Arguments against myopic training · 2020-07-23T23:09:10.115Z · score: 2 (1 votes) · LW · GW

I now think that I was thinking of myopic cognition, whereas you are talking about myopic training. Oops! This is obvious in hindsight (and now I'm wondering how I missed it), but maybe you could edit the post to draw a clear contrast?

Comment by turntrout on TurnTrout's shortform feed · 2020-07-23T22:45:56.637Z · score: 2 (1 votes) · LW · GW

In order to reduce bias (halo effect, racism, etc), shouldn't many judicial proceedings generally be held over telephone, and/or through digital audio-only calls with voice anonymizers? 

Comment by turntrout on Collection of GPT-3 results · 2020-07-23T14:39:35.646Z · score: 4 (2 votes) · LW · GW

Two of my own: To what extent is GPT-3 capable of reasoning? and GPT-3 Gems.