## Posts

Seeking Power is Provably Instrumentally Convergent in MDPs 2019-12-05T02:33:34.321Z · score: 101 (27 votes)
How I do research 2019-11-19T20:31:16.832Z · score: 55 (21 votes)
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z · score: 53 (23 votes)
The Gears of Impact 2019-10-07T14:44:51.212Z · score: 38 (12 votes)
World State is the Wrong Level of Abstraction for Impact 2019-10-01T21:03:40.153Z · score: 51 (15 votes)
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z · score: 53 (17 votes)
Deducing Impact 2019-09-24T21:14:43.177Z · score: 57 (16 votes)
Value Impact 2019-09-23T00:47:12.991Z · score: 47 (18 votes)
Reframing Impact 2019-09-20T19:03:27.898Z · score: 82 (30 votes)
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z · score: 28 (9 votes)
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 28 (6 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 76 (17 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 111 (38 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)
Overcoming Clinginess in Impact Measures 2018-06-30T22:51:29.065Z · score: 41 (14 votes)
Swimming Upstream: A Case Study in Instrumental Rationality 2018-06-03T03:16:21.613Z · score: 115 (38 votes)
Into the Kiln: Insights from Tao's 'Analysis I' 2018-06-01T18:16:32.616Z · score: 69 (19 votes)
Confounded No Longer: Insights from 'All of Statistics' 2018-05-03T22:56:27.057Z · score: 56 (13 votes)
Internalizing Internal Double Crux 2018-04-30T18:23:14.653Z · score: 80 (19 votes)
The First Rung: Insights from 'Linear Algebra Done Right' 2018-04-22T05:23:49.024Z · score: 77 (22 votes)
Unyielding Yoda Timers: Taking the Hammertime Final Exam 2018-04-03T02:38:48.327Z · score: 40 (12 votes)
Open-Category Classification 2018-03-28T14:49:23.665Z · score: 36 (8 votes)
The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach' 2018-03-25T06:55:46.204Z · score: 68 (18 votes)
Lightness and Unease 2018-03-21T05:24:26.289Z · score: 53 (15 votes)
How to Dissolve It 2018-03-07T06:19:22.923Z · score: 41 (15 votes)
Ambiguity Detection 2018-03-01T04:23:13.682Z · score: 33 (9 votes)
Set Up for Success: Insights from 'Naïve Set Theory' 2018-02-28T02:01:43.790Z · score: 62 (18 votes)
Walkthrough of 'Formalizing Convergent Instrumental Goals' 2018-02-26T02:20:09.294Z · score: 27 (6 votes)
Interpersonal Approaches for X-Risk Education 2018-01-24T00:47:44.183Z · score: 29 (8 votes)

Comment by turntrout on Towards a New Impact Measure · 2019-12-11T03:42:47.951Z · score: 12 (3 votes) · LW · GW

This is my post.

### How my thinking has changed

I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!

If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today. I currently think that power-seeking behavior is the worst part of goal-directed agency, incentivizing things like self-preservation and taking-over-the-planet. Unless we assume an "evil" utility function, an agent only seems incentivized to hurt us in order to become more able to achieve its own goals. But... what if the agent's own utility function penalizes it for seeking power? What happens if the agent maximizes a utility function while penalizing itself for becoming more able to maximize that utility function?

This doesn't require knowing anything about human values in particular, nor do we need to pick out privileged parts of the agent's world model as "objects" or anything, nor do we have to disentangle butterfly effects. The agent lives in the same world as us; if we stop it from making waves by gaining a ton of power, we won't systematically get splashed. In fact, it should be extremely easy to penalize power-seeking behavior, because power-seeking is provably instrumentally convergent. That is, penalizing increasing your ability to e.g. look at blue pixels should also penalize power increases.

The main question in my mind is whether there's an equation that gets exactly what we want here. I think there is, but I'm not totally sure.

Comment by turntrout on Babble · 2019-12-08T15:46:29.422Z · score: 4 (2 votes) · LW · GW

I weakly think this post should be included in Best of LessWrong 2018. Although I'm not an expert, the post seems sound. The writing style is nice and relaxed. The author highlights a natural dichotomy; thinking about Babble/Prune has been useful to me on several occasions. For example, in a research brainstorming / confusion-noticing session, I might notice I'm not generating any ideas (Prune is too strong!). Having this concept handle lets me notice that more easily.

One improvement to this post could be the inclusion of specific examples of how the author used this dichotomy to improve their idea generation process.

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-07T17:04:30.106Z · score: 5 (3 votes) · LW · GW

That makes sense, thanks. I then agree that it isn't always true that actively decreases, but it should generally become harder for us to optimize. This is the difference between a utility decrease and an attainable utility decrease.

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-07T17:00:04.111Z · score: 7 (4 votes) · LW · GW

Yeah, this is a great connection which I learned about earlier in the summer. I think this theory explains what's going on when they say

They argue that simple mechanical systems that are postulated to follow this rule show features of “intelligence,” hinting at a connection between this most-human attribute and fundamental physical laws.

Basically, since near-optimal agents tend to go towards states of high power, and near-optimal agents are generally ones which are intelligent, observing an agent moving towards a state of high power is Bayesian evidence that it is intelligent. However, as I understand it, they have the causation wrong: instead of physical laws -> power-seeking and intelligence, intelligent goal-directed behavior tends to produce power-seeking.

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-06T14:01:31.842Z · score: 4 (2 votes) · LW · GW

I don't immediately see why this wouldn't be true as well as the "intermediate version". Can you expand?

Comment by turntrout on Understanding “Deep Double Descent” · 2019-12-06T00:25:13.879Z · score: 15 (6 votes) · LW · GW

Great walkthrough! One nitpick:

find a regime where increasing the amount of training data by four and a half times actually decreases test loss (!):

Did you mean to say increases test loss?

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-05T23:03:38.539Z · score: 6 (3 votes) · LW · GW

What happens if it only has 1.1x as much power?

Then it won't always be instrumentally convergent, depending on the environment in question. For Tic-Tac-Toe, there's an exact proportionality in the limit of farsightedness (see theorem 46). In general, there's a delicate interaction between control provided and probability which I don't fully understand right now. However, we can easily bound how different these quantities can be; the constant depends on the distribution we choose (it's at most 2 for the uniform distribution). The formal explanation can be found in the proof of theorem 48, but I'll try to give a quick overview.

The power calculation is the average attainable utility. This calculation breaks down into the weighted sum of the average attainable utility when is best, the average attainable utility when is best, and the average attainable utility when is best; each term is weighted by the probability that its possibility is optimal.[1] Each term is the power contribution of a different possibility.[2]

Let's think about 's contribution to the first (simple) example. First, how likely is to be optimal? Well, each state has an equal chance of being optimal, so of goals choose . Next, given that is optimal, how much reward do we expect to get? Learning that a possibility is optimal tells us something about its expected value. In this case, the expected reward is still ; the higher this number is, the "happier" an agent is to have this as its optimal possibility.

In general,

If the agent can "die" in an environment, more of its "ability to do things in general" is coming from not dying at first. Like, let's follow where the power is coming from, and that lets us deduce things about the instrumental convergence. Consider the power at a state. Maybe 99% of the power comes from the possibilities for one move (like the move that avoids dying), and 1% comes from the rest. Part of this is because there are "more" goals which say to avoid dying at first, but part also might be that, conditional on not dying being optimal, agents tend to have more control.

By analogy, imagine you're collecting taxes. You have this weird system where each person has to pay at least 50¢, and pays no more than $1. The western half of your city pays$99, while the eastern half pays \$1. Obviously, there have to be more people living in this wild western portion – but you aren't sure exactly how many more. Even so, you know that there are at least 99 people west, and at most 2 people east; so, there are at least 44.5 times as many people in the western half.

In the exact same way, the minimum possible average control is not doing better than chance ( is the expected value of an arbitrary possibility), and the maximum possible is all agents being in heaven ( reward is maximal). So if 99% of the power comes from one move, then this move is at least 44.5 times as likely as any other moves.

1. opt(f,), in the terminology of the paper. ↩︎

2. Power(f,); see definition 9. ↩︎

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-05T22:01:15.583Z · score: 5 (3 votes) · LW · GW

My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?

I'm not totally sure I fully follow the conclusion, but I'll take a shot at answering - correct me if it seems like I'm talking past you.

Taking to be some notion of human values, I think it's both true that actively decreases and becomes harder for us to optimize. Both of these are caused, I think, by the agent's drive to take power / resources from us. If this weren't true, we might expect to see only "evil" objectives inducing catastrophically bad outcomes.

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-05T21:17:19.546Z · score: 6 (3 votes) · LW · GW

Yes in this case, although note that that only tells us about the rules of the game, not about the reward function - most agents we're considering don't have the normal Tic-Tac-Toe reward function.

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-05T21:15:43.579Z · score: 6 (3 votes) · LW · GW

Yeah, you could think of the control as coming from . Will rephrase.

Comment by turntrout on Seeking Power is Provably Instrumentally Convergent in MDPs · 2019-12-05T21:13:37.459Z · score: 6 (3 votes) · LW · GW

Yes. The full expansions (with no limit on the time horizon) are

, where .

Comment by turntrout on Fetch The Coffee! · 2019-12-05T02:32:32.597Z · score: 2 (1 votes) · LW · GW

The work is now public.

Comment by turntrout on TurnTrout's shortform feed · 2019-12-04T00:50:30.859Z · score: 15 (4 votes) · LW · GW

Listening to Eneasz Brodski's excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn't realize until experiencing this (so far) intelligently written story.

Similarly, how do we actually "align" agents, and what are good frames for thinking about that?

Here's to hoping we don't sate the former curiosity too early.

Comment by turntrout on A list of good heuristics that the case for AI x-risk fails · 2019-12-03T05:37:16.885Z · score: 15 (4 votes) · LW · GW

Sounds like their problem isn't just misleading heuristics, it's motivated cognition.

Comment by turntrout on World State is the Wrong Level of Abstraction for Impact · 2019-12-01T03:14:04.708Z · score: 8 (3 votes) · LW · GW

This seems like a misreading of my post.

When you change your ontology, concepts like "cat" or "vase" don't become meaningless, they just get translated.

That’s a big part of my point.

Also, you know that AIXI's reward function is defined on its percepts and not on world states, right? It seems a bit tautological to say that its utility is local, then.

Comment by turntrout on The Correct Contrarian Cluster · 2019-11-30T15:03:12.405Z · score: 12 (3 votes) · LW · GW

A sizable shift has occurred because of him, which is different than your interpretation of my position. If you’re convincing Stuart Russell, who is convincing Turing award winners like Yoshua Bengio and Judea Pearl, then there was something that wasn’t considered.

Comment by turntrout on The Correct Contrarian Cluster · 2019-11-29T15:06:25.898Z · score: 7 (3 votes) · LW · GW

The latter has come to be true, in no small part as a result of his writing. This implies that there was indeed something academics were missing about alignment.

Comment by turntrout on TurnTrout's shortform feed · 2019-11-29T02:52:46.899Z · score: 12 (6 votes) · LW · GW

My life has gotten a lot more insane over the last two years. However, it's also gotten a lot more wonderful, and I want to take time to share how thankful I am for that.

Before, life felt like... a thing that you experience, where you score points and accolades and check boxes. It felt kinda fake, but parts of it were nice. I had this nice cozy little box that I lived in, a mental cage circumscribing my entire life. Today, I feel (much more) free.

I love how curious I've become, even about "unsophisticated" things. Near dusk, I walked the winter wonderland of Ogden, Utah with my aunt and uncle. I spotted this gorgeous red ornament hanging from a tree, with a hunk of snow stuck to it at north-east orientation. This snow had apparently decided to defy gravity. I just stopped and stared. I was so confused. I'd kinda guessed that the dry snow must induce a huge coefficient of static friction, hence the winter wonderland. But that didn't suffice to explain this. I bounded over and saw the smooth surface was iced, so maybe part of the snow melted in the midday sun, froze as evening advanced, and then the part-ice part-snow chunk stuck much more solidly to the ornament.

Maybe that's right, and maybe not. The point is that two years ago, I'd have thought this was just "how the world worked", and it was up to physicists to understand the details. Whatever, right? But now, I'm this starry-eyed kid in a secret shop full of wonderful secrets. Some secrets are already understood by some people, but not by me. A few secrets I am the first to understand. Some secrets remain unknown to all. All of the secrets are enticing.

My life isn't always like this; some days are a bit gray and draining. But many days aren't, and I'm so happy about that.

Socially, I feel more fascinated by people in general, more eager to hear what's going on in their lives, more curious what it feels like to be them that day. In particular, I've fallen in love with the rationalist and effective altruist communities, which was totally a thing I didn't even know I desperately wanted until I already had it in my life! There are so many kind, smart, and caring people, inside many of whom burns a similarly intense drive to make the future nice, no matter what. Even though I'm estranged from the physical community much of the year, I feel less alone: there's a home for me somewhere.

Professionally, I'm working on AI alignment, which I think is crucial for making the future nice. Two years ago, I felt pretty sidelined - I hadn't met the bars I thought I needed to meet in order to do Important Things, so I just planned for a nice, quiet, responsible, normal life, doing little kindnesses. Surely the writers of the universe's script would make sure things turned out OK, right?

I feel in the game now. The game can be daunting, but it's also thrilling. It can be scary, but it's important. It's something we need to play, and win. I feel that viscerally. I'm fighting for something important, with every intention of winning.

I really wish I had the time to hear from each and every one of you. But I can't, so I do what I can: I wish you a very happy Thanksgiving. :)

Comment by turntrout on The Correct Contrarian Cluster · 2019-11-28T17:54:36.243Z · score: 10 (5 votes) · LW · GW

You should also take into account that Eliezer seems to have been right, as an “amateur” AI researcher, about AI alignment being a big deal.

Comment by turntrout on TurnTrout's shortform feed · 2019-11-26T18:43:56.755Z · score: 4 (2 votes) · LW · GW

I was having a bit of trouble holding the point of quadratic residues in my mind. I could effortfully recite the definition, give an example, and walk through the broad-strokes steps of proving quadratic reciprocity. But it felt fake and stale and memorized.

Alex Mennen suggested a great way of thinking about it. For some odd prime , consider the multiplicative group . This group is abelian and has even order . Now, consider a primitive root / generator . By definition, every element of the group can be expressed as . The quadratic residues are those expressible by even (this is why, for prime numbers, half of the group is square mod ). This also lets us easily see that the residual subgroup is closed under multiplication by (which generates it), that two non-residues multiply to make a residue, and that a residue and non-residue make a non-residue. The Legendre symbol then just tells us, for , whether is even.

Now, consider composite numbers whose prime decomposition only contains or in the exponents. By the fundamental theorem of finite abelian groups and the chinese remainder theorem, we see that a number is square mod iff it is square mod all of the prime factors.

I'm still a little confused about how to think of squares mod .

Comment by turntrout on An Untrollable Mathematician Illustrated · 2019-11-22T17:01:15.930Z · score: 12 (3 votes) · LW · GW

Abram's writing and illustrations often distill technical insights into accessible, fun adventures. I've come to appreciate the importance and value of this expository style more and more over the last year, and this post is what first put me on this track. While more rigorous communication certainly has its place, clearly communicating the key conceptual insights behind a piece of work makes those insights available to the entire community.

Comment by turntrout on How I do research · 2019-11-22T03:03:19.703Z · score: 17 (6 votes) · LW · GW

Don't lose "much" informational content? Alex, you're a goddamned alignment researcher... Do you want them to spend time fretting about exactly how to phrase their comment so as to not hurt your feelings?

This is a straw interpretation of what I'm trying to communicate. This argument isn't addressing the actual norm I plan on enforcing, and seems to instead cast me as walling myself off from anything I might not like.

The norm I'm actually proposing is that if you see an easy way to make an abrasive thing less abrasive, you take it. If the thing still has to be abrasive, that's fine. Remember, I said

needlessly abrasive

Am I to believe that if people can't say the thing that first comes to mind in the heat of the moment, there isn't any way to express the same information? What am I plausibly losing out on by not hearing "stooping to X" instead of "resorting to X"? That Said first thought of "stooping"? That he's passionate about typography?

I don't see many reasonable situations where this kind of statement

The thing you're doing really doesn't work, because [reason X]. You should do [alternative] because [reason Y].

doesn't suffice (again, this isn't a "template"; it's just an example of a reasonable level of frankness).

I've been actively engaged with LW for a good while, and this issue hasn't come up until now (which makes me think it's uncommon, thankfully). Additionally, I've worked with people in the alignment research community. No one has seemed to have trouble communicating efficiently in a minimally abrasive fashion, and I don't see why that will change.

I don't currently plan on commenting further on this thread, although I thank both you and Said for your contributions and certainly hope you keep commenting on my posts!

Comment by turntrout on Do you get value out of contentless comments? · 2019-11-21T22:57:12.771Z · score: 19 (12 votes) · LW · GW

Good question!

Yes, and probably. The comment makes me feel more rewarded for whatever effort I put into the post.

Comment by turntrout on TurnTrout's shortform feed · 2019-11-20T21:52:55.015Z · score: 9 (4 votes) · LW · GW

I feel very excited by the AI alignment discussion group I'm running at Oregon State University. Three weeks ago, most attendees didn't know much about "AI security mindset"-ish considerations. This week, I asked the question "what, if anything, could go wrong with a superhuman reward maximizer which is rewarded for pictures of smiling people? Don't just fit a bad story to the reward function. Think carefully."

There was some discussion and initial optimism, after which someone said "wait, those optimistic solutions are just the ones you'd prioritize! What's that called, again?" (It's called anthropomorphic optimism)

I'm so proud.

Comment by turntrout on How I do research · 2019-11-20T21:25:42.758Z · score: 31 (11 votes) · LW · GW

I see that you're trying to enforce civility norms.

Very clever.

You're right that Said's criticism was substantive, and I didn't mean to downplay that in my comment. I do, in fact, think that Said is right: my formatting messes with archiving and search, and there are better alternatives. He has successfully persuaded me of this. In fact, I'll update the post after writing this comment!

The reason I made that comment is that I notice his tone makes it harder for me to update on and accept his argumentation. Although an ideal reasoner might not mind, I do. The additional difficulty of updating is a real cost, and the tone just seemed consistently unreasonable for the situation.

I don't think we should just prioritize authors' simply getting thicker skin, although I agree it's good for authors to strive for individually. Here is some of my reasoning.

• Suppose I were a newcomer to the site, I wrote a post about my research habits, and then I recieved this comment thread in return. Would I write more posts? 2017-me would not have done so. Suppose I even saw this happening on someone else's post. Would I write posts? No. I, like many people I have anecdotally heard of, was already intimidated by the percieved harshness of this site's comments. I think you might not currently be appreciating how brutal this site can look at first. If there are tradeoffs we can make along the lines of saying "if you're resorting to X" instead of "if you're stooping to X", tradeoffs that don't really lose much informational content but do significantly reduce potential for abrasion, it seems sensible to make them.

• Truly thickening one's skin seems pretty difficult. Maybe I can just sit down and Internal Double Crux this, but even if so, do we just expect authors to do this in general?

• Microhedonics. If one has reasonably but imperfectly thick skin, then the author might be slightly discouraged from engaging with the community. Obviously there is a balance to be struck here, but the line I drew does not seem unreasonable to me.

ETA: My comment also wasn't saying that people have to specifically follow the scripted example. They don't need to say they just prefer X, or whatever. The "good" example is probably overly flowery. Just avoid being needlessly abrasive.

Comment by turntrout on How I do research · 2019-11-20T18:53:13.507Z · score: 4 (2 votes) · LW · GW

That's a good, simple idea, thanks! I'll consider doing that.

Comment by turntrout on How I do research · 2019-11-20T18:19:42.968Z · score: 11 (6 votes) · LW · GW

Meta: at first, I found the tone of this thread to be fun and nerdy, but I'm quickly changing my mind. In fairness to you and the other commenters, I specified no moderation guidelines on this post. However, here are my thoughts on this kind of thing going forward:

I don't at all mind people debating my formatting choices, or saying they didn't like something I did. Here's a good comment:

As an aside, I see that you tried to employ makeshift drop caps, but I don't think it works well like that. Personally, I prefer using [technique X], so you might consider that instead. Additionally, it has the benefits of Y and Z. ✔️

this attempt at “drop caps” is an insult to drop caps. ❌

If you're stooping to the use of Unicode alternates for your "drop-caps"... ❌

This is needlessly abrasive, and I won't tolerate it on my posts in the future.

Comment by turntrout on How I do research · 2019-11-20T17:26:39.966Z · score: 2 (1 votes) · LW · GW

FYI, I'm told that normal users are not allowed to use this kind of formatting, but the admins can edit it in.

ETA maybe you could do , though.

Comment by turntrout on How I do research · 2019-11-20T15:48:33.936Z · score: 4 (2 votes) · LW · GW

Bullets wouldn’t work because some tips had several paragraphs, and it would have been awkward to make a new subsubsection with eg two sentences (make simplifying assumptions). So, I did this, and I, like Zack, liked how it looks.

Comment by turntrout on How I do research · 2019-11-20T06:04:59.726Z · score: 1 (5 votes) · LW · GW

ow can make amends, aid?

ETA: the special formatting apparently doesn't show up on the front page's comment feed. Weird.

Comment by turntrout on TurnTrout's shortform feed · 2019-11-13T17:18:29.555Z · score: 12 (4 votes) · LW · GW

Yesterday, I put the finishing touches on my chef d'œuvre, a series of important safety-relevant proofs I've been striving for since early June. Strangely, I felt a great exhaustion come over me. These proofs had been my obsession for so long, and now - now, I'm done.

I've had this feeling before; three years ago, I studied fervently for a Google interview. The literal moment the interview concluded, a fever overtook me. I was sick for days. All the stress and expectation and readiness-to-fight which had been pent up, released.

I don't know why this happens. But right now, I'm still a little tired, even after getting a good night's sleep.

Comment by turntrout on Rohin Shah on reasons for AI optimism · 2019-11-06T16:32:10.923Z · score: 2 (1 votes) · LW · GW

Not on topic, but from the first article:

In reality, much of the success of a government is due to the role of the particular leaders, particular people, and particular places. If you have a mostly illiterate nation, divided 60%/40% into two tribes, then majoritarian democracy is a really, really bad idea. But if you have a homogeneous, educated, and savvy populace, with a network of private institutions, and a high-trust culture, then many forms of government will work quite well. Much of the purported success of democracy is really survivorship bias. Countries with the most human capital and strongest civic institutions can survive the chaos and demagoguery that comes with regular mass elections. Lesser countries succumb to chaos, and then dictatorship.

I don't see why having a well-educated populace would protect you against the nigh-inevitable value drift of even well-intentioned leaders when they ascend to power in a highly authoritarian regime.

Comment by turntrout on But exactly how complex and fragile? · 2019-11-04T16:28:47.880Z · score: 3 (2 votes) · LW · GW

There's a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.

Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).

Comment by turntrout on TurnTrout's shortform feed · 2019-11-04T01:29:44.252Z · score: 11 (4 votes) · LW · GW

With respect to the integers, 2 is prime. But with respect to the Gaussian integers, it's not: it has factorization . Here's what's happening.

You can view complex multiplication as scaling and rotating the complex plane. So, when we take our unit vector 1 and multiply by , we're scaling it by and rotating it counterclockwise by :

This gets us to the purple vector. Now, we multiply by , scaling it up by again (in green), and rotating it clockwise again by the same amount. You can even deal with the scaling and rotations separately (scale twice by , with zero net rotation).

Comment by turntrout on Fetch The Coffee! · 2019-10-29T22:34:51.869Z · score: 2 (1 votes) · LW · GW

I guess I'm not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it's trained to maximize an evaluation function for its proximity, with just theta being the parameter?

That assumption is doing a lot of work, it's not clear what is packed into that, and it may not be sufficient to prove the argument.

Well, my reasoning isn't publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate , and for each reward function there exists an optimal policy for that discount rate. I'm claiming that given sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.

(All of this can be formally defined in the right way. If you want the proof, you'll need to hold tight for a while)

Comment by turntrout on Fetch The Coffee! · 2019-10-29T20:14:35.726Z · score: 2 (1 votes) · LW · GW

It's still going to act instrumentally convergently within the MDP it thinks it's in. If you're assuming it thinks it's in a different MDP that can't possibly model the real world, or if it is in the real world but has an empty action set, then you're right - it won't become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that's farsighted, over a state space that is "close enough" to representing the real world, then it does take over.

The thing that's fuzzy here is "agent acting in the real world". In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the "won a Go game" signal. I don't know if I buy that yet, but you're correct that there's some kind of fuzzy boundary here. If we knew what exactly it took to get a "sufficiently good model", we'd probably be a lot closer to AGI.

But Russell's original argument assumes the relevant factors are within the model.

If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee.

I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.

Comment by turntrout on The First Rung: Insights from 'Linear Algebra Done Right' · 2019-10-29T17:51:00.640Z · score: 15 (4 votes) · LW · GW

There are two cats where I live. Sometimes, I watch them meander around; it's fascinating to think how they go about their lives totally oblivious to the true nature of the world around them. The above incomputability result (ETA: the insolubility of the quintic) surprised me so much that I have begun to suspect that I too am a clueless cat (until I learn complex analysis; you'll excuse me for having a few other textbooks to study first).

So I finally got around to learning Galois theory / group theory, and I now understand this insolubility result (although my book stated some field theoretic results without proof). It's really, really cool, and I'm feeling a lot of things about it right now.

But also... 2018-Trout was just so adorably ignorant! Not only did I make the goof that daozaich pointed out, but that version of me also called the result an "incomputability" result!

You know the feeling you get when you find a drawing you made your mom when you were five, and it's full of adorable little spelling mistakes? Yeah, that's what I'm feeling right now.

Comment by turntrout on Fetch The Coffee! · 2019-10-29T16:44:40.176Z · score: 2 (1 votes) · LW · GW

In which case I think it would be wise for someone with Russell's views not to call the opposition stupid. Or to assert that the position is trivial. When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc. As far as I can tell from limited reading, the arguments around how tightly bundled these things may be are not watertight.

I agree from a general convincing-people standpoint that calling discussants stupid is a bad idea. However, I think it is indeed quite obvious if framed properly, and I don't think the argument needs to come down to nuanced points, as long as we agree on the agent design we're talking about - the Roomba is not a farsighted reward maximizer, and is implied to be trained in a pretty weak fashion.

Suppose an agent is incentivized to maximize reward. That means it's incentivized to be maximally able to get reward. That means it will work to stay able to get as much reward as possible. That means if we mess up, it's working against us.

I think the main point of disagreement here is goal-directedness, but if we assume RL as the thing that gets us to AGI, the instrumental convergence case is open and shut.

Comment by turntrout on Fetch The Coffee! · 2019-10-26T21:01:56.826Z · score: 5 (3 votes) · LW · GW

Here are my readings of the arguments.

Stuart claims that if you give the system a fixed objective , then it's incentivized to ensure it can achieve that goal. Naturally, it takes any advantage to stop us from stopping it from best achieving the fixed objective.

Yann's response reads as "who would possibly be so dumb as to build a system which would be incentivized to stop us from stopping it?".

Now, I also think this response is weak. But let's consider whether this is a reasonable response, even if it seemed reasonable to be confident that it will be easy to see when systems are flawed, and easy to build safeguards; neither is remotely likely to be true, in my opinion.

Suppose you have access to a computer terminal. You have a mathematical proof that if you type random characters into the terminal and then press Enter, one of the most likely outcomes is that it explodes and kills you. Now, a response analogous to Yann's is: "who would be so dumb as to type random things into this computer? I won't. Time to type my next paper.".

I think a wiser response would be to ask what about this computer kills you if you type in random stuff, and think very, very carefully before you type anything. If you have text you want to type, you should be checking it against your detailed gears-level model of the computer to assure yourself that it won't trigger whatever-causes-explosions. It's not enough to say that you won't type random stuff.

ETA: In particular, if, after learning of this proof, you find yourself still exactly as enthusiastic to type as you were before - well, we've discussed that failure mode before.

Comment by turntrout on All I know is Goodhart · 2019-10-26T20:16:33.893Z · score: 3 (2 votes) · LW · GW

I think you're using Able #2 (which makes sense--it's how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:

I am using #2, but I'm aware that there's a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).

Comment by turntrout on All I know is Goodhart · 2019-10-26T16:16:32.501Z · score: 3 (2 votes) · LW · GW

Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t. (This will be in the next part of my sequence on impact).

Comment by turntrout on What are some unpopular (non-normative) opinions that you hold? · 2019-10-26T00:42:23.965Z · score: 12 (3 votes) · LW · GW

The CDC has data on promiscuity and pair bonding (https://www.cdc.gov/nchs/data/nhsr/nhsr036.pdf). I can find no data that indicates neutral or positive effects for promiscuity.

The study focuses on STI rates. How would promiscuity possibly improve these rates?

It seems to me that when pressed on the compact claim you made above, you just made a lot more claims (as pjeby pointed out).

Comment by turntrout on All I know is Goodhart · 2019-10-25T16:55:02.368Z · score: 3 (2 votes) · LW · GW

Let me clarify the distinction I'm trying to point at:

First, Goodhart's law applies to us when we're optimizing a goal for ourselves, but we don't know the exact goal. For example, if I'm trying to make myself happy, I might find a proxy of dancing, even though dancing isn't literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn't seem that bad. I'm pretty corrigible to myself.

Second, Goodhart's law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it's maximizing the proxy, then it's ensuring it's most able to maximize the proxy, which means it's incentivized to stop us from doing things (unless the proxy specifically includes that - which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours.

I think this entropyish thing is also why Stuart's makes his point that Goodhart applies to humans and not in general: It's only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.

Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart's Law isn't driven by entropy; Goodhart's Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you'd expect from your universe. Entropy just happens to be a reasonable proxy for it sometimes.

I don't think the intial value has much to do with what you label the "AIS version" of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart's law still applies to agents we build to ensure we can see red forever, but it doesn't apply to us directly - we presumably deduce our true reward function, and no longer rely on proxies to maximize it.

The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something.

How proxies interact is worth considering, but (IMO) it's far from the main reason for Goodhart's law being really, really bad in the context of AI safety.

Comment by turntrout on All I know is Goodhart · 2019-10-23T15:32:04.352Z · score: 4 (2 votes) · LW · GW

The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us. Therefore, any universe where we have an effective maximizer of something misspecified is a universe where we're no longer able to get what we want. That's why instrumental convergence is so bad.

Comment by turntrout on What are some unpopular (non-normative) opinions that you hold? · 2019-10-23T15:27:30.900Z · score: 19 (9 votes) · LW · GW

I do not accept this premise, and I'm surprised that it seems to you to be "almost unconsciously accepted by everyone" you've raised it with.

Comment by turntrout on An1lam's Short Form Feed · 2019-10-23T03:12:32.529Z · score: 3 (2 votes) · LW · GW

Although I haven't used Anki for math, it seems to me like I want to build up concepts and competencies, not remember definitions. Like, I couldn't write down the definition of absolute continuity, but if I got back in the zone and refreshed myself, I'd have all of my analysis skills intact.

I suppose definitions might be a useful scaffolding?

Comment by turntrout on Deducing Impact · 2019-10-22T19:35:14.174Z · score: 2 (1 votes) · LW · GW

Great responses.

What you're inferring is impressively close to where the sequence is leading in some ways, but the final destination is more indirect and avoids the issues you rightly point out (with the exception of the "ensuring the future is valuable" issue; I really don't think we can or should build eg low-impact yet ambitious singletons - more on that later).

Comment by turntrout on TurnTrout's shortform feed · 2019-10-22T19:22:03.859Z · score: 4 (2 votes) · LW · GW

I noticed I was confused and liable to forget my grasp on what the hell is so "normal" about normal subgroups. You know what that means - colorful picture time!

First, the classic definition. A subgroup is normal when, for all group elements , (this is trivially true for all subgroups of abelian groups).

ETA: I drew the bounds a bit incorrectly; is most certainly within the left coset ().

Notice that nontrivial cosets aren't subgroups, because they don't have the identity .

This "normal" thing matters because sometimes we want to highlight regularities in the group by taking a quotient. Taking an example from the excellent Visual Group Theory, the integers have a quotient group consisting of the congruence classes , each integer slotted into a class according to its value mod 12. We're taking a quotient with the cyclic subgroup .

So, what can go wrong? Well, if the subgroup isn't normal, strange things can happen when you try to take a quotient.

Here's what's happening:

Normality means that when you form the new Cayley diagram, the arrows behave properly. You're at the origin, . You travel to using . What we need for this diagram to make sense is that if you follow any you please, applying means you go back to . In other words, . In other words, . In other other words (and using a few properties of groups), .

Comment by turntrout on Reframing Impact · 2019-10-22T15:37:11.792Z · score: 2 (1 votes) · LW · GW

Accessible in what way? I’m planning to put up a full a text version at the end.

Comment by turntrout on Invisible Choices, Made by Default · 2019-10-20T20:52:39.385Z · score: 4 (3 votes) · LW · GW

Starting with Duolingo might be OK, but it’s very bare-bones and unlike the real deal of having thoughts in a language and then expressing yourself. In fact, I recommend not using it after you have a rudimentary grasp of the language.

Try entering full sentence-to-sentence cards in Anki. At first, you won’t be able to write good sentences on your own, so use a sentence-search service for that (I don’t remember what they’re called, but at the least you can use the example wordreference sentences). Over the course of a year, I input all the thoughts I’d had that day which I couldn’t translate. By learning to produce words in-context instead of memorizing the multilingual dictionary, I began to run out of additions to the Anki deck. That exercise gave me near-fluency, which has persisted to this day.

So, I think Anki is suited for nearly everything except oral skills. Plus, spaced repetition saves time.

Also, you might try an app like Tandem. If you’re single, try learning to be interesting in the foreign language - it’s naturally motivating, and both of you can chalk up any awkwardness to mistranslation. :)