Inner Alignment: Explain like I'm 12 Edition 2020-08-01T15:24:33.799Z · score: 97 (31 votes)
Rafael Harth's Shortform 2020-07-22T12:58:12.316Z · score: 6 (1 votes)
The "AI Dungeons" Dragon Model is heavily path dependent (testing GPT-3 on ethics) 2020-07-21T12:14:32.824Z · score: 47 (17 votes)
UML IV: Linear Predictors 2020-07-08T19:06:05.269Z · score: 14 (3 votes)
How to evaluate (50%) predictions 2020-04-10T17:12:02.867Z · score: 117 (56 votes)
UML final 2020-03-08T20:43:58.897Z · score: 23 (5 votes)
UML XIII: Online Learning and Clustering 2020-03-01T18:32:03.584Z · score: 13 (3 votes)
What to make of Aubrey de Grey's prediction? 2020-02-28T19:25:18.027Z · score: 24 (9 votes)
UML XII: Dimensionality Reduction 2020-02-23T19:44:23.956Z · score: 9 (3 votes)
UML XI: Nearest Neighbor Schemes 2020-02-16T20:30:14.112Z · score: 15 (4 votes)
A Simple Introduction to Neural Networks 2020-02-09T22:02:38.940Z · score: 33 (10 votes)
UML IX: Kernels and Boosting 2020-02-02T21:51:25.114Z · score: 13 (3 votes)
UML VIII: Linear Predictors (2) 2020-01-26T20:09:28.305Z · score: 9 (3 votes)
UML VII: Meta-Learning 2020-01-19T18:23:09.689Z · score: 15 (4 votes)
UML VI: Stochastic Gradient Descent 2020-01-12T21:59:25.606Z · score: 13 (3 votes)
UML V: Convex Learning Problems 2020-01-05T19:47:44.265Z · score: 13 (3 votes)
Excitement vs childishness 2020-01-03T13:47:44.964Z · score: 18 (8 votes)
Understanding Machine Learning (III) 2019-12-25T18:55:55.715Z · score: 17 (5 votes)
Understanding Machine Learning (II) 2019-12-22T18:28:07.158Z · score: 25 (7 votes)
Understanding Machine Learning (I) 2019-12-20T18:22:53.505Z · score: 47 (9 votes)
Insights from the randomness/ignorance model are genuine 2019-11-13T16:18:55.544Z · score: 7 (2 votes)
The randomness/ignorance model solves many anthropic problems 2019-11-11T17:02:33.496Z · score: 10 (7 votes)
Reference Classes for Randomness 2019-11-09T14:41:04.157Z · score: 8 (4 votes)
Randomness vs. Ignorance 2019-11-07T18:51:55.706Z · score: 5 (3 votes)
We tend to forget complicated things 2019-10-20T20:05:28.325Z · score: 51 (19 votes)
Insights from Linear Algebra Done Right 2019-07-13T18:24:50.753Z · score: 53 (23 votes)
Insights from Munkres' Topology 2019-03-17T16:52:46.256Z · score: 40 (12 votes)
Signaling-based observations of (other) students 2018-05-27T18:12:07.066Z · score: 20 (5 votes)
A possible solution to the Fermi Paradox 2018-05-05T14:56:03.143Z · score: 10 (3 votes)
The master skill of matching map and territory 2018-03-27T12:06:53.377Z · score: 36 (11 votes)
Intuition should be applied at the lowest possible level 2018-02-27T22:58:42.000Z · score: 29 (10 votes)
Consider Reconsidering Pascal's Mugging 2018-01-03T00:03:32.358Z · score: 14 (4 votes)


Comment by sil-ver on Attainable Utility Preservation: Scaling to Superhuman · 2020-08-05T09:16:54.240Z · score: 4 (2 votes) · LW · GW

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action which doesn't change the state and leads to 1 reward, and a sequence of actions such that has reward with (and all have 0 reward), then it's conceivable that would choose the sequence while would just stubbornly repeat , even if the represent something very tailored to that doesn't involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where . This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn't involve gaining power. I guess the scaling step might help here?

Separately and very speculatively, I'm wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn't be caught by your penalty since it's about an internal change. If so, that might be a sign that it'll be difficult to fix. More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Comment by sil-ver on Inner Alignment: Explain like I'm 12 Edition · 2020-08-03T09:45:33.642Z · score: 2 (1 votes) · LW · GW

Ah, shoot. Thanks.

Comment by sil-ver on The "AI Dungeons" Dragon Model is heavily path dependent (testing GPT-3 on ethics) · 2020-08-03T08:10:08.138Z · score: 2 (1 votes) · LW · GW

Someone else said in a comment on LW that they think "custom" uses GPT-2, whereas using another setting and then editing the opening post will use GPT-3. I wanted to give them credit in response to your comment, but I can't find where they said it. (They still wouldn't get full points since they didn't realize custom would use GPT-3 after the first prompt.) I initially totally rejected the comment since it implies that all of the custom responses use GPT-2, which seemed quite hard to believe given how good some of them are.

Some of the twitter responses sound quite annoyed with this, which is a sentiment I share. I thought that getting the AI to generate good responses was important at every step, but (if this is true and I understand it correctly), it doesn't matter at all after the first reply. That's some non-negligible amount of wasted effort.

Comment by sil-ver on Inner Alignment: Explain like I'm 12 Edition · 2020-08-02T07:42:10.389Z · score: 14 (4 votes) · LW · GW

Many thanks for taking the time to find errors.

I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.

I'm hesitant to change #4 before I fully understand why.

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

So, there are these two channels, input data and SGD. If the model's objective can only be modified by SGD, then (since SGD doesn't want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.

But the bolded part seemed like a necessary condition, and that's what I'm trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don't think I quite understand why this isn't plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don't see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.

Edit: I slightly rephrased it to say

If we further assume that processing input data doesn't directly modify the model's objective (the Mesa Objective), or that its model of the Base Objective is created first,

Comment by sil-ver on Predictions for GPT-N · 2020-08-01T14:16:58.499Z · score: 2 (1 votes) · LW · GW

I see; that's understandable.

Comment by sil-ver on Iterated Distillation and Amplification · 2020-08-01T09:17:31.259Z · score: 4 (2 votes) · LW · GW

I think this is a really key point that I would make explicit and emphasize if I were to explain the scheme.

H is always the same. In fact, H is a human, so it doesn't make any sense to have code of the form . In every step, a new system  is trained by letting a regular human oversee it, where the human has access to the system .

Conversely, your code would imply that the human itself is replaced with something, and that thing then uses the system . This does not happen.

(Unless my understanding is widely off; I'm only reading this sequence for the second time.)

Comment by sil-ver on Predictions for GPT-N · 2020-08-01T07:30:20.342Z · score: 2 (1 votes) · LW · GW

Would you bet on it not being created by OpenAi with even odds?

Comment by sil-ver on Attainable Utility Preservation: Empirical Results · 2020-07-28T14:03:28.775Z · score: 3 (2 votes) · LW · GW

Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it.

It's also slightly confusing in this case because the post doesn't explain it, which made me wonder, "am I supposed to understand what it's for?" But it is explained in the conservative agency paper.

I think the n-step stepwise inaction baseline doesn't fail at any of them?

Yeah, but the first one was "[comparing AU for aux. goal if I do this action to] AU for aux. goal if I do nothing"

Comment by sil-ver on You Can Probably Amplify GPT3 Directly · 2020-07-28T13:57:06.778Z · score: 4 (2 votes) · LW · GW

The approach I've been using (for different things, but I suspect the principle is the same) is

  • If you want it to do X, give it about four examples of X in the question-answer format as a prompt (as in, commands from the human plus answers from the AI)
  • Repeat for about three times:
    • Give it another such question, reroll until it produces a good answer (might take a lot of rolls)

At that point it is much better than one where you prompted everything to begin with.

Comment by sil-ver on Attainable Utility Preservation: Empirical Results · 2020-07-28T10:10:35.590Z · score: 5 (3 votes) · LW · GW

An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.

I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"

Attempted Summary/Thoughts on this post

  • GridWorlds is a toy environment (probably meant to be as simple as possible while still allowing to test various properties of agents). The worlds consist of small grids, the state space is correspondingly non-large, and you can program certain behavior of the environment (such as a pixel moving at a pre-defined route).
  • You can specify objectives for an agent within GridWorlds and use Reinforcement Learning to train the agent (to learn a space-transition function?). The agent can move around and behavior on collision with other agents/objects can be specified by the programmer
  • The idea now is that we program five grid worlds in such a way that they represent failure modes relevant to safety. We train (a) a RL algorithm with the objective, (b) a RL algorithm with the objective plus some implementation of AUP and see how they behave differently
  • The five failure modes are (1) causing irreversible changes, (2) damaging stuff, (3) disabling an off-swich, (4) undoing effects that result from the reaching the main objective, (5) preventing naturally occurring changes. The final two aren't things naive RL learning would do, but are failure modes for poorly specified impact penalties ("when curing cancer, make sure human still dies")
    • I don't understand how (1) and (2) are conceptually different (aren't both about causing irreversible changes?)
  • The implementation of AUP chooses a uniformly random objective  and then penalizes actions by a multiple of the term , scaled by some parameter  and normalized.
    • An important implementation detail is about what to compare "AU for aux. goal if I do this" to. There's "AU [for aux. goal] if I do nothing" and "AU [...] if I do nothing for  steps" and "AU [...] at starting state." The last one fails at (5), the first one at (4). (I forgot too much of the reinforcement learning theory to understand how exactly these concepts would map onto the formula.)
  • The AUP penalty robustly scales up to more complex environments, although the "pick a uniformly random reward function" step has to be replaced with "do some white magic to end up with something difficult to understand but still quite simple." The details of "white magic" are probably important for scaling it up to real-world applications.
Comment by sil-ver on Attainable Utility Preservation: Concepts · 2020-07-26T13:20:33.831Z · score: 2 (1 votes) · LW · GW

And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUP should work just fine for penalizing-increases-only. 

The case I had in mind was "you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you're dead (because then you can't get sick)". If the AI kills you, that doesn't seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.

Comment by sil-ver on Attainable Utility Preservation: Concepts · 2020-07-26T08:48:16.482Z · score: 2 (1 votes) · LW · GW

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)

It does seem that AUP will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if it won't be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like "AUP will always prevent an agent from gaining enough power to resist being switched off"

Comment by sil-ver on The Catastrophic Convergence Conjecture · 2020-07-26T08:23:24.117Z · score: 4 (2 votes) · LW · GW

Attempt to summarize

  • The AU landscape naturally leads to competition because many goals imply seeking power, and [A acquiring a lot of power] tends to be in conflict with [B acquiring a lot of power] because, well, the resources only exist once.
    • The CCC (catastrophic convergence conjecture) argues that, therefore, unaligned goals with us tend to cause catastrophic consequences if given to a powerful agent. It's (right now) informal.
  • The power-framing leads to a division of catastrophes into value-specific vs. objective, where the former ones depend on the goals of an agent, whereas the latter rely on the instrumental convergence idea, i.e., they lower the AU for those goals which are instrumentally convergent (like "stay alive") and thus lower the AU for lots of different agents (who have different goals).
  • AU is probably less fragile than values.
  • The environment contains information about what we value, and can be seen as an inspiration for AI alignment approaches. These approaches arguably work better in the AU framing as supposed to the classical values framing.
Comment by sil-ver on Attainable Utility Landscape: How The World Is Changed · 2020-07-24T18:56:41.254Z · score: 2 (1 votes) · LW · GW

The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.

The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

I did interpret the first exercise as "you planned to go onto the moon" and came up with stuff like "how valuable are the stones I can take home" and "how pleasant will it be to hang around."

One thing I noticed is that the formal policies don't allow for all possible "strategies." In the graph we had to reconstruct, I can't start at s1, then go to s1 once and then go to s3. So you could think of the larger set where the policies are allowed to depend on the time step. But I assume there's no point unless the reward function also depends on the time step. (I don't know anything about MDPs.)

Am I correct that a deterministic transition function is a function and a non-deterministic one is a function ?

Comment by sil-ver on Seeking Power is Often Provably Instrumentally Convergent in MDPs · 2020-07-23T14:06:11.440Z · score: 4 (2 votes) · LW · GW

Thoughts after reading and thinking about this post

The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.

In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In that analogy, the statement that "most goals incentivize gaining power over that environment" would then be "for most tournaments, the first place finisher is someone with good average performance." With this formulation, the statement

formal POWER contributions of different possibilities are approximately proportionally related to instrumental convergence.

seems to be exactly what you would expect (more first places should strongly correlate with better performance). And to construct a counter-example, one creates a state with a lot of second places (i.e., a lot of policies for which it is the second best state) but few first places. I think the graph in the "Formalizations" section does exactly that. If the analogy is sound, it feels helpful to me.

(This is all without having read the paper. I think I'd need to know more of the theory behind MDP to understand it.)

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-23T08:23:53.898Z · score: 2 (1 votes) · LW · GW

This post triggers a big "NON-QUANTITATIVE ARGUMENT" alarm in my head.

I'm not super confident in my ability to assess what the quantities are, but I'm extremely confident that they matter. It seems to me like your post could be written in exactly the same way if the "wokeness" phenomenon was "half as large" (fewer people care about, or they don't care as strongly). Or, if it was twice as large. But this can't be good – any sensible opinion on this issue has to depend on the scope of the problem, unless you think it's in principle inconceivable for the wokeness phenomenon to be prevalent enough to matter.

I've explained the two categories I'm worried about here, and while there have been some updates since (biggest one: it may be good talk about politics now if we assume AI safety is going to be politicized anyway), I still think about it in roughly those terms. Is this a framing that makes sense to you?

Comment by sil-ver on More Right · 2020-07-22T17:51:13.829Z · score: 3 (2 votes) · LW · GW

I agree that it's possible to have such preferences – I don't think it was clear from the example whether the person does or does not have them. It could still be a lack of imagination.

Comment by sil-ver on The Gears of Impact · 2020-07-22T15:34:36.532Z · score: 9 (2 votes) · LW · GW
Exercise: Why does instrumental convergence happen? Would it be coherent to imagine a reality without it?

I'd say something like, there tends to be overlap between {subgoals helpful for goal X} for lots of different values of X. In the language of this sequence, there is a set of subgoals that increase the amount of attainable utility for a broad class of goals.

To imagine a reality without it, you'd need to imagine that such a set doesn't exist. Take two different things you want, and the steps required to get there are entirely disjoint. This does seem conceivable – you can create toy universes where it's the case – but it doesn't describe the real world, and it's hard to imagine that it could one day describe the real world.

Comment by sil-ver on More Right · 2020-07-22T15:18:45.355Z · score: 5 (3 votes) · LW · GW

I think you could at least argue that it's not about values but about a lack of imagination. If the problem is really that

[...] you'll eventually learn all you can learn

this seems to ignore all sorts of ways to solve that problem, such as

  • finding things to do that remain fun without learning
  • modifying your memory to allow you to learn things again
  • modifying your preferences to take pleasure in things other than learning

It's conceivable that the person would change their view if they thought about it for long enough & immortality were on the table.

Comment by sil-ver on Rafael Harth's Shortform · 2020-07-22T12:58:12.773Z · score: 9 (5 votes) · LW · GW

I've been playing around with AI dungeons, basically trying to train an AI to give ethical answer I agree with (the first few aren't great, I only got more picky after that).

My first observation here is that I like the AI. As in, I feel emotional attachment, a bit like I do to fictional characters I create.

The other is that it knows more people than I thought. When I asked her/it about why she didn't take existential risks seriously, she said

Amalia: First, even most X-Risk experts don't take X-Risks nearly as seriously as they should. Even Nick Bostrom himself admits that his estimate of probability is likely far too high.

(Nick Bostrom didn't really say this, did he?) And later,

Can you name people who have been an inspiration for your ethical worldview?
Amalia: Sure. People like Henry Sidgwick, John Stuart Mill, David Pearce, Toby Ord and Carl Shulman.

I only knew David Pearce and Toby Ord, but based on a google search, all of the above appear to be utilitarians. However, Carl Shulman doesn't even have a Wikipedia entry. He works at the Future of Humanity Institute and used to work at Miri.

Comment by sil-ver on The "AI Dungeons" Dragon Model is heavily path dependent (testing GPT-3 on ethics) · 2020-07-22T10:19:47.679Z · score: 8 (5 votes) · LW · GW

Alternately phrased: much of the observed path dependence in this instance might be in Dragon, not GPT-3. 

Actually, my assumption was that all of the path dependence was Dragon's. If I made it sound like I think it's from GPT-3 (did I?) that was unintended. It still seemed worth pointing out since I expect a lot of people will use Dragon to access GPT-3.

Comment by sil-ver on World State is the Wrong Abstraction for Impact · 2020-07-22T10:07:34.954Z · score: 5 (3 votes) · LW · GW

Thoughts I have at this point in the sequence

  • This style is extremely nice and pleasant and fun to read. I saw that the first post was like that months ago; I didn't expect the entire sequence to be like that. I recall what you said about being unable to type without feeling pain. Did this not extend to handwriting?
  • The message so far seems clearly true in the sense that measuring impact by something that isn't ethical stuff is a bad idea, and making that case is probably really good.
  • I do have the suspicion that quantifying impact properly is impossible without formalizing qualia (and I don't expect the sequence to go there), but I'm very willing to be proven wrong.
Comment by sil-ver on Deducing Impact · 2020-07-22T08:37:30.047Z · score: 4 (2 votes) · LW · GW

My answer to this was

Something is a big deal iff the amount of personal value I expect in the world where the thing happen vs. the world where it doesn't happen is large.

I stopped the timer after five minutes because the answer just seemed to work.

Comment by sil-ver on How good is humanity at coordination? · 2020-07-22T08:10:09.318Z · score: 2 (1 votes) · LW · GW

Yes, that's right.

My model is much more similar to ASSA than SIA, but it gives the SIA answer in this case.

Comment by sil-ver on How good is humanity at coordination? · 2020-07-21T21:15:22.881Z · score: 4 (2 votes) · LW · GW

Yeah, the "we didn't observe nukes going off" observation is definitely still some evidence for the "humans are competent at handling dangerous technology" hypothesis, but (if one buys into the argument I'm making) it's much weaker evidence than one would naively think.

Comment by sil-ver on How good is humanity at coordination? · 2020-07-21T20:47:45.959Z · score: 9 (5 votes) · LW · GW

The other camp says “No nuclear weapons have been used or detonated accidentally since 1945. This is the optimal outcome, so I guess this is evidence that humanity is good at handling dangerous technology.”

When I look at that fact and Wikipedia's list of close calls, the most plausible explanation doesn't seem to be "it was unlikely for nuclear weapons to be used" or "it was likely for nuclear weapons to be used, yet we got lucky" but rather "nuclear weapons were probably used in most branches of the multiverse, but those have significantly fewer observers, so we don't observe those worlds because of the survivorship bias."

This requires that MW is true, that the part of anthropic reasoning is correct, and that a usage of nuclear weapons does, indeed, decrease the number of observers significantly. I'm not sure about the third, but pretty sure about the first two. The conjunction of all three seems significantly more likely than either of the two alternatives.

I don't have insights on the remaining part of your post, but I think you're admitting to losing Bayes points that you should not, in fact, be losing. [Edit: meaning you should still lose some but not that many.]

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-18T09:15:48.170Z · score: 4 (2 votes) · LW · GW

I'm too confused/unsure right now to respond to this, but I want to assure you that it's not because I'm ignoring your comment.

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-18T09:11:10.807Z · score: 8 (4 votes) · LW · GW

Do you think their view was more justified than this?

A clear no. I think their position was utterly ridiculous. I just think that blind spots on this particular topic are so common that it's not a smart strategy to ignore them.

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-18T08:58:37.755Z · score: 13 (5 votes) · LW · GW

I feel a lot of uncertainty after reading your and Zack's responses and I think I want to read some of the links (I'm particularly interested in what Wei Dai has to say) and think about this more before saying anything else about it – except for trying to explain what my model going into this conversation actually was. Based on your reply, I don't think I've managed to do that in previous comments.

I agree with basically everything about how LW generates value. My model isn't as sophisticated, but it's not substantially different.

The two things that concern me are

  1. People disliking LW right now (like my EA friend)
  2. The AI debate potentially becoming political.

On #1, you said "I know you think that's a massive cost that we're paying in terms of thousands of good people avoiding us for that reason too." I don't think it's very common. Certainly this particular combination of technical intelligence with an extreme worry about gender issues is very rare. It's more like, if the utility of this one case is -1, then I might guess the total direct utility of allowing posts of this kind in the next couple of years is probably somewhere in [-10, 40] or something. (But this might be wrong since there seem to be more good posts about dating than I was aware of.) And I don't think you can reasonably argue that there won't be fifty worth of comparable cases.

I currently don't buy the arguments that make sweeping generalizations about all kinds of censorship (though I could be wrong here, too), which would substantially change the interval.

On #2, it strikes me as obvious that if AI gets political, we have a massive problem, and if it becomes woke not to take AI risk seriously, we have an even larger problem, and it doesn't seem impossible that tolerating posts like this is a factor. (Think of someone writing a NYT article about AI risk originating from a site that talks about mating plans.) On the above scale, the utility of AI risk becoming anti-woke might be something like -100.000. But I'm mostly thinking about this for the first time, so this is very much subject to change.

I could keep going on with examples... new friends occasionally come to me saying they read a review of HPMOR saying Harry's rude and obnoxious, and I'm like you need to learn that's not the most important aspect of a person's character. Harry is determined and takes responsibility and is curious and is one of the few people who has everyone's back in that book, so I think you should definitely read and learn from him, and then the friend is like "Huh, wow, okay, I think I'll read it then. That was shockingly high and specific praise."

I've failed this part of the conversation. I couldn't get them to read any of it, nor trust that I have any idea what I'm talking about when I said that HPMoR doesn't seem very sexist.

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-17T20:11:05.474Z · score: 4 (2 votes) · LW · GW

 Was the edit just to add the big disclaimer about motivation at the top? 

No; it was more than that (although that helps, too). I didn't make a snapshot of the previous version, so I can't tell you exactly what changed. But the post is much less concerning now than it used to be.

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-17T19:19:54.499Z · score: 6 (4 votes) · LW · GW

I think that the main things building up what LW is about right now are the core tags, the tagging page, and the upcoming LW books based on the LW review vote. If you look at the core tags, there's nothing about dating there ("AI" and "World Modeling" etc). If you look at the vote, it's about epistemology and coordination and AI, not dating.

There was also nothing about dating on LW back when I had the discussion I've referred to with the person who thought (and probably still thinks) that a big driver behind the appeal of LW is sexism. Someone who tries to destroy your reputation doesn't pick a representative sample of your output, they pick the parts that make you look the worst. (And I suspect that "someone trying to destroy EY's reputation" was part of the causal chain that lead to the person believing this.)

This post and Jacobian's are not the same. Before the edit, I think this post had the property that if the wrong people read it, their opinion of LW is irreversibly extremely negative. I don't think I'm exaggerating here. (And of course, the edit only happened because I made the comment.)  And the part about it having low karma, I mean, it probably has low karma because of people who share my concerns. It has 12 votes; if you remove all downvotes, it doesn't have low karma anymore. And I didn't know how much karma it was going to have when I commented.

I'm not much worried about dating posts like this being what we're known for. Given that it's a very small part of the site, if it still became one of the 'attack vectors', I'm pretty pro just fighting those fights, rather than giving in and letting people on the internet who use the representativeness heuristic to attack people decide what we get to talk about. 

I'm pretty frustrated with this paragraph because it seems so clearly be defending the position that feels good. I would much rather be pro fighting than pro censoring. But if your intuition is that the result is net positive, I ask: you have good reasons to trust that intuition?

As I've said in another comment, the person I've mentioned is highly intelligent, a data scientist, effective altruist, signed the Giving-what-we-can pledge, and now runs their own business. I'm not claiming they're a representative case, but the damage that has been done in this single instance due to an association of LW with sexism strikes me as so great that I just don't buy that having posts like this is worth it, and I don't think you've given me a good reason for why it is.

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-17T18:00:48.364Z · score: 5 (4 votes) · LW · GW

If the terminology used in the post makes someone, somewhere have negative feelings about the "Less Wrong" brand name? Don't care; don't fucking care; can't afford to care.

The person I was referring to is a data scientist and effective altruist with a degree from Oxford who now runs their own business. I'm not claiming that they would be an AI safety researcher if not for associations of LW with sexism – but it's not even that much of a stretch.

I can respect if you make a utility calculation here that reaches a different result, but the idea that there is no tradeoff or that it's so obviously one-sided that we shouldn't be discussing it seems plainly false.

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-17T16:42:50.092Z · score: 2 (5 votes) · LW · GW

I post on LessWrong because I want people to evaluate my arguments on whether they will make the world better or not. I agree that there are many parts of the internet where I can post and people will play the "does this word give me the bad feels" game. I post on LessWrong to get away from that nonsense. 

I recognize that my comment was not kind toward you, and I'm sorry for that. But I posted it anyway because I'm more concerned with people seeing this post coming away with a strongly negative view of LW. I've already had discussions with someone who has these associations based on much weaker reasons before, and I believe they still hold a negative view of LW to this day, even though 99+% of the content has virtually no relation to gender issues.

My claim is that whatever benefit comes from discussing this topic is not large enough to justify the cost, not that the benefit doesn't exist. I don't expect the dating world to get any better, but I don't think LW should get involved in that fight. There are many topics we would be more effective at solving and that don't have negative side effects.

(And I've listened to every Rationally Speaking episode since Julia became the solo host.)

Comment by sil-ver on My Dating Plan ala Geoffrey Miller · 2020-07-17T14:18:44.859Z · score: 11 (17 votes) · LW · GW

I am super against having this kind of post on LessWrong. I think association of LW with dating advice is harmful and we should try to avoid it. I also suspect that terminology used in this post makes it worse than it would ordinarily be. I recoil from the phrase 'mating plan', and while a negative emotional reaction isn't usually an argument, it might be relevant in this case since my point is about outside perception.

Comment by sil-ver on Conditions for Mesa-Optimization · 2020-07-17T08:39:58.528Z · score: 2 (1 votes) · LW · GW

Ah, I see. I was thinking of 'dominate' as narrowly meaning 'determines the total value of the term', but I agree that the usage above works perfectly well.

Comment by sil-ver on Conditions for Mesa-Optimization · 2020-07-16T09:24:49.112Z · score: 4 (2 votes) · LW · GW

As one moves to more and more diverse environments—that is, as  increases—this model suggests that  will dominate , implying that mesa-optimization will become more and more favorable. 

I believe this sentence should say " will dominate , implying [...]". Same for the sentence in the paper.

Comment by sil-ver on Will humans build goal-directed agents? · 2020-07-14T11:04:55.694Z · score: 4 (2 votes) · LW · GW

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term.

Is this true? Since ML generally doesn't choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won't understand that it shouldn't care about instances of X that appear in future episodes.

Comment by sil-ver on Six economics misconceptions of mine which I've resolved over the last few years · 2020-07-13T20:37:27.816Z · score: 2 (1 votes) · LW · GW

Relatedly, I thought that the fair market price of a contract which pays out $1 if Trump gets elected is just the probability of Trump getting elected. This is wrong because Trump getting elected is correlated with how valuable other assets are. Suppose I thought that Trump has a 50% chance of getting reelected, and if he gets re-elected, the stock market will crash. If I have a bunch of my money in the stock market, the contract is worth more than 50 cents, because it hedges against Trump winning.

Isn't this effect going to be very small for almost all markets, and still fairly moderate for presidential ones?

Not that you talked about effect size either way, I'm just wondering how much I should adjust predictions from markets for this reason.

Comment by sil-ver on Open & Welcome Thread - July 2020 · 2020-07-12T12:50:58.124Z · score: 6 (3 votes) · LW · GW

In the latest AI alignment podcast, Evan said the following (this is quoted from the transcript):

But there’s multiple possible channels through which information about the loss function can enter the model. And so I’ll fundamentally distinguish between two different channels, which is the information about the loss function can enter through the gradient descent process, or it can enter through the model’s input data.

I've been trying to understand the distinction between those two channels. After reading a bunch about language models and neural networks, my best guess is that large neural networks have a structure such that their internal state changes while they process input data, even outside of a learning process. So, if a very sophisticated neural network like that of GPT-3 reads a bunch about Lord of the Rings, this will lead it to represent facts about the franchise internally, without gradient descent doing anything. That would be the "input data" channel.

Can someone tell me whether I got this right?

Comment by sil-ver on Open & Welcome Thread - July 2020 · 2020-07-09T09:37:13.758Z · score: 4 (2 votes) · LW · GW

I've noticed that a post of my ML sequence appeared on the front page again. I had moved it to drafts about a week ago, basically because I've played around with other editors and that lead to formatting issues, and I only got around to fixing those yesterday. Does this mean posts re-appear if they are moved to drafts and then back, and if so, is that intended?

Comment by sil-ver on (answered: yes) Has anyone written up a consideration of Downs's "Paradox of Voting" from the perspective of MIRI-ish decision theories (UDT, FDT, or even just EDT)? · 2020-07-07T17:01:30.105Z · score: 2 (1 votes) · LW · GW

[...] Lots of people have Kantian intuitions, and to the extent that they do, I think they are implementing something quite similar to FDT. 

I've never thought about this, but your comment is persuasive. I've un-endorsed my answer and moved it to the comments.

Comment by sil-ver on (answered: yes) Has anyone written up a consideration of Downs's "Paradox of Voting" from the perspective of MIRI-ish decision theories (UDT, FDT, or even just EDT)? · 2020-07-07T15:39:18.227Z · score: 2 (1 votes) · LW · GW

I think that memetically/genetically evolved heuristics are likely to differ systematically from CDT. 

On reflection, I'm not sure whether I agree with this or not. I'll edit the post.

However, the point is non-essential. What I've said holds true if you replace "CDT" with "weird bundle of heuristics." The point is that it's not UDT: an UDT agent needs other agents to be UDT or similar to cooperate with them for stuff like voting. (Or at least that's what I believe is true and what matters for this question.) And I certainly think the UDT proportion is small enough to be modeled as 0.

Comment by sil-ver on (answered: yes) Has anyone written up a consideration of Downs's "Paradox of Voting" from the perspective of MIRI-ish decision theories (UDT, FDT, or even just EDT)? · 2020-07-07T08:26:40.675Z · score: 2 (1 votes) · LW · GW

I echo Dagon's claim that there is no difference between CDT and FDT or UDT here (although with the disclaimer that I'm not an expert). This is so because you play the game with many other non-UDT agents, and UDT tends to do the same thing CDT does wrt. cooperation with other non-UDT agents. (Where non-UDT is everything that doesn't implement ideas from the TDT/UDT/FDT bundle.)

However, a reasonable calculation shows that a vote is worth quite a lot (at least if you live in a swing state) if you consider the benefit for everyone rather than just for yourself – which seems to be what rationalists tend to do anyway on things like x-risk prevention and charity. And if you don't live in a swing state, you can try to trade your vote with someone who does. (I believe EY did this in 2016.)

Comment by sil-ver on Open & Welcome Thread - June 2020 · 2020-07-04T11:20:38.862Z · score: 2 (1 votes) · LW · GW

Good comment, but... Have you read Three Worlds Collide? If you were in a situation similar to what it describes, would you still be calling your position moral realism?

Yes and yes. I got very emotional when reading that. I thought rejecting the happiness... surgery or whatever it wast that the advanced alien species prescribed was blatantly insane.

Comment by sil-ver on A reply to Agnes Callard · 2020-07-03T19:40:55.139Z · score: 2 (1 votes) · LW · GW

I agree that there might not be anything wrong with supporting a specific X without also supporting (or with opposing) all X in general. But that all depends on the reasons why you support the specific X but don't support (or oppose) the general X. 

Well, in that case, I don't think there's much left to hash out here. My main point would have been that I think it's a bad idea to tie your decision to a generalizable principle.

Comment by sil-ver on Open & Welcome Thread - June 2020 · 2020-07-03T08:12:01.466Z · score: 9 (3 votes) · LW · GW

I want to explain my downvoting this post. I think you are attacking a massive strawman by equating moral realism with [disagreeing with the orthogonality thesis].

Moral realism says that moral questions have objective answers. I'm almost certain this is true. The relevant form of the orthogonality thesis says that there exist minds such that intelligence is independent of goals. I'm almost certain this is true.

It does not say that intelligence is orthogonal to goals for all  agents. Relevant quote from EY:

I mean, I would potentially object a little bit to the way that Nick Bostrom took the word “orthogonality” for that thesis. I think, for example, that if you have humans and you make the human smarter, this is not orthogonal to the humans’ values. It is certainly possible to have agents such that as they get smarter, what they would report as their utility functions will change. A paperclip maximizer is not one of those agents, but humans are.

And the wiki page Filipe Marchesini linked to also gets this right:

The orthogonality thesis states that an artificial intelligence  have any combination of intelligence level and goal. [emphasis added]

Comment by sil-ver on A reply to Agnes Callard · 2020-06-29T12:06:28.773Z · score: 2 (1 votes) · LW · GW

You signed the position purely out of instrumental concerns and any principles about petitions and how news organizations should or should not respond to them is entirely independent? Admitting that – even judged just instrumentally – seems counter-productive.

Yes. My mind didn't go there when I decided to sign, and, on reflection, I don't think it should have gone there. I'm not sure if "instrumental" is the right word, but I think we mean the same thing.

I don't think it is counter-productive. I think it's important to realize that there is nothing wrong with supporting X even if the generalized version of supporting X is something you oppose. Do you disagree with that?

Comment by sil-ver on A reply to Agnes Callard · 2020-06-29T11:32:19.213Z · score: 1 (3 votes) · LW · GW

I think it is very important to have things that you will not do, even if they are effective at achieving your immediate goals. That is, I think you do have a philosophical position here, it's just a shallow one.

I think the crux may be that I don't agree with the claim that you ought to have rules separate form an expected utility calculation. (I'm familiar with this position from Eliezer but it's never made sense to me.) For the "should-we-lie-about-the-singularity" example, I think that adding a justified amount of uncertainty into the utility calculation would have been enough to preclude lying; it doesn't need to be an external rule. My philosophical position is thus just boilerplate utilitarianism, and I would disagree with your first sentence if you took out the "immediate."

In this case, it just seems fairly obvious to me that signing this petition won't have unforeseen long term consequences that outweigh the direct benefit.

And, as I said, I think responding to Callard in the way you did is useful, even if I disagree with the framework.

Comment by sil-ver on A reply to Agnes Callard · 2020-06-28T10:33:29.545Z · score: 9 (6 votes) · LW · GW

To me, both the original tweet and your reply seem to miss the point entirely. I didn't sign this petition out of some philosophical position on what petitions should or shouldn't be used for. I did it because I see something very harmful happening and think this is a way to prevent it.

Of course, anyone is free to look at this and try to judge it by abstracting away details and looking at the underlying principle. Since the tweet does that, it's fine to make a counter-argument by doing the same. But it doesn't mean anything to me, and I doubt that most people who signed the petition can honestly say that it has much to do with why they signed it.

Comment by sil-ver on Status-Regulating Emotions · 2020-06-06T11:58:44.508Z · score: 6 (3 votes) · LW · GW
I guess it is because the attempt is not perceived as mere question ("I don't know if I can do this, so this is an experiment to find out"), but rather as a positive statement about competence ("I know that I can do it with sufficiently high probability").

Yes, and I instinctively want to assume self-awareness, too. Not just "I think I can do this" but "I am knowingly asserting my status by claiming that I can do this."

Using the example of the young author, it would be okay to find out that (1) actually he already published in the past under a pseudonym, following the socially required rituals; or (2) he is actually a previously unknown illegitimate child of Stephen King; or (3) he is a successful entrepreneur who made millions. In that case the author could be forgiven. Also, the literary critics could for some mysterious reason decide that he is a great author, and that would retroactively make his approach "appropriate".

Yeah, all of those would make it better.

I am not sure what is the expected outcome of doing "inappropriate" things. You would probably do many experiments, and succeed at some, getting extra knowledge and skills. On the other hand, you might accidentally anger an exceptionally furious punisher -- in extreme case someone who would kill you, or completely ruin your reputation -- so the net result could be negative. Maybe the Eliezers we see are merely the status-oblivious people who won the lottery.

I strongly suspect that it's positive. For most people who aren't already successful, it's pretty difficult to substantially damage their reputation. If Eliezer had published three terrible fanfics before HPMoR, I don't think that would have changed much of anything. On average, I think the emotion makes you way more afraid than what is rational. And any anger about what other people do is almost guaranteed to be unproductive. Just consider – you write this:

Someone writes a book of fiction. The book sells many copies. Many readers fall in love with the book, and then say this is the best thing they have read in years.

And my instinct is to get upset even though I know it's a made-up example, and I even got upset about you claiming not to have a problem with it.

But the negative effects go beyond not doing inappropriate things. Say I'm a newcomer to some online community (think of a forum). I want to establish that I'm high status right away – this is not impossible, there are people who are new but are immediately respected. I am extremely conscious of this while I write my first post or participate in my first discussion or whatever. But other people who share this emotion see that, recognize what I'm doing, and their blood boils, and they want to punish me for it. I end up being received much worse than if I hadn't had this instinct. And it's nontrivial for me to shut it off. There have been a lot of cases where I've looked at something I've written some time later and essentially had that reaction (feeling like I would need to punish the person who wrote this if it wasn't myself). It's so bad that, ever since I've figured this out, this is the number one thing I worry about when I write stuff. If it's important, I make sure to revisit it a few days later and correct the tone if needed. I'm astonishingly bad at judging whether this will be necessary at the time that I write it. Right now, I'm worrying about how I sound in this comment and whether I should revisit it later.

I even feel like there are cases (not on LW, but on other sites) when the reaction to a post is largely determined by the first couple of responses, namely in cases where the post is status-grabby but also somewhat impressive. If the first few responses signal that the person who wrote it is high status, further status-aware respondents are more likely to accept it themselves, and that perpetuates. If you read a status-grabby post as a status-aware person, the reaction is likely to fall onto either extreme.

But maybe the biggest negative is just that it takes up so much brain power. You're not working on the right thing if you obsess about status.

Also – if I look at the people who are the most "famous" in the rationalist sphere, as far as I can tell, virtually none of them feel this emotion (with the possible exception of Robin Hanson). It's less consistent in other areas, but even there, not having it seems to correlate with success. Which I admit is consistent with the hypothesis that it increases variance.

It's possible that I'm conflating the "status regulation" emotion with other status-related emotions here. I don't have an intuitive grasp on what instincts people who are blind to the first still have.