Measuring Coherence of Policies in Toy Environments 2024-03-18T17:59:08.118Z
Notes from a Prompt Factory 2024-03-10T05:13:39.384Z
Every "Every Bay Area House Party" Bay Area House Party 2024-02-16T18:53:28.567Z
Masterpiece 2024-02-13T23:10:35.376Z
A sketch of acausal trade in practice 2024-02-04T00:32:54.622Z
Succession 2023-12-20T19:25:03.185Z
∀: a story 2023-12-17T22:42:32.857Z
Meditations on Mot 2023-12-04T00:19:19.522Z
The Witness 2023-12-03T22:27:16.248Z
The Soul Key 2023-11-04T17:51:53.176Z
Value systematization: how values become coherent (and misaligned) 2023-10-27T19:06:26.928Z
Techno-humanism is techno-optimism for the 21st century 2023-10-27T18:37:39.776Z
The Gods of Straight Lines 2023-10-14T04:10:50.020Z
Eight Magic Lamps 2023-10-14T04:10:02.040Z
The Witching Hour 2023-10-10T00:19:37.786Z
One: a story 2023-10-10T00:18:31.604Z
Arguments for moral indefinability 2023-09-30T22:40:04.325Z
Alignment Workshop talks 2023-09-28T18:26:30.250Z
Jacob on the Precipice 2023-09-26T21:16:39.590Z
The King and the Golem 2023-09-25T19:51:22.980Z
Drawn Out: a story 2023-07-11T00:08:09.286Z
The virtue of determination 2023-07-10T05:11:00.412Z
You must not fool yourself, and you are the easiest person to fool 2023-07-08T14:05:18.642Z
Fixed Point: a love story 2023-07-08T13:56:54.807Z
Agency begets agency 2023-07-06T13:08:44.318Z
Frames in context 2023-07-03T00:38:52.078Z
Meta-rationality and frames 2023-07-03T00:33:20.355Z
Man in the Arena 2023-06-26T21:57:45.353Z
The ones who endure 2023-06-16T14:40:09.623Z
Cultivate an obsession with the object level 2023-06-07T01:39:54.778Z
The ants and the grasshopper 2023-06-04T22:00:04.577Z
Coercion is an adaptation to scarcity; trust is an adaptation to abundance 2023-05-23T18:14:19.117Z
Self-leadership and self-love dissolve anger and trauma 2023-05-22T22:30:06.650Z
Trust develops gradually via making bids and setting boundaries 2023-05-19T22:16:38.483Z
Resolving internal conflicts requires listening to what parts want 2023-05-19T00:04:20.451Z
Conflicts between emotional schemas often involve internal coercion 2023-05-17T10:02:50.860Z
We learn long-lasting strategies to protect ourselves from danger and rejection 2023-05-16T16:36:08.398Z
Judgments often smuggle in implicit standards 2023-05-15T18:50:07.781Z
From fear to excitement 2023-05-15T06:23:18.656Z
Clarifying and predicting AGI 2023-05-04T15:55:26.283Z
AGI safety career advice 2023-05-02T07:36:09.044Z
Communicating effectively under Knightian norms 2023-04-03T22:39:58.350Z
Policy discussions follow strong contextualizing norms 2023-04-01T23:51:36.588Z
AGISF adaptation for in-person groups 2023-01-13T03:24:58.320Z
The Alignment Problem from a Deep Learning Perspective (major rewrite) 2023-01-10T16:06:05.057Z
Applications open for AGI Safety Fundamentals: Alignment Course 2022-12-13T18:31:55.068Z
Alignment 201 curriculum 2022-10-12T18:03:03.454Z
Some conceptual alignment research projects 2022-08-25T22:51:33.478Z
The alignment problem from a deep learning perspective 2022-08-10T22:46:46.752Z
Moral strategies at different capability levels 2022-07-27T18:50:05.366Z


Comment by Richard_Ngo (ricraz) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T22:07:36.804Z · LW · GW

I think the two things that felt most unhealthy were:

  1. The "no forgiveness is ever possible" thing, as you highlight. Almost all talk about ineradicable sin should, IMO, be seen as a powerful psychological attack.
  2. The "our sins" thing feels like an unhealthy form of collective responsibility—you're responsible even if you haven't done anything. Again, very suspect on priors.

Maybe this is more intuitive for rationalists if you imagine a SJW writing a song about how, even millions of years in the future, anyone descended from westerners should still feel guilt about slavery: "Our sins can never be undone. No single death will be forgiven." I think this is the psychological exploit that's screwed up leftism so much over the last decade, and feels very analogous to what's happening in this song.

Comment by Richard_Ngo (ricraz) on Nick Bostrom’s new book, “Deep Utopia”, is out today · 2024-04-03T17:58:14.865Z · LW · GW

Just read this (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.

Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom's style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn't work very well when describing possible utopias, because they'll be so different from today that it's hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.

The stories and vignettes are somewhat esoteric; it's hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to "benefit" a room heater.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-04-03T17:56:32.583Z · LW · GW

Just read Bostrom's Deep Utopia (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.

Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom's style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn't work very well when describing possible utopias, because they'll be so different from today that it's hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.

The stories and vignettes are somewhat esoteric; it's hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to "benefit" a room heater.

Comment by Richard_Ngo (ricraz) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T02:09:52.704Z · LW · GW

Fantastic work :)

Some thoughts on the songs:

  • I'm overall super impressed by how well the styles of the songs fit the content—e.g. the violins in FHI, the British accent works really well for More Dakka, the whisper for We Do Not Wish, the Litany of Tarrrrski, etc.
  • My favorites to listen to are FHI at Oxford, Nihil Supernum, and Litany of Tarrrrski, because they have both messages that resonate a lot and great tunes.
  • IMO Answer to Job is the best-composed on artistic merits, and will have the most widespread appeal. Tune is great, style matches the lyrics really well (particular shout-out to the "or labor or lust" as a well-composed bar). Only change I'd make is changing "upon lotus thrones" to "on lotus thrones" to scan better.
  • Dath Ilan's Song feels... pretty unhealthy, tbh.
  • I thought Prime Factorization was really great until the bit about the car and the number, which felt a bit jarring.
Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T22:44:50.771Z · LW · GW

If it was the case that there was important public information attached to Scott's full name, then this argument would make sense to me.

In general having someone's actual name public makes it much easier to find out other public information attached to them. E.g. imagine if Scott were involved in shady business dealings under his real name. This is the sort of thing that the NYT wouldn't necessarily discover just by writing the profile of him, but other people could subsequently discover after he was doxxed.

To be clear, btw, I'm not arguing that this doxxing policy is correct, all things considered. Personally I think the benefits of pseudonymity for a healthy ecosystem outweigh the public value of transparency about real names. I'm just arguing that there are policies consistent with the NYT's actions which are fairly reasonable.

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T19:39:28.530Z · LW · GW

But it wasn't a cancellation attempt. The issue at hand is whether a policy of doxxing influential people is a good idea. The benefits are transparency about who is influencing society, and in which ways; the harms include the ones you've listed above, about chilling effects.

It's hard to weigh these against each other, but one way you might do so is by following a policy like "doxx people only if they're influential enough that they're probably robust to things like losing their job". The correlation between "influential enough to be newsworthy" and "has many options open to them" isn't perfect, but it's strong enough that this policy seems pretty reasonable to me.

To flip this around, let's consider individuals who are quietly influential in other spheres. For example, I expect there are people who many news editors listen to, when deciding how their editorial policies should work. I expect there are people who many Democrat/Republican staffers listen to, when considering how to shape policy. In general I think transparency about these people would be pretty good for the world. If those people happened to have day jobs which would suffer from that transparency, I would say "Look, you chose to have a bunch of influence, which the world should know about, and I expect you can leverage this influence to end up in a good position somehow even after I run some articles on you. Maybe you're one of the few highly-influential people for whom this happens to not be true, but it seems like a reasonable policy to assume that if someone is actually pretty influential then they'll land on their feet either way." And the fact that this was true for Scott is some evidence that this would be a reasonable policy.

(I also think that taking someone influential who didn't previously have a public profile, and giving them a public profile under their real name, is structurally pretty analogous to doxxing. Many of the costs are the same. In both cases one of the key benefits is allowing people to cross-reference information about that person to get a better picture of who is influencing the world, and how.)

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T02:36:06.187Z · LW · GW

I don't think the NYT thing played much of a role in Scott being better off now. My guess is a small minority of people are subscribed to his Substack because of the NYT thing (the dominant factor is clearly the popularity of his writing).

What credence do you have that he would have started the substack at all without the NYT thing? I don't have much information, but probably less than 80%. The timing sure seems pretty suggestive.

(I'm also curious about the likelihood that he would have started his startup without the NYT thing, but that's less relevant since I don't know whether the startup is actually going well.)

My guess is the NYT thing hurt him quite a bit and made the potential consequences of him saying controversial things a lot worse for him.

Presumably this is true of most previously-low-profile people that the NYT chooses to write about in not-maximally-positive ways, so it's not a reasonable standard to hold them to. And so as a general rule I do think "the amount of adversity that you get when you used to be an influential yet unknown person but suddenly get a single media feature about you" is actually fine to inflict on people. In fact, I'd expect that many (or even most) people in this category will have a worse time of it than Scott—e.g. because they do things that are more politically controversial than Scott, have fewer avenues to make money, etc.

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T22:32:14.556Z · LW · GW

I mean, Scott seems to be in a pretty good situation now, in many ways better than before.

And yes, this is consistent with NYT hurting him in expectation.

But one difference between doxxing normal people versus doxxing "influential people" is that influential people typically have enough power to land on their feet when e.g. they lose a job. And so the fact that this has worked out well for Scott (and, seemingly, better than he expected) is some evidence that the NYT was better-calibrated about how influential Scott is than he was.

This seems like an example of the very very prevalent effect that Scott wrote about in "against bravery debates", where everyone thinks their group is less powerful than they actually are. I don't think there's a widely-accepted name for it; I sometimes use underdog bias. My main diagnosis of the NYT/SSC incident is that rationalists were caught up by underdog bias, even as they leveraged thousands of influential tech people to attack the NYT.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-26T21:21:41.039Z · LW · GW

Since there's been some recent discussion of the SSC/NYT incident (in particular via Zack's post), it seems worth copying over my twitter threads from that time about why I was disappointed by the rationalist community's response to the situation.

I continue to stand by everything I said below.

Thread 1 (6/23/20):

Scott Alexander is the most politically charitable person I know. Him being driven off the internet is terrible. Separately, it is also terrible if we have totally failed to internalize his lessons, and immediately leap to the conclusion that the NYT is being evil or selfish.

Ours is a community built around the long-term value of telling the truth. Are we unable to imagine reasonable disagreement about when the benefits of revealing real names outweigh the harms? Yes, it goes against our norms, but different groups have different norms.

If the extended rationalist/SSC community could cancel the NYT, would we? For planning to doxx Scott? For actually doing so, as a dumb mistake? For doing so, but for principled reasons? Would we give those reasons fair hearing? From what I've seen so far, I suspect not.

I feel very sorry for Scott, and really hope the NYT doesn't doxx him or anyone else. But if you claim to be charitable and openminded, except when confronted by a test that affects your own community, then you're using those words as performative weapons, deliberately or not.

[One more tweet responding to tweets by @balajis and @webdevmason, omitted here.]

Thread 2 (1/21/21):

Scott Alexander is writing again, on a substack blog called Astral Codex Ten! Also, he doxxed himself in the first post. This post seems like solid evidence that many SSC fans dramatically overreacted to the NYT situation.

Scott: "I still think the most likely explanation for what happened was that there was a rule on the books, some departments and editors followed it more slavishly than others, and I had the bad luck to be assigned to a department and editor that followed it a lot. That's all." [I didn't comment on this in the thread, but I intended to highlight the difference between this and the conspiratorial rhetoric that was floating around when he originally took his blog down.]

I am pretty unimpressed by his self-justification: "Suppose Power comes up to you and says hey, I'm gonna kick you in the balls. ... Sometimes you have to be a crazy bastard so people won't walk all over you." Why is doxxing the one thing Scott won't be charitable about?

[In response to @habryka asking what it would mean for Scott to be charitable about this]: Merely to continue applying the standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives. And not to turn this into a bravery debate.

[In response to @benskuhn saying that Scott's response is understandable, since being doxxed nearly prevented him from going into medicine]: On one hand, yes, this seems reasonable. On the other hand, this is also a fully general excuse for unreasonable dialogue. It is always the case that important issues have had major impacts on individuals. Taking this excuse seriously undermines Scott's key principles.

I would be less critical if it were just Scott, but a lot of people jumped on narratives similar to "NYT is going around kicking people in the balls for selfish reasons", demonstrating an alarming amount of tribalism - and worse, lack of self-awareness about it.

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T20:56:15.195Z · LW · GW

+1, I agree with all of this, and generally consider the SSC/NYT incident to be an example of the rationalist community being highly tribalist.

(more on this in a twitter thread, which I've copied over to LW here)

Comment by Richard_Ngo (ricraz) on My PhD thesis: Algorithmic Bayesian Epistemology · 2024-03-25T17:00:58.006Z · LW · GW

Very cool work! A couple of (perhaps-silly) questions:

  1. Do these results have any practical implications for prediction markets?
  2. Which of your results rely on there being a fixed pool of experts who have to forecast a question (as opposed to experts being free to pick and choose which questions they forecast)?
  3. Do you know if your arbitrage-free contract function permits types of collusion that don't leave all experts better off under every outcome, but do make each of them better off in expectation according to their own credences? (I.e. types of collusion that they would agree to in advance.) Apart from just making side bets.
Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-24T22:31:52.403Z · LW · GW

What are the others?

Comment by Richard_Ngo (ricraz) on On green · 2024-03-24T19:03:25.443Z · LW · GW

Huh, I'd say the opposite. Green-according-to-black says "fuck all the people who are harming nature", because black sees the world through an adversarial lens. But actual green is better at getting out of the adversarial/striving mindset.

Comment by Richard_Ngo (ricraz) on On green · 2024-03-23T02:43:26.052Z · LW · GW

My favorite section of this post was the "green according to non-green" section, which I felt captured really well the various ways that other colors see past green.

I don't fully feel like the green part inside me resonated with any of your descriptions of it, though. So let me have a go at describing green, and seeing if that resonates with you.

Green is the idea that you don't have to strive towards anything. Thinking that green is instrumentally useful towards some other goal misses the whole point of green, which is about getting out of a goal- or action-oriented mindset. When you do that, your perception expands from a tunnel-vision "how can I get what I want" to actually experiencing the world in its unfiltered glory—actually looking at the redwoods. If you do that, then you can't help but feel awe. And when you step out of your self-oriented tunnel, suddenly the world has far more potential for harmony than you'd previously seen, because in fact the motivations that are causing the disharmony are... illusions, in some sense. Green looks at someone cutting down a redwood and sees someone who is hurting themself, by forcibly shutting off the parts of themselves that are capable of appreciation and awe. Knowing this doesn't actually save the redwoods, necessarily, but it does make it far easier to be in a state of acceptance, because deep down nobody is actually your enemy.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-22T23:42:58.382Z · LW · GW

More thoughts: what's the difference between paying in a counterfactual mugging based on:

  1. Whether the millionth digit of pi (5) is odd or even
  2. Whether or not there are an infinite number of primes?

In the latter case knowing the truth is (near-)inextrictably entangled with a bunch of other capabilities, like the ability to do advanced mathematics. Whereas in the former it isn't. Suppose that before you knew either fact you were told that one of them was entangled in this way—would you still want to commit to paying out in a mugging based on it?

Well... maybe? But it means that the counterlogical of "if there hadn't been an infinite number of primes" is not very well-defined—it's hard to modify your brain to add that belief without making a bunch of other modifications. So now Omega doesn't just have to be (near-)omniscient, it also needs to have a clear definition of the counterlogical that's "fair" according to your standards; without knowing that it has that, paying up becomes less tempting.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-22T19:55:50.275Z · LW · GW

Yepp, as in Logical Induction, new traders get spawned over time (in some kind of simplicity-weighted ordering).

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-22T01:19:47.725Z · LW · GW

Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal.

Yepp, very good point. Am working on a short story about this right now.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-21T00:53:52.560Z · LW · GW

Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don't fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that.

Ah, I see. In that case I think I disagree that it happens "by default" in this model. A few dynamics which prevent it:

  1. If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
  2. I don't yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can't just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
  3. Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.

Yep, that's why I believe "in the limit your traders will already do this". I just think it will be a dominant dynamic of efficient agents in the real world, so it's better to represent it explicitly

IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.

But actually I don't think that this is a "dominant dynamic" because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you'll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-21T00:12:12.943Z · LW · GW

Yep, but you can just treat it as another observation channel into UDT.

Hmm, I'm confused by this. Why should we treat it this way? There's no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm. That's the role V is playing.

It's just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go "hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don't really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal".

Obviously I am not arguing that you should agree to all moral muggings. If a pain-maximizer came up to you and said "hey, looks like we're in a world where pain is way easier to create than pleasure, give me all your resources", it would be nuts to agree, just like it would be nuts to get mugged by "1+1=3". I'm just saying that "sometimes you get mugged" is not a good argument against my position, and definitely doesn't imply "you get mugged everywhere".

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-21T00:02:52.603Z · LW · GW

I think real learning has some kind of ground-truth reward.

I'd actually represent this as "subsidizing" some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.

I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you'll need a modification of this framework which explains why that's not the case.

I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don't think this is actually a problem that should be ruled out—it's more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can't be ruled out: it's always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)

they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don't lose much, and you waste less compute

I'm assuming that traders can choose to ignore whichever inputs/topics they like, though. They don't need to make trades on everything if they don't want to.

I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all

Yeah, this is why I'm interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T23:48:47.302Z · LW · GW

Also, you can get rid of this problem by saying "you just want to maximize the variable U". And the things you actually care about (dogs, apples) are just "instrumentally" useful in giving you U.

But you need some mechanism for actually updating your beliefs about U, because you can't empirically observe U. That's the role of V.

leads to getting Pascal's mugged by the world in which you care a lot about easy things

I think this is fine. Consider two worlds:

In world L, lollipops are easy to make, and paperclips are hard to make.

In world P, it's the reverse.

Suppose you're a paperclip-maximizer in world L. And a lollipop-maximizer comes up to you and says "hey, before I found out whether we were in L or P, I committed to giving all my resources to paperclip-maximizers if we were in P, as long as they gave me all their resources if we were in L. Pay up."

UDT says to pay here—but that seems basically equivalent to getting "mugged" by worlds where you care about easy things.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T23:34:20.580Z · LW · GW

Some more thoughts: we can portray the process of choosing a successor policy as the iterative process of making more and more commitments over time. But what does it actually look like to make a commitment? Well, consider an agent that is made of multiple subagents, that each get to vote on its decisions. You can think of a commitment as basically saying "this subagent still gets to vote, but no longer gets updated"—i.e. it's a kind of stop-gradient.

Two interesting implications of this perspective:

  1. The "cost" of a commitment can be measured both in terms of "how often does the subagent vote in stupid ways?", and also "how much space does it require to continue storing this subagent?" But since we're assuming that agents get much smarter over time, probably the latter is pretty small.
  2. There's a striking similarity to the problem of trapped priors in human psychology. Parts of our brains basically are subagents that still get to vote but no longer get updated. And I don't think this is just a bug—it's also a feature. This is true on the level of biological evolution (you need to have a strong fear of death in order to actually survive) and also on the level of cultural evolution (if you can indoctrinate kids in a way that sticks, then your culture is much more likely to persist).

    The (somewhat provocative) way of phrasing this is that trauma is evolution's approach to implementing UDT. Someone who's been traumatized into conformity by society when they were young will then (in theory) continue obeying society's dictates even when they later have more options. Someone who gets very angry if mistreated in a certain way is much harder to mistreat in that way. And of course trauma is deeply suboptimal in a bunch of ways, but so too are UDT commitments, because they were made too early to figure out better alternatives.

    This is clearly only a small component of the story but the analogy is definitely a very interesting one.
Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T20:59:02.118Z · LW · GW

UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that's not using something UDT-like) wouldn't be able to do that in any circumstance

Agreed; apologies for the sloppy phrasing.

Historically it was overwhelmingly the frame until recently, so it's the correct frame for interpreting the intended meaning of texts from that time.

I agree, that's why I'm trying to outline an alternative frame for thinking about it.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T20:50:21.748Z · LW · GW

Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I'll call it the BAVM model; the one-sentence summary is "internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds". There's little novel here, I'm just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).

In more detail, the three main components are:

  1. A prediction market
  2. An action auction
  3. A value election

You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:

  1. Making accurate predictions about future sensory experiences on the market.
  2. Taking actions which lead to reward or increase the agent's expected future value.

They spend money in three ways:

  1. Bidding to control the agent's actions for the next N timesteps.
  2. Voting on what actions get reward and what states are assigned value.
  3. Running the computations required to figure out all these trades.

Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that's simpler than using different ontologies. (Note that this does requires the assumption that simpler traders start off with more money.)

The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T20:21:03.722Z · LW · GW

One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that.

I think there's an ambiguity here. UDT makes the agent stop considering updated-away possibilities, but I haven't seen any discussion of UDT which suggests that it stops caring about them in principle (except for a brief suggestion from Paul that one option for UDT is to "go back to a position where I’m mostly ignorant about the content of my values"). Rather, when I've seen UDT discussed, it focuses on updating or un-updating your epistemic state.

I don't think the shift I'm proposing is particularly important, but I do think the idea that "you have your prior and your utility function from the very beginning" is a kinda misleading frame to be in, so I'm trying to nudge a little away from that.

Comment by Richard_Ngo (ricraz) on Measuring Coherence of Policies in Toy Environments · 2024-03-20T19:26:12.230Z · LW · GW

That's what would make the check nontrivial: IIUC there exist policies which are not consistent with any assignment of values satisfying that Bellman equation.

Ah, I see. Yeah, good point. So let's imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)

Now, there's some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:

  1. When going to a terminal state, does it always take the shortest path?
  2. For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
  3. Do these pairings allow us to rank all terminal states?

This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.

Note that going to many different terminal states from different starting points doesn't necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.

Comment by Richard_Ngo (ricraz) on Measuring Coherence of Policies in Toy Environments · 2024-03-20T18:14:24.411Z · LW · GW

Note that in the setting we describe here, we start off only with a policy and a (reward-less) MDP. No rewards, no value functions. Given this, there is always a value function or q-function consistent with the policy and the Bellman equations.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T06:55:14.320Z · LW · GW

In UDT2, when you're in epistemic state Y and you need to make a decision based on some utility function U, you do the following:
1. Go back to some previous epistemic state X and an EDT policy (the combination of which I'll call the non-updated agent).
2. Spend a small amount of time trying to find the policy P which maximizes U based on your current expectations X.
3. Run P(Y) to make the choice which maximizes U.

The non-updated agent gets much less information than you currently have, and also gets much less time to think. But it does use the same utility function. That seems... suspicious. If you're updating so far back that you don't know who or where you are, how are you meant to know what you care about?

What happens if the non-updated agent doesn't get given your utility function? On its face, that seems to break its ability to decide which policy P to commit to. But perhaps it could instead choose a policy P(Y,U) which takes as input not just an epistemic state, but also a utility function. Then in step 2, the non-updated agent needs to choose a policy P that maximizes, not the agent's current utility function, but rather the utility functions it expects to have across a wide range of future situations.

Problem: this involves aggregating the utilities of different agents, and there's no canonical way to do this. Hmm. So maybe instead of just generating a policy, the non-updated agent also needs to generate a value learning algorithm, that maps from an epistemic state Y to a utility function U, in a way which allows comparison across different Us. Then the non-updated agent tries to find a pair (P, V) such that P(Y) maximizes V(Y) on the distribution of Ys predicted by X. EDIT: no, this doesn't work. Instead I think you need to go back, not just to a previous epistemic state X, but also to a previous set of preferences U' (which include meta-level preferences about how your values evolve). Then you pick P and V in order to maximize U'.

Now, it does seem kinda wacky that the non-updated agent can maybe just tell you to change your utility function. But is that actually any weirder than it telling you to change your policy? And after all, you did in fact acquire your values from somewhere, according to some process.

Overall, I haven't thought about this very much, and I don't know if it's already been discussed. But three quick final comments:

  1. This brings UDT closer to an ethical theory, not just a decision theory.
  2. In practice you'd expect P and V to be closely related. In fact, I'd expect them to be inseparable, based on arguments I make here.
  3. Overall the main update I've made is not that this version of UDT is actually useful, but that I'm now suspicious of the whole framing of UDT as a process of going back to a non-updated agent and letting it make commitments.
Comment by Richard_Ngo (ricraz) on Policy Selection Solves Most Problems · 2024-03-20T04:57:03.813Z · LW · GW

Is there a principled way to avoid the chaos of a too-early market state while also steering clear of knowledge we need to be updateless toward?

Is there a particular reason to think that the answer to this shouldn't just be "first run a logical inductor to P_f(f(n)), then use that distribution to determine how to use P_f(n) to determine how to choose an action from P_n" (at least for large enough n)?

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-18T20:03:01.942Z · LW · GW

The "average" is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you'd prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn't.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-18T18:36:17.158Z · LW · GW

The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?

One answer, which Yudkowsky gives here, is that conscious experiences are just a "weird and more abstract and complicated pattern that matter can be squiggled into".

But that seems to be in tension with another claim he makes, that there's no way for one agent's conscious experiences to become "more real" except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.

Clearly a squiggle-maximizer would not be an average squigglean. So what's the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you're deciding for the good of A) you pick whichever one gives A a better time on average.

Yudkowsky has written before (can't find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he's a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky's preferences are incoherent, and that the only coherent thing to do here is to "expect to be" a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we're at the hinge of history.)

But this is just an answer, it doesn't dissolve the problem. What could? Some wild guesses:

  1. You are allowed to have preferences about the external world, and you are allowed to have preferences about your "thread of experience"—you're just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
  2. Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you're not allowed to be both (on pain of incoherence/being dutch-booked).
  3. Something totally different: the problem here is that we don't have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
  4. Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.
Comment by Richard_Ngo (ricraz) on More people getting into AI safety should do a PhD · 2024-03-15T17:06:42.878Z · LW · GW

But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed)

Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn't imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too.

then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because TAI could happen in 9 years but it could also happen in 1 year

So? Your research doesn't have to be useful in every possible world. If a PhD increases the quality of your research by, say, 3x (which is plausible, since research is heavy-tailed) then it may well be better to do that research for half the time.

(In general I don't think x-risk-motivated people should do PhDs that don't directly contribute to alignment, to be clear; I just think this isn't a good argument for that conclusion.)

Comment by Richard_Ngo (ricraz) on 'Empiricism!' as Anti-Epistemology · 2024-03-15T02:09:12.450Z · LW · GW

"Well, since it's too late there," said the Scientist, "would you maybe agree with me that 'eternal returns' is a prediction derived by looking at observations in a simple way, and then doing some pretty simple reasoning on it; and that's, like, cool?  Even if that coolness is not the single overwhelming decisive factor in what to believe?"

"Depends exactly what you mean by 'cool'," said the Epistemologist.

"Okay, let me give it a shot," said the Scientist. "Suppose you model me as having a bunch of subagents who make trades on some kind of internal prediction market. The whole time I've been watching Ponzi Pyramid Incorporated, I've had a very simple and dumb internal trader who has been making a bunch of money betting that they will keep going up by 20%. Of course, my mind contains a whole range of other traders too, so this one isn't able to swing the market by itself, but what I mean by 'cool' is that this trader does have a bunch of money now! (More than others do, because in my internal prediction markets, simpler traders start off with more money.)"

"The problem," said the Epistemologist, "is that you're in an adversarial context, where the observations you're seeing have been designed to make that particular simple trader rich. In that context, you shouldn't be giving those simple traders so much money to start off with; they'll just continue being exploited until you learn better."

"But is that the right place to intervene? After all, my internal prediction market is itself an adversarial process. And so the simple internal trader who just predicts that things will continue going up the same amount every year will be exploited by other internal traders as soon as it dares to venture a bet on, say, the returns of the previous company that our friend the Spokesperson worked at. Indeed, those savvier traders might even push me to go look up that data (using, perhaps, some kind of internal action auction), in order to more effectively take the simple trader's money."

"You claim," said the Epistemologist, "to have these more sophisticated internal traders. Yet you started this conversation by defending the coolness, aka wealth, of the trader corresponding to the Spokesperson's predictions. So it seems like these sophisticated internal traders are not doing their work so well after all."

"They haven't taken its money yet," said the Scientist, "But they will before it gets a chance to invest any of my money. Nevertheless, your point is a good one; it's not very cool to only have money temporarily. Hmmm, let me muse on this."

The Scientist thinks for a few minutes, then speaks again.

"I'll try another attempt to describe what I mean by 'cool'. Often-times, clever arguers suggest new traders to me, and point out that those traders would have made a lot of money if they'd been trading earlier. Now, if I were an ideal Garrabrant inductor I would ignore these arguments, and only pay attention to these new traders' future trades. But I have not world enough or time for this; so I've decided to subsidize new traders based on how they would have done if they'd been trading earlier. Of course, though, this leaves me vulnerable to clever arguers inventing overfitted traders. So the subsidy has to be proportional to how likely it is that the clever arguer could have picked out this specific trader in advance. And for all Spokesperson's flaws, I do think that 5 years ago he was probably saying something that sounded reasonably similar to '20% returns indefinitely!' That is the sense in which his claim is cool."

"Hmm," said the Epistemologist. "An interesting suggestion, but I note that you've departed from the language of traders in doing so. I feel suspicious that you're smuggling something in, in a way which I can't immediately notice."

"Right, which would be not very cool. Alas, I feel uncertain about how to put my observation into the language of traders. But... well, I've already said that simple traders start off with more money. So perhaps it's just the same thing as before, except that when evaluating new traders on old data I put extra weight on simplicity when deciding how much money they start with—because now it also helps prevent clever arguers from fooling me (and potentially themselves) with overfitted post-hoc hypotheses."

("Parenthetically," added the Scientist, "there are plenty of other signals of overfitting I take into account when deciding how much to subsidize new traders—like where I heard about them, and whether they match my biases and society's biases, and so on. Indeed, there are enough such signals that perhaps it's best to think of this as a process of many traders bidding on the question of how easy/hard it would have been for the clever arguer to have picked out this specific trader in advance. But this is getting into the weeds—the key point is that simplicity needs to be extra-strongly-prioritized when evaluating new traders on past data.")

Comment by Richard_Ngo (ricraz) on Notes from a Prompt Factory · 2024-03-10T19:26:00.588Z · LW · GW

I'm sorry you regret reading it. A content warning seems like a good idea, I've added one now.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-06T20:40:31.632Z · LW · GW

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in "subagents" which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly "goal-directed" as you go up the hierarchy. This is an old idea, FWIW; e.g. it's how Minsky frames intelligence in Society of Mind. And it's also somewhat consistent with the claim made in the original shard theory post, that "shards are just collections of subshards".

The problem is the "just". The post also says "shards are not full subagents", and that "we currently estimate that most shards are 'optimizers' to the extent that a bacterium or a thermostat is an optimizer." But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from "heuristic" to "agent", and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral "shards" (like caring about other people's welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the "rational EU maximizer" picture that they're ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

(I make a similar point in the appendix of my value systematization post.)

Comment by Richard_Ngo (ricraz) on Evidential Cooperation in Large Worlds: Potential Objections & FAQ · 2024-02-28T22:17:57.800Z · LW · GW

Is ECL the same thing as acausal trade?

Typically, no. “Acausal trade” usually refers to a different mechanism: “I do this thing for you if you do this other thing for me.” Discussions of acausal trade often involve the agents attempting to simulate each other. In contrast, ECL flows through direct correlation: “If I do this, I learn that you are more likely to also do this.” For more, see Christiano (2022)’s discussion of correlation versus reciprocity and Oesterheld, 2017, section 6.1.

I'm skeptical about the extent to which these are actually different things. Oesterheld says "superrationality may be seen as a special case of acausal trade in which the agents’ knowledge implies the correlation directly, thus avoiding the need for explicit mutual modeling and the complications associated with it". So at the very least, we can think of one as a subset of the other (though I think I'd actually classify it the other way round, with acausal trade being a special case of superrationality).

But it's not just that. Consider an ECL model that concludes: "my decision is correlated with X's decision, therefore I should cooperate". But this conclusion also requires complicated recursive reasoning—specifically, reasons for thinking that the correlation holds even given that you're taking the correlation into account when making your decision.

(E.g. suppose that you know that you were similar to X, except that you are doing ECL and X isn't. But then ECL might break the previous correlation between you and X. So actually the ECL process needs to reason "the outcome of the decision process I'm currently doing is correlated with the outcome of the decision process that they're doing", and I think realistically finding a fixed point probably wouldn't look that different from standard descriptions of acausal trade.)

This may be another example of the phenomenon Paul describes in his post on why EDT > CDT: although EDT is technically more correct, in practice you need to do something like CDT to reason robustly. (In this case ECL is EDT and acausal trade is more CDTish.)

Comment by Richard_Ngo (ricraz) on Announcing Timaeus · 2024-02-22T20:12:32.419Z · LW · GW

FWIW for this sort of research I support a strong prior in favor of publishing.

Comment by Richard_Ngo (ricraz) on Every "Every Bay Area House Party" Bay Area House Party · 2024-02-20T08:41:54.238Z · LW · GW

The thing I'm picturing here is a futures contract where charizard-shirt-guy is obligated to deliver 3 trillion paperclips in exchange for one soul. And, assuming a reasonable discount rate, this is a better deal than only receiving a handful of paperclips now in exchange for the same soul. (I agree that you wouldn't want to invest in a current-market-price paperclip futures contract.)

Comment by Richard_Ngo (ricraz) on Masterpiece · 2024-02-16T05:25:31.103Z · LW · GW

Damn, MMDoom is a good one. New lore: it won the 2055 technique award.

Comment by Richard_Ngo (ricraz) on Masterpiece · 2024-02-15T09:00:19.951Z · LW · GW

Judges' ratings:

Technique: 5/10

The training techniques used here are in general very standard ones (although the dissonance filters were a nice touch). For a higher score on this metric, we would have expected more careful work to increase the stability of self-evaluation and/or the accuracy of the judgments.

Novelty: 7/10

While the initial premise was a novel one to us, we thought that more new ideas could have been incorporated into this entry in order for it to score more highly on this metric. For example, the "outliers" in the entry's predictions were a missed opportunity to communicate an underlying pattern. Similarly, the instability of the self-evaluation could have been incorporated into the entry in some clearer way.

Artistry: 9/10

We consider the piece a fascinating concept—one which forces the judges to confront the automatability of their own labors. Holding a mirror to the faces of viewers is certainly a classic artistic endeavor. We also appreciated the artistic irony of the entry's inability to perceive itself.

Comment by Richard_Ngo (ricraz) on Dreams of AI alignment: The danger of suggestive names · 2024-02-13T12:29:59.498Z · LW · GW

I think we have failed, thus far.  I'm sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.

I empathize with this, and have complained similarly (e.g. here).

I have also been trying to figure out why I feel quite a strong urge to push back on posts like this one. E.g. in this case I do in fact agree that only a handful of people actually understand AI risk arguments well enough to avoid falling into "suggestive names" traps. But I think there's a kind of weak man effect where if you point out enough examples of people making these mistakes, it discredits even those people who avoid the trap.

Maybe another way of saying this: of course most people are wrong about a bunch of this stuff. But the jump from that to claiming the community or field has failed isn't a valid one, because the success of a field is much more dependent on max performance than mean performance.

Comment by Richard_Ngo (ricraz) on Dreams of AI alignment: The danger of suggestive names · 2024-02-13T12:14:32.983Z · LW · GW

In this particular case, Ajeya does seem to lean on the word "reward" pretty heavily when reasoning about how an AI will generalize. Without that word, it's harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I've previously complained about this here.

Ryan, curious if you agree with my take here.

Comment by Richard_Ngo (ricraz) on The case for ensuring that powerful AIs are controlled · 2024-01-27T00:39:00.872Z · LW · GW

Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".

I think there are (at least) two strong reasons to like this approach:

1. It’s complementary with alignment.

2.  It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…

As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.

This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence  threshold at which bad things happen.

This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.

You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.

Comment by Richard_Ngo (ricraz) on This might be the last AI Safety Camp · 2024-01-25T19:51:24.219Z · LW · GW

I originally found this comment helpful, but have now found other comments pushing back against it to be more helpful. Upon reflection, I don't think the comparison to MATS is very useful (a healthy field will have a bunch of intro programs), the criticism of Remmelt is less important given that Linda is responsible for most of the projects, the independence of the impact assessment is not crucial, and the lack of papers is relatively unsurprising given that it's targeting earlier-stage researchers/serving as a more introductory funnel than MATS.

Comment by Richard_Ngo (ricraz) on The Hidden Complexity of Wishes · 2024-01-25T01:25:38.087Z · LW · GW

my guess is the brain is highly redundant and works on ion channels that would require actually a quite substantial amount of matter to be displaced (comparatively)

Neurons are very small, though, compared with the size of a hole in a gas pipe that would be necessary to cause an explosive gas leak. (Especially because you then can't control where the gas goes after leaking, so it could take a lot of intervention to give the person a bunch of away-from-building momentum.)

I would probably agree with you if the building happened to have a ton of TNT sitting around in the basement.

Comment by Richard_Ngo (ricraz) on The Hidden Complexity of Wishes · 2024-01-24T02:13:27.504Z · LW · GW

The resulting probability distribution of events will definitely not reflect your prior probability distribution, so I think Thomas' argument still doesn't go through. It will reflect the shape of the wave-function. 

This is a good point. But I don't think "particles being moved the minimum necessary distance to achieve the outcome" actually favors explosions. I think it probably favors the sensor hardware getting corrupted, or it might actually favor messing with the firemens' brains to make them decide to come earlier (or messing with your mother's brain to make her jump out of the building)—because both of these are highly sensitive systems where small changes can have large effects.

Does this undermine the parable? Kinda, I think. If you built a machine that samples from some bizarre inhuman distribution, and then you get bizarre outcomes, then the problem is not really about your wish any more, the problem is that you built a weirdly-sampling machine. (And then we can debate about the extent to which NNs are weirdly-sampling machines, I guess.)

Comment by Richard_Ngo (ricraz) on The Hidden Complexity of Wishes · 2024-01-24T01:47:30.737Z · LW · GW

The outcome pump is defined in a way that excludes the possibility of active subversion: it literally just keeps rerunning until the outcome is satisfied, which is a way of sampling based on (some kind of) prior probability. Yudkowsky is arguing that this is equivalent to a malicious genie. But this is a claim that can be false.

In this specific case, I agree with Thomas that whether or not it's actually false will depend on the details of the function: "The further she gets from the building's center, the less the time machine's reset probability." But there's probably some not-too-complicated way to define it which would render the pump safe-ish (since this was a user-defined function).

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-01-17T00:56:01.868Z · LW · GW

People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don't like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:

1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they're relevant. So the speed cost of deception will be amortized across the (likely very long) training period.

2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like "when you see X, initiate the world takeover plan" would therefore constitute a very small proportion of the total information represented in the network; it'd be hard to regularize it away without regularizing away most of the AGI's knowledge.

I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptive alignment, but it needs to be more nuanced. One idea I've been playing around with: the tradeoff between conservatism and systematization (as discussed here). An agent that prioritizes conservatism will tend to do the things they've previously done. An agent that prioritizes systematization will tend to do the things that are favored by simple arguments.

To illustrate: suppose you have an argument in your head like "if I get a chance to take a 60/40 double-or-nothing bet for all my assets, I should". Suppose you've thought about this a bunch and you're intellectually convinced of it. Then you're actually confronted with the situation. Some people will be more conservative, and follow their gut ("I know I said I would, but... this is kinda crazy"). Others (like most utilitarians and rationalists) will be more systematizing ("it makes sense, let's do it"). Intuitively, you could also think of this as a tradeoff between memorization and generalization; or between a more egalitarian decision-making process ("most of my heuristics say no") and a more centralized process ("my intellectual parts say yes"). I don't know how to formalize any of these ideas, but I'd like to try to figure it out.

Comment by Richard_Ngo (ricraz) on The alignment problem from a deep learning perspective · 2024-01-16T23:42:12.174Z · LW · GW

Ty for review. I still think it's better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.

Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level "case for misalignment as an x-risk" has been accepted at a major ML conference, to my knowledge. (Though Langosco's goal misgeneralization paper did this a little bit, and was accepted at ICML.)

Comment by Richard_Ngo (ricraz) on Value systematization: how values become coherent (and misaligned) · 2024-01-12T02:24:31.188Z · LW · GW

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?

I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it's fine if the specific gear gets swapped out for a copy. I'm not assuming there's a mistake anywhere, I'm just assuming you switch from caring about one type of property it has (physical) to another (functional).

In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.

I picture you saying "well, you could just not prioritize them". But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing "this particular gear", but you realize that atoms are constantly being removed and new ones added (implausibly, but let's assume it's a self-repairing gear) and so there's no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of "the gear that moves the piston"—then valuing that could be much simpler.

Zooming out: previously you said

I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that's a separate topic. I think that, if you have some value , and it doesn't run into direct conflict with any other values you have, and your model of  isn't wrong at the abstraction level it's defined at, you'll never want to change .

I'm worried that we're just talking about different things here, because I totally agree with what you're saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you're going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you're going to systematize your values.

Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.