Posts

Tinker 2024-04-16T18:26:38.679Z
Measuring Coherence of Policies in Toy Environments 2024-03-18T17:59:08.118Z
Notes from a Prompt Factory 2024-03-10T05:13:39.384Z
Every "Every Bay Area House Party" Bay Area House Party 2024-02-16T18:53:28.567Z
Masterpiece 2024-02-13T23:10:35.376Z
A sketch of acausal trade in practice 2024-02-04T00:32:54.622Z
Succession 2023-12-20T19:25:03.185Z
∀: a story 2023-12-17T22:42:32.857Z
Meditations on Mot 2023-12-04T00:19:19.522Z
The Witness 2023-12-03T22:27:16.248Z
The Soul Key 2023-11-04T17:51:53.176Z
Value systematization: how values become coherent (and misaligned) 2023-10-27T19:06:26.928Z
Techno-humanism is techno-optimism for the 21st century 2023-10-27T18:37:39.776Z
The Gods of Straight Lines 2023-10-14T04:10:50.020Z
Eight Magic Lamps 2023-10-14T04:10:02.040Z
The Witching Hour 2023-10-10T00:19:37.786Z
One: a story 2023-10-10T00:18:31.604Z
Arguments for moral indefinability 2023-09-30T22:40:04.325Z
Alignment Workshop talks 2023-09-28T18:26:30.250Z
Jacob on the Precipice 2023-09-26T21:16:39.590Z
The King and the Golem 2023-09-25T19:51:22.980Z
Drawn Out: a story 2023-07-11T00:08:09.286Z
The virtue of determination 2023-07-10T05:11:00.412Z
You must not fool yourself, and you are the easiest person to fool 2023-07-08T14:05:18.642Z
Fixed Point: a love story 2023-07-08T13:56:54.807Z
Agency begets agency 2023-07-06T13:08:44.318Z
Frames in context 2023-07-03T00:38:52.078Z
Meta-rationality and frames 2023-07-03T00:33:20.355Z
Man in the Arena 2023-06-26T21:57:45.353Z
The ones who endure 2023-06-16T14:40:09.623Z
Cultivate an obsession with the object level 2023-06-07T01:39:54.778Z
The ants and the grasshopper 2023-06-04T22:00:04.577Z
Coercion is an adaptation to scarcity; trust is an adaptation to abundance 2023-05-23T18:14:19.117Z
Self-leadership and self-love dissolve anger and trauma 2023-05-22T22:30:06.650Z
Trust develops gradually via making bids and setting boundaries 2023-05-19T22:16:38.483Z
Resolving internal conflicts requires listening to what parts want 2023-05-19T00:04:20.451Z
Conflicts between emotional schemas often involve internal coercion 2023-05-17T10:02:50.860Z
We learn long-lasting strategies to protect ourselves from danger and rejection 2023-05-16T16:36:08.398Z
Judgments often smuggle in implicit standards 2023-05-15T18:50:07.781Z
From fear to excitement 2023-05-15T06:23:18.656Z
Clarifying and predicting AGI 2023-05-04T15:55:26.283Z
AGI safety career advice 2023-05-02T07:36:09.044Z
Communicating effectively under Knightian norms 2023-04-03T22:39:58.350Z
Policy discussions follow strong contextualizing norms 2023-04-01T23:51:36.588Z
AGISF adaptation for in-person groups 2023-01-13T03:24:58.320Z
The Alignment Problem from a Deep Learning Perspective (major rewrite) 2023-01-10T16:06:05.057Z
Applications open for AGI Safety Fundamentals: Alignment Course 2022-12-13T18:31:55.068Z
Alignment 201 curriculum 2022-10-12T18:03:03.454Z
Some conceptual alignment research projects 2022-08-25T22:51:33.478Z
The alignment problem from a deep learning perspective 2022-08-10T22:46:46.752Z

Comments

Comment by Richard_Ngo (ricraz) on AI Regulation is Unsafe · 2024-04-25T01:02:53.306Z · LW · GW

I'm not sure who you've spoken to, but at least among the people who I talk to regularly who I consider to be doing "serious AI policy work" (which admittedly is not everyone who claims to be doing AI policy work), I think nearly all of them have thought about ways in which regulation + regulatory capture could be net negative. At least to the point of being able to name the relatively "easy" ways (e.g., governments being worse at alignment than companies).

I don't disagree with this; when I say "thought very much" I mean e.g. to the point of writing papers about it, or even blog posts, or analyzing it in talks, or basically anything more than cursory brainstorming. Maybe I just haven't seen that stuff, idk.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-04-25T00:58:07.001Z · LW · GW

This is particularly weird because your indexical probability then depends on what kind of bet you're offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me...

It seems pretty weird to me too, but to steelman: why shouldn't it depend on the type of bet you're offered? Your indexical probabilities can depend on any other type of observation you have when you open your eyes. E.g. maybe you see blue carpets, and you know that world A is 2x more likely to have blue carpets. And hearing someone say "and the bet is denominated in money not time" could maybe update you in an analogous way.

I mostly offer this in the spirit of "here's the only way I can see to reconcile subjective anticipation with UDT at all", not "here's something which makes any sense mechanistically or which I can justify on intuitive grounds".

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-04-25T00:17:16.411Z · LW · GW

My own interpretation of how UDT deals with anthropics (and I'm assuming ADT is similar) is "Don't think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over."

(Speculative paragraph, quite plausibly this is just nonsense.) Suppose you have copies A and B who are both offered the same bet on whether they're A. One way you could make this decision is to assign measure to A and B, then figure out what the marginal utility of money is for each of A and B, then maximize measure-weighted utility. Another way you could make this decision, though, is just to say "the indexical probability I assign to ending up as each of A and B is proportional to their marginal utility of money" and then maximize your expected money. Intuitively this feels super weird and unjustified, but it does make the "prediction" that we'd find ourselves in a place with high marginal utility of money, as we currently do.

(Of course "money" is not crucial here, you could have the same bet with "time" or any other resource that can be compared across worlds.)

I would say that under UDASSA, it's perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).

Fair point. By "acausal games" do you mean a generalization of acausal trade? (Acausal trade is the main reason I'd expect us to be simulated a lot.)

Comment by Richard_Ngo (ricraz) on AI Regulation is Unsafe · 2024-04-24T22:51:52.136Z · LW · GW

I don't actually think proponents of anti-x-risk AI regulation have thought very much about the ways in which regulatory capture might in fact be harmful to reducing AI x-risk. At least, I haven't seen much writing about this, nor has it come up in many of the discussions I've had (except insofar as I brought it up).

In general I am against arguments of the form "X is terrible but we have to try it because worlds that don't do it are even more doomed". I'll steal Scott Garrabrant's quote from here:

"If you think everything is doomed, you should try not to mess anything up. If your worldview is right, we probably lose, so our best out is the one where your your worldview is somehow wrong. In that world, we don't want mistaken people to take big unilateral risk-seeking actions.

Until recently, people with P(doom) of, say, 10%, have been natural allies of people with P(doom) of >80%. But the regulation that the latter group thinks is sufficient to avoid xrisk with high confidence has, on my worldview, a significant chance of either causing x-risk from totalitarianism, or else causing x-risk via governments being worse at alignment than companies would have been. How high? Not sure, but plausibly enough to make these two groups no longer natural allies.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-04-24T19:59:23.347Z · LW · GW

A tension that keeps recurring when I think about philosophy is between the "view from nowhere" and the "view from somewhere", i.e. a third-person versus first-person perspective—especially when thinking about anthropics.

One version of the view from nowhere says that there's some "objective" way of assigning measure to universes (or people within those universes, or person-moments). You should expect to end up in different possible situations in proportion to how much measure your instances in those situations have. For example, UDASSA ascribes measure based on the simplicity of the computation that outputs your experience.

One version of the view from somewhere says that the way you assign measure across different instances should depend on your values. You should act as if you expect to end up in different possible future situations in proportion to how much power to implement your values the instances in each of those situations has. I'll call this the ADT approach, because that seems like the core insight of Anthropic Decision Theory. Wei Dai also discusses it here.

In some sense each of these views makes a prediction. UDASSA predicts that we live in a universe with laws of physics that are very simple to specify (even if they're computationally expensive to run), which seems to be true. Meanwhile the ADT approach "predicts" that we find ourselves at an unusually pivotal point in history, which also seems true.

Intuitively I want to say "yeah, but if I keep predicting that I will end up in more and more pivotal places, eventually that will be falsified". But.... on a personal level, this hasn't actually been falsified yet. And more generally, acting on those predictions can still be positive in expectation even if they almost surely end up being falsified. It's a St Petersburg paradox, basically.

Very speculatively, then, maybe a way to reconcile the view from somewhere and the view from nowhere is via something like geometric rationality, which avoids St Petersburg paradoxes. And more generally, it feels like there's some kind of multi-agent perspective which says I shouldn't model all these copies of myself as acting in unison, but rather as optimizing for some compromise between all their different goals (which can differ even if they're identical, because of indexicality). No strong conclusions here but I want to keep playing around with some of these ideas (which were inspired by a call with @zhukeepa).

This was all kinda rambly but I think I can summarize it as "Isn't it weird that ADT tells us that we should act as if we'll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don't have a story for why these things are related but it does seem like a suspicious coincidence."

Comment by Richard_Ngo (ricraz) on Mid-conditional love · 2024-04-23T23:40:38.649Z · LW · GW

Suppose we replace "unconditional love" with "unconditional promise". E.g. suppose Alice has promised Bob that she'll make Bob dinner on Christmas no matter what. Now it would be clearly confused to say "Alice promised Bob Christmas dinner unconditionally, so presumably she promised everything else Christmas dinner as well, since it is only conditions that separate Bob from the worms".

What's gone wrong here? Well, the ontology humans use for coordinating with each other assumes the existence of persistent agents, and so when you say you unconditionally promise/love/etc a given agent, then this implicitly assumes that we have a way of deciding which agents are "the same agent". No theory of personal identity is fully philosophically robust, of course, but if you object to that then you need to object not only to "I unconditionally love you" but also any sentence which contains the word "you", since we don't have a complete theory of what that refers to.

A woman who leaves a man because he grew plump and a woman who leaves a man because he committed treason both possessed ‘conditional love’.

This is not necessarily conditional love, this is conditional care or conditional fidelity. You can love someone and still leave them; they don't have to outweigh everything else you care about.

But also: I think "I love you unconditionally" is best interpreted as a report of your current state, rather than a commitment to maintaining that state indefinitely.

Comment by Richard_Ngo (ricraz) on What is the best way to talk about probabilities you expect to change with evidence/experiments? · 2024-04-19T22:10:47.672Z · LW · GW

The thing that distinguishes the coin case from the wind case is how hard it is to gather additional information, not how much more information could be gathered in principle. In theory you could run all sorts of simulations that would give you informative data about an individual flip of the coin, it's just that it would be really hard to do so/very few people are able to do so. I don't think the entropy of the posterior captures this dynamic.

Comment by Richard_Ngo (ricraz) on What is the best way to talk about probabilities you expect to change with evidence/experiments? · 2024-04-19T21:27:30.821Z · LW · GW

The variance over time depends on how you gather information in the future, making it less general. For example, I may literally never learn enough about meteorology to update my credence about the winds from 0.5. Nevertheless, there's still an important sense in which this credence is more fragile than my credence about coins, because I could update it.

I guess you could define it as something like "the variance if you investigated it further". But defining what it means to investigate further seems about as complicated as defining the reference class of people you're trading against. Also variance doesn't give you the same directional information—e.g. OP would bet on doom at 2% or bet against it at 16%.

Overall though, as I said above, I don't know a great way to formalize this, and would be very interested in attempts to do so.

Comment by Richard_Ngo (ricraz) on What is the best way to talk about probabilities you expect to change with evidence/experiments? · 2024-04-19T20:45:06.670Z · LW · GW

I don't think there's a very good precise way to do so, but one useful concept is bid-ask spreads, which are a way of protecting yourself from adverse selection of bets. E.g. consider the following two credences, both of which are 0.5.

  1. My credence that a fair coin will land heads.
  2. My credence that the wind tomorrow in my neighborhood will be blowing more northwards than southwards (I know very little about meteorology and have no recollection of which direction previous winds have mostly blown).

Intuitively, however, the former is very difficult to change, whereas the latter might swing wildly given even a little bit of evidence (e.g. someone saying "I remember in high school my teacher mentioned that winds often blow towards the equator.")

Suppose I have to decide on a policy that I'll accept bets for or against each of these propositions at X:1 odds (i.e. my opponent puts up $X for every $1 I put up). For the first proposition, I might set X to be 1.05, because as long as I have a small edge I'm confident I won't be exploited.

By contrast, if I set X=1.05 for the second proposition, then probably what will happen is that people will only decide to bet against me if they have more information than me (e.g. checking weather forecasts), and so they'll end up winning a lot of money for me. And so I'd actually want X to be something more like 2 or maybe higher, depending on who I expect to be betting against, even though my credence right now is 0.5.

In your case, you might formalize this by talking about your bid-ask spread when trading against people who know about these bottlenecks.

Comment by Richard_Ngo (ricraz) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T22:07:36.804Z · LW · GW

I think the two things that felt most unhealthy were:

  1. The "no forgiveness is ever possible" thing, as you highlight. Almost all talk about ineradicable sin should, IMO, be seen as a powerful psychological attack.
  2. The "our sins" thing feels like an unhealthy form of collective responsibility—you're responsible even if you haven't done anything. Again, very suspect on priors.

Maybe this is more intuitive for rationalists if you imagine a SJW writing a song about how, even millions of years in the future, anyone descended from westerners should still feel guilt about slavery: "Our sins can never be undone. No single death will be forgiven." I think this is the psychological exploit that's screwed up leftism so much over the last decade, and feels very analogous to what's happening in this song.

Comment by Richard_Ngo (ricraz) on Nick Bostrom’s new book, “Deep Utopia”, is out today · 2024-04-03T17:58:14.865Z · LW · GW

Just read this (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.

Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom's style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn't work very well when describing possible utopias, because they'll be so different from today that it's hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.

The stories and vignettes are somewhat esoteric; it's hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to "benefit" a room heater.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-04-03T17:56:32.583Z · LW · GW

Just read Bostrom's Deep Utopia (though not too carefully). The book is structured with about half being transcripts of fictional lectures given by Bostrom at Oxford, about a quarter being stories about various woodland creatures striving to build a utopia, and another quarter being various other vignettes and framing stories.

Overall, I was a bit disappointed. The lecture transcripts touch on some interesting ideas, but Bostrom's style is generally one which tries to classify and taxonimize, rather than characterize (e.g. he has a long section trying to analyze the nature of boredom). I think this doesn't work very well when describing possible utopias, because they'll be so different from today that it's hard to extrapolate many of our concepts to that point, and also because the hard part is making it viscerally compelling.

The stories and vignettes are somewhat esoteric; it's hard to extract straightforward lessons from them. My favorite was a story called The Exaltation of ThermoRex, about an industrialist who left his fortune to the benefit of his portable room heater, leading to a group of trustees spending many millions of dollars trying to figure out (and implement) what it means to "benefit" a room heater.

Comment by Richard_Ngo (ricraz) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-03T02:09:52.704Z · LW · GW

Fantastic work :)

Some thoughts on the songs:

  • I'm overall super impressed by how well the styles of the songs fit the content—e.g. the violins in FHI, the British accent works really well for More Dakka, the whisper for We Do Not Wish, the Litany of Tarrrrski, etc.
  • My favorites to listen to are FHI at Oxford, Nihil Supernum, and Litany of Tarrrrski, because they have both messages that resonate a lot and great tunes.
  • IMO Answer to Job is the best-composed on artistic merits, and will have the most widespread appeal. Tune is great, style matches the lyrics really well (particular shout-out to the "or labor or lust" as a well-composed bar). Only change I'd make is changing "upon lotus thrones" to "on lotus thrones" to scan better.
  • Dath Ilan's Song feels... pretty unhealthy, tbh.
  • I thought Prime Factorization was really great until the bit about the car and the number, which felt a bit jarring.
Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T22:44:50.771Z · LW · GW

If it was the case that there was important public information attached to Scott's full name, then this argument would make sense to me.

In general having someone's actual name public makes it much easier to find out other public information attached to them. E.g. imagine if Scott were involved in shady business dealings under his real name. This is the sort of thing that the NYT wouldn't necessarily discover just by writing the profile of him, but other people could subsequently discover after he was doxxed.

To be clear, btw, I'm not arguing that this doxxing policy is correct, all things considered. Personally I think the benefits of pseudonymity for a healthy ecosystem outweigh the public value of transparency about real names. I'm just arguing that there are policies consistent with the NYT's actions which are fairly reasonable.

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T19:39:28.530Z · LW · GW

But it wasn't a cancellation attempt. The issue at hand is whether a policy of doxxing influential people is a good idea. The benefits are transparency about who is influencing society, and in which ways; the harms include the ones you've listed above, about chilling effects.

It's hard to weigh these against each other, but one way you might do so is by following a policy like "doxx people only if they're influential enough that they're probably robust to things like losing their job". The correlation between "influential enough to be newsworthy" and "has many options open to them" isn't perfect, but it's strong enough that this policy seems pretty reasonable to me.

To flip this around, let's consider individuals who are quietly influential in other spheres. For example, I expect there are people who many news editors listen to, when deciding how their editorial policies should work. I expect there are people who many Democrat/Republican staffers listen to, when considering how to shape policy. In general I think transparency about these people would be pretty good for the world. If those people happened to have day jobs which would suffer from that transparency, I would say "Look, you chose to have a bunch of influence, which the world should know about, and I expect you can leverage this influence to end up in a good position somehow even after I run some articles on you. Maybe you're one of the few highly-influential people for whom this happens to not be true, but it seems like a reasonable policy to assume that if someone is actually pretty influential then they'll land on their feet either way." And the fact that this was true for Scott is some evidence that this would be a reasonable policy.

(I also think that taking someone influential who didn't previously have a public profile, and giving them a public profile under their real name, is structurally pretty analogous to doxxing. Many of the costs are the same. In both cases one of the key benefits is allowing people to cross-reference information about that person to get a better picture of who is influencing the world, and how.)

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T02:36:06.187Z · LW · GW

I don't think the NYT thing played much of a role in Scott being better off now. My guess is a small minority of people are subscribed to his Substack because of the NYT thing (the dominant factor is clearly the popularity of his writing).

What credence do you have that he would have started the substack at all without the NYT thing? I don't have much information, but probably less than 80%. The timing sure seems pretty suggestive.

(I'm also curious about the likelihood that he would have started his startup without the NYT thing, but that's less relevant since I don't know whether the startup is actually going well.)

My guess is the NYT thing hurt him quite a bit and made the potential consequences of him saying controversial things a lot worse for him.

Presumably this is true of most previously-low-profile people that the NYT chooses to write about in not-maximally-positive ways, so it's not a reasonable standard to hold them to. And so as a general rule I do think "the amount of adversity that you get when you used to be an influential yet unknown person but suddenly get a single media feature about you" is actually fine to inflict on people. In fact, I'd expect that many (or even most) people in this category will have a worse time of it than Scott—e.g. because they do things that are more politically controversial than Scott, have fewer avenues to make money, etc.

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T22:32:14.556Z · LW · GW

I mean, Scott seems to be in a pretty good situation now, in many ways better than before.

And yes, this is consistent with NYT hurting him in expectation.

But one difference between doxxing normal people versus doxxing "influential people" is that influential people typically have enough power to land on their feet when e.g. they lose a job. And so the fact that this has worked out well for Scott (and, seemingly, better than he expected) is some evidence that the NYT was better-calibrated about how influential Scott is than he was.

This seems like an example of the very very prevalent effect that Scott wrote about in "against bravery debates", where everyone thinks their group is less powerful than they actually are. I don't think there's a widely-accepted name for it; I sometimes use underdog bias. My main diagnosis of the NYT/SSC incident is that rationalists were caught up by underdog bias, even as they leveraged thousands of influential tech people to attack the NYT.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-26T21:21:41.039Z · LW · GW

Since there's been some recent discussion of the SSC/NYT incident (in particular via Zack's post), it seems worth copying over my twitter threads from that time about why I was disappointed by the rationalist community's response to the situation.

I continue to stand by everything I said below.

Thread 1 (6/23/20):

Scott Alexander is the most politically charitable person I know. Him being driven off the internet is terrible. Separately, it is also terrible if we have totally failed to internalize his lessons, and immediately leap to the conclusion that the NYT is being evil or selfish.

Ours is a community built around the long-term value of telling the truth. Are we unable to imagine reasonable disagreement about when the benefits of revealing real names outweigh the harms? Yes, it goes against our norms, but different groups have different norms.

If the extended rationalist/SSC community could cancel the NYT, would we? For planning to doxx Scott? For actually doing so, as a dumb mistake? For doing so, but for principled reasons? Would we give those reasons fair hearing? From what I've seen so far, I suspect not.

I feel very sorry for Scott, and really hope the NYT doesn't doxx him or anyone else. But if you claim to be charitable and openminded, except when confronted by a test that affects your own community, then you're using those words as performative weapons, deliberately or not.

[One more tweet responding to tweets by @balajis and @webdevmason, omitted here.]

Thread 2 (1/21/21):

Scott Alexander is writing again, on a substack blog called Astral Codex Ten! Also, he doxxed himself in the first post. This post seems like solid evidence that many SSC fans dramatically overreacted to the NYT situation.

Scott: "I still think the most likely explanation for what happened was that there was a rule on the books, some departments and editors followed it more slavishly than others, and I had the bad luck to be assigned to a department and editor that followed it a lot. That's all." [I didn't comment on this in the thread, but I intended to highlight the difference between this and the conspiratorial rhetoric that was floating around when he originally took his blog down.]

I am pretty unimpressed by his self-justification: "Suppose Power comes up to you and says hey, I'm gonna kick you in the balls. ... Sometimes you have to be a crazy bastard so people won't walk all over you." Why is doxxing the one thing Scott won't be charitable about?

[In response to @habryka asking what it would mean for Scott to be charitable about this]: Merely to continue applying the standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives. And not to turn this into a bravery debate.

[In response to @benskuhn saying that Scott's response is understandable, since being doxxed nearly prevented him from going into medicine]: On one hand, yes, this seems reasonable. On the other hand, this is also a fully general excuse for unreasonable dialogue. It is always the case that important issues have had major impacts on individuals. Taking this excuse seriously undermines Scott's key principles.

I would be less critical if it were just Scott, but a lot of people jumped on narratives similar to "NYT is going around kicking people in the balls for selfish reasons", demonstrating an alarming amount of tribalism - and worse, lack of self-awareness about it.

Comment by Richard_Ngo (ricraz) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T20:56:15.195Z · LW · GW

+1, I agree with all of this, and generally consider the SSC/NYT incident to be an example of the rationalist community being highly tribalist.

(more on this in a twitter thread, which I've copied over to LW here)

Comment by Richard_Ngo (ricraz) on My PhD thesis: Algorithmic Bayesian Epistemology · 2024-03-25T17:00:58.006Z · LW · GW

Very cool work! A couple of (perhaps-silly) questions:

  1. Do these results have any practical implications for prediction markets?
  2. Which of your results rely on there being a fixed pool of experts who have to forecast a question (as opposed to experts being free to pick and choose which questions they forecast)?
  3. Do you know if your arbitrage-free contract function permits types of collusion that don't leave all experts better off under every outcome, but do make each of them better off in expectation according to their own credences? (I.e. types of collusion that they would agree to in advance.) Apart from just making side bets.
Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-24T22:31:52.403Z · LW · GW

What are the others?

Comment by Richard_Ngo (ricraz) on On green · 2024-03-24T19:03:25.443Z · LW · GW

Huh, I'd say the opposite. Green-according-to-black says "fuck all the people who are harming nature", because black sees the world through an adversarial lens. But actual green is better at getting out of the adversarial/striving mindset.

Comment by Richard_Ngo (ricraz) on On green · 2024-03-23T02:43:26.052Z · LW · GW

My favorite section of this post was the "green according to non-green" section, which I felt captured really well the various ways that other colors see past green.

I don't fully feel like the green part inside me resonated with any of your descriptions of it, though. So let me have a go at describing green, and seeing if that resonates with you.

Green is the idea that you don't have to strive towards anything. Thinking that green is instrumentally useful towards some other goal misses the whole point of green, which is about getting out of a goal- or action-oriented mindset. When you do that, your perception expands from a tunnel-vision "how can I get what I want" to actually experiencing the world in its unfiltered glory—actually looking at the redwoods. If you do that, then you can't help but feel awe. And when you step out of your self-oriented tunnel, suddenly the world has far more potential for harmony than you'd previously seen, because in fact the motivations that are causing the disharmony are... illusions, in some sense. Green looks at someone cutting down a redwood and sees someone who is hurting themself, by forcibly shutting off the parts of themselves that are capable of appreciation and awe. Knowing this doesn't actually save the redwoods, necessarily, but it does make it far easier to be in a state of acceptance, because deep down nobody is actually your enemy.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-22T23:42:58.382Z · LW · GW

More thoughts: what's the difference between paying in a counterfactual mugging based on:

  1. Whether the millionth digit of pi (5) is odd or even
  2. Whether or not there are an infinite number of primes?

In the latter case knowing the truth is (near-)inextrictably entangled with a bunch of other capabilities, like the ability to do advanced mathematics. Whereas in the former it isn't. Suppose that before you knew either fact you were told that one of them was entangled in this way—would you still want to commit to paying out in a mugging based on it?

Well... maybe? But it means that the counterlogical of "if there hadn't been an infinite number of primes" is not very well-defined—it's hard to modify your brain to add that belief without making a bunch of other modifications. So now Omega doesn't just have to be (near-)omniscient, it also needs to have a clear definition of the counterlogical that's "fair" according to your standards; without knowing that it has that, paying up becomes less tempting.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-22T19:55:50.275Z · LW · GW

Yepp, as in Logical Induction, new traders get spawned over time (in some kind of simplicity-weighted ordering).

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-22T01:19:47.725Z · LW · GW

Artificial agents can be copied or rolled back (erase memories), which makes it possible to reverse the receipt of information if an assessor concludes with a price that the seller considers too low for a deal.

Yepp, very good point. Am working on a short story about this right now.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-21T00:53:52.560Z · LW · GW

Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don't fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that.

Ah, I see. In that case I think I disagree that it happens "by default" in this model. A few dynamics which prevent it:

  1. If the wealthy trader makes reward easier to get, then the price of actions will go up accordingly (because other traders will notice that they can get a lot of reward by winning actions). So in order for the wealthy trader to keep making money, they need to reward outcomes which only they can achieve, which seems a lot harder.
  2. I don't yet know how traders would best aggregate votes into a reward function, but it should be something which has diminishing marginal return to spending, i.e. you can't just spend 100x as much to get 100x higher reward on your preferred outcome. (Maybe quadratic voting?)
  3. Other traders will still make money by predicting sensory observations. Now, perhaps the wealthy trader could avoid this by making observations as predictable as possible (e.g. going into a dark room where nothing happens—kinda like depression, maybe?) But this outcome would be assigned very low reward by most other traders, so it only works once a single trader already has a large proportion of the wealth.

Yep, that's why I believe "in the limit your traders will already do this". I just think it will be a dominant dynamic of efficient agents in the real world, so it's better to represent it explicitly

IMO the best way to explicitly represent this is via a bias towards simpler traders, who will in general pay attention to fewer things.

But actually I don't think that this is a "dominant dynamic" because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews. And so even if you start off with simple traders who pay attention to fewer things, you'll end up with these big worldviews that have opinions on everything. (These are what I call frames here.)

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-21T00:12:12.943Z · LW · GW

Yep, but you can just treat it as another observation channel into UDT.

Hmm, I'm confused by this. Why should we treat it this way? There's no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm. That's the role V is playing.

It's just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go "hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don't really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal".

Obviously I am not arguing that you should agree to all moral muggings. If a pain-maximizer came up to you and said "hey, looks like we're in a world where pain is way easier to create than pleasure, give me all your resources", it would be nuts to agree, just like it would be nuts to get mugged by "1+1=3". I'm just saying that "sometimes you get mugged" is not a good argument against my position, and definitely doesn't imply "you get mugged everywhere".

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-21T00:02:52.603Z · LW · GW

I think real learning has some kind of ground-truth reward.

I'd actually represent this as "subsidizing" some traders. For example, humans have a social-status-detector which is hardwired to our reward systems. One way to implement this is just by taking a trader which is focused on social status and giving it a bunch of money. I think this is also realistic in the sense that our human hardcoded rewards can be seen as (fairly dumb) subagents.

I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you'll need a modification of this framework which explains why that's not the case.

I think this happens in humans—e.g. we fall into cults, we then look for evidence that the cult is correct, etc etc. So I don't think this is actually a problem that should be ruled out—it's more a question of how you tweak the parameters to make this as unlikely as possible. (One reason it can't be ruled out: it's always possible for an agent to end up in a belief state where it expects that exploration will be very severely punished, which drives the probability of exploration arbitrarily low.)

they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don't lose much, and you waste less compute

I'm assuming that traders can choose to ignore whichever inputs/topics they like, though. They don't need to make trades on everything if they don't want to.

I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all

Yeah, this is why I'm interested in understanding how sub-markets can be aggregated into markets, sub-auctions into auctions, sub-elections into elections, etc.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T23:48:47.302Z · LW · GW

Also, you can get rid of this problem by saying "you just want to maximize the variable U". And the things you actually care about (dogs, apples) are just "instrumentally" useful in giving you U.

But you need some mechanism for actually updating your beliefs about U, because you can't empirically observe U. That's the role of V.

leads to getting Pascal's mugged by the world in which you care a lot about easy things

I think this is fine. Consider two worlds:

In world L, lollipops are easy to make, and paperclips are hard to make.

In world P, it's the reverse.

Suppose you're a paperclip-maximizer in world L. And a lollipop-maximizer comes up to you and says "hey, before I found out whether we were in L or P, I committed to giving all my resources to paperclip-maximizers if we were in P, as long as they gave me all their resources if we were in L. Pay up."

UDT says to pay here—but that seems basically equivalent to getting "mugged" by worlds where you care about easy things.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T23:34:20.580Z · LW · GW

Some more thoughts: we can portray the process of choosing a successor policy as the iterative process of making more and more commitments over time. But what does it actually look like to make a commitment? Well, consider an agent that is made of multiple subagents, that each get to vote on its decisions. You can think of a commitment as basically saying "this subagent still gets to vote, but no longer gets updated"—i.e. it's a kind of stop-gradient.

Two interesting implications of this perspective:

  1. The "cost" of a commitment can be measured both in terms of "how often does the subagent vote in stupid ways?", and also "how much space does it require to continue storing this subagent?" But since we're assuming that agents get much smarter over time, probably the latter is pretty small.
  2. There's a striking similarity to the problem of trapped priors in human psychology. Parts of our brains basically are subagents that still get to vote but no longer get updated. And I don't think this is just a bug—it's also a feature. This is true on the level of biological evolution (you need to have a strong fear of death in order to actually survive) and also on the level of cultural evolution (if you can indoctrinate kids in a way that sticks, then your culture is much more likely to persist).

    The (somewhat provocative) way of phrasing this is that trauma is evolution's approach to implementing UDT. Someone who's been traumatized into conformity by society when they were young will then (in theory) continue obeying society's dictates even when they later have more options. Someone who gets very angry if mistreated in a certain way is much harder to mistreat in that way. And of course trauma is deeply suboptimal in a bunch of ways, but so too are UDT commitments, because they were made too early to figure out better alternatives.

    This is clearly only a small component of the story but the analogy is definitely a very interesting one.
Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T20:59:02.118Z · LW · GW

UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that's not using something UDT-like) wouldn't be able to do that in any circumstance

Agreed; apologies for the sloppy phrasing.

Historically it was overwhelmingly the frame until recently, so it's the correct frame for interpreting the intended meaning of texts from that time.

I agree, that's why I'm trying to outline an alternative frame for thinking about it.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T20:50:21.748Z · LW · GW

Here is the best toy model I currently have for rational agents. Alas, it is super messy and hacky, but better than nothing. I'll call it the BAVM model; the one-sentence summary is "internal traders concurrently bet on beliefs, auction actions, vote on values, and merge minds". There's little novel here, I'm just throwing together a bunch of ideas from other people (especially Scott Garrabrant and Abram Demski).

In more detail, the three main components are:

  1. A prediction market
  2. An action auction
  3. A value election

You also have some set of traders, who can simultaneously trade on any combination of these three. Traders earn money in two ways:

  1. Making accurate predictions about future sensory experiences on the market.
  2. Taking actions which lead to reward or increase the agent's expected future value.

They spend money in three ways:

  1. Bidding to control the agent's actions for the next N timesteps.
  2. Voting on what actions get reward and what states are assigned value.
  3. Running the computations required to figure out all these trades.

Values are therefore dominated by whichever traders earn money from predictions or actions, who will disproportionately vote for values that are formulated in the same ontologies they use for prediction/action, since that's simpler than using different ontologies. (Note that this does requires the assumption that simpler traders start off with more money.)

The last component is that it costs traders money to do computation. The way they can reduce this is by finding other traders who do similar computations as them, and then merging into a single trader. I am very interested in better understanding what a merging process like this might look like, though it seems pretty intractable in general because it will depend a lot on the internal details of the traders. (So perhaps a more principled approach here is to instead work top-down, figuring out what sub-markets or sub-auctions look like?)

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T20:21:03.722Z · LW · GW

One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that.

I think there's an ambiguity here. UDT makes the agent stop considering updated-away possibilities, but I haven't seen any discussion of UDT which suggests that it stops caring about them in principle (except for a brief suggestion from Paul that one option for UDT is to "go back to a position where I’m mostly ignorant about the content of my values"). Rather, when I've seen UDT discussed, it focuses on updating or un-updating your epistemic state.

I don't think the shift I'm proposing is particularly important, but I do think the idea that "you have your prior and your utility function from the very beginning" is a kinda misleading frame to be in, so I'm trying to nudge a little away from that.

Comment by Richard_Ngo (ricraz) on Measuring Coherence of Policies in Toy Environments · 2024-03-20T19:26:12.230Z · LW · GW

That's what would make the check nontrivial: IIUC there exist policies which are not consistent with any assignment of values satisfying that Bellman equation.

Ah, I see. Yeah, good point. So let's imagine drawing a boundary around some zero-reward section of an MDP, and evaluating consistency within it. In essence, this is equivalent to saying that only actions which leave that section of the MDP have any reward. Without loss of generality we could do this by making some states terminal states, with only terminal states getting reward. (Or saying that only self-loops get reward, which is equivalent for deterministic policies.)

Now, there's some set of terminal states which are ever taken by a deterministic policy. And so we can evaluate the coherence of the policy as follows:

  1. When going to a terminal state, does it always take the shortest path?
  2. For every pair of terminal states in that set, is there some k such that it always goes to one unless the path to the other is at least k steps shorter?
  3. Do these pairings allow us to rank all terminal states?

This could be calculated by working backwards from the terminal states that are sometimes taken, with each state keeping a record of which terminal states are reachable from it via different path lengths. And then a metric of coherence here will allow for some contradictions, presumably, but not many.

Note that going to many different terminal states from different starting points doesn't necessarily imply a lack of coherence—it might just be the case that there are many nearly-equally-good ways to exit this section of the MDP. It all depends on how the policy goes to those states.

Comment by Richard_Ngo (ricraz) on Measuring Coherence of Policies in Toy Environments · 2024-03-20T18:14:24.411Z · LW · GW

Note that in the setting we describe here, we start off only with a policy and a (reward-less) MDP. No rewards, no value functions. Given this, there is always a value function or q-function consistent with the policy and the Bellman equations.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-20T06:55:14.320Z · LW · GW

In UDT2, when you're in epistemic state Y and you need to make a decision based on some utility function U, you do the following:
1. Go back to some previous epistemic state X and an EDT policy (the combination of which I'll call the non-updated agent).
2. Spend a small amount of time trying to find the policy P which maximizes U based on your current expectations X.
3. Run P(Y) to make the choice which maximizes U.

The non-updated agent gets much less information than you currently have, and also gets much less time to think. But it does use the same utility function. That seems... suspicious. If you're updating so far back that you don't know who or where you are, how are you meant to know what you care about?

What happens if the non-updated agent doesn't get given your utility function? On its face, that seems to break its ability to decide which policy P to commit to. But perhaps it could instead choose a policy P(Y,U) which takes as input not just an epistemic state, but also a utility function. Then in step 2, the non-updated agent needs to choose a policy P that maximizes, not the agent's current utility function, but rather the utility functions it expects to have across a wide range of future situations.

Problem: this involves aggregating the utilities of different agents, and there's no canonical way to do this. Hmm. So maybe instead of just generating a policy, the non-updated agent also needs to generate a value learning algorithm, that maps from an epistemic state Y to a utility function U, in a way which allows comparison across different Us. Then the non-updated agent tries to find a pair (P, V) such that P(Y) maximizes V(Y) on the distribution of Ys predicted by X. EDIT: no, this doesn't work. Instead I think you need to go back, not just to a previous epistemic state X, but also to a previous set of preferences U' (which include meta-level preferences about how your values evolve). Then you pick P and V in order to maximize U'.

Now, it does seem kinda wacky that the non-updated agent can maybe just tell you to change your utility function. But is that actually any weirder than it telling you to change your policy? And after all, you did in fact acquire your values from somewhere, according to some process.

Overall, I haven't thought about this very much, and I don't know if it's already been discussed. But three quick final comments:

  1. This brings UDT closer to an ethical theory, not just a decision theory.
  2. In practice you'd expect P and V to be closely related. In fact, I'd expect them to be inseparable, based on arguments I make here.
  3. Overall the main update I've made is not that this version of UDT is actually useful, but that I'm now suspicious of the whole framing of UDT as a process of going back to a non-updated agent and letting it make commitments.
Comment by Richard_Ngo (ricraz) on Policy Selection Solves Most Problems · 2024-03-20T04:57:03.813Z · LW · GW

Is there a principled way to avoid the chaos of a too-early market state while also steering clear of knowledge we need to be updateless toward?

Is there a particular reason to think that the answer to this shouldn't just be "first run a logical inductor to P_f(f(n)), then use that distribution to determine how to use P_f(n) to determine how to choose an action from P_n" (at least for large enough n)?

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-18T20:03:01.942Z · LW · GW

The "average" is interpreted with respect to quality. Imagine that your only option is to create low-quality squiggles, or not to do so. In isolation, you'd prefer to produce them than not to produce them. But then you find out that the rest of the multiverse is full of high-quality squiggles. Do you still produce the low-quality squiggles? A total squigglean would; an average squigglean wouldn't.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-18T18:36:17.158Z · LW · GW

The idea that maximally-coherent agents look like squiggle-maximizers raises the question: what would it look like for humans to become maximally coherent?

One answer, which Yudkowsky gives here, is that conscious experiences are just a "weird and more abstract and complicated pattern that matter can be squiggled into".

But that seems to be in tension with another claim he makes, that there's no way for one agent's conscious experiences to become "more real" except at the expense of other conscious agents—a claim which, according to him, motivates average utilitarianism across the multiverse.

Clearly a squiggle-maximizer would not be an average squigglean. So what's the disanalogy here? It seems like @Eliezer Yudkowsky is basically using SSA, but comparing between possible multiverses—i.e. when facing the choice between creating agent A or not, you look at the set of As in the multiverse where you decided yes, and compare it to the set of As in the multiverse where you decided no, and (if you're deciding for the good of A) you pick whichever one gives A a better time on average.

Yudkowsky has written before (can't find the link) that he takes this approach because alternatives would entail giving up on predictions about his future experiences—e.g. constantly predicting he's a Boltzmann brain and will dissolve in the next second. But this argument by Wei Dai shows that agents which reason in this way can be money-pumped by creating arbitrarily short-lived copies of them. Based on this I claim that Yudkowsky's preferences are incoherent, and that the only coherent thing to do here is to "expect to be" a given copy in proportion to the resources it will have available, as anthropic decision theory claims. (Incidentally, this also explains why we're at the hinge of history.)

But this is just an answer, it doesn't dissolve the problem. What could? Some wild guesses:

  1. You are allowed to have preferences about the external world, and you are allowed to have preferences about your "thread of experience"—you're just not allowed to have both. The incoherence comes from trying to combine the two; the coherent thing to do would be to put them into different agents, who will then end up in very different parts of the multiverse.
  2. Another way of framing this: you are allowed to be a decision-maker, and you are allowed to be a repository of welfare, but you're not allowed to be both (on pain of incoherence/being dutch-booked).
  3. Something totally different: the problem here is that we don't have intuitive experience of being agents which can copy themselves, shut down copies, re-merge, etc. If we did, then maybe SSA would seem as silly as expecting to end up in a different universe whenever we went to sleep.
  4. Actually, maybe the operative thing we lack experience with is not just splitting into different subagents, but rather merging together afterwards. What does it feel like to have been thousands of different parallel agents, and now be a single agent with their unified experiences? What sort of identity would one construct in that situation? Maybe this is an important part of dissolving the problem.
Comment by Richard_Ngo (ricraz) on More people getting into AI safety should do a PhD · 2024-03-15T17:06:42.878Z · LW · GW

But if you think TAI is coming within 10 years (for example, if you think that the current half-life on worlds surviving is 10 years; if you think 10 years is the amount of time in which half of worlds are doomed)

Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn't imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too.

then depending on your distribution-over-time you should absolutely not wait 5 years before doing research, because TAI could happen in 9 years but it could also happen in 1 year

So? Your research doesn't have to be useful in every possible world. If a PhD increases the quality of your research by, say, 3x (which is plausible, since research is heavy-tailed) then it may well be better to do that research for half the time.

(In general I don't think x-risk-motivated people should do PhDs that don't directly contribute to alignment, to be clear; I just think this isn't a good argument for that conclusion.)

Comment by Richard_Ngo (ricraz) on 'Empiricism!' as Anti-Epistemology · 2024-03-15T02:09:12.450Z · LW · GW

"Well, since it's too late there," said the Scientist, "would you maybe agree with me that 'eternal returns' is a prediction derived by looking at observations in a simple way, and then doing some pretty simple reasoning on it; and that's, like, cool?  Even if that coolness is not the single overwhelming decisive factor in what to believe?"

"Depends exactly what you mean by 'cool'," said the Epistemologist.

"Okay, let me give it a shot," said the Scientist. "Suppose you model me as having a bunch of subagents who make trades on some kind of internal prediction market. The whole time I've been watching Ponzi Pyramid Incorporated, I've had a very simple and dumb internal trader who has been making a bunch of money betting that they will keep going up by 20%. Of course, my mind contains a whole range of other traders too, so this one isn't able to swing the market by itself, but what I mean by 'cool' is that this trader does have a bunch of money now! (More than others do, because in my internal prediction markets, simpler traders start off with more money.)"

"The problem," said the Epistemologist, "is that you're in an adversarial context, where the observations you're seeing have been designed to make that particular simple trader rich. In that context, you shouldn't be giving those simple traders so much money to start off with; they'll just continue being exploited until you learn better."

"But is that the right place to intervene? After all, my internal prediction market is itself an adversarial process. And so the simple internal trader who just predicts that things will continue going up the same amount every year will be exploited by other internal traders as soon as it dares to venture a bet on, say, the returns of the previous company that our friend the Spokesperson worked at. Indeed, those savvier traders might even push me to go look up that data (using, perhaps, some kind of internal action auction), in order to more effectively take the simple trader's money."

"You claim," said the Epistemologist, "to have these more sophisticated internal traders. Yet you started this conversation by defending the coolness, aka wealth, of the trader corresponding to the Spokesperson's predictions. So it seems like these sophisticated internal traders are not doing their work so well after all."

"They haven't taken its money yet," said the Scientist, "But they will before it gets a chance to invest any of my money. Nevertheless, your point is a good one; it's not very cool to only have money temporarily. Hmmm, let me muse on this."

The Scientist thinks for a few minutes, then speaks again.

"I'll try another attempt to describe what I mean by 'cool'. Often-times, clever arguers suggest new traders to me, and point out that those traders would have made a lot of money if they'd been trading earlier. Now, if I were an ideal Garrabrant inductor I would ignore these arguments, and only pay attention to these new traders' future trades. But I have not world enough or time for this; so I've decided to subsidize new traders based on how they would have done if they'd been trading earlier. Of course, though, this leaves me vulnerable to clever arguers inventing overfitted traders. So the subsidy has to be proportional to how likely it is that the clever arguer could have picked out this specific trader in advance. And for all Spokesperson's flaws, I do think that 5 years ago he was probably saying something that sounded reasonably similar to '20% returns indefinitely!' That is the sense in which his claim is cool."

"Hmm," said the Epistemologist. "An interesting suggestion, but I note that you've departed from the language of traders in doing so. I feel suspicious that you're smuggling something in, in a way which I can't immediately notice."

"Right, which would be not very cool. Alas, I feel uncertain about how to put my observation into the language of traders. But... well, I've already said that simple traders start off with more money. So perhaps it's just the same thing as before, except that when evaluating new traders on old data I put extra weight on simplicity when deciding how much money they start with—because now it also helps prevent clever arguers from fooling me (and potentially themselves) with overfitted post-hoc hypotheses."

("Parenthetically," added the Scientist, "there are plenty of other signals of overfitting I take into account when deciding how much to subsidize new traders—like where I heard about them, and whether they match my biases and society's biases, and so on. Indeed, there are enough such signals that perhaps it's best to think of this as a process of many traders bidding on the question of how easy/hard it would have been for the clever arguer to have picked out this specific trader in advance. But this is getting into the weeds—the key point is that simplicity needs to be extra-strongly-prioritized when evaluating new traders on past data.")

Comment by Richard_Ngo (ricraz) on Notes from a Prompt Factory · 2024-03-10T19:26:00.588Z · LW · GW

I'm sorry you regret reading it. A content warning seems like a good idea, I've added one now.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-03-06T20:40:31.632Z · LW · GW

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in "subagents" which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly "goal-directed" as you go up the hierarchy. This is an old idea, FWIW; e.g. it's how Minsky frames intelligence in Society of Mind. And it's also somewhat consistent with the claim made in the original shard theory post, that "shards are just collections of subshards".

The problem is the "just". The post also says "shards are not full subagents", and that "we currently estimate that most shards are 'optimizers' to the extent that a bacterium or a thermostat is an optimizer." But the whole point of thinking about shards, in my mind, is that it allows us to talk about a gradual spectrum from "heuristic" to "agent", and how the combination of low-level heuristics may in fact give rise to high-level agents which pursue consequentialist goals. I talk about this in my post on value systematization—e.g. using the example of how normal human moral "shards" (like caring about other people's welfare) can aggregate into highly-consequentialist utilitarian subagents. In other words, shard advocates seem so determined to rebut the "rational EU maximizer" picture that they're ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

(I make a similar point in the appendix of my value systematization post.)

Comment by Richard_Ngo (ricraz) on Evidential Cooperation in Large Worlds: Potential Objections & FAQ · 2024-02-28T22:17:57.800Z · LW · GW

Is ECL the same thing as acausal trade?

Typically, no. “Acausal trade” usually refers to a different mechanism: “I do this thing for you if you do this other thing for me.” Discussions of acausal trade often involve the agents attempting to simulate each other. In contrast, ECL flows through direct correlation: “If I do this, I learn that you are more likely to also do this.” For more, see Christiano (2022)’s discussion of correlation versus reciprocity and Oesterheld, 2017, section 6.1.

I'm skeptical about the extent to which these are actually different things. Oesterheld says "superrationality may be seen as a special case of acausal trade in which the agents’ knowledge implies the correlation directly, thus avoiding the need for explicit mutual modeling and the complications associated with it". So at the very least, we can think of one as a subset of the other (though I think I'd actually classify it the other way round, with acausal trade being a special case of superrationality).

But it's not just that. Consider an ECL model that concludes: "my decision is correlated with X's decision, therefore I should cooperate". But this conclusion also requires complicated recursive reasoning—specifically, reasons for thinking that the correlation holds even given that you're taking the correlation into account when making your decision.

(E.g. suppose that you know that you were similar to X, except that you are doing ECL and X isn't. But then ECL might break the previous correlation between you and X. So actually the ECL process needs to reason "the outcome of the decision process I'm currently doing is correlated with the outcome of the decision process that they're doing", and I think realistically finding a fixed point probably wouldn't look that different from standard descriptions of acausal trade.)

This may be another example of the phenomenon Paul describes in his post on why EDT > CDT: although EDT is technically more correct, in practice you need to do something like CDT to reason robustly. (In this case ECL is EDT and acausal trade is more CDTish.)

Comment by Richard_Ngo (ricraz) on Announcing Timaeus · 2024-02-22T20:12:32.419Z · LW · GW

FWIW for this sort of research I support a strong prior in favor of publishing.

Comment by Richard_Ngo (ricraz) on Every "Every Bay Area House Party" Bay Area House Party · 2024-02-20T08:41:54.238Z · LW · GW

The thing I'm picturing here is a futures contract where charizard-shirt-guy is obligated to deliver 3 trillion paperclips in exchange for one soul. And, assuming a reasonable discount rate, this is a better deal than only receiving a handful of paperclips now in exchange for the same soul. (I agree that you wouldn't want to invest in a current-market-price paperclip futures contract.)

Comment by Richard_Ngo (ricraz) on Masterpiece · 2024-02-16T05:25:31.103Z · LW · GW

Damn, MMDoom is a good one. New lore: it won the 2055 technique award.

Comment by Richard_Ngo (ricraz) on Masterpiece · 2024-02-15T09:00:19.951Z · LW · GW

Judges' ratings:

Technique: 5/10

The training techniques used here are in general very standard ones (although the dissonance filters were a nice touch). For a higher score on this metric, we would have expected more careful work to increase the stability of self-evaluation and/or the accuracy of the judgments.

Novelty: 7/10

While the initial premise was a novel one to us, we thought that more new ideas could have been incorporated into this entry in order for it to score more highly on this metric. For example, the "outliers" in the entry's predictions were a missed opportunity to communicate an underlying pattern. Similarly, the instability of the self-evaluation could have been incorporated into the entry in some clearer way.

Artistry: 9/10

We consider the piece a fascinating concept—one which forces the judges to confront the automatability of their own labors. Holding a mirror to the faces of viewers is certainly a classic artistic endeavor. We also appreciated the artistic irony of the entry's inability to perceive itself.

Comment by Richard_Ngo (ricraz) on Dreams of AI alignment: The danger of suggestive names · 2024-02-13T12:29:59.498Z · LW · GW

I think we have failed, thus far.  I'm sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.

I empathize with this, and have complained similarly (e.g. here).

I have also been trying to figure out why I feel quite a strong urge to push back on posts like this one. E.g. in this case I do in fact agree that only a handful of people actually understand AI risk arguments well enough to avoid falling into "suggestive names" traps. But I think there's a kind of weak man effect where if you point out enough examples of people making these mistakes, it discredits even those people who avoid the trap.

Maybe another way of saying this: of course most people are wrong about a bunch of this stuff. But the jump from that to claiming the community or field has failed isn't a valid one, because the success of a field is much more dependent on max performance than mean performance.