Posts

rvnnt's Shortform 2025-02-13T15:21:33.046Z
Should you publish solutions to corrigibility? 2025-01-30T11:52:05.983Z
Requesting feedback/advice: what Type Theory to study for AI safety? 2020-06-23T17:03:32.452Z

Comments

Comment by rvnnt on rvnnt's Shortform · 2025-02-14T13:55:20.601Z · LW · GW

It's unclear whether there is a tipping point where [...]

Yes. Also unclear whether the 90% could coordinate to take any effective action, or whether any effective action would be available to them. (Might be hard to coordinate when AIs control/influence the information landscape; might be hard to rise up against e.g. robotic law enforcement or bioweapons.)

Don't use passive voice for this. [...]

Good point! I guess one way to frame that would be as

by what kind of process do the humans in law enforcement, military, and intelligence agencies get replaced by AIs? Who/what is in effective control of those systems (or their successors) at various points in time?

And yeah, that seems very difficult to predict or reliably control. OTOH, if someone were to gain control of the AIs (possibly even copies of a single model?) that are running all the systems, that might make centralized control easier? </wild, probably-useless speculation>

Comment by rvnnt on rvnnt's Shortform · 2025-02-13T15:21:33.039Z · LW · GW

A potentially somewhat important thing which I haven't seen discussed:

  • People who have a lot of political power or own a lot of capital, are unlikely to be adversely affected if (say) 90% of human labor becomes obsolete and replaced by AI.
  • In fact, so long as property rights are enforced, and humans retain a monopoly on decisionmaking/political power, such people are not-unlikely to benefit from the economic boost that such automation would bring.
  • Decisions about AI policy are mostly determined by people with a lot of capital or political power. (E.g. Andreessen Horowitz, JD Vance, Trump, etc.)

(This looks like a decisionmaker is not the beneficiary -type of situation.)

Why does that matter?

  • It has implications for modeling decisionmakers, interpreting their words, and for how to interact with them.[1]

  • If we are in a gradual-takeoff world[2], then we should perhaps not be too surprised to see the wealthy and powerful push for AI-related policies that make them more wealthy and powerful, while a majority of humans become disempowered and starve to death (or live in destitution, or get put down with viruses or robotic armies, or whatever). (OTOH, I'm not sure if that possibility can be planned/prepared for, so maybe that's irrelevant, actually?)


  1. For example: we maybe should not expect decisionmakers to take risks from AI seriously until they realize those risks include a high probability of "I, personally, will die". As another example: when people like JD Vance output rhetoric like "[AI] is not going to replace human beings. It will never replace human beings", we should perhaps not just infer that "Vance does not believe in AGI", but instead also assign some probability to hypotheses like "Vance thinks AGI will in fact replace lots of human beings, just not him personally; and he maybe does not believe in ASI, or imagines he will be able to control ASI". ↩︎

  2. Here I'll define "gradual takeoff" very loosely as "a world in which there is a >1 year window during which it is possible to replace >90% of human labor, before the first ASI comes into existence". ↩︎

Comment by rvnnt on Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion? · 2025-02-11T18:29:38.036Z · LW · GW

Thank you for (being one of the horrifyingly few people) doing sane reporting on these crucially important topics.

Comment by rvnnt on Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion? · 2025-02-11T11:52:25.963Z · LW · GW

Typo: "And humanity needs all the help we it can get."

Comment by rvnnt on Altman blog on post-AGI world · 2025-02-10T10:53:28.539Z · LW · GW

Out of (1)-(3), I think (3)[1] is clearly most probable:

  • I think (2) would require Altman to be deeply un-strategic/un-agentic, which seems in stark conflict with all the skillful playing-of-power-games he has displayed.
  • (3) seems strongly in-character with the kind of manipulative/deceitful maneuvering-into-power he has displayed thus far.
  • I suppose (1) is plausible; but for that to be his only motive, he would have to be rather deeply un-strategic (which does not seem to be the case).

(Of course one could also come up with other possibilities besides (1)-(3).)[2]


  1. or some combination of (1) and (3) ↩︎

  2. E.g. maybe he plans to keep ASI to himself, but use it to implement all-of-humanity's CEV, or something. OTOH, I think the kind of person who would do that, would not exhibit so much lying, manipulation, exacerbating-arms-races, and gambling-with-everyone's-lives. Or maybe he doesn't believe ASI will be particularly impactful; but that seems even less plausible. ↩︎

Comment by rvnnt on Should you publish solutions to corrigibility? · 2025-01-31T14:24:45.879Z · LW · GW

Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.

That's a good thing to consider! However, taking Earth's situation as a prior for other "cradles of intelligence", I think that consideration returns to the question of "should we expect Earth's lightcone to be better or worse than zero-value (conditional on corrigibility)?"

Comment by rvnnt on Should you publish solutions to corrigibility? · 2025-01-31T14:23:34.753Z · LW · GW

To me, those odds each seem optimistic by a factor of about 1000, but ~reasonable relative to each other.

(I don't see any low-cost way to find out why we disagree so strongly, though. Moving on, I guess.)

But this isn't any worse to me than being killed [...]

Makes sense (given your low odds for bad outcomes).

Do you also care about minds that are not you, though? Do you expect most future minds/persons that are brought into existence to have nice lives, if (say) Donald "Grab Them By The Pussy" Trump became god-emperor (and was the one deciding what persons/minds get to exist)?

Comment by rvnnt on Should you publish solutions to corrigibility? · 2025-01-31T14:16:03.657Z · LW · GW

IIUC, your model would (at least tentatively) predict that

  • if person P has a lot of power over person Q,
  • and P is not sadistic,
  • and P is sufficiently secure/well-resourced that P doesn't "need" to exploit Q,
  • then P will not intentionally do anything that would be horrible for Q?

If so, how do you reconcile that with e.g. non-sadistic serial killers, rapists, or child abusers? Or non-sadistic narcissists in whose ideal world everyone else would be their worshipful subject/slave?

That last point also raises the question: Would you prefer the existence of lots of (either happily or grudgingly) submissive slaves over oblivion?

To me it seems that terrible outcomes do not require sadism. Seems sufficient that P be low in empathy, and want from Q something Q does not want to provide (like admiration, submission, sex, violent sport, or even just attention).[1] I'm confused as to how/why you disagree.


  1. Also, AFAICT, about 0.5% to 8% of humans are sadistic, and about 8% to 16% have very little or zero empathy. How did you arrive at "99% of humanity [...] are not so sadistic"? Did you account for the fact that most people with sadistic inclinations probably try to hide those inclinations? (Like, if only 0.5% of people appear sadistic, then I'd expect the actual prevalence of sadism to be more like ~4%.) ↩︎

Comment by rvnnt on Should you publish solutions to corrigibility? · 2025-01-30T14:36:15.138Z · LW · GW

It seems like you're assuming people won't build AGI if they don't have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity.

I'm assuming neither. I agree with you that both seem (very) unlikely. [1]

It seems like you're assuming that any humans succeeding in controlling AGI is (on expectation) preferable to extinction? If so, that seems like a crux: if I agreed with that, then I'd also agree with "publish all corrigibility results".


  1. I expect that unaligned ASI would lead to extinction, and our share of the lightcone being devoid of value or disvalue. I'm quite uncertain, though. ↩︎

Comment by rvnnt on Should you publish solutions to corrigibility? · 2025-01-30T14:30:39.271Z · LW · GW

It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.

I think there is a key disanalogy to the situation with AGI: The analogy would be stronger if the bomb was likely to kill everyone, but also had a some (perhaps very small) probability of conferring godlike power to whomever holds it. I.e., there is a tradeoff: decrease the probability of dying, at the expense of increasing the probability of S-risks from corrupt(ible) humans gaining godlike power.

If you agree that there exists that kind of tradeoff, I'm curious as to why you think it's better to trade in the direction of decreasing probability-of-death for increased probability-of-suffering.

So, the question I'm most interested in is the one at the end of the post[1], viz

What (crucial) considerations should one take into account, when deciding whether to publish---or with whom to privately share---various kinds of corrigibility-related results?


  1. Didn't put it in the title, because I figured that'd be too long of a title. ↩︎

Comment by rvnnt on Should you publish solutions to corrigibility? · 2025-01-30T11:54:57.689Z · LW · GW

Taking a stab at answering my own question; an almost-certainly non-exhaustive list:

  • Would the results be applicable to deep-learning-based AGIs?[1] If I think not, how can I be confident they couldn't be made applicable?

  • Do the corrigibility results provide (indirect) insights into other aspects of engineering (rather than SGD'ing) AGIs?

  • How much weight one gives to avoiding x-risks vs s-risks.[2]

  • Who actually needs to know of the results? Would sharing the results with the whole Internet lead to better outcomes than (e.g.) sharing the results with a smaller number of safety-conscious researchers? (What does the cost-benefit analysis look like? Did I even do one?)

  • How optimistic (or pessimistic) one is about the common-good commitment (or corruptibility) of the people who one thinks might end up wielding corrigible AGIs.


  1. Something like the True Name of corrigibility might at first glance seem applicable only to AIs of whose internals we have some meaningful understanding or control. ↩︎

  2. If corrigibility were easily feasible, then at first glance, that would seem to reduce the probability of extinction (via unaligned AI), but increase the probability of astronomical suffering (under god-emperor Altman/Ratcliffe/Xi/Putin/...). ↩︎

Comment by rvnnt on What are the differences between AGI, transformative AI, and superintelligence? · 2025-01-24T14:27:59.445Z · LW · GW

I think the main value of that operationalization is enabling more concrete thinking/forecasting about how AI might progress. Models some of the relevant causal structure of reality, at a reasonable level of abstraction: not too nitty-gritty[1], not too abstract[2].


  1. which would lead to "losing the forest for the trees", make the abstraction too effortful to use in practice, and/or risk making it irrelevant as soon as something changes in the world of AI ↩︎

  2. e.g. a higher-level abstraction like "AI that speeds up AI development by a factor of N" might at first glance seem more useful. But as you and ryan noted, speed-of-AI-development depends on many factors, so that operationalization would be mixing together many distinct things, hiding relevant causal structures of reality, and making it difficult/confusing to think about AI development. ↩︎

Comment by rvnnt on What are the differences between AGI, transformative AI, and superintelligence? · 2025-01-23T14:24:56.244Z · LW · GW

I think this approach to thinking about AI capabilities is quite pertinent. Could be worth including "Nx AI R&D labor AIs" in the list?

Comment by rvnnt on Daniel Kokotajlo's Shortform · 2025-01-23T13:57:14.521Z · LW · GW

Cogent framing; thanks for writing it. I'd be very interested to read your framing for the problem of "how do we get to a good future for humanity, conditional on the first attractor state for AGI alignment?"[1]


  1. Would you frame it as "the AGI lab leadership alignment problem"? Or a governance problem? Or something else? ↩︎

Comment by rvnnt on Orienting to 3 year AGI timelines · 2025-01-03T15:33:58.555Z · LW · GW

Thanks for the answer. It's nice to get data about how other people think about this subject.

the concern that the more sociopathic people wind up in positions of power is the big concern.

Agreed!

Do I understand correctly: You'd guess that

  • 99% of humans have a "positive empathy-sadism balance",
  • and of those, 90-99% could be trusted to control the world (via controlling ASI),
  • i.e., ~89-98% of humanity could be trusted to control the world with ASI-grade power?

If so, then I'm curious -- and somewhat bewildered! -- as to how you arrived at those guesses/numbers.

I'm under the impression that narcissism and sadism have prevalences of very roughly 6% and 4%, respectively. See e.g. this post, or the studies cited therein. Additionally, probably something like 1% to 10% of people are psychopaths, depending on what criteria are used to define "psychopathy". Even assuming there's a lot of overlap, I think a reasonable guess would be that ~8% of humans have at least one of those traits. (Or 10%, if we include psychopathy.)

I'm guessing you disagree with those statistics? If yes, what other evidence leads you to your different (much lower) estimates?

Do you believe that someone with (sub-)clinical narcissism, if given the keys to the universe, would bring about good outcomes for all (with probability >90%)? Why/how? What about psychopaths?

Do you completely disagree with the aphorism that "power corrupts, and absolute power corrupts absolutely"?

Do you think that having good intentions (and +0 to +3 SD intelligence) is probably enough for someone to produce good outcomes, if they're given ASI-grade power?

FWIW, my guesstimates are that

  • over 50% of genpop would become corrupted by ASI-grade power, or are sadistic/narcissistic/psychopathic/spiteful to begin with,
  • of the remainder, >50% would fuck things up astronomically, despite their good intentions[1],
  • genetic traits like psychopathy and narcissism (not sure about sadism), and acquired traits like cynicism, are much more prevalent (~5x odds?) in people who will end up in charge of AGI projects, relative to genpop. OTOH, competence at not-going-insane is likely higher among them too.

it would be so easy to benefit humanity, just by telling your slave AGI to go make it happen. A lot of people would enjoy being hailed as a benevolent hero

I note that if someone is using an AGI as a slave, and is motivated by wanting prestige status, then I do not expect that to end well for anyone else. (Someone with moderate power, e.g. a medieval king, with the drive to be hailed a benevolent hero, might indeed do great things for other people. But someone with more extreme power -- like ASI-grade power -- could just... rewire everyone's brains; or create worlds full of suffering wretches, for him to save and be hailed/adored by; or... you get the idea.)


  1. Even relatively trivial things like social media or drugs mess lots of humans up; and things like "ability to make arbitrary modifications to your mind" or "ability to do anything you want, to anyone, with complete impunity" are even further OOD, and open up even more powerful superstimuli/reward-system hacks. Aside from tempting/corrupting humans to become selfish, I think that kind of situation has high potential to just lead to them going insane or breaking (e.g. start wireheading) in any number of ways.

    And then there are other failure modes, like insufficient moral uncertainty and locking in some parochial choice of values, or a set of values that made sense in some baseline human context but which generalize to something horrible. ("Obviously we should fill the universe with Democracy/Christianity/Islam/Hedonism/whatever!", ... "Oops, turns out Yahweh is pretty horrible, actually!") ↩︎

Comment by rvnnt on Orienting to 3 year AGI timelines · 2024-12-27T13:11:41.488Z · LW · GW

I'd be interested to see that draft as a post!

What fraction of humans in set X would you guess have a "positive empathy-sadism balance", for

  • X = all of humanity?
  • X = people in control of (governmental) AGI projects?

I agree that the social environment / circumstances could have a large effect on whether someone ends up wielding power selfishly or benevolently. I wonder if there's any way anyone concerned about x/s-risks could meaningfully affect those conditions.

I'm guessing[1] I'm quite a bit more pessimistic than you about what fraction of humans would produce good outcomes if they controlled the world.


  1. with a lot of uncertainty, due to ignorance of your models. ↩︎

Comment by rvnnt on Orienting to 3 year AGI timelines · 2024-12-27T13:10:05.209Z · LW · GW

I agree that "strengthening democracy" sounds nice, and also that it's too vague to be actionable. Also, what exactly would be the causal chain from "stronger democracy" (whatever that means) to "command structure in the nationalized AGI project is trustworthy and robustly aligned to the common good"?

If you have any more concrete ideas in this domain, I'd be interested to read about them!

Comment by rvnnt on Orienting to 3 year AGI timelines · 2024-12-27T13:09:35.389Z · LW · GW

Pushing for nationalization or not might affect when it's done, giving some modicum of control.

I notice that I have almost no concrete model of what that sentence means. A couple of salient questions[1] I'd be very curious to hear answers to:

  • What concrete ways exist for affecting when (and how) nationalization is done? (How, concretely, does one "push" for/against nationalization of AGI?)

  • By what concrete causal mechanism could pushing for nationalization confer a modicum of control; and control over what exactly, and to whom?


  1. Other questions I wish I (or people advocating for any policy w.r.t. AGI) had answers to include (i.a.) "How could I/we/anyone ensure that the resulting AGI project actually benefits everyone? Who, in actual concrete practice, would end up effectively having control over the AGI? How could (e.g.) the public hold those people accountable, even as those people gain unassailable power? How do we ensure that those people are not malevolent to begin with, and also don't become corrupted by power? What kinds of oversight mechanisms could be built, and how?" ↩︎

Comment by rvnnt on Orienting to 3 year AGI timelines · 2024-12-23T16:29:54.144Z · LW · GW

make their models sufficiently safe

What does "safe" mean, in this post?

Do you mean something like "effectively controllable"? If yes: controlled by whom? Suppose AGI were controlled by some high-ranking people at (e.g.) the NSA; with what probability do you think that would be "safe" for most people?

Comment by rvnnt on Orienting to 3 year AGI timelines · 2024-12-23T16:29:02.571Z · LW · GW

Doing nationalization right

I think this post (or the models/thinking that generated it) might be missing an important consideration[1]: "Is it possible to ensure that the nationalized AGI project does not end up de facto controlled by not-good people? If yes, how?"

Relevant quote from Yudkowsky's Six Dimensions of Operational Adequacy in AGI Projects (emphasis added):

Opsec [...] Military-grade or national-security-grade security. (It's hard to see how attempts to get this could avoid being counterproductive, considering the difficulty of obtaining trustworthy command and common good commitment with respect to any entity that can deploy such force [...])

Another quote (emphasis mine):

You cannot possibly cause such a project[2] to exist with adequately trustworthy command, alignment mindset, and common-good commitment, and you should therefore not try to make it exist, first because you will simply create a still more dire competitor developing unaligned AGI, and second because if such an AGI could be aligned it would be a hell of an s-risk given the probable command structure.


  1. or possibly a crucial consideration ↩︎

  2. The quote is referring to "[...] a single global Manhattan Project which is somehow not answerable to non-common-good command such as Trump or Putin or the United Nations Security Council. [...]" ↩︎

Comment by rvnnt on Hierarchical Agency: A Missing Piece in AI Alignment · 2024-11-27T20:07:44.815Z · LW · GW

A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like

  • legal systems, police
  • immune systems (esp. in suppressing cancer)
  • social norms, reputation systems, etc.

I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.

Do you think a good theory of hierarchical agency would subsume those kinds of patterns-in-reality? If yes: I wonder if their inclusion could be used as a criterion/heuristic for narrowing down the search for a good theory?

Comment by rvnnt on Another argument against alignment paradigms that center on paperclips · 2024-09-22T18:12:53.314Z · LW · GW

find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system

We could maybe weaken this requirement? Perhaps it would suffice to show/argue that it's feasible[1] to build any kind of "acute risk period -ending AI"[2] that is not a "subject of AGI-doom arguments"?

I'd be (very) curious to see such arguments. [3]


  1. within time constraints, before anyone else builds a "subject of AGI-doom arguments" ↩︎

  2. or, "AIs that implement humanity's CEV" ↩︎

  3. If I became convinced that it's feasible to build such a "pivotal AI" that is not "subject to AGI doom arguments", I think that would shift a bunch of my probability mass from "we die due to unaligned AI" to "we die-or-worse due to misaligned humans controlling ASI" and "utopia". ↩︎

Comment by rvnnt on AI, centralization, and the One Ring · 2024-09-14T11:10:31.158Z · LW · GW

I think this is an important subject and I agree with much of this post. However, I think the framing/perspective might be subtly but importantly wrong-or-confused.

To illustrate:

How much of the issue here is about the very singular nature of the One dominant project, vs centralization more generally into a small number of projects?

Seems to me that centralization of power per se is not the problem.

I think the problem is something more like

  • we want to give as much power as possible to "good" processes, e.g. a process that robustly pursues humanity's CEV[1]; and we want to minimize the power held by "evil" processes

  • but: a large fraction of humans are evil, or become evil once prosocial pressures are removed; and we do not know how to reliably construct "good" AIs

  • and also: we (humans) are confused and in disagreement about what "good" even means

  • and even if it were clear what a "good goal" is, we have no reliable way of ensuring that an AI or a human institution is robustly pursuing such a goal.

I agree that (given the above conditions) concentrating power into the hands of a few humans or AIs would on expectation be (very) bad. (OTOH, a decentralized race is also very bad.) But concentration-vs-decentralization of power is just one relevant consideration among many.

Thus: if the quoted question has an implicit assumption like "the main variable to tweak is distribution-of-power", then I think it is trying to carve the problem at unnatural joints, or making a false implicit assumption that might lead to ignoring multiple other important variables.

(And less centralization of power has serious dangers of its own. See e.g. Wei Dai's comment.)

I think a more productive frame might be something like "how do we construct incentives, oversight, distribution of power, and other mechanisms, such that Ring Projects remain robustly aligned to 'the greater good'?"

And maybe also "how do we become less confused about what 'the greater good' even is, in a way that is practically applicable to aligning Ring Projects?"


  1. If such a thing is even possible. ↩︎

Comment by rvnnt on TurnTrout's shortform feed · 2024-08-14T11:21:21.282Z · LW · GW

Upvoted and disagreed. [1]

One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like

[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly "useful/capable".

Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model's weights to 0.0; thereby guaranteeing the non-entrainment of any ("bad") circuits.

I'm curious: what do you think would be a good (...useful?) operationalization of "useful/capable"?

Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model's outputs might cause catastrophe. [2]


  1. I think writing one's thoughts/intuitions out like this is valuable --- for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best). ↩︎

  2. Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly "localized/concentrated" in some sense. (OTOH, that seems likely to at least eventually be the case?) ↩︎

Comment by rvnnt on Forecasting: the way I think about it · 2024-05-09T11:00:35.708Z · LW · GW

In Fig 1, is the vertical axis P(world) ?

Comment by rvnnt on AI Clarity: An Initial Research Agenda · 2024-05-04T12:22:30.037Z · LW · GW

Possibly a nitpick, but:

The development and deployment of AGI, or similarly advanced systems, could constitute a transformation rivaling those of the agricultural and industrial revolutions.

seems like a very strong understatement. Maybe replace "rivaling" with e.g. "(vastly) exceeding"?

Comment by rvnnt on AI #56: Blackwell That Ends Well · 2024-03-21T14:46:20.117Z · LW · GW

Referring to the quote-picture from the Nvidia GTC keynote talk: I searched the talk's transcript, and could not find anything like the quote.

Could someone point out time-stamps of where Huang says (or implies) anything like the quote? Or is the quote entirely made up?

Comment by rvnnt on 7. Evolution and Ethics · 2024-02-17T10:04:47.091Z · LW · GW

That clarifies a bunch of thing. Thanks!

Comment by rvnnt on 7. Evolution and Ethics · 2024-02-16T12:49:07.029Z · LW · GW

I'm not sure I understand what the post's central claim/conclusion is. I'm curious to understand it better. To focus on the Summary:

So overall, evolution is the source of ethics,

Do you mean: Evolution is the process that produced humans, and strongly influenced humans' ethics? Or are you claiming that (humans') evolution-induced ethics are what any reasonable agent ought to adhere to? Or something else?

and sapient evolved agents inherently have a dramatically different ethical status than any well-designed created agents [...]

...according to some hypothetical evolved agents' ethical framework, under the assumption that those evolved agents managed to construct the created agents in the right ways (to not want moral patienthood etc.)? Or was the quoted sentence making some stronger claim?

evolution and evolved beings having a special role in Ethics is not just entirely justified, but inevitable

Is that sentence saying that

  • evolution and evolved beings are of special importance in any theory of ethics (what ethics are, how they arise, etc.), due to Evolution being one of the primary processes that produce agents with moral/ethical preferences [1]

or is it saying something like

  • evolution and evolved beings ought to have a special role; or we ought to regard the preferences of evolved beings as the True Morality?

I roughly agree with the first version; I strongly disagree with the second: I agree that {what oughts humans have} is (partially) explained by Evolutionary theory. I don't see how that crosses the is-ought gap. If you're saying that that somehow does cross the is-ought gap, could you explain why/how?


  1. I.e., similar to how one might say "amino acids having a special role in Biochemistry is not just entirely justified, but inevitable"? ↩︎

Comment by rvnnt on Conversation Visualizer · 2023-12-31T12:29:23.783Z · LW · GW

I wonder how much work it'd take to implement a system that incrementally generates a graph of the entire conversation. (Vertices would be sub-topics, represented as e.g. a thumbnail image + a short text summary.) Would require the GPT to be able to (i.a.) understand the logical content of the discussion, and detect when a topic is revisited, etc. Could be useful for improving clarity/productivity of conversations.

Comment by rvnnt on Vote on Interesting Disagreements · 2023-11-09T10:16:45.424Z · LW · GW

One of the main questions on which I'd like to understand others' views is something like: Conditional on sentient/conscious humans[1] continuing to exist in an x-risk scenario[2], with what probability do you think they will be in an inescapable dystopia[3]?

(My own current guess is that dystopia is very likely.)


  1. or non-human minds, other than the machines/Minds that are in control ↩︎

  2. as defined by Bostrom, i.e. "the permanent and drastic destruction of [humanity's] potential for desirable future development" ↩︎

  3. Versus e.g. just limited to a small disempowered population, but living in pleasant conditions? Or a large population living in unpleasant conditions, but where everyone at least has the option of suicide? ↩︎

Comment by rvnnt on Vote on Interesting Disagreements · 2023-11-09T10:14:21.040Z · LW · GW

That makes sense; but:

so far outside the realm of human reckoning that I'm not sure it's reasonable to call them dystopian.

setting aside the question of what to call such scenarios, with what probability do you think the humans[1] in those scenarios would (strongly) prefer to not exist?


  1. or non-human minds, other than the machines/Minds that are in control ↩︎

Comment by rvnnt on Vote on Interesting Disagreements · 2023-11-08T13:01:59.664Z · LW · GW

non-extinction AI x-risk scenarios are unlikely

Many people disagreed with that. So, apparently many people believe that inescapable dystopias are not-unlikely? (If you're one of the people who disagreed with the quote, I'm curious to hear your thoughts on this.)

Comment by rvnnt on Evolution Solved Alignment (what sharp left turn?) · 2023-10-18T10:44:52.376Z · LW · GW

(Ah. Seems we were using the terms "(alignment) success/failure" differently. Thanks for noting it.)

In-retrospect-obvious key question I should've already asked: Conditional on (some representative group of) humans succeeding at aligning ASI, what fraction of the maximum possible value-from-Evolution's-perspective do you expect the future to attain? [1]

My modal guess is that the future would attain ~1% of maximum possible "Evolution-value".[2]

If tech evolution is similar enough to bio evolution then we should roughly expect tech evolution to have a similar level of success

Seems like a reasonable (albeit very preliminary/weak) outside view, sure. So, under that heuristic, I'd guess that the future will attain ~1% of max possible "human-value".


  1. setting completely aside whether to consider the present "success" or "failure" from Evolution's perspective. ↩︎

  2. I'd call that failure on Evolution's part, but IIUC you'd call it partial success? (Since the absolute value would still be high?) ↩︎

Comment by rvnnt on Evolution Solved Alignment (what sharp left turn?) · 2023-10-17T12:39:02.095Z · LW · GW

Evolution has succeeded at aligning homo sapiens brains to date

I'm guessing we agree on the following:

  • Evolution shaped humans to have various context-dependent drives (call them Shards) and the ability to mentally represent and pursue complex goals. Those Shards were good proxies for IGF in the EEA[1].

  • Those Shards were also good[2] enough to produce billions of humans in the modern environment. However, it is also the case that most modern humans spend at least part of their optimization power on things orthogonal to IGF.

I think our disagreement here maybe boils down to approximately the following question:

With what probability are we in each of the following worlds?

  • (World A) The Shards only work[2:1] conditional on the environment being sufficiently similar to the EEA, and humans not having too much optimization power. If the environment changes too far OOD, or if humans were to gain a lot of power[3], then the Shards would cease to be good[2:2] proxies.

    In this world, we should expect the future to contain only a small fraction[4] of the "value" it would have, if humanity were fully "aligned"[2:3]. I.e. Evolution failed to "(robustly) align humanity".

  • (World B) The Shards (in combination with other structures in human DNA/brains) are in fact sufficiently robust that they will keep humanity aligned[2:4] even in the face of distributional shift and humans gaining vast optimization power.

    In this world, we should expect the future to contain a large fraction of the "value" it would have, if humanity were fully "aligned"[2:5]. I.e. Evolution succeeded in "(robustly) aligning humanity".

  • (World C) Something else?

I think we're probably in (A), and IIUC, you think we're most likely in (B). Do you consider this an adequate characterization?

If yes, the obvious next question would be: What tests could we run, what observations could we make,[5] that would help us discern whether we're in (A) or (B) (or (C))?

(For example: I think the kinds of observations I listed in my previous comment are moderate-to-strong evidence for (A); and the existence of some explicit-IGF-maximizing humans is weak evidence for (B).)


  1. Environment of evolutionary adaptedness. For humans: hunter-gatherer tribes on the savanna, or maybe primitive subsistence agriculture societies. ↩︎

  2. in the sense of optimizing for IGF, or whatever we're imagining Evolution to "care" about. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  3. e.g. ability to upload their minds, construct virtual worlds, etc. ↩︎

  4. Possibly (but not necessarily) still a large quantity in absolute terms. ↩︎

  5. Without waiting a possibly-long time to watch how things in fact play out. ↩︎

Comment by rvnnt on Evolution Solved Alignment (what sharp left turn?) · 2023-10-17T12:32:22.550Z · LW · GW

vast computation some of which is applied to ancestral simulations

I agree that a successful post-human world would probably involve a large amount[1] of resources spent on simulating (or physically instantiating) things like humans engaging in play, sex, adventure, violence, etc. IOW, engaging in the things for which Evolution installed Shards in us. However, I think that is not the same as [whatever Evolution would care about, if Evolution could care about anything]. For the post-human future to be a success from Evolution's perspective, I think it would have to be full of something more like [programs (sentient or not, DNA or digital) striving to make as many copies of themselves as possible].

(If we make the notion of "DNA" too broad/vague, then we could interpret almost any future outcome as "success for Evolution".)


  1. a large absolute amount, but maybe not a large relative amount. ↩︎

Comment by rvnnt on Evolution Solved Alignment (what sharp left turn?) · 2023-10-16T15:20:12.680Z · LW · GW

Humans have not put an end to biological life.

Yup. I, too, have noticed that.

Your doom[1] predictions [...]

C'mon, man, that's obviously a misrepresentation of what I was saying. Or maybe my earlier comment failed badly at communication? In case that's so, here's an attempted clarification (bolded parts added):

If Evolution had a lot more time (than I expect it to have) to align humans to relative-gene-replication-count, before humans put an end to biological life , as they seem to me to be on track to do, based on things I have observed in the past, then [...] But Evolution (almost surely) does not have infinite time [...]

Point being: Sure, Evolution managed to cough up some individuals who explicitly optimize for IGF. But they're exceptions, not the rule; and humanity seems (based on past observations!) to be on track to (mostly) end DNA-based life. So it seems premature to say that Evolution succeeded at aligning humanity.

In case you're wondering what past observations lead me to think that humans are unaligned[2] w.r.t. IGF and on track to end (or transcend) biological life, here are some off the top of my head:

  • Of the people whose opinions on the subject I'm aware of (including myself), nearly all would like to transcend (or end) biological life.[3]

  • Birth rates in most developed nations have been low or below replacement for a long time.[4] There seems to be a negative correlation between wealth/education and number of offspring produced. That matches my impression that as people gain wealth, education, and empowerment in general, most choose to spend it mostly on something other than producing offspring.

  • Diligent sperm bank donors are noteworthy exceptions. Most people are not picking obvious low-hanging fruit to increasing their IGF. Rich people waste money on yachts and stuff, instead of using it to churn out as many high-fitness offspring as possible; etc.

  • AFAIK, most of the many humans racing to build ASI are not doing so with the goal of increasing their IGF. And absent successful attempts to align ASI specifically to producing lots of DNA-based replicators, I don't see strong reason to expect the future to be optimized for quantity of DNA-based replicators.

Perhaps you disagree with the last point above?

There are many other scenarios where DNA flourishes even after a posthuman transition.

Interesting. Could you list a few of those scenarios?


  1. Note: I wasn't even talking (only) about doom; I was talking about humanity seemingly being on track to end biological life. I think the "good" outcomes probably also involve transcending biology/DNA-based replicators. ↩︎

  2. to the extent that it even makes sense to talk about incoherent things like humans being "(mis/un)aligned" to anything. ↩︎

  3. My sample might not be super representative of humanity as a whole. Maybe somewhat representative of people involved in AI, though? ↩︎

  4. At least according to sources like this: https://en.wikipedia.org/wiki/Total_fertility_rate ↩︎

Comment by rvnnt on Evolution Solved Alignment (what sharp left turn?) · 2023-10-12T09:09:34.202Z · LW · GW

evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! [...]

If Evolution had a lot more time to align humans to relative-gene-replication-count, before humans put an end to biological life, then sure, seems plausible that Evolution might be able to align humans very robustly. But Evolution does not have infinite time or "retries" --- humanity is in the process of executing something like a "sharp left turn", and seems likely to succeed long before the human gene pool is taken over by sperm bank donors and such.

Comment by rvnnt on Evolution Solved Alignment (what sharp left turn?) · 2023-10-12T09:05:41.004Z · LW · GW

The utility function is fitness: gene replication count (of the human defining genes) [1]

Seems like humans are soon going to put an end to DNA-based organisms, or at best relegate them to some small fraction of all "life". I.e., seems to me that the future is going to score very poorly on the gene-replication-count utility function, relative to what it would score if humanity (or individual humans) were actually aligned to gene-replication-count.

Do you disagree? (Do you expect the post-ASI future to be tiled with human DNA?)


  1. Obviously Evolution doesn't actually have a utility function, and if it did, gene-replication-count is probably not it, as TekhneMakre points out. But, let's accept that for now, arguendo. ↩︎

Comment by rvnnt on How to talk about reasons why AGI might not be near? · 2023-09-18T09:34:05.845Z · LW · GW

I mostly agree.

I also think that impact is very unevenly distributed over people; the most impactful 5% of people probably account for >70% of the impact. [1]

And if so, then the difference in positive impact between {informing the top 5%} and {broadcasting to the field in general on the open Internet} is probably not very large. [2]

Possibly also worth considering: Would (e.g.) writing a public post actually reach those few key people more effectively than (e.g.) sending a handful of direct/targeted emails? [3]


  1. Talking about AI (alignment) here, but I think something like this applies in many fields. I don't have a good quantification of "impact" in mind, though, so this is very hand-wavey. ↩︎

  2. Each approach has its downsides. The first approach requires identifying the relevant people, and is likely more effortful. The latter approach has the downside of putting potentially world-ending information in the hands of people who would use it to end the world (a bit sooner than they otherwise would). ↩︎

  3. What is in fact the most effective way to reach whoever needs to be reached? (I don't know.) ↩︎

Comment by rvnnt on How to talk about reasons why AGI might not be near? · 2023-09-17T09:56:25.146Z · LW · GW

If {the reasoning for why AGI might not be near} comprises {a list of missing capabilities}, then my current guess is that the least-bad option would be to share that reasoning in private with a small number of relevant (and sufficiently trustworthy) people[1].

(More generally, my priors strongly suggest keeping any pointers to AGI-enabling capabilities private.)


  1. E.g. the most capable alignment researchers who seem (to you) to be making bad strategic decisions due to not having considered {the reasoning for why AGI might not be near}. ↩︎

Comment by rvnnt on Optimization, loss set at variance in RL · 2023-07-23T10:12:00.188Z · LW · GW

I can't critique your plan, because I can't parse your writing. My suggestion would be to put some effort into improving the clarity of your writing. [1]


  1. Even basic things, such as the avoidance of long sentences, sometimes with side notes included and separated from the main sentence by commas, rather than e.g. em dashes, and making the scopes of various syntactic structures unambiguous, could go a long way towards making your text more legible. ↩︎

Comment by rvnnt on How necessary is intuition, for advanced math? · 2023-07-20T09:44:21.108Z · LW · GW

[...] bridge the "gap" between (less-precise proofs backed by advanced intuition) and (precise proofs simple enough for basically anyone to technically "follow").

Meta: Please consider using curly or square brackets ({} or []) for conceptual/grammatic grouping; please avoid overloading parentheses.

Comment by rvnnt on Simple alignment plan that maybe works · 2023-07-19T10:21:44.593Z · LW · GW

Thumbs up for trying to think of novel approaches to solving the alignment problem.

Every time the model does something that harms the utility function of the dumber models, it gets a loss function.

A few confusions:

  • By "it gets a loss function", did you mean "it gets negative reward"?
  • If yes, doesn't this plan consist entirely of reinforcement learning? How does this "emulate Evolution"?
  • What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to... changes in the dumber agents' (D's) total utility?

Some problems, off the top of my head:

  • GPT-like models don't have utility functions.

  • Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models' utility functions.

  • Humans don't have utility functions. It's unclear how this would generalize to human-alignment.

  • It's very much unclear what policy S would end up learning in this RL setup. It's even less clear how that policy would generalize outside of training.

    • If S is given reward proportional to (changes in) D's utility, then basically we're just training S with D's utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. [1]
    • If S is instead given reward for things like {taking actions that lead to obtaining information about D's utility function}, then... we're training an RL policy/agent on proxies to "alignment". I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.

I don't know how you arrived at this plan, but I'm guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:


  1. Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D's be very accurate models of humans. But that's not remotely realistic; and still subject to Goodhart. ↩︎

Comment by rvnnt on Where are the people building AGI in the non-dumb way? · 2023-07-12T15:56:08.629Z · LW · GW

Tamsin Leake's project might match what you're looking for.

Comment by rvnnt on [Linkpost] Introducing Superalignment · 2023-07-06T10:48:16.762Z · LW · GW

[...] iteratively align superintelligence.

To align the first automated alignment researcher, [...]

To validate the alignment of our systems, [...]

What do they mean by "aligned"?

How do we ensure AI systems much smarter than humans follow human intent?

OK. Assuming that

  • sharp left turns are not an issue,
  • and scalable oversight is even possible in practice,
  • and OAI somehow solves the problems of
    • AIs hacking humans (to influence their intents),
    • and deceptive alignment,
    • humans going crazy when given great power,
    • etc.
    • and all the problems no-one has noticed yet,

then, there's the question of "aligned to what"? Whose intent? What would success at this agenda look like?

Maybe: A superintelligence that accurately models its human operator, follows the human's intent[1] to complete difficult-but-bounded tasks, and is runnable at human-speed with manageable amount of compute, sitting on OAI's servers?

Who would get to use that superintelligence? For what purpose would they use it? How long before the {NSA, FSB, CCP, ...} steal that superintelligence off OAI's servers? What would they use it for?

Point being: If an organization is not adequate in all key dimensions of operational adequacy, then even if they somehow miraculously solve the alignment/control problem, they might be increasing S-risks while only somewhat decreasing X-risks.

What is OAI's plan for getting their opsec and common-good-commitment to adequate levels? What's their plan for handling success at alignment/control?


  1. and does not try to hack the human into having more convenient intents ↩︎

Comment by rvnnt on Palantir's AI models · 2023-06-17T10:13:18.917Z · LW · GW

To what extent would you expect the government's or general populace's responses to "Robots with guns" to be helpful (or harmful) for mitigating risks from superintelligence? (Would getting them worried about robots actually help with x-risks?)

Comment by rvnnt on The ones who endure · 2023-06-17T08:51:22.466Z · LW · GW

Right; that would be a silly thing to think.

My intended message might've been better worded as follows

If staring into abysses is difficult/rough, then adequately staring into the darker abysses might require counter-intuitively large amounts of effort/agency. And yet, I think it might be necessary to grok those darker abysses, if we are to avoid falling into them. That makes me worried.

OTOH, you seem exceptionally reflective, so perhaps that worry is completely unfounded in your case. Anyway, I'm grateful for the work you do; I wish there were more people like you in this world. (Also, your attention is extremely valuable, so please feel free to promptly drop/forget this conversation.)

Comment by rvnnt on The ones who endure · 2023-06-16T17:18:15.335Z · LW · GW

When people call things like this post "rough to write/read", and consider them to require a content warning, I wonder if most people are able to think clearly (or at all) about actually terrible scenarios, and worry that they aren't. (I'm especially worried if those people have influence in a domain where there might be a tradeoff between mitigating X-risks vs mitigating S-risks.)

I liked the description of the good future, though. Thanks for the reminder that things can (maybe) go well, too.

Comment by rvnnt on Instrumental Convergence? [Draft] · 2023-06-16T08:51:25.141Z · LW · GW

Thanks for the response.

To the extent that I understand your models here, I suspect they don't meaningfully bind/correspond to reality. (Of course, I don't understand your models at all well, and I don't have the energy to process the whole post, so this doesn't really provide you with much evidence; sorry.)

I wonder how one could test whether or not the models bind to reality? E.g. maybe there are case examples (of agents/people behaving in instrumentally rational ways) one could look at, and see if the models postdict the actual outcomes in those examples?