AI Alignment Metastrategy 2023-12-31T12:06:11.433Z
Critical review of Christiano's disagreements with Yudkowsky 2023-12-27T16:02:50.499Z
Learning-theoretic agenda reading list 2023-11-09T17:25:35.046Z
[Closed] Agent Foundations track in MATS 2023-10-31T08:12:50.482Z
Which technologies are stuck on initial adoption? 2023-04-29T17:37:34.749Z
The Learning-Theoretic Agenda: Status 2023 2023-04-19T05:21:29.177Z
Compositional language for hypotheses about computations 2023-03-11T19:43:40.064Z
Human beats SOTA Go AI by learning an adversarial policy 2023-02-19T09:38:58.684Z
[Closed] Prize and fast track to alignment research at ALTER 2022-09-17T16:58:24.839Z
[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda 2022-04-19T06:44:18.772Z
[Closed] Job Offering: Help Communicate Infrabayesianism 2022-03-23T18:35:16.790Z
Infra-Bayesian physicalism: proofs part II 2021-11-30T22:27:04.744Z
Infra-Bayesian physicalism: proofs part I 2021-11-30T22:26:33.149Z
Infra-Bayesian physicalism: a formal theory of naturalized induction 2021-11-30T22:25:56.976Z
My Marriage Vows 2021-07-21T10:48:24.443Z
Needed: AI infohazard policy 2020-09-21T15:26:05.040Z
Introduction To The Infra-Bayesianism Sequence 2020-08-26T20:31:30.114Z
Deminatalist Total Utilitarianism 2020-04-16T15:53:13.953Z
The Reasonable Effectiveness of Mathematics or: AI vs sandwiches 2020-02-14T18:46:39.280Z
Offer of co-authorship 2020-01-10T17:44:00.977Z
Intelligence Rising 2019-11-27T17:08:40.958Z
Vanessa Kosoy's Shortform 2019-10-18T12:26:32.801Z
Biorisks and X-Risks 2019-10-07T23:29:14.898Z
Slate Star Codex Tel Aviv 2019 2019-09-05T18:29:53.039Z
Offer of collaboration and/or mentorship 2019-05-16T14:16:20.684Z
Reinforcement learning with imperceptible rewards 2019-04-07T10:27:34.127Z
Dimensional regret without resets 2018-11-16T19:22:32.551Z
Computational complexity of RL with traps 2018-08-29T09:17:08.655Z
Entropic Regret I: Deterministic MDPs 2018-08-16T13:08:15.570Z
Algo trading is a central example of AI risk 2018-07-28T20:31:55.422Z
The Learning-Theoretic AI Alignment Research Agenda 2018-07-04T09:53:31.000Z
Meta: IAFF vs LessWrong 2018-06-30T21:15:56.000Z
Computing an exact quantilal policy 2018-04-12T09:23:27.000Z
Quantilal control for finite MDPs 2018-04-12T09:21:10.000Z
Improved regret bound for DRL 2018-03-02T12:49:27.000Z
More precise regret bound for DRL 2018-02-14T11:58:31.000Z
Catastrophe Mitigation Using DRL (Appendices) 2018-02-14T11:57:47.000Z
Bugs? 2018-01-21T21:32:10.492Z
The Behavioral Economics of Welfare 2017-12-22T11:35:09.617Z
Improved formalism for corruption in DIRL 2017-11-30T16:52:42.000Z
Why DRL doesn't work for arbitrary environments 2017-11-30T12:22:37.000Z
Catastrophe Mitigation Using DRL 2017-11-22T05:54:42.000Z
Catastrophe Mitigation Using DRL 2017-11-17T15:38:18.000Z
Delegative Reinforcement Learning with a Merely Sane Advisor 2017-10-05T14:15:45.000Z
On the computational feasibility of forecasting using gamblers 2017-07-18T14:00:00.000Z
Delegative Inverse Reinforcement Learning 2017-07-12T12:18:22.000Z
Learning incomplete models using dominant markets 2017-04-28T09:57:16.000Z
Dominant stochastic markets 2017-03-17T12:16:55.000Z
A measure-theoretic generalization of logical induction 2017-01-18T13:56:20.000Z
Towards learning incomplete models using inner prediction markets 2017-01-08T13:37:53.000Z


Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-09T11:56:00.342Z · LW · GW

Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.

In the following, all infradistributions are crisp.

Fix finite action set  and finite observation set .  For any  and , let

be defined by

In other words, this kernel samples a time step  out of the geometric distribution with parameter , and then produces the sequence of length  that appears in the destiny starting at .

For any continuous[1] function , we get a decision rule. Namely, this rule says that, given infra-Bayesian law  and discount parameter , the optimal policy is

The usual maximin is recovered when we have some reward function  and corresponding to it is

Given a set  of laws, it is said to be learnable w.r.t.  when there is a family of policies  such that for any 

For  we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any  we have the learnable decision rule

This is the "mesomism" I taked about before

Also, any monotonically increasing  seems to be learnable, i.e. any  s.t. for  we have . For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.

On the other hand, decision rules of the form  are not learnable in general, and so are decision rules of the form  for  monotonically increasing.

Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?

A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.

  1. ^

    We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need  to be at least upper semicontinuous.

  2. ^

    There are weaker conditions than "communicating" that are sufficient, e.g. "resettable" (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.

  3. ^

    I mean theorems like VNM, Savage etc.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-08T13:05:12.979Z · LW · GW

First, given nanotechnology, it might be possible to build colonies much faster.

Second, I think the best way to live is probably as uploads inside virtual reality, so terraforming is probably irrelevant.

Third, it's sufficient that the colonists are uploaded or cryopreserved (via some superintelligence-vetted method) and stored someplace safe (whether on Earth or in space) until the colony is entirely ready.

Fourth, if we can stop aging and prevent other dangers (including unaligned AI), then a timeline of decades is fine.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-08T12:44:32.798Z · LW · GW

I don't know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).

I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.

Commit to not make anyone predictably regret supporting the project or not opposing it" is worrying only by omission -- it's a good guideline, but it leaves the door open for "punish anyone who failed to support the project once the project gets the power to do so".

Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don't make threats, i.e. don't try to shape others incentives in ways that they would be better off precommitting not to go along with it.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-07T06:41:31.088Z · LW · GW

It's not because they're not on Earth, it's because they have a superintelligence helping them. Which might give them advice and guidance, take care of their physical and mental health, create physical constraints (e.g. that prevent violence), or even give them mind augmentation like mako yass suggested (although I don't think that's likely to be a good idea early on). And I don't expect their environment to be fragile because, again, designed by superintelligence. But I don't know the details of the solution: the AI will decide those, as it will be much smarter than me.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-07T06:29:31.387Z · LW · GW

I don't have to know in advance that we're in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology). 

I think you're misinterpreting Yudkowsky's quote. "Using the null string as input" doesn't mean "without evidence", it means "without other people telling me parts of the answer (to this particular question)".

I'm not sure what is "extremely destructive and costly" in what I described? Unless you mean the risk of misalignment, in which case, see above.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-06T19:47:13.794Z · LW · GW

I know, this is what I pointed at in footnote 1. Although "dumbest AI" is not quite right: the sort of AI MIRI envision is still very superhuman in particular domains, but is somehow kept narrowly confined to acting within those domains (e.g. designing nanobots). The rationale mostly isn't assuming that at that stage it won't be possible to create a full superintelligence, but assuming that aligning such a restricted AI would be easier. I have different views on alignment, leading me to believe that aligning a full-fledged superintelligence (sovereign) is actually easier (via PSI or something in that vein). On this view, we still need to contend with the question, what is the thing we will (honestly!) tell other people that our AI is actually going to do. Hence, the above.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-06T11:06:00.691Z · LW · GW

People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their "pivotal act" strategy. The latter can be described as "build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order" (e.g. the notorious "melt all GPUs" example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.

Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat "apples and oranges". Suppose that we do live in a hard-takeoff singleton world, what then? One answer is "create a trustworthy, competent, multinational megaproject". Alright, but suppose you can't create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?

Here is my proposed rough sketch of such a plan[1]:

  • Commit to not make anyone predictably regret supporting the project or not opposing it. This rule is the most important and the one I'm the most confident of by far. In an ideal world, it should be more-or-less sufficient in itself. But in the real world, it might be still useful to provide more tangible details, which the next items try to do.
  • Within the bounds of Earth, commit to obey the international law, and local law at least inasmuch as the latter is consistent with international law, with only two possible exceptions (see below). Notably, this allows for actions such as (i) distributing technology that cures diseases, reverses aging, produces cheap food etc. (ii) lobbying for societal improvements (but see superpersuation clause below).
  • Exception 1: You can violate any law if it's absolutely necessary to prevent a catastrophe on the scale comparable with a nuclear war or worse, but only to the extent it's necessary for that purpose. (e.g. if a lab is about to build unaligned AI that would kill millions of people and it's not possible to persuade them to stop or convince the authorities to act in a timely manner, you can sabotage it.)[2]
  • Build space colonies. These space colonies will host utopic societies and most people on Earth are invited to immigrate there.
  • Exception 2: A person held in captivity in a manner legal according to local law, who faces death penalty or is treated in a manner violating accepted international rules about treatment of prisoners, might be given the option to leave to the colonies. If they exercise this option, their original jurisdiction is permitted to exile them from Earth permanently and/or bar them from any interaction with Earth than can plausibly enable activities illegal according to that jurisdiction[3].
  • Commit to adequately compensate any economy hurt by emigration to the colonies or other disruption by you. For example, if space emigration causes the loss of valuable labor, you can send robots to supplant it.
  • Commit to not directly intervene in international conflicts or upset the balance of powers by supplying military tech to any side, except in cases when it is absolutely necessary to prevent massive violations of international law and human rights.
  • Commit to only use superhuman persuasion when arguing towards a valid conclusion via valid arguments, in a manner that doesn't go against the interests of the person being persuaded. 
  1. ^

    Importantly, this makes stronger assumptions about the kind of AI you can align than MIRI-style pivotal acts. Essentially, it assumes that you can directly or indirectly ask the AI to find good plans consistent with the commitments below, rather than directing it to do something much more specific. Otherwise, it is hard to use Exception 1 (see below) gracefully.

  2. ^

    A more conservative alternative is to limit Exception 1 to catastrophes that would spill over to the space colonies (see next item).

  3. ^

    It might be sensible to consider a more conservative version which doesn't have Exception 2, even though the implications are unpleasant.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-04-05T15:25:32.512Z · LW · GW

Ratfic idea / conspiracy theory: Yudkowsky traveled back in time to yell at John Nash about how Nash equilibria are stupid[1], and that's why Nash went insane.

h/t Marcus (my spouse)

  1. ^

    They are.

Comment by Vanessa Kosoy (vanessa-kosoy) on tailcalled's Shortform · 2024-03-30T06:46:41.919Z · LW · GW

Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.

Comment by Vanessa Kosoy (vanessa-kosoy) on tailcalled's Shortform · 2024-03-29T18:15:51.506Z · LW · GW

I missed that paragraph on first reading, mea culpa. I think that your story about how it's a win for interpretability and alignment is very unconvincing, but I don't feel like hashing it out atm. Revised to weak downvote.

Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?

Comment by Vanessa Kosoy (vanessa-kosoy) on tailcalled's Shortform · 2024-03-29T17:47:46.764Z · LW · GW

Because it's capability research. It shortens the TAI timeline with little compensating benefit.

Comment by Vanessa Kosoy (vanessa-kosoy) on tailcalled's Shortform · 2024-03-29T17:30:32.757Z · LW · GW

Downvoted because conditional on this being true, it is harmful to publish. Don't take it personally, but this is content I don't want to see on LW.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-03-25T01:27:56.945Z · LW · GW

Formalizing the richness of mathematics

Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.

Here is my proposal for how to formulate a theorem that would make this idea rigorous.

(Wrong) First Attempt

Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis  represents an infradistribution over : the "space of counterpossible computational universes". We can say that  is a "true hypothesis" when there is some  in the credal set  (a distribution over ) s.t. the ground truth  "looks" as if it's sampled from . The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.

We can now try to say that  is "rich" if for any true hypothesis , there is a refinemen which is also a true hypothesis and "knows" at least one bit of information that  doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes . But, it's also completely boring: the required  can be constructed by "hardcoding" an additional fact into . This doesn't look like "discovering interesting structure", but rather just like brute-force memorization.

(Wrong) Second Attempt

What if instead we require that  knows infinitely many bits of information that  doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the -th bit in the sequence depends only the largest  s.t.  divides , but the dependence on  is already uncomputable (or at least inexpressible by a finite automaton).

However, for our original application, this is entirely insufficient. This is because in the formal language we use to define  (e.g. combinator calculus) has some "easy" equivalence relations. For example, consider the family of programs of the form "if 2+2=4 then output 0, otherwise...". All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many "new" bits of information. Once again, we are left with brute-force memorization.

(Less Wrong) Third Attempt

Here's the improved condition: For any true hypothesis , there is a true refinement  s.t. conditioning  on any finite set of observations cannot produce a refinement of .

There is a technicality here, because we're talking about infradistributions, so what is "conditioning" exactly? For credal sets, I think it is sufficient to allow two types of "conditioning":

  • For any given observation  and , we can form .
  • For any given observation  s.t. , we can form .

This rules-out the counterexample from before: the easy equivalence relation can be represented inside , and then the entire sequence of "novel" bits can be generated by a conditioning.

Alright, so does  actually satisfy this condition? I think it's very probable, but I haven't proved it yet. 

Comment by Vanessa Kosoy (vanessa-kosoy) on New report: Safety Cases for AI · 2024-03-20T17:09:35.748Z · LW · GW

Linkpost to Twitter thread is a bad format for LessWrong. Not everyone has Twitter.

Comment by Vanessa Kosoy (vanessa-kosoy) on Tamsin Leake's Shortform · 2024-03-13T16:48:49.871Z · LW · GW

I agree that in the long-term it probably matters little. However, I find the issue interesting, because the failure of reasoning that leads people to ignore the possibility of AI personhood seems similar to the failure of reasoning that leads people to ignore existential risks from AI. In both cases it "sounds like scifi" or "it's just software". It is possible that raising awareness for the personhood issue is politically beneficial for addressing X-risk as well. (And, it would sure be nice to avoid making the world worse in the interim.)

Comment by vanessa-kosoy on [deleted post] 2024-03-04T13:06:01.266Z


What is ? Also, we should allow adding some valid reward function of .

Comment by vanessa-kosoy on [deleted post] 2024-03-04T12:21:57.921Z

 is a polytope with , corresponding to allowed action distributions at that state. 

I think it's mathematically cleaner to get rid of A and have those be abstract polytopes.

Comment by Vanessa Kosoy (vanessa-kosoy) on Open Thread – Winter 2023/2024 · 2024-03-02T14:04:17.271Z · LW · GW

Did anyone around here try Relationship Hero and has opinions?

Comment by Vanessa Kosoy (vanessa-kosoy) on evhub's Shortform · 2024-02-04T15:43:35.310Z · LW · GW

First, I said I'm not a utilitarian, I didn't say that I don't value other people. There's a big difference!

Second, I'm not willing to step behind that veil of ignorance. Why should I? Decision-theoretically, it can make sense to argue "you should help agent X because in some counterfactual, agent X would be deciding whether to help you using similar reasoning". But, there might be important systematic differences between early people and late people (for example, because late people are modified in some ways compared to the human baseline) which break the symmetry. It might be a priori improbable for me to be born as a late person (and still be me in the relevant sense) or for a late person to be born in our generation[1].

Moreover, if there is a valid decision-theoretic argument to assign more weight to future people, then surely a superintelligent AI acting on my behalf would understand this argument and act on it. So, this doesn't compel me to precommit to a symmetric agreement with future people in advance.

  1. ^

    There is a stronger case for intentionally creating and giving resources to people who are early in counterfactual worlds. At least, assuming people have meaningful preferences about the state of never-being-born.

Comment by Vanessa Kosoy (vanessa-kosoy) on A sketch of acausal trade in practice · 2024-02-04T14:45:35.711Z · LW · GW

Your "psychohistory" is quite similar to my "metacosmology".

Comment by Vanessa Kosoy (vanessa-kosoy) on evhub's Shortform · 2024-02-03T19:10:25.767Z · LW · GW

Disagree. I'm in favor of (2) because I think that what you call a "tyranny of the present" makes perfect sense. Why would the people of the present not maximize their utility functions, given that it's the rational thing for them to do by definition of "utility function"? "Because utilitarianism" is a nonsensical answer IMO. I'm not a utilitarian. If you're a utilitarian, you should pay for your utilitarianism out of your own resource share. For you to demand that I pay for your utilitarianism is essentially a defection in the decision-theoretic sense, and would incentivize people like me to defect back.

As to problem (2.b), I don't think it's a serious issue in practice because time until singularity is too short for it to matter much. If it was, we could still agree on a cooperative strategy that avoids a wasteful race between present people.

Comment by Vanessa Kosoy (vanessa-kosoy) on Chapter 1 of How to Win Friends and Influence People · 2024-01-29T12:23:51.683Z · LW · GW

John Wentworth, founder of the stores that bear his name, once confessed: "I learned thirty years ago that it is foolish to scold. I have enough trouble overcoming my own limitations without fretting over the fact that God has not seen fit to distribute evenly the gift of intelligence." 

@johnswentworth is an ancient vampire, confirmed.

Comment by Vanessa Kosoy (vanessa-kosoy) on Open Thread – Winter 2023/2024 · 2024-01-28T11:06:54.208Z · LW · GW

I'm going to be in Berkeley February 8 - 25. If anyone wants to meet, hit me up!

Comment by Vanessa Kosoy (vanessa-kosoy) on AI #48: Exponentials in Geometry · 2024-01-18T15:34:57.836Z · LW · GW

Where do the Base Rate Times report on AI? I don't see it on their front page.

Comment by Vanessa Kosoy (vanessa-kosoy) on The impossible problem of due process · 2024-01-16T17:20:25.613Z · LW · GW

I honestly don't know. The discussions of this problem I encountered are all in the American (or at least Western) context[1], and I'm not sure whether it's because Americans are better at noticing this problem and fixing it, or because American men generate more unwanted advances, or because American women are more sensitive to such advances, or because this is an overreaction to a problem that's much more mild than it's portrayed.

Also, high-status men, really? Men avoiding meetups because they get too many propositions from women is a thing?

  1. ^

    To be clear, we certainly have rules against sexual harassment here in Israel, but that's very different from "don't ask a woman out the first time you meet her".

Comment by Vanessa Kosoy (vanessa-kosoy) on The impossible problem of due process · 2024-01-16T12:35:22.743Z · LW · GW

"It's true that we don't want women to be driven off by a bunch of awkward men asking them out, but if we make everyone read a document that says 'Don't ask a woman out the first time you meet her', then we'll immediately give the impression that we have a problem with men awkwardly asking women out too much — which will put women off anyway."


American social norms around romance continue to be weird to me. For the record, y'all can feel free to ask me out the first time you meet me, even if you do it awkwardly ;)

Comment by Vanessa Kosoy (vanessa-kosoy) on Saving the world sucks · 2024-01-13T17:28:51.259Z · LW · GW

"Virtue is its own reward" is a nice thing to believe in when you feel respected, protected and loved. When you feel tired, lonely and afraid, and nobody cares at all, it's very hard to understand why you should be making big sacrifices for the sake of virtue. But, hey, people are different. Maybe, for you virtue is truly, unconditionally, its own reward, and a sufficient one at that. And maybe EA is a community professional circle only for people who are that stoic and selfless. But, if so, please put the warning in big letters on the lid.

Comment by Vanessa Kosoy (vanessa-kosoy) on Saving the world sucks · 2024-01-13T13:36:56.913Z · LW · GW

There is tension between the stance that "EA is just a professional circle" and the (common) thesis that EA is a moral ideal. The latter carries the connotation of "things you will be rewarded for doing" (by others sharing the ideal). Likely some will claim that, in their philosophy, there is no such connotation: but it is on them to emphasize this, since this runs contrary to the intuitive perception of morality by most people. People who take up the ideology expecting the implied community aspect might understandably feel disappointed or even betrayed when they find it lacking, which might have happened to the OP.

As I said, cooperation is rational. There are, roughly speaking, two mechanisms to achieve cooperation: the "acausal" way and the "causal" way. The acausal way means doing something out of abstract reasoning that, if many others do the same, it will be in everyone's benefit, and moreover many others follow the same reasoning. This might work even without a community, in principle.

However, the more robust mechanism is causal: tit-for-tat. This requires that other people actually reward you for doing the thing. One way to reward is by money, which EA does to some extent: however, it also encourages members to take pay cuts and/or make donations. Another way to reward is by the things money cannot buy: respect, friendship, emotional support and generally conveying the sense that you're a cherished member of the community. On this front, more could be done IMO.

Even if we accept that EA is nothing more than a professional circle, it is still lacking in the respects I pointed out. In many professional circles, you work in an office with peers, leading naturally to a network of personal connections. On the other hand, AFAICT many EAs work independently/remotedly (I am certainly one of those), which denies the same benefits.

Comment by Vanessa Kosoy (vanessa-kosoy) on Saving the world sucks · 2024-01-11T14:47:00.865Z · LW · GW

I agree with the OP that: Utilitarianism is not a good description of most people's values, possibly not even a good description of anyone's values. Effective altruism encourages people to pretend that they are intrinsically utilitarian, which is not healthy or truth-seeking. Intrinsic values are (to 1st approximation) immutable. It's healthy to understand your own values, it's bad to shame people for having "wrong" values.

I agree with critics of the OP that: Cooperation is rational, we should be trying to help each other over and above the (already significant) extent to which we intrinsically care about each other, because this is in our mutual interest. A healthy community rewards prosocial behavior and punishes sufficiently antisocial behavior (there should also be ample room for "neutral" though).

A point insufficiently appreciated by either: The rationalist/EA community doesn't reward prosocial behavior enough. In particular, we need much more in the way of emotional support and mental health resources for community members. I speak from personal experience here: I am very grateful to this community for support in the career/professional sense. However, on the personal/emotional level, I never felt that the community cares about what I'm going through.

Comment by Vanessa Kosoy (vanessa-kosoy) on You can just spontaneously call people you haven't met in years · 2024-01-11T08:36:54.822Z · LW · GW

For the record, I contacted 3/4 but it led to nothing, alas. (I also thought of another person to contact but she moved to a different country in the intervening time.)

Comment by Vanessa Kosoy (vanessa-kosoy) on Where I agree and disagree with Eliezer · 2024-01-11T07:07:47.910Z · LW · GW

I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.

  1. ^

    I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)

Comment by Vanessa Kosoy (vanessa-kosoy) on The Learning-Theoretic Agenda: Status 2023 · 2024-01-10T10:48:20.248Z · LW · GW

Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes computational complexity into account instead of . For example, you can look at the measure  I defined here. More realistically, this measure should be based on the frugal universal prior.

Comment by Vanessa Kosoy (vanessa-kosoy) on Why aren't Yudkowsky & Bostrom getting more attention now? · 2024-01-09T10:09:59.432Z · LW · GW

Part of the reason is that Yudkowsky radicalized his position to stay out of the overton window. Fifteen years ago, his position was "we need to do research into AI safety, because AI will pose a threat to humanity some time this century". Now, the latter is becoming mainstream-adjacent, but he shifted to "it's too late to do research, we need to stop all capability work or else we all die in 10-15 years". And, "even if we stop all capability work as much as an international treaty can conceivably accomplish, we must augment human intelligence in adults in order to be able to solve the problem in time."

Comment by Vanessa Kosoy (vanessa-kosoy) on MIRI 2024 Mission and Strategy Update · 2024-01-05T15:12:52.414Z · LW · GW

It is tricky, but there might be some ways for data to defend itself.

Comment by Vanessa Kosoy (vanessa-kosoy) on 2023 in AI predictions · 2024-01-02T12:11:27.821Z · LW · GW


I'll toss in some predictions of my own. I predict that all of the following things will not happen without a breakthrough substantially more significant than the invention of transformers:

  • AI inventing new things in science and technology, not via narrow training/design for a specific subtask (like e.g. AlphaFold) but roughly the way humans do it. (Confidence: 80%)
  • AI being routinely used by corporate executives to make strategic decisions, not as a glorified search engine but as a full-fledged advisor. (Confidence: 75%)
  • As above, but politicians instead of corporate executives. (Confidence: 72%)
  • AI learning how to drive using a human driving teacher, within a number of lessons similar to what humans take, without causing accidents (that the teacher fails to prevent) and without any additional driving training data or domain-specific design. (Confidence: 67%)
  • AI winning gold in IMO, using a math training corpus comparable in size to the number of math problems human contestants see in their lifetime. (Confidence: 65%)
  • AI playing superhuman Diplomacy, using a training corpus (including self-play) comparable in size to the number of games played by human players, while facing reputation incentives similar to those of human players. (Confidence: 60%)
  • As above, but Go instead of Diplomacy. (Confidence: 55%)
Comment by Vanessa Kosoy (vanessa-kosoy) on 2023 Unofficial LessWrong Census/Survey · 2023-12-28T10:16:01.857Z · LW · GW

 is the probability of the event that actually occured. You can't submit  without knowing what is true in advance. For example, suppose you need to predict who wins the next US presidential election. You assign probability 0.6 to Biden, 0.3 to Trump and 0.1 to Eliezer Yudkowsky. Then, if Biden wins, . But, if Yudkowsky wins then .

Comment by Vanessa Kosoy (vanessa-kosoy) on Critical review of Christiano's disagreements with Yudkowsky · 2023-12-27T18:53:38.664Z · LW · GW

Thank you for the clarification.

How do you expect augmented humanity will solve the problem? Will it be something other than "guessing it with some safe weak lesser tries / clever theory"?

Comment by Vanessa Kosoy (vanessa-kosoy) on Is being sexy for your homies? · 2023-12-14T08:02:47.105Z · LW · GW

Not especially important to your main points, but for the sake of pedantry:

While it's true that transwomen are biologically distinct from ciswomen, medically-transitioning transwomen are also biologically distinct from cismen. In particular, most of them (and all of the post-op) can't make babies with anyone. So, from a purely reproductive perspective, those transwomen are in a group onto itself. From a sexual-attraction perspective, this group is somewhat more similar to ciswomen than to cismen, in the sense that a much bigger fraction of straight men would be attracted to a (medically-transitioning in advance stage) transwoman than the fraction of straight women attracted to that transwoman (even if it the fraction of straight men attracted to a same-percentile-of-attractiveness ciswoman would be larger still).

Comment by Vanessa Kosoy (vanessa-kosoy) on Google Gemini Announced · 2023-12-07T12:24:51.182Z · LW · GW

in each of the 50 different subject areas that we tested it on, it's as good as the best expert humans in those areas


That sounds like an incredibly strong claim, but I suspect that the phrasing is very misleading. What kind of tests is Hassabis talking about here? Maybe those are tests that rely on remembering known facts much more than on making novel inferences? Surely Gemini is not (say) as good as the best mathematicians at solving open problems in mathematics?

Comment by Vanessa Kosoy (vanessa-kosoy) on 2023 Unofficial LessWrong Census/Survey · 2023-12-05T14:40:46.140Z · LW · GW

Imagine that, for every question, you will have to pay  dollars if the event you assigned a probability  occurs. Here,  is some sufficiently small constant (this assumes your strategy doesn't fluctuate as  approaches 0). Answer in the optimal way for that game, according to whatever decision theory you follow. (But choosing which questions to answer is not part of the game.)

Comment by Vanessa Kosoy (vanessa-kosoy) on The LessWrong 2022 Review · 2023-12-05T08:46:48.755Z · LW · GW

The LessWrong moderation team will take the voting results as a strong indicator of which posts to include in the Best of 2022 sequence.

Will there also be a Best of 2021 sequence at some point?

Comment by Vanessa Kosoy (vanessa-kosoy) on Neither EA nor e/acc is what we need to build the future · 2023-11-28T16:50:25.325Z · LW · GW

The analogy between SBF and Helen Toner is completely misguided. SBF did deeply immoral things, with catastrophic results for everyone, whatever his motivations has been. With Toner, we don't know what really happened, but if she indeed was willing to destroy OpenAI for safety reasons, then AFAICT she was 100% justified. The only problem is that she didn't succeed. (Where "success" would mean actually removing OpenAI from the gameboard, rather than e.g. rebranding it as part of Microsoft.)

Comment by Vanessa Kosoy (vanessa-kosoy) on Shallow review of live agendas in alignment & safety · 2023-11-27T12:29:04.739Z · LW · GW

Nice work.

Regarding the Learning-Theoretic Agenda:

  • We don't have 3-6 full time employees. We have ~2 full time employees and another major contributor.
  • In "funded by", Effective Ventures and Lightspeed Grants should appear as well.
Comment by Vanessa Kosoy (vanessa-kosoy) on Open Thread – Autumn 2023 · 2023-11-24T11:40:35.501Z · LW · GW

If I downvote my own post, or a collaborative post with me as one of the authors, does it affect either my karma or my coauthors' karma? I'm guessing "no" but want to make sure.

Comment by Vanessa Kosoy (vanessa-kosoy) on Public Call for Interest in Mathematical Alignment · 2023-11-24T08:15:00.206Z · LW · GW

You are more or less right. By "mathematical approaches", we mean approaches focused on building mathematical models relevant to alignment/agency/learning and finding non-trivial theorems (or at least conjectures) about these models. I'm not sure what the word "but" is doing in "but you mention RL": there is a rich literature of mathematical inquiry into RL. For a few examples, see everything under the bullet "reinforcement learning theory" in the LTA reading list.

Comment by Vanessa Kosoy (vanessa-kosoy) on You can just spontaneously call people you haven't met in years · 2023-11-13T06:13:28.009Z · LW · GW


Made a list of 4 people I can try to contact. Wish me luck!

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2023-11-11T18:09:20.050Z · LW · GW

Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.

A deterministic partial environment (DPE) over action set  and observation set  is a pair  where  and  s.t.

  • If  is a prefix of some , then .
  • If  and  is a prefix of , then .

DPEs are equipped with a natural partial order. Namely,  when   and .

Let  be a strong upwards antichain in the DPE poset which doesn't contain the bottom DPE (i.e. the DPE with ). Then, it naturally induces an infra-POMDP. Specifically: 

  • The state space is .
  • The initial infradistribution is .
  • The observation mapping is , where  is the empty history.
  • The transition infrakernel is , where

If  is non-empty for all  and , this is a learnable undogmatic ontology.

Any  yields an example . Namely,  iff  and for any  it holds that:

  1. If  then for any .

I think that for any continuous some non-trivial hidden reward functions over such an ontology, the class of communicating RUMDPs is learnable. If the hidden reward function doesn't depend on the action argument, it's equivalent to some instrumental reward function.

Comment by Vanessa Kosoy (vanessa-kosoy) on Petrov Day Retrospective, 2023 (re: the most important virtue of Petrov Day & unilaterally promoting it) · 2023-09-28T04:53:09.465Z · LW · GW

First, I clicked the link in the second poll[1]. My thought process looked as follows:

  • I quickly skimmed the content of the message
  • My split-second-judgement registered that there is a RACE
  • Moreover, the race is on very small time scales: every second of indecision might cost me victory!
  • Moreover, split-second-judgment estimates that winning the race is good-in-expectation (where "expectation" should be thought of as including the "logical uncertainty" resulting from having to rely on split-second-judgement).
  • Therefore, click NOW before it's too late!

Worse, even after clicking and reading the text again, I misunderstood its content. Somehow, I thought that this year's celebration will be determined by the plurality, whereas next year's will be determined by the fastest minority. This system is strange, but is not obviously defect-y, i.e. not obviously inferior to e.g. using plurality twice in a row, from behind the veil-of-ignorance.

Only after reading the OP and starting composing this comment in my mind, did I understand the actual meaning of the text in the second poll: that only the next year's celebration is decided upon, and only according to a minority (if anyone in a minority clicks). Now, this is more or less clearly defect-y and in hindsight I don't endorse clicking it.

What is my take-away lesson? The process I used to make the decision seems correct to me: if you have to make a split-second decision, then you need to use your split-second judgement because there is nothing else to go by. There might be some case for a bias towards inaction, but it's not an overwhelming case. Personally, I know that I'm usually too slow to respond in emergency scenarios, so I don't want to train myself to prefer inaction.

The right way to optimize this is to train your split-second judgement to do well in the sort of situations in which split-second judgement is likely to be required. The sort of reasoning required of us here is not likely to be tied to a split-second decision anywhere outside of Petrov Day games[2], so I think my split-second judgement did as well as expected and there's nothing to correct.

[EDIT: Actually, there is a correction to be made here, and it refers to my wrong reading of the message after clicking the link. The lesson is: if I make a split-second decision, I need to carefully reexamine it after the fact, in order to understand its true consequences, and beware of anchoring on my split-second reasoning: this anchoring is probably motivated by wanting to justify myself later.]

Second, I think that going with the majority in this case is not honoring your word. You explicitly said "the first to do so out of any minority group". If you break your word and go with the majority, I won't completely lose my trust in you: but that's mostly because this is a game. In a situation with more serious stakes, I expect you to take the precise meaning of your promises way more seriously, and I would be extremely disappointed if you don't.

Third, I think this was a cool way to celebrate Petrov's Day (modulo the issue with breaking your word, which is really bad and must not be repeated). Kudos!

  1. ^

    My choice in the first poll was "accurately reporting your epistemic state".

  2. ^

    The actual Petrov had more time to make his decision, and also if I got Petrov's job I would train my fast-judgement on Petrov-like situations in advance.

Comment by Vanessa Kosoy (vanessa-kosoy) on Would You Work Harder In The Least Convenient Possible World? · 2023-09-24T09:38:16.932Z · LW · GW

Maybe the Effective Altruist movement should accept people like you because they’re a big tent and they’re friendly and welcoming, but the rationalist community should be elitist and only accept people who say tsuyoku naritai...


This is a disturbing claim, although I realize that the author's opinions don't coincide with those of the "Alice" character. Personally, I'm not a utilitarian, nor do I want to be a utilitarian or think that I "should" be a utilitarian[1]. I do consider myself a person who is empathetic, honest and cooperative[2]. I hope this doesn't disqualify me from the rationalist community?

In general, I'm in favor of promoting societal norms which incentivize making the world better: such norms are obviously in everyone's interest. In this sense, I'm very sympathetic to effective altruism. However, these norms should still regard altruism as supererogatory: i.e., it should be rewarded and encouraged, but it's lack should not be severely punished. The alternative is much too vulnerable to abuse.

  1. ^

    IMO utilitarianism is not even logically coherent, due to paradoxes with infinite ethics and Pascal's mugging.

  2. ^

    In the sense of, trying to act according to superrationality.

Comment by Vanessa Kosoy (vanessa-kosoy) on UDT shows that decision theory is more puzzling than ever · 2023-09-16T08:03:23.640Z · LW · GW

...the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)


It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where  and  are equivalent when there exists some  s.t.  and ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent.

One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior  are likely to evolve/flourish in universes sampled from prior , while agents with prior  are likely to evolve/flourish in universes sampled from prior . No prior is superior across all universes: there's no free lunch.

For the purpose of AI alignment, the solution is some combination of (i) learn the user's prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user's prior). The claim is, different reasonable choices in ii will lead to similar results.

Given all that, I'm not sure what's still unsatisfying. Is there any reason to believe something is missing in this picture?