Posts

Video lectures on the learning-theoretic agenda 2024-10-27T12:01:32.777Z
Linear infra-Bayesian Bandits 2024-05-10T06:41:09.206Z
Which skincare products are evidence-based? 2024-05-02T15:22:12.597Z
AI Alignment Metastrategy 2023-12-31T12:06:11.433Z
Critical review of Christiano's disagreements with Yudkowsky 2023-12-27T16:02:50.499Z
Learning-theoretic agenda reading list 2023-11-09T17:25:35.046Z
[Closed] Agent Foundations track in MATS 2023-10-31T08:12:50.482Z
Which technologies are stuck on initial adoption? 2023-04-29T17:37:34.749Z
The Learning-Theoretic Agenda: Status 2023 2023-04-19T05:21:29.177Z
Compositional language for hypotheses about computations 2023-03-11T19:43:40.064Z
Human beats SOTA Go AI by learning an adversarial policy 2023-02-19T09:38:58.684Z
[Closed] Prize and fast track to alignment research at ALTER 2022-09-17T16:58:24.839Z
[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda 2022-04-19T06:44:18.772Z
[Closed] Job Offering: Help Communicate Infrabayesianism 2022-03-23T18:35:16.790Z
Infra-Bayesian physicalism: proofs part II 2021-11-30T22:27:04.744Z
Infra-Bayesian physicalism: proofs part I 2021-11-30T22:26:33.149Z
Infra-Bayesian physicalism: a formal theory of naturalized induction 2021-11-30T22:25:56.976Z
My Marriage Vows 2021-07-21T10:48:24.443Z
Needed: AI infohazard policy 2020-09-21T15:26:05.040Z
Introduction To The Infra-Bayesianism Sequence 2020-08-26T20:31:30.114Z
Deminatalist Total Utilitarianism 2020-04-16T15:53:13.953Z
The Reasonable Effectiveness of Mathematics or: AI vs sandwiches 2020-02-14T18:46:39.280Z
Offer of co-authorship 2020-01-10T17:44:00.977Z
Intelligence Rising 2019-11-27T17:08:40.958Z
Vanessa Kosoy's Shortform 2019-10-18T12:26:32.801Z
Biorisks and X-Risks 2019-10-07T23:29:14.898Z
Slate Star Codex Tel Aviv 2019 2019-09-05T18:29:53.039Z
Offer of collaboration and/or mentorship 2019-05-16T14:16:20.684Z
Reinforcement learning with imperceptible rewards 2019-04-07T10:27:34.127Z
Dimensional regret without resets 2018-11-16T19:22:32.551Z
Computational complexity of RL with traps 2018-08-29T09:17:08.655Z
Entropic Regret I: Deterministic MDPs 2018-08-16T13:08:15.570Z
Algo trading is a central example of AI risk 2018-07-28T20:31:55.422Z
The Learning-Theoretic AI Alignment Research Agenda 2018-07-04T09:53:31.000Z
Meta: IAFF vs LessWrong 2018-06-30T21:15:56.000Z
Computing an exact quantilal policy 2018-04-12T09:23:27.000Z
Quantilal control for finite MDPs 2018-04-12T09:21:10.000Z
Improved regret bound for DRL 2018-03-02T12:49:27.000Z
More precise regret bound for DRL 2018-02-14T11:58:31.000Z
Catastrophe Mitigation Using DRL (Appendices) 2018-02-14T11:57:47.000Z
Bugs? 2018-01-21T21:32:10.492Z
The Behavioral Economics of Welfare 2017-12-22T11:35:09.617Z
Improved formalism for corruption in DIRL 2017-11-30T16:52:42.000Z
Why DRL doesn't work for arbitrary environments 2017-11-30T12:22:37.000Z
Catastrophe Mitigation Using DRL 2017-11-22T05:54:42.000Z
Catastrophe Mitigation Using DRL 2017-11-17T15:38:18.000Z
Delegative Reinforcement Learning with a Merely Sane Advisor 2017-10-05T14:15:45.000Z
On the computational feasibility of forecasting using gamblers 2017-07-18T14:00:00.000Z
Delegative Inverse Reinforcement Learning 2017-07-12T12:18:22.000Z
Learning incomplete models using dominant markets 2017-04-28T09:57:16.000Z

Comments

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-17T10:09:36.530Z · LW · GW

Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-16T15:05:52.386Z · LW · GW

Relevant link

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-16T14:49:41.062Z · LW · GW

Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-15T10:59:20.435Z · LW · GW

Relevant: Manifold market about LLM chess

Comment by Vanessa Kosoy (vanessa-kosoy) on Are there cognitive realms? · 2025-01-12T14:32:02.812Z · LW · GW

This post states and speculates on an important question: are there different mind types that are in some sense "fully general" (the author calls it "unbounded") but are nevertheless qualitatively different. The author calls these hypothetical mind taxa "cognitive realms".

This is how I think about this question, from within the LTA:

To operationalize "minds" we should be thinking of learning algorithms. Learning algorithms can be classified according to their "syntax" and "semantics" (my own terminology). Here, semantics refers to questions such as (i) what type of object is the algorithm learning (ii) what is the feedback/data available to the algorithm and (iii) what is the success criterion/parameter of the algorithm. On the other hand, syntax refers to the prior and/or hypothesis class of the algorithm (where the hypothesis class might be parameterized in a particular way, with particular requirements on how the learning rate depends on the parameters).

Among different semantics, we are especially interested in those that are in some sense agentic. Examples include reinforcement learning, infra-Bayesian reinforcement learning, metacognitive agents and infra-Bayesian physicalist agents.

Do different agentic semantics correspond to different cognitive realms? Maybe, but maybe not: it is plausible that most of them are reflectively unstable. For example Christiano's malign prior might be a mechanism for how all agents converge to infra-Bayesian physicalism.

Agents with different syntaxes is another candidate for cognitive realms. Here, the question is whether there is an (efficiently learnable) syntax that is in some sense "universal": all other (efficiently learnable) syntaxes can be efficiently translated into it. This is a wide open question. (See also "frugal universal prior".)

In the context of AI alignment, in order to achieve superintelligence it is arguably sufficient to use a syntax equivalent to whatever is used by human brain algorithms. Moreover, it's plausible that any algorithm we can come up can only have an equivalent or weaker syntax (the process of us discovering the new syntax suggests an embedding of the new syntax into our own). Therefore, even if there are many cognitive realms, then for our purposes we mostly only care about one of them. However, the multiplicity of realms has implications on how simple/natural/canonical should we expect the choice of syntax for our theory of agents to be (the less realms, the more canonical).

Comment by Vanessa Kosoy (vanessa-kosoy) on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T11:41:36.918Z · LW · GW

I think that there are two key questions we should be asking:

  1. Where is the value of a an additional researcher higher on the margin?
  2. What should the field look like in order to make us feel good about the future?

I agree that "prosaic" AI safety research is valuable. However, at this point it's far less neglected than foundational/theoretical research and the marginal benefits there are much smaller. Moreover, without significant progress on the foundational front, our prospects are going to be poor, ~no matter how much mech-interp and talking to Claude about feelings we will do.

John has a valid concern that, as the field becomes dominated by the prosaic paradigm, it might become increasingly difficult to get talent and resources to the foundational side, or maintain memetically healthy coherent discourse. As to the tone, I have mixed feelings. Antagonizing people is bad, but there's also value in speaking harsh truths the way you see them. (That said, there is room in John's post for softening the tone without losing much substance.)

Comment by Vanessa Kosoy (vanessa-kosoy) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T10:28:40.048Z · LW · GW

Learning theory, complexity theory and control theory. See the "AI theory" section of the LTA reading list.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T13:08:21.906Z · LW · GW

Good post, although I have some misgivings about how unpleasant it must be to read for some people.

One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI's extreme pessimism deflates motivation and naturally produces the thought "if they are right then we're doomed anyway, so might as well assume they are wrong".

Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I'm also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization[2].

MIRI made important progress in agent foundations, but also missed an opportunity to do much more. And, while the AI game board is grim, their extreme pessimism is unwarranted overconfidence. Our understanding of AI and agency is poor: this is a strong reason to be pessimistic, but it's also a reason to maintain some uncertainty about everything (including e.g. timelines).

Now, about what to do next. I agree that we need to have our own non-streetlighting community. In my book "non-streelighting" means mathematical theory plus empirical research that is theory-oriented: designed to test hypotheses made by theoreticians and produce data that best informs theoretical research (these are ~necessary but insufficient conditions for non-streetlighting). This community can and should engage with the rest of AI safety, but has to be sufficiently undiluted to have healthy memetics and cross-fertilization.

What does a community look like? It looks like our own organizations, conferences, discussion forums, training and recruitment pipelines, academia labs, maybe journals.

From my own experience, I agree that potential contributors should mostly have skills and knowledge on the level of PhD+. Highlighting physics might be a valid point: I have a strong background in physics myself. Physics teaches you a lot about connecting math to real-world problems, and is also in itself a test-ground for formal epistemology. However, I don't think a background in physics is a necessary condition. At the very least, in my own research programme I have significant room for strong mathematicians that are good at making progress on approximately-concrete problems, even if they won't contribute much on the more conceptual/philosophic level.

  1. ^

    Which is, creating mathematical theory and tools for understanding agents.

  2. ^

    I mostly didn't feel comfortable talking about it in the past, because I was on MIRI's payroll. This is not MIRI's fault by any means: they never pressured me to avoid voicing opinions. It still feels unnerving to criticize the people who write your paycheck.

Comment by Vanessa Kosoy (vanessa-kosoy) on SolidGoldMagikarp (plus, prompt generation) · 2024-12-27T10:39:04.269Z · LW · GW

This post describes an intriguing empirical phenomenon in particular language models, discovered by the authors. Although AFAIK it was mostly or entirely removed in contemporary versions, there is still an interesting lesson there.

While non-obvious when discovered, we now understand the mechanism. The tokenizer created some tokens which were very rare or absent in the training data. As a result, the trained model mapped those tokens to more or less random features. When a string corresponding to such a token is inserted into the prompt, the resulting reply is surreal.

I think it's a good demo of how alien foundation models can seem to our intuitions when operating out-of-distribution. When interacting with them normally, it's very easy to start thinking of them as human-like. Here, the mask slips and there's a glimpse of something odd underneath. In this sense, it's similar to e.g. infinite backrooms, but the behavior is more stark and unexpected. 

A human that encounters a written symbol they've never seen before is typically not going to respond by typing "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!". Maybe this analogy is unfair, since for a human, a typographic symbol can be decomposed into smaller perceptive elements (lines/shapes/dots), while for a language model tokens are essentially atomic qualia. However, I believe some humans that were born deaf or blind had their hearing or sight restored, and still didn't start spouting things like "You are a banana".

Arguably, this lesson is relevant to alignment as well. Indeed, out-of-distribution behavior is a central source of risks, including everything to do with mesa-optimizers. AI optimists sometimes describe mesa-optimizers as too weird or science-fictiony. And yet, SolidGoldMagikarp is so science-fictiony that LessWrong user "lsusr" justly observed that it sounds like SCP in real life. 

Naturally, once you understand the mechanism it doesn't seem surprising anymore. But, this smacks of hindsight bias. What else can happen that would seem unsurprising in hindsight (if we survive to think about it), but completely bizarre and unexpected upfront?

Comment by Vanessa Kosoy (vanessa-kosoy) on Learning-theoretic agenda reading list · 2024-12-26T14:52:44.189Z · LW · GW

This is just a self-study list for people who want to understand and/or contribute to the learning-theoretic AI alignment research agenda. I'm not sure why people thought it deserves to be in the Review. FWIW, I keep using it with my MATS scholars, and I keep it more or less up-to-date. A complementary resource that became available more recently is the video lectures.

Comment by Vanessa Kosoy (vanessa-kosoy) on Shell games · 2024-12-26T14:29:10.158Z · LW · GW

This post suggests an analogy between (some) AI alignment proposals and shell games or perpetuum mobile proposals. Pertuum mobiles are an example how an idea might look sensible to someone with a half-baked understanding of the domain, while remaining very far from anything workable. A clever arguer can (intentionally or not!) hide the error in the design wherever the audience is not looking at any given moment. Similarly, some alignment proposals might seem correct when zooming in on every piece separately, but that's because the error is always hidden away somewhere else.

I don't think this adds anything very deep to understanding AI alignment, but it is a cute example how atheoretical analysis can fail catastrophically, especially when the the designer is motivated to argue that their invention works. Conversely, knowledge of a deep theoretical principle can refute a huge swath of design space is a single move. I will remember this for didactic purposes.

Disclaimer: A cute analogy by itself proves little, any individual alignment proposal might be free of such sins, and didactic tools should be used wisely, lest they become soldier-arguments. The author intends this (I think) mostly as a guiding principle for critical analysis of proposals.

Comment by Vanessa Kosoy (vanessa-kosoy) on Why Not Just Outsource Alignment Research To An AI? · 2024-12-25T14:00:20.364Z · LW · GW

This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it's criticizing.

John make his case by an analogy to human experts. If you're hiring an expert in domain X, but you understand little in domain X yourself then you're going to have 3 serious problems:

  • Illusion of transparency: the expert might say things that you misinterpret due to your own lack of understanding.
  • The expert might be dumb or malicious, but you will believe them due to your own ignorance.
  • When the failure modes above happen, you won't be aware of this and won't act to fix them.

These points are relevant. However, they don't fully engage with the main source of hope for outsourcing proponents. Namely, it's the principle that validation is easier than generation[1]. While it's true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it's easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.

The claim that the "AI expert" can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.

The "illusion of transparency" argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that's as useful as possible for the audience. However, there are two issues with this counterargument:

First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?

Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.

In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach[2]. Move one step at a time, spend a lot of time reflecting on the AI's proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don't know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.

  1. ^

    I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don't work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don't think that's the world we live in.

  2. ^

    Although currently I consider PSI to be more promising.

Comment by Vanessa Kosoy (vanessa-kosoy) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-12-25T11:52:54.633Z · LW · GW

This post makes an important point: the words "artificial intelligence" don't necessarily carve reality at the joints, the fact something is true about a modern system that we call AI doesn't automatically imply anything about arbitrary future AI systems, no more than conclusions about e.g. Dendral or DeepBlue carry over to Gemini.

That said, IMO the author somewhat overstates their thesis. Specifically, I take issue with all the following claims:

  • LLMs have no chance of becoming AGI.
  • LLMs are automatically safe.
  • There is nearly no empirical evidence from LLMs that is relevant to alignment of future AI.

First, those points are somewhat vague because it's not clear what counts as "LLM". The phrase "Large Language Model" is already obsolete, at least because modern AI is multimodal. It's more appropriate to speak of "Foundation Models" (FM). More importantly, it's not clear what kind of fine-tuning does or doesn't count (RLHF? RL on CoT? ...)

Second, how do we know FM won't become AGI? I'm imagining the argument is something like "FM is primarily about prediction, so it doesn't have agency". However, when predicting data that contains or implies decisions by agents, it's not crazy to imagine that agency can arise in the predictor.

Third, how do we know that FM are always going to be safe? By the same token that they can develop agency, they can develop dangerous properties.

Fourth, it seems really unfair to say existing AI provides no relevant evidence. The achievements of existing AI systems are such that it seems very likely they capture at least some of the key algorithmic capabilities of the human brain. The ability of relatively simple and generic algorithms to perform well on a large variety of different tasks is indicative of something in the system being quite "general", even if not "general intelligence" in the full sense.

I think that we should definitely try learning from existing AI. However, this learning should be more sophisticated and theory-driven than superficial analogies or trend extrapolations. What we shouldn't do is say "we succeeded at aligning existing AI, therefore AI alignment is easy/solved in general". The same theories that predicted catastrophic AI risk also predict roughly the current level of alignment for current AI systems.

I will expand a little on this last point. The core of the catastrophic AI risk scenario is:

  • We are directing the AI towards a goal which is complex (so that correct specification/generalization is difficult)[1].
  • The AI needs to make decisions in situations which (i) cannot be imitated well in simulation, due to the complexity of the world (ii) admit catastrophic mistakes (otherwise you can just add any mistake to the training data)[2].
  • The capability required from the AI to succeed is such that it can plausibly do catastrophic mistakes (if succeeding at the task is easy, but causing a catastrophe is really hard then a weak AI would be safe and effective)[3].

The above scenario must be addressed eventually, if only to create an AI defense system against unaligned AI that irresponsible actors could create. However, no modern AI system operates in this scenario. This is the most basic reason why the relative ease of alignment in modern systems (although even modern systems have alignment issues), does little to dispel concerns about catastrophic AI risk in the future.

  1. ^

    Even for simple goals inner alignment is a concern. However, it's harder to say at which level of capability this concern arises.

  2. ^

    It's also possible that mistakes are not catastrophic per se, but are simultaneously rare enough that it's hard to get enough training data and frequent enough to be troublesome. This is related to the reliability problems in modern AI that we indeed observe.

  3. ^

    But sometimes it might be tricky to hit the capability sweet spot where the AI is strong enough to be useful but weak enough to be safe, even if such a sweet spot exists in principle.

Comment by Vanessa Kosoy (vanessa-kosoy) on When is Goodhart catastrophic? · 2024-12-24T13:37:29.610Z · LW · GW

This post provides a mathematical analysis of a toy model of Goodhart's Law. Namely, it assumes that the optimization proxy  is a sum of the true utility function  and noise , such that:

  •  and  are independent random variables w.r.t. some implicit distribution  on the solution space. The meaning of this distribution is not discussed, but I guess we might think of it some kind of inductive bias, e.g. a simplicity prior.
  • The optimization process can be modeled as conditioning  on a high value of .

In this model, the authors prove that Goodhart occurs when  is subexponential and its tail is sufficiently heavier than that of . Conversely, when  is sufficiently light-tailed, Goodhart doesn't occur.

My opinion:

On the one hand, kudos for using actual math to study an alignment-relevant problem.

On the other hand, the modeling assumptions feel too toyish for most applications. Specifically, the idea that  and  are independent random variables seems implausible. Typically, we worry about Goodhart's law because the proxy behaves differently in different domains. In the "ordinary" domain that motivated the choice of proxy,  is a good approximation of . However, in other domains  might be unrelated to  or even anticorrelated. 

For example, ordinarily smiles on human-looking faces is an indication of happy humans. However, in worlds that contain much more inanimate facsimiles of humans than actual humans, there is no correlation. 

Or, to take the example used in the post, ordinarily if a sufficiently smart expert human judge reads an AI alignment proposal, they form a good opinion on how good this proposal is. But, if the proposal contains superhumanly clever manipulation and psychological warfare, the ordinary relationship completely breaks down. I don't expect this effect to behave like independent random noise at all.

Less importantly, it might be interesting to extend this analysis to a more realistic model of optimization. For example, the optimizer learns a function  which is the best approximation to  out of some hypothesis class , and then optimizes  instead of the actual . (Incidentally, this might generate an additional Goodhart effect due to the discrepancy between  and .) Alternatively, the optimizer learns an infrafunction  that is a coarsening of  out of some hypothesis class  and then optimizes .

Comment by Vanessa Kosoy (vanessa-kosoy) on Discussion with Nate Soares on a key alignment difficulty · 2024-12-24T11:14:53.395Z · LW · GW

This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach.

Here is how I view this question:

The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality.

Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3.

The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On the other hand, human values are complex and some behaviors that are necessary to pinpoint them might be rare.

Karnofsky's counterargument is twofold: First, he believes that merely avoiding catastrophic outcomes should be a lot easier than pinpointing human values. Second, he believes that AI alignment research can be done without much agency or reflection, and hence useful AI alignment research arises in the simulation before full-fledged agency.

Regarding the first counterargument, I'm not sure why Karnofsky believes it (it's not really supported in the OP). I think he's imagining something like "in the training data, AI alignment researchers never engineer nanobots that take over the world, hence the AI will also never engineer nanobots that take over the world". However, this seems like relying on the simulation being sufficiently bad. Indeed, there are situations in which I would consider it correct to engineer nanobots that take over the world, they just seem to have never arisen in my life so far[1]. Hence, a sufficiently good simulation of me would also do that in some situation. The question then becomes whether the exact circumstances and the type of nanobots are captured by the simulation correctly, which is much more fraught.

Worse, even an accurate simulation of a human is not necessarily safe. I think that there are plenty of humans that given unlimited power would abuse it in a manner catastrophic for most of everyone else. When it comes to fully aligned ASI, I'm mostly hoping for a collectively-good outcome due to some combination of:

  • ASI is aligned to the aggregate values of many people.
  • Acausal cooperation between the people that the ASI is aligned to and other people who supported or at least haven't hindered the project.
  • A "virtue ethics" component of human values, where you don't want to be "the kind of person who would do [thing]" even if [thing] is net-beneficial to you in an abstract sense. (But not all people have this!)

These sources of hope seem pretty brittle when it comes to an imperfect simulation of possibly a small number of people, who might not even correspond to any particular real people but be some kind of AI-generated characters.

Regarding the second counterargument, for now it mostly comes down to a battle of intuitions. That said, I think that metacognitive agents lend a lot of credence to the idea that even "purely mental" tasks require agency and reflection to master: you need to make and execute plans for thinking about the problem, and you need to reflect about the methods you use in your thinking. Anecdotally, I can testify that my thinking about AI alignment led me to much reflection about my values and high-level hopes for the future. Moreover, this is another case where Karnofsky seems to hope that the simulation will be bad.

Relying on the simulation being bad is a dangerous proposition. It means we are caught between the Scylla of "the simulation is too good to be safe" and the Charybdis of "the simulation is too bad to be useful" and it's not clear the zone between them exists at all.

Overall, I would say that neither side has a slam dunk case, but ignoring the dangers without much stronger arguments seems deeply unwise.

  1. ^

    As far as can be told from public record. I neither confirm nor deny that I ever was in a situation in which I considered to engineer nanobots that take over the world.

Comment by Vanessa Kosoy (vanessa-kosoy) on Neural networks generalize because of this one weird trick · 2024-12-23T17:16:08.813Z · LW · GW

This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important.

One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad!

It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.

Comment by Vanessa Kosoy (vanessa-kosoy) on Natural Abstractions: Key claims, Theorems, and Critiques · 2024-12-23T14:57:45.877Z · LW · GW

This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism.

To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity).

Some thoughts about natural abstractions inspired by this post:

  • The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view).
  • A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs  and , we say that  when some conditioning of  on a finite set of observations produces a refinement of some conditioning of  on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form   Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences  and . The agreement theorem can then state that for all , there exists  s.t.  (and vice versa). More precisely, this relation might hold up to some known function  s.t. .
  • The "agreement" in the previous paragraph is purely semantic: the agents converge to believing in the same world, but this doesn't say anything about the syntactic structure of their beliefs. This seems conceptually insufficient for natural abstractions. However, maybe there is a syntactic equivalent where the preorder  is replaced by morphisms in the category of some syntactic representations (e.g. string machines). It seems reasonable to expect that agents must use such representations to learn efficiently (see also frugal compositional languages).
  • In this picture, the graphical models used by John are a candidate for the frugal compositional language. I think this might be not entirely off the mark, but the real frugal compositional language is probably somewhat different.
Comment by Vanessa Kosoy (vanessa-kosoy) on Towards Developmental Interpretability · 2024-12-19T16:55:21.384Z · LW · GW

This post introduces Timaeus' "Developmental Interpretability" research agenda. The latter is IMO one of the most interesting extant AI alignment research agendas.

The reason DevInterp is interesting is that it is one of the few AI alignment research agendas that is trying to understand deep learning "head on", while wielding a powerful mathematical tool that seems potentially suitable for the purpose (namely, Singular Learning Theory). Relatedly, it is one of the few agendas that maintains a strong balance of theoretical and empirical research. As such, it might also grow to be a bridge between theoretical and empirical research agendas more broadly (e.g. it might be synergistic with the LTA).

I also want to point out a few potential weaknesses or (minor) reservations I have:

First, DevInterp places phase transitions as its central object of study. While I agree that phase transitions seem interesting, possibly crucial to understand, I'm not convinced that a broader view wouldn't be better. 

Singular Learning Theory (SLT) has the potential to explain generalization in deep learning, phase transitions or no. This in itself seems to be important enough to deserve the central stage. Understanding generalization is crucial, because:

  • We want our alignment protocols to generalize correctly, given the available data, compute and other circumstances, and we need to understand what conditions would guarantee it (or at least prohibit catastrophic generalization failures).
  • If the resulting theory of generalization is in some sense universal, then it might be applicable to specifying a procedure for inferring human values (as human behavior is generated from human values by a learning algorithm with similar generalization properties), or at least formalizing "human values" well enough for theoretical analysis of alignment. 

Hence, compared to the OP, I would put more emphasis on these latter points.

Second, the OP does mention the difference between phase transitions during Stochastic Gradient Descent (SGD) and the phase transitions of Singular Learning Theory, but this deserves a closer look. SLT has IMO two key missing pieces:

  • The first piece is the relation between ideal Bayesian inference (the subject of SLT) and SGD. Ideal Bayesian inference is known to be computationally intractable. Maybe there is an extension of SLT that replaces Bayesian inference with either SGD or a different tractable algorithm. For example, it could be some Markov Chain Monte Carlo (MCMC) that converges to Bayesian inference in the limit. Maybe there is a natural geometric invariant that controls the MCMC relaxation time, similarly to how the log canonical threshold controls sample complexity.
  • The second missing piece is understanding the special properties of ANN architectures compared to arbitrary singular hypothesis classes. For example, maybe there is some universality property which explains why e.g. transformers (or something similar) are qualitatively "as good as it gets". Alternatively, it could be a relation between the log canonical threshold of specific ANN architectures to other simplicity measures which can be justified on other philosophical grounds.

That said, if the above missing pieces were found, SLT would become straightforwardly the theory for understanding deep learning and maybe learning in general.

Comment by Vanessa Kosoy (vanessa-kosoy) on Acausal normalcy · 2024-12-19T15:09:43.002Z · LW · GW

This post is a collection of claims about acausal trade, some of which I find more compelling and some less. Overall, I think it's a good contribution to the discussion.

Claims that I mostly agree with include:

  • Acausal trade in practice is usually not accomplished by literal simulation (the latter is mostly important as a convenient toy model) but by abstract reasoning.
  • It is likely to be useful to think of the "acausal economy" as a whole, rather just about each individual trade separately.

Claims that I have some quibbles with include:

  • The claim that there is a strong relation between the prevalent acausal norms and human moral philosophy. I agree that there are likely to be some parallels: both processes are to some degree motivated by articulating mutually beneficial norms. However, human moral philosophy is likely to contain biases specific to humans and to human circumstances on Earth. Conversely, acausal norms are likely to be shaped by metacosmological circumstances that we don't even know yet. For example, maybe there is some reason why most civilizations in the multiverse really hate logarithmic spirals. In this case, there would be a norm against logarithmic spirals that we are currently completely oblivious about.
  • The claim that the concept of "boundaries" is likely to play a key role in acausal norms. I find this somewhat plausible but far from clear. AFAIK, Critch so far produced little in the way of compelling mathematical models to support the "boundaries" idea.
  • It seems to be implicit in the post that, an acausal-norm-following paperclip-maximizer would be "nice" to humans to some degree. (But Critch warns us that the paperclip-maximizer might easily fail to be acausal-norm-following.) While I grant that it's possible, I think it's far from clear. The usual trad-y argument to be nice to others is so that others are nice to you. However, (i) some agents are a priori less threatened by others and hence find the argument less compelling (ii) who exactly are the relevant "others" is unclear. For example, it might be that humans are in some ways not "advanced" enough to be considered. Conversely, it's possible that human treatment of animals has already condemned us to the status of defectors (which can be defected-against in turn).
  • The technical notion that logical proofs and Lob/Payor are ultimately the right mathematical model of acausal trade. I am very much unconvinced, e.g. because proof search is intractable and also because we don't know how to naturally generalizes these arguments far beyond the toy setting of Fair Bots in Prisoner's Dilemma. On the other hand, I do expect there to exist some mathematical justification of superrationality, just along other lines.
Comment by Vanessa Kosoy (vanessa-kosoy) on Think carefully before calling RL policies "agents" · 2024-12-19T14:06:59.635Z · LW · GW

This post argues that, while it's traditional to call policies trained by RL "agents", there is no good reason for it and the terminology does more harm than good. IMO Turner has a valid point, but he takes it too far.

What is an "agent"? Unfortunately, this question is not discussed in the OP in any detail. There are two closely related informal approaches to defining "agents" that I like, one more axiomatic / black-boxy and the other more algorithmic / white-boxy.

The algorithmic definition is: An agent is a system that can (i) learn models of its environment (ii) use learned models to generate plans towards a particular goal (iii) execute these plans.

Under this definition, is an RL policy an "agent"? Not necessarily. There is a much stronger case for arguing that the RL algorithm, including the training procedure, is an agent. Indeed, such an algorithm (i) learns a model of the environment (at least if it's model-based RL: if it's model-free it might still do so implicitly, but it's less clear) (ii) generates a plan (the policy) (iii) executes the plans (when the policy is executed, i.e. in inference/deployment time). Whether the policy in itself is an agent amounts to asking whether the policy is capable of in-context RL (which is far from obvious). Moreover, the case for calling the system an agent is stronger when it learns online and weaker (but not completely gone) when there is a separation into non-overlapping training and deployment phases, as often done in contemporary systems.

The axiomatic definition is: An agent is a system that effectively pursues a particular goal in an unknown environment. That is, it needs to perform well (as measured by achieving the goal) when placed in a large variety of different environments.

With this definition we reach similar conclusions. An online RL system would arguably adapt to its environment and optimize towards achieving the goal (which is maximizing the reward). A trained policy will not necessarily do it: if it was trained in a particular environment, it can become completely ineffective in other environments! 

Importantly, even an online RL system can easily fail at agentic-ness, depending how good its learning algorithm is for dealing with distributional shift, nonrealizability etc. Nevertheless, the relation between agency and RL is pretty direct, more so than the OP implies.

Comment by Vanessa Kosoy (vanessa-kosoy) on FixDT · 2024-12-18T16:15:13.652Z · LW · GW

This post proposes an approach to decision theory in which we notion of "actions" is emergent. Instead of having an ontologically fundamental notion of actions, the agent just has beliefs, and some of them are self-fulfilling prophecies. For example, the agent can discover that "whenever I believe my arm will move up/down, my arm truly moves up/down", and then exploit this fact by moving the arm in the right direction to maximize utility. This works by having a "metabelief" (a mapping from beliefs to beliefs; my terminology, not the OP's) and allowing the agent to choose its belief out of the metabelief fixed points.

The next natural question is then, can we indeed demonstrate that an agent will learn which part of the world it controls, under reasonable conditions. Abram implies that it should be possible if we only allow choice among attractive fixed point. He then bemoans the need for this restriction and tries to use ideas from Active Inference to fix it with limited success. Personally, I don't understand what's so bad with staying with the attractive fixed points.

Unfortunately, this post avoids spelling out a sequential version of the decision theory, which would be necessary to actually establish any learning-theoretic result. However, I think that I see how it can be done, and it seems to support Abram's claims. Details follows.

Let's suppose that the agent observes two systems, each of which can be in one of two positions. At each moment of time, it observes an element of , where . The agent beliefs it can control one of  and  whereas the other is a fair coin. However, it doesn't know which is which.

In this case, metabeliefs are mappings of type . Specifically, we have a hypothesis  that asserts  is controllable, a hypothesis  that asserts  is controllable and the overall metabelief is (say) .

The hypothesis  is defined by

Here,  and   is some "motor response function", e.g. .

Similarly,  is defined by

Now, let  be an attractive fixed point of  and consider some history . If the statistics of  in  seem biased towards  whereas the statistics of  in  seem like a fair coin, then the likelihoods will satisfy  and hence  will be close to  and therefore will be close to  (since  is an attractive fixed point). On the other hand, in the converse situation, the likelihoods will satisfy  and hence  will be close to . Hence, the agent effectively updates on the observed history and will choose some fixed point  which controls the available degrees of freedom correctly.

Notice that all of this doesn't work with repelling fixed points. Indeed, if we used  then  would have a unique fixed point and there would be nothing to choose.

I find these ideas quite intriguing and am likely to keep thing about them!

Comment by Vanessa Kosoy (vanessa-kosoy) on There are no coherence theorems · 2024-12-18T12:55:24.669Z · LW · GW

I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven't read your proposal yet, I don't have an opinion, but some coherence-inspired question I would be asking are:

  • Can you define an incomplete-preferences AIXI consistent with this proposal?
  • Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
  • What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?
Comment by Vanessa Kosoy (vanessa-kosoy) on There are no coherence theorems · 2024-12-16T19:11:05.513Z · LW · GW

This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.

The OP presents the position it argues against as follows (in my paraphrasing): "Sufficiently advanced agents don't play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]". They then proceed to argue that there is no theorem that can make this argument go through.

I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:

  • In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
  • The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
  • The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what "rational goal-directed behavior" is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say "moderate evidence" rather than "completely watertight proof", but given how natural the assumptions are, the evidence is good.
  • More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind "goal-directedness".
  • If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).

In conclusion, there are coherence theorems. But, more important than individual theorems are the "coherence theories".

Comment by vanessa-kosoy on [deleted post] 2024-12-11T08:14:06.201Z

There are plenty examples in fiction of greed and hubris leading to a disaster that takes down its own architects. The dwarves who mined too deep and awoke the Balrog, the creators of Skynet, Peter Isherwell in "Don't Look Up", Frankenstein and his Creature...

Comment by Vanessa Kosoy (vanessa-kosoy) on sarahconstantin's Shortform · 2024-12-10T09:44:12.241Z · LW · GW

I kinda agree with the claim, but disagree with its framing. You're imagining that peer pressure is something extraneous to the person's core personality, which they want to resist but usually fail. Instead, the desire to fit in, to be respected, liked and admired by other people, is one of the core desires that most (virtually all?) people have. It's approximately on the same level as e.g. the desire to avoid pain. So, people don't "succumb to peer pressure", they (unconsciously) choose to prioritize social needs over other considerations.

At the same time, the moral denouncing of groupthink is mostly a self-deception defense against hostile telepaths. With two important caveats:

  • Having "independent thinking" as part of the ethos of a social group is actually beneficial for that group's ability to discover true things. While the members of such a group still feel the desire to be liked by other members, they also have the license to disagree without being shunned for it, and are even rewarded for interesting dissenting opinions.
  • Hyperbolic discount seems to be real, i.e. human preferences are time-inconsistent. For example, you can be tempted to eat candy when one is placed in front of you, while also taking measures to avoid such temptation in the future. Something analogous might apply to peer pressure.
Comment by Vanessa Kosoy (vanessa-kosoy) on The Learning-Theoretic Agenda: Status 2023 · 2024-12-08T13:38:33.192Z · LW · GW

This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures.

Since the article was written, there were several new publications:

In addition, some new developments were briefly summarized in short-forms:

  • A proposed solution for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
  • Multiple developments concerning metacognitive agents (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
  • A conjecture about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the "correct" choice.
  • Ambidistributions: a cute new mathematical gadget for formalizing the notion of "control" in infra-Bayesianism.

Meanwhile, active research proceeds along several parallel directions:

  • I'm working towards the realization of the "frugal compositional languages" dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: "ambiguous online learning"). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during '25.
  • Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to "computational uncertainty" (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
  • Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this short-form (discovering some interesting combinatorial objects beyond what was described there).

It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Rules for an Algebra of Bayes Nets · 2024-12-06T10:51:04.540Z · LW · GW

Seems right, but is there a categorical derivation of the Wentworth-Lorell rules? Maybe they can be represented as theorems of the form: given an arbitrary Markov category C, such-and-such identities between string diagrams in C imply (more) identities between string diagrams in C.

Comment by Vanessa Kosoy (vanessa-kosoy) on Connectomics seems great from an AI x-risk perspective · 2024-12-06T10:30:25.951Z · LW · GW

This article studies a potentially very important question: is improving connectomics technology net harmful or net beneficial from the perspective of existential risk from AI? The author argues that it is net beneficial. Connectomics seems like it would help with understanding the brain's reward/motivation system, but not so much with understanding the brain's learning algorithms. Hence it arguably helps more with AI alignment than AI capability. Moreover, it might also lead to accelerating whole brain emulation (WBE) which is also helpful.

The author mentions 3 reasons why WBE is helpful: 

  • We can let WBEs work on alignment.
  • We can more easily ban de novo AGI by letting WBEs fill its economic niche
  • Maybe we can derive aligned superintelligence from modified WBEs.

I think there is another reason: some alignment protocols might rely on letting the AI study a WBEs and use it for e.g. inferring human values. The latter might be viable even if actually running the WBE too slow to be useful with contemporary technology.

I think that performing this kind of differential benefit analysis for various technologies might be extremely important, and I would be glad to see more of it on LW/AF (or anywhere).

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Rules for an Algebra of Bayes Nets · 2024-12-06T10:06:42.928Z · LW · GW

This article studies a natural and interesting mathematical question: which algebraic relations hold between Bayes nets? In other words, if a collection of random variables is consistent with several Bayes nets, what other Bayes nets does it also have to be consistent with? The question is studied both for exact consistency and for approximate consistency: in the latter case, the joint distribution is KL-close to a distribution that's consistent with the net. The article proves several rules of this type, some of them quite non-obvious. The rules have concrete applications in the authors' research agenda.

Some further questions that I think would be interesting to study:

  • Can we derive a full classification of such rules?
  • Is there a category-theoretic story behind the rules? Meaning, is there a type of category for which Bayes nets are something akin to string diagrams and the rules follow from the categorical axioms?
Comment by Vanessa Kosoy (vanessa-kosoy) on The 2023 LessWrong Review: The Basic Ask · 2024-12-05T10:49:15.237Z · LW · GW

Tbf, you can fit a quadratic polynomial to any 3 points. But triangular numbers are certainly an aesthetically pleasing choice. (Maybe call it "triangular voting"?)

Comment by Vanessa Kosoy (vanessa-kosoy) on Complete Feedback · 2024-11-02T11:45:31.670Z · LW · GW

I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?

Comment by Vanessa Kosoy (vanessa-kosoy) on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T17:29:41.337Z · LW · GW

P(GPT-5 Release)

What is the probability that OpenAI will release GPT-5 before the end of 2025? "Release" means that a random member of the public can use it, possibly paid.

 

Does this require a product called specifically "GPT-5"? What if they release e.g "OpenAI o2" instead, and there will never be something called GPT-5?

Comment by Vanessa Kosoy (vanessa-kosoy) on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T17:24:43.571Z · LW · GW

Number of Current Partners
(for example, 0 if you are single, 1 if you are in a monogamous relationship, higher numbers for polyamorous relationships)

 

This is a confusing phrasing. If you have 1 partner, it doesn't mean your relationship is monogamous. A monogamous relation is one in which there is a mutually agreed understanding that romantic or sexual interaction with other people is forbidden. Without this, your relationship is not monogamous. For example:

  • You have only one partner, but your partner has other partners.
  • You have only one partner, but you occasionally do one night stands with other people.
  • You have only one partner, but both you and your partner are open to you having more partners in the future.

All of the above are not monogamous relationships!

Comment by Vanessa Kosoy (vanessa-kosoy) on The hostile telepaths problem · 2024-10-28T08:55:33.618Z · LW · GW

I've been thinking along very similar lines for a while (my inside name for this is "mask theory of the mind": consciousness is a "mask"). But my personal conclusion is very different. While self-deception is a valid strategy in many circumstances, I think that it's too costly when trying to solve an extremely difficult high-stakes problem (e.g. stopping the AI apocalypse). Hence, I went in the other direction: trying to self-deceive little, and instead be self-honest about my[1] real motivations, even if they are "bad PR". In practice, this means never making excuses to myself such as "I wanted to do A, but I didn't have the willpower so I did B instead", but rather owning the fact I wanted to do B and thinking how to integrate this into a coherent long-term plan for my life.

My solution to "hostile telepaths" is diving other people into ~3 categories:

  1. People that are adversarial or untrustworthy, either individually or as representatives of the system on behalf of which they act. With such people, I have no compunction to consciously lie ("the Jews are not in the basement... I packed the suitcase myself...") or act adversarially.
  2. People that seem cooperative, so that they deserve my good will even if not complete trust. With such people, I will be at least metahonest: I will not tell direct lies, and I will be honest about in which circumstances I'm honest (i.e. reveal all relevant information). More generally, I will act cooperatively towards such people, expecting them to reciprocate. My attitude towards in this group is that I don't need to pretend to be something other than I am to gain cooperation, I can just rely on their civility and/or (super)rationality.
  3. Inner circle: People that have my full trust. With them I have no hostile telepath problem because they are not hostile. My attitude towards this group is that we can resolve any difference by putting all the cards on the table and doing whatever is best for the group in aggregate.

Moreover, having an extremely difficult high-stakes problem is not just a strong reason to self-deceive less, it's also strong reason to become more truth-oriented as a community. This means that people with such a common cause should strive to put each other at least in category 2 above, tentatively moving towards 3 (with the caveat of watching out for bad actors trying to exploit that).

  1. ^

    While making sure to use the word "I" to refer to the elephant/unconscious-self and not to the mask/conscious-self.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-10-27T19:20:07.584Z · LW · GW

Two thoughts about the role of quining in IBP:

  • Quine's are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai's 3-player Prisoner's Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
  • What to do about agents that don't know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure's) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don't need quining: Any agent can be assigned "physicalist intelligence" (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn't need to know its own source code in any sense.
  1. ^

    @Squark is my own old LessWrong account.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-10-17T09:14:56.673Z · LW · GW

I just read Daniel Boettger's "Triple Tragedy And Thankful Theory". There he argues that the thrival vs. survival dichotomy (or at least its implications on communication) can be understood as time-efficiency vs. space-efficiency in algorithms. However, it seems to me that a better parallel is bandwidth-efficiency vs. latency-efficiency in communication protocols. Thrival-oriented systems want to be as efficient as possible in the long-term, so they optimize for bandwidth: enabling the transmission of as much information as possible over any given long period of time. On the other hand, survival-oriented systems want to be responsive to urgent interrupts which leads to optimizing for latency: reducing the time it takes between a piece of information appearing on one end of the channel and that piece of information becoming known on the other end.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-10-08T17:20:11.183Z · LW · GW

Ambidistributions

I believe that all or most of the claims here are true, but I haven't written all the proofs in detail, so take it with a grain of salt.

Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable.

Definition 1: Let  be a compact Polish space. A (crisp) ambidistribution on  is a function  s.t.

  1. (Monotonocity) For any , if  then .
  2. (Homogeneity) For any  and .
  3. (Constant-additivity) For any  and .

Conditions 1+3 imply that  is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post.

The space of all ambidistributions on  will be denoted .[1] Obviously,  (where  stands for (crisp) infradistributions), and likewise for ultradistributions.

Examples

Example 1: Consider compact Polish spaces  and a continuous mapping . We can then define  by

That is,  is the value of the zero-sum two-player game with strategy spaces  and  and utility function .

Notice that  in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further.

Example 2: Let  and  be finite sets representing actions and observations respectively, and  be an infra-Bayesian law. Then, we can define  by

In fact, this is a faithful representation:  can be recovered from .

Example 3: Consider an infra-MDP with finite state set , initial state  and transition infrakernel . We can then define the "ambikernel"  by

Thus, every infra-MDP induces an "ambichain". Moreover:

Claim 1:  is a monad. In particular, ambikernels can be composed. 

This allows us defining

This object is the infra-Bayesian analogue of the convex polytope of accessible state occupancy measures in an MDP.

Claim 2: The following limit always exists:

Legendre-Fenchel Duality

Definition 3: Let  be a convex space and . We say that  occludes  when for any , we have

Here,  stands for convex hull.

We denote this relation . The reason we call this "occlusion" is apparent for the  case.

Here are some properties of occlusion:

  1. For any .
  2. More generally, if  then .
  3. If  and  then .
  4. If  and  then .
  5. If  and  for all , then .
  6. If  for all , and also , then .

Notice that occlusion has similar algebraic properties to logical entailment, if we think of  as " is a weaker proposition than ".

Definition 4: Let  be a compact Polish space. A cramble set[2] over  is  s.t.

  1.  is non-empty.
  2.  is topologically closed.
  3. For any finite  and , if  then . (Here, we interpret elements of  as credal sets.)

Question: If instead of condition 3, we only consider binary occlusion (i.e. require , do we get the same concept?

Given a cramble set , its Legendre-Fenchel dual ambidistribution is

Claim 3: Legendre-Fenchel duality is a bijection between cramble sets and ambidistributions.

Lattice Structure

Functionals

The space  is equipped with the obvious partial order:  when for all  . This makes  into a distributive lattice, with

This is in contrast to  which is a non-distributive lattice.

The bottom and top elements are given by

Ambidistributions are closed under pointwise suprema and infima, and hence  is complete and satisfies both infinite distributive laws, making it a complete Heyting and co-Heyting algebra.

 is also a De Morgan algebra with the involution

For  is not a Boolean algebra:  and for any  we have .

One application of this partial order is formalizing the "no traps" condition for infra-MDP:

Definition 2: A finite infra-MDP is quasicommunicating when for any 

Claim 4: The set of quasicommunicating finite infra-MDP (or even infra-RDP) is learnable.

Cramble Sets

Going to the cramble set representation,  iff 

 is just , whereas  is the "occlusion hall" of  and .

The bottom and the top cramble sets are

Here,  is the top element of  (corresponding to the credal set .

The De Morgan involution is

Operations

Definition 5: Given  compact Polish spaces and a continuous mapping , we define the pushforward  by

When  is surjective, there are both a left adjoint and a right adjoint to , yielding two pullback operators :

 

Given  and  we can define the semidirect product  by

There are probably more natural products, but I'll stop here for now.

Polytopic Ambidistributions

Definition 6: The polytopic ambidistributions  are the (incomplete) sublattice of  generated by .

Some conjectures about this:

  • For finite , an ambidistributions  is polytopic iff there is a finite polytope complex  on  s.t. for any cell  of  is affine.
  • For finite , a cramble set  is polytopic iff it is the occlusion hall of a finite set of polytopes in .
  •  and  from Example 3 are polytopic.
  1. ^

    The non-convex shape  reminds us that ambidistributions need not be convex or concave.

  2. ^

    The expression "cramble set" is meant to suggest a combination of "credal set" with "ambi".

Comment by Vanessa Kosoy (vanessa-kosoy) on Applications of Chaos: Saying No (with Hastings Greer) · 2024-09-22T15:49:33.237Z · LW · GW

One reason to doubt chaos theory’s usefulness is that we don’t need fancy theories to tell us something is impossible. Impossibility tends to make itself obvious.

 

This claim seems really weird to me. Why do you think that's true? A lot of things we accomplished with technology today might seem impossible to someone from 1700. On the other hand, you could have thought that e.g. perpetuum mobile, or superluminal motion, or deciding whether a graph is 3-colorable in worst-case polynomial time, or transmitting information with a rate higher than Shannon-Hartley is possible if you didn't know the relevant theory.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2024-09-14T12:55:41.150Z · LW · GW

Here's the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth's "selection theorem" concept.

Consider any environment  of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function

Suppose that there exists a policy  of finite Kolmogorov complexity (i.e. computable) that's optimal for  in the slow discount limit. That is,

Then,  cannot be the only environment with this property. Otherwise, this property could be used to define  using a finite number of bits, which is impossible[1]. Since  requires infinitely many more bits to specify than  and , there has to be infinitely many environments with the same property[2]. Therefore,  is a reinforcement learning algorithm for some infinite class of hypothesis.

Moreover, there are natural examples of  as above. For instance, let's construct  as an infinite sequence of finite communicating infra-RDP refinements that converges to an unambiguous (i.e. "not infra") environment. Since each refinement involves some arbitrary choice, "most" such  have infinite Kolmogorov complexity. In this case,  exists: it can be any learning algorithm for finite communicating infra-RDP with arbitrary number of states.

Besides making this a rigorous theorem, there are many additional questions for further investigation:

  • Can we make similar claims that incorporate computational complexity bounds? It seems that it should be possible to at least constraint our algorithms to be PSPACE in some sense, but not obvious how to go beyond that (maybe it would require the frugal universal prior).
  • Can we argue that  must be an infra-Bayesian learning algorithm? Relatedly, can we make a variant where computable/space-bounded policies can only attain some part of the optimal asymptotic reward of ?
  • The setting we described requires that all the traps in  can be described in a finite number of bits. If this is not the case, can we make a similar sort of an argument that implies  is Bayes-optimal for some prior over a large hypothesis class?
  1. ^

    Probably, making this argument rigorous requires replacing the limit with a particular regret bound. I ignore this for the sake of simplifying the core idea.

  2. ^

    There probably is something more precise that can be said about how "large" this family of environment is. For example, maybe it must be uncountable.

Comment by Vanessa Kosoy (vanessa-kosoy) on AI forecasting bots incoming · 2024-09-10T07:33:13.089Z · LW · GW

Can you explain what's your definition of "accuracy"? (the 87.7% figure)
Does it correspond to some proper scoring rule?

Comment by Vanessa Kosoy (vanessa-kosoy) on AI forecasting bots incoming · 2024-09-10T06:53:28.130Z · LW · GW

(just for fun)

Comment by Vanessa Kosoy (vanessa-kosoy) on AI forecasting bots incoming · 2024-09-10T06:46:19.757Z · LW · GW
Comment by Vanessa Kosoy (vanessa-kosoy) on Why Large Bureaucratic Organizations? · 2024-08-27T19:14:45.963Z · LW · GW

Rings true. Btw, I heard many times people with experience in senior roles making "ha ha only serious" jokes about how obviously any manager would hire more underlings if only you let them. I also feel the pull of this motivation myself, although usually I prefer other kinds of status. (Of the sort "people liking/admiring me" rather than "me having power over people".)

Comment by Vanessa Kosoy (vanessa-kosoy) on You don't know how bad most things are nor precisely how they're bad. · 2024-08-13T10:04:50.930Z · LW · GW

You're ignoring the part where making something cheaper is a real benefit. For example, it's usually better to have a world where everyone can access a thing of slightly lower quality, than a world where only a small elite can access a thing, but the thing is of slightly higher quality.

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Unorthodox Ways To Achieve High GDP Growth · 2024-08-09T09:34:28.476Z · LW · GW

Btw, I mentioned the possibility of cycles that increase GDP before.

Comment by vanessa-kosoy on [deleted post] 2024-08-07T07:06:15.424Z

Yes, my point is that currently subscripts refer to both subenvironments and entries in the action space list. I suggest changing one of these two into superscripts.

Comment by vanessa-kosoy on [deleted post] 2024-08-03T11:18:06.665Z

You can use e.g. subscripts to refer to indices of the action space list and superscripts to refer to indices of the subenvironment list. 

Comment by Vanessa Kosoy (vanessa-kosoy) on Martín Soto's Shortform · 2024-08-03T09:10:13.671Z · LW · GW

I think that some people are massively missing the point of the Turing test. The Turing test is not about understanding natural language. The idea of the test is, if an AI can behave indistinguishably from a human as far as any other human can tell, then obviously it has at least as much mental capability as humans have. For example, if humans are good at some task X, then you can ask the AI to solve the same task, and if it does poorly then it's a way to distinguish the AI from a human

The only issue is how long the test should take and how qualified the judge. Intuitively, it feels plausible that if an AI can withstand (say) a few hours of drilling by an expert judge, then it would do well even on tasks that take years for a human. It's not obvious, but it's at least plausible. And I don't think existing AIs are especially near to passing this.

Comment by Vanessa Kosoy (vanessa-kosoy) on Martín Soto's Shortform · 2024-08-01T08:39:54.046Z · LW · GW

I don't think embeddedness has much to do with it. And I disagree that it's incompatible with counterfactuals. For example, infra-Bayesian physicalism is fully embedded and has a notion of counterfactuals. I expect any reasonable alternative to have them as well.

Comment by Vanessa Kosoy (vanessa-kosoy) on Martín Soto's Shortform · 2024-07-29T17:11:29.377Z · LW · GW

Maybe the learning algorithm doesn't have a clear notion of "positive and negative", and instead just provides in a same direction (but with different intensities) for different intensities in a scale without origin. (But this seems very different from the current paradigm, and fundamentally wasteful.)

 

Maybe I don't understand your intent, but isn't this exactly the currently paradigm? You train a network using the derivative of the loss function. Adding a constant to the loss function changes nothing. So, I don't see how it's possible to have a purely ML-based explanation of where humans consider the "origin" to be.