Posts

Stop-gradients lead to fixed point predictions 2023-01-28T22:47:35.008Z
Proper scoring rules don’t guarantee predicting fixed points 2022-12-16T18:22:23.547Z
Extracting Money from Causal Decision Theorists 2021-01-28T17:58:44.129Z
Moral realism and AI alignment 2018-09-03T18:46:44.266Z
The law of effect, randomization and Newcomb’s problem 2018-02-15T15:31:56.033Z
Naturalized induction – a challenge for evidential and causal decision theory 2017-09-22T08:15:09.999Z
A survey of polls on Newcomb’s problem 2017-09-20T16:50:08.802Z
Invitation to comment on a draft on multiverse-wide cooperation via alternatives to causal decision theory (FDT/UDT/EDT/...) 2017-05-29T08:34:59.311Z
Are causal decision theorists trying to outsmart conditional probabilities? 2017-05-16T08:01:27.426Z
Publication on formalizing preference utilitarianism in physical world models 2015-09-22T16:46:54.934Z
Two-boxing, smoking and chewing gum in Medical Newcomb problems 2015-06-29T10:35:58.162Z
Request for feedback on a paper about (machine) ethics 2014-09-28T12:03:05.500Z

Comments

Comment by Caspar Oesterheld (Caspar42) on Using (Uninterpretable) LLMs to Generate Interpretable AI Code · 2024-09-18T17:22:56.291Z · LW · GW

I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood.

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

Comment by Caspar Oesterheld (Caspar42) on My AI Model Delta Compared To Christiano · 2024-09-14T23:18:33.789Z · LW · GW

To some extent, this is all already in Jozdien's comment, but:

It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people's level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let's ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the disagreement is mostly about what happens as we move from human to superhuman-AI discussants. In particular, I would expect Paul to concede that the current level of disagreement in the alignment community is problematic and to argue that this will improve (enough) if we have superhuman debaters. If even this closely related form of debate/delegation/verification process isn't taken to be very informative (by at least one of Paul and John), then it's hard to imagine that much more distant delegation processes (such as those behind making computer monitors) are very informative to their disagreement.

Comment by Caspar Oesterheld (Caspar42) on Using (Uninterpretable) LLMs to Generate Interpretable AI Code · 2024-08-30T06:54:01.981Z · LW · GW

As once discussed in person, I find this proposal pretty interesting and I think it deserves further thought.

Like some other commenters, I think for many tasks it's probably not tractable to develop a fully interpretable, competitive GOFAI program. For example, I would imagine that for playing chess well, one needs to do things like positively evaluating some random-looking feature of a position just on the basis that empirically this feature is associated with higher win rate. However, the approach of the post could be weakened to allow "mixed" programs that have some not so interpretable aspects, e.g., search + a network for evaluating positions is more interpretable than just a network that chooses moves, a search + sum over feature evals is even more interpretable, and so on.

As you say in the post, there seems to be some analogy between your proposal and interpreting a given network. (For interpreting a given chess-playing network, the above impossibility argument also applies. I doubt that a full interpretation of 3600 elo neural nets will ever exist. There'll always be points where you'd want to ask, "why?", and the answer is, "well, on average this works well...") I think if I wanted to make a case for the present approach, I'd mostly try to sell it as a better version of interpretation.

Here's a very abstract argument. Consider the following two problems:

  • Given a neural net (or circuit or whatever) for a task, generate an interpretation/explanation (whatever that is exactly, could be a "partial" interpretation) of that neural net.
  • Given a neural net for a task, generate a computer program that performs the task roughly as well as the given neural net and an interpretation/explanation for this new program.

Interpretability is the first problem. My variant of your suggestion is that we solve the second problem instead. Solving the second problem seems just as useful as solving the first problem. Solving the second problem is at most as hard as solving the first. (If you can solve the first problem, you automatically solve the second problem.)

So actually all we really need to argue is that getting to (use enormous amounts of LLM labor to) write a new program partly from scratch makes the problem strictly easier. And then it's easy to come up with lots of concrete ideas for cases where it might be easier. For instance, take chess. Then imposing the use of a GOFAI search algorithm to use with a position evaluation network increases interpretability relative to just training an end-to-end model. It also doesn't hurt performance. (In fact, my understanding is that the SOTA still uses some GOFAI methods, rather than an end-to-end-trained neural net.) You can think of further ways to hard-code-things in a way that simplifies interpretability at small costs to performance. For instance, I'd guess that you can let the LLMs write 1000 different Python functions that detect various features in the position (whether White has the Bishop pair, whether White's king has three pawns in front of it, etc.). For chess in particular you could of course also just get these functions from prior work on chess engines. Then you feed these into the neural net that you use for evaluating positions. In return, you can presumably make that network smaller (assuming your features are actually useful), while keeping performance constant. This leaves less work for neural interpretation. How much smaller is an empirical question.

Comment by Caspar Oesterheld (Caspar42) on Boycott OpenAI · 2024-06-21T20:43:50.471Z · LW · GW

If all you're using is ChatGPT, then now's a good time to cancel the subscription because GPT-4o seems to be similarly powerful as GPT-4, and GPT-4o is available for free.

Comment by Caspar Oesterheld (Caspar42) on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T16:06:21.135Z · LW · GW
  • As one further data point, I also heard people close to/working at Anthropic giving "We won't advance the state of the art."-type statements, though I never asked about specifics.
  • My sense is also that Claude 3 Opus is only slightly better than the best published GPT-4. To add one data point: I happen to work on a benchmark right now and on that benchmark, Opus is only very slightly better than gpt-4-1106. (See my X/Twitter post for detailed results.) So, I agree with LawrenceC's comment that they're arguably not significantly advancing the state of the art.
  • I suppose even if Opus is only slightly better (or even just perceived to be better) and even if we all expect OpenAI to release a better GPT-4.5 soon, Anthropic could still take a bunch of OpenAI's GPT-4 business with this. (I'll probably switch from ChatGPT-4 to Claude, for instance.) So it's not that hard to imagine an internal OpenAI email saying, "Okay, folks, let's move a bit faster with these top-tier models from now on, lest too many people switch to Claude." I suppose that would already be quite worrying to people here. (Whereas, people would probably worry less if Anthropic took some of OpenAI's business by having models that are slightly worse but cheaper or more aligned/less likely to say things you wouldn't want models to say in production.)
Comment by Caspar Oesterheld (Caspar42) on AI things that are perhaps as important as human-controlled AI · 2024-03-03T18:39:26.241Z · LW · GW

In short, the idea is that there might be a few broad types of “personalities” that AIs tend to fall into depending on their training. These personalities are attractors.

I'd be interested in why one might think this to be true. (I only did a very superficial ctrl+f on Lukas' post -- sorry if that post addresses this question.) I'd think that there are lots of dimensions of variation and that within these, AIs could assume a continuous range of values. (If AI training mostly works by training to imitate human data, then one might imagine that (assuming inner alignment) they'd mostly fall within the range of human variation. But I assume that's not what you mean.)

Comment by Caspar Oesterheld (Caspar42) on How LLMs are and are not myopic · 2023-12-08T20:30:11.481Z · LW · GW

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

Comment by Caspar Oesterheld (Caspar42) on How LLMs are and are not myopic · 2023-12-08T19:48:10.022Z · LW · GW

Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).

Say you have a bunch of triplets . Your want to train a model that predicts  from  and  from .

Your model consists of three components: . It makes predictions as follows:


(Why have such a model? Why not have two completely separate models, one for predicting  and one for predicting ? Because it might be more efficient to use a single  both for predicting  and for predicting , given that both predictions presumably require "interpreting" .)

So, intuitively, it first builds an "inner representation" (embedding) of . Then it sequentially makes predictions based on that inner representation.

Now you train  and  to minimize the prediction loss on the  parts of the triplets. Simultaneously you train  to minimize prediction loss on the full  triplets. For example, you update  and  with the gradients

and you update  and  with the gradients

.
(The  here is the "true" , not one generated by the model itself.)

This training pressures  to be myopic in the second and third sense described in the post. In fact, even if we were to train  with the  predicted by  rather than the true  is pressured to be myopic.

  • Type 3 myopia: Training doesn't pressure  to output something that makes the  follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data  implies , while under  follows some distribution that depends in complicated ways on . Then  will not try to predict  more often.
  • Type 2 myopia:  won't try to provide useful information to  in its output, even if it could. For example, imagine that the s are strings representing real numbers. Imagine that  is always a natural number, that  is the -th Fibonacci number and  is the -th Fibonacci number. Imagine further that the model representing  is large enough to compute the -th Fibonacci number, while the model representing  is not. Then one way in which one might think one could achieve low predictive loss would be for  to output the -th Fibonacci number and then encode, for example, the -th Fibonacci number in the decimal digits. (E.g., .) And then  computes the -th Fibonacci number from the -th decimal. But the above training will not give rise to this strategy, because  gets the true  as input, not the one produced by . Further, even if we were to change this, there would still be pressure against this strategy because  () is not optimized to give useful information to . (The gradient used to update  doesn't consider the loss on predicting .) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on .

Of course,  still won't be pressured to be type-1-myopic. If predicting  requires predicting , then  will be trained to predict ("plan") .

(Obviously, $g_2$ is pressured to be myopic in this simple model.)

Now what about ? Well,  is optimized both to enable predicting  from  and predicting  from . Therefore, if resources are relevantly constrained in some way (e.g., the model computing  is small, or the output of  is forced to be small),  will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for  (and thus in some sense the overall model) can and will sacrifice accuracy on  to achieve better accuracy on . In particular, we should expect trained models to find an efficient tradeoff between accuracy on  and accuracy on . When  is relatively easy to predict,  will spend most of its computation budget on predicting .

So,  is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting  aren't optimized purely for predicting .

However,  is still "Type 3" myopic. Because the prediction made by  isn't fed (in training) as an input to  or the loss, there's no pressure towards making  influence the output of  in a way that has anything to do with . (In contrast to the myopia of , this really does hinge on not using  in training. If  mattered in training, then there would be pressure for  to trick  into performing calculations that are useful for predicting . Unless you use stop-gradients...)

* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.

Comment by Caspar Oesterheld (Caspar42) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-25T00:06:21.000Z · LW · GW

At least in this case (celebrities and their largely unknown parents), I would predict the opposite. That is, people are more likely to be able to correctly answer "Who is Mary Lee Pfeiffer's son?" than "Who is Tom Cruise's mother?" Why? Because there are lots of terms / words / names that people can recognize passively but not produce. Since Mary Lee Pfeiffer is not very well known, I think Mary Lee Pfeiffer will be recognizable but not producable to lots of people. (Of people who know Mary Lee Pfeiffer in any sense, I think the fraction of people who can only recognize her name is high.) As another example, I think "Who was born in Ulm?" might be answered correctly by more people than "Where was Einstein born?", even though "Einstein was born in Ulm" is a more common sentence for people to read than "Ulm is the city that Einstein was born in".

If I had to run an experiment to test whether similar effects apply in humans, I'd probably try to find cases where A and B in and of themselves are equally salient but the association A -> B is nonetheless more salient than the association B -> A. The alphabet is an example of this (where the effect is already confirmed).

Comment by Caspar Oesterheld (Caspar42) on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-23T21:25:20.422Z · LW · GW

I mean, translated to algorithmic description land, my claim was: It's often difficult to prove a negative and I think the non-existence of a short algorithm to compute a given object is no exception to this rule. Sometimes someone wants to come up with a simple algorithm for a concept for which I suspect no such algorithm to exist. I usually find that I have little to say and can only wait for them to try to actually provide such an algorithm.

So, I think my comment already contained your proposed caveat. ("The concept has K complexity at least X" is equivalent to "There's no algorithm of length <X that computes the concept.")

Of course, I do not doubt that it's in principle possible to know (with high confidence) that something has high description length. If I flip a coin n times and record the results, then I can be pretty sure that the resulting binary string will take at least ~n bits to describe. If I see the graph of a function and it has 10 local minima/maxima, then I can conclude that I can't express it as a polynomial of degree <10. And so on. 

Comment by Caspar Oesterheld (Caspar42) on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-22T08:15:50.072Z · LW · GW

I think I sort of agree, but...

It's often difficult to prove a negative and I think the non-existence of a crisp definition of any given concept is no exception to this rule. Sometimes someone wants to come up with a crisp definition of a concept for which I suspect no such definition to exist. I usually find that I have little to say and can only wait for them to try to actually provide such a definition. And sometimes I'm surprised by what people can come up with. (Maybe this is the same point that Roman Leventov is making.)

Also, I think there are many different ways in which concepts can be crisp or non-crisp. I think cooperation can be made crisp in some ways and not in others.

For example, I do think that (in contrast to human values) there are approximate characterizations of cooperation that are useful, precise and short. For example: "Cooperation means playing Pareto-better equilibria."

One way in which I think cooperation isn't crisp, is that you can give multiple different sensible definitions that don't fully agree with each other. (For example, some definitions (like the above) will include coordination in fully cooperative (i.e., common-payoff) games, and others won't.) I think in that way it's similar to comparing sets by size, where you can give lots of useful, insightful, precise definitions that disagree with each other. For example, bijection, isomorphism, and the subset relationship can each tell us when one set is larger than or as large as another, but they sometimes disagree and nobody expects that one can resolve the disagreement between the concepts or arrive at "one true definition" of whether one set is larger than another.

When applied to the real world rather than rational agent models, I would think we also inherit fuzziness from the application of the rational agent model to the real world. (Can we call the beneficial interaction between two cells cooperation? Etc.)

Comment by Caspar Oesterheld (Caspar42) on Conditions for Superrationality-motivated Cooperation in a one-shot Prisoner's Dilemma · 2023-07-07T21:54:42.278Z · LW · GW

I guess we have talked about this a bunch last year, but since the post has come up again...

It then becomes clear what the requirements are besides “I believe we have compatible DTs” for Arif to believe there is decision-entanglement:

“I believe we have entangled epistemic algorithms (or that there is epistemic-entanglement[5], for short)”, and
“I believe we have been exposed to compatible pieces of evidence”.

I still don't understand why it's necessary to talk about epistemic algorithms and their entanglement as opposed to just talking about the beliefs that you happen to have (as would be normal in decision and game theory theory).

Say Alice has epistemic algorithm A with inputs x that gives rise to beliefs b and Bob has a completely different [ETA: epistemic] algorithm A' with completely different inputs x' that happens to give rise to beliefs b as well. Alice and Bob both use decision algorithm D to make decisions. Part of b is the belief that Alice and Bob have the same beliefs and the same decision algorithm. It seems that Alice and Bob should cooperate. (If D is EDT/FDT/..., they will cooperate.) So it seems that the whole A,x,A',x' stuff just doesn't matter for what they should do. It only matters what their beliefs are. My sense from the post and past discussions is that you disagree with this perspective and that I don't understand why.

(Of course, you can talk about how in practice, arriving at the right kind of b will typically require having similar A, A' and similar x, x'.)

(Of course, you need to have some requirement to the extent that Alice can't modify her beliefs in such a way that she defects but that she doesn't (non-causally) make it much more likely that Bob also defects. But I view this as an assumption about decision-theoretic not epistemic entanglement: I don't see why an epistemic algorithm (in the usual sense of the word) would make such self-modifications.)

Comment by Caspar Oesterheld (Caspar42) on GPT-4 · 2023-06-23T06:36:29.385Z · LW · GW

Three months later, I still find that:
a) Bing Chat has a lot of issues that the ChatGPTs (both 3.5 or 4) don't seem to suffer from nearly as much. For example, it often refuses to answer prompts that are pretty clearly harmless.
b) Bing Chat has a harder time than I expected when answering questions that you can answer by copy-and-pasting the question into Google and then copy-and-pasting the right numbers, sentence or paragraph from the first search result. (Meanwhile, I find that Bing Chat's search still works better than the search plugins for ChatGPT 4, which seem to still have lots of mundane technical issues.) Occasionally ChatGPT (even ChatGPT 3.5) gives better (more factual or relevant) answers "from memory" than Bing Chat gives by searching.

However, when I pose very reasoning-oriented tasks to Bing Chat (i.e., tasks that mostly aren't about searching on Google) (and Bing Chat doesn't for some reason refuse to answer and doesn't get distracted by unrelated search results it gets), it seems clear that Bing Chat is more capable than ChatGPT 3.5, while Bing Chat and ChatGPT 4 seem similar in their capabilities. I pose lots of tasks that (in contrast to variants of Monty Hall (which people seem to be very interested in), etc.) I'm pretty sure aren't in the training data, so I'm very confident that this improvement isn't primarily about memorization. So I totally buy that people who asked Bing Chat the right questions were justified in being very confident that Bing Chat is based on a newer model than ChatGPT 3.5.

Also:
>I've tried (with little success) to use Bing Chat instead of Google Search.
I do now use Bing Chat instead of Google Search for some things, but I still think Bing Chat is not really a game changer for search itself. My sense is that Bing Chat doesn't/can't comb through pages and pages of different documents to find relevant info and that it also doesn't do one search to identify relevant search times for a second search, etc. (Bing Chat seems to be restricted to a few (three?) searches per query.) For the most part it seems to enter obvious search terms into Bing Search and then give information based on the first few results (even if those don't really answer the question or are low quality). The much more important feature from a productivity perspective is the processing of the information it finds, such as the processing of the information on some given webpage into a bibtex entry or applying some method from Stack Exchange to the particularities of one's code.

Comment by Caspar Oesterheld (Caspar42) on Language Models can be Utility-Maximising Agents · 2023-06-20T23:37:08.318Z · LW · GW

Very interesting post! Unfortunately, I found this a bit hard to understand because the linked papers don’t talk about EDT versus CDT or scenarios where these two come apart and because both papers are (at least in part) about sequential decision problems, which complicates things. (CDT versus EDT can mostly be considered in the case of a single decision and there are various complications in multi-decision scenarios, like updatelessness.)

Here’s an attempt at trying to describe the relation of the two papers to CDT and EDT, including prior work on these topics. Please correct me if I’m misunderstanding anything! The writing is not very polished -- sorry!

Ignoring all the sequential stuff, my understanding is that the first paper basically does this: First, we train a model to predict utilities after observing actions, i.e., make predictions conditional on actions. So in particular, we get a function a ---> E[utility | a] that maps an observed action by the agent onto a prediction of future reward/utility. Then if we use some procedure to find the action a that maximizes E[utility | a], it seems that we have an EDT agent. I think this is essentially the case of an “EDT overseer” who rewards based on actions (rather than outcomes) in “Approval-directed agency and the decision theory of Newcomb-like problems”. Also see the discussion of Obstacle 1 in "Two Major Obstacles for Logical Inductor Decision Theory".

Now what could go wrong with this? I think in some sense the problem is generally that it's unclear how the predictive model works, or where it comes from. The second paper (the DeepMind one) basically points out one issue with this. Other issues are known to this community. I’ll start with an issue that has been known to this community: the 5 and 10 problem / the problem of counterfactuals. If the agent always (reliably) chooses the action a that maximizes E[utility | a], then the predictive model’s counterfactual predictions (i.e., predictions for all other actions) could be nonsensical without being strictly speaking wrong. So for example, in 5 and 10, you choose between a five dollar bill and a ten dollar bill. (There’s no catch and you should clearly just take the ten dollar bill.) The model predicts that if you take the five dollar bill, you will get five dollars, and (spuriously / intuitively falsely) that if you take the ten dollar bill, you get nothing. Because you are maximizing expected utility according to this particular predictive model, you take the five dollars. So the crazy prediction for what happens if you take the ten dollars is never falsified.

In non-Newcomb-like scenarios, a simple, extremely standard solution to this problem is to train the predictive model (the thing that gives a ---> E[utility | a]) while the agent follows some policy that randomizes over all actions (perhaps one that takes actions with probabilities in proportion to the model's predictions E[utility | a]). My understanding is that this is how the first paper avoids these issues and gives good results. Unfortunately, in Newcomb-like problems these approaches tend to lead to pretty CDT-ish behavior, as shown in "Reinforcement Learning in Newcomblike Environments".

Anyway, the second paper (the DeepMind one) points out another issue related to where the E[utility | action] model comes from. Roughly, the story — which I think is very well described in Section 2 — seems to be the following: the E[utility | action] model is trained on the actions of an expert who knows whether X=1,2 and acts on that fact by choosing A=X; then the E[utility | action] model won't work for a non-expert agent, i.e., one who doesn’t observe X. I view this as a distributional shift issue — you train a model (the a ---> E[utility | a] one) in a setting where A=X, and then you apply it in a setting where sometimes A and X are uncorrelated.

It’s also similar to the Smoking Lesion/medical Newcomb-like problems! Consider the following medical Newcomb-like problem: First we learn the fact that sick people go to the doctor and healthy people don’t go to the doctor. Then without looking at how healthy I am, I don’t go to the doctor so as to gain evidence that I am healthy. Arguably what goes wrong here is also that I’m using a rule for prediction out of distribution on someone who doesn’t look at whether they’re sick. I think it relates to one of the least challenging versions of medical Newcomb-like problems and it’s handled comfortably by the so-called tickle defense.

Interlude: The paper talks about how this relates to hallucination in LLMs. So what’s that about? IIUC, the idea is that when generating text, LLMs incorrectly update based on the text they generate themselves. For example, imagine that you want an LLM to generate ten tokens. Then after generating the first nine tokens, it will predict the tenth token from its learned distribution . But this distribution was trained on fully human- not LLM-written text. So (in my way of thinking),  might do poorly (i.e., not give a human-like continuation of ), because it was trained on seeing nine tokens created by a human and having to predict a continuation by a human rather than nine tokens by itself/an LLM and having to predict a continuation by a human. For example, we might imagine that if  are words that only a human expert confident in a particular claim C would say, then the LLM will predict continuations that confidently defend claim C, even if the LLM doesn’t know anything about C. I'm not sure I really buy this explanation of hallucination. I think the claim would need more evidence than the authors provide. But it's definitely a very interesting point.

Now, back to the original toy model. Again, I would view this as a distribution shift problem. If we make some assumptions, though, we can infer/guess a model (i.e. function a ---> E[utility | a]) that predicts the utility obtained by a non-expert, i.e., an agent who doesn't observe X. Specifically, let’s assume that we are told the conditional distributions P(utility | X=1, A=0) and P(utility | X=0, A=1) (which we never see in training if the agent in training always knows and acts on X). Let’s also assume that we know that the difference between the training distribution and the new setting is that in the new setting the agent chooses A independently of X. Then in the new model we just need to make X and A independent and change nothing else. Formally you use the new distribution P’(X,U|A) = P(X)P(U|A,X), where the Ps on the right-hand side are just the old distribution, instead of P(X,U|A) = P(X|A)P(U|A,X).

It turns out that if we put the original distribution into a causal graph with edges X->A and A->U and X->U and then make a do-intervention on A (a la Pearl), then we get this exact distribution, i.e., P(X,U|do(A)) = P’(X,U|A). (Intuitively, removing the inference from A to X is exactly what the do(A) does if A's parent is X.) So in particular maximizing E[U | do(A)] gives the same result as maximizing E’[U|A]. Anyway, the paper uses the do operator to construct the new predictor, rather than the above argument. They seem to claim that the causal structure (or reasoning about causality) is necessary to construct the new predictor, with which I disagree.

Is this really CDT? I’m not sure… In the above type of case, this doesn’t come apart from EDT. If we buy that their scenario is a bit like a Smoking Lesion, then one could argue that part of the point of CDT is to solve this type of scenario. (In some sense my response is as in most versions of the Smoking Lesion: Because of the tickle defense, EDT applied properly gets this right anyway, so there’s actually nothing to fix here.) In my view it’s basically just about using the do-calculus to concisely specify the scenario P’ (based on P plus a particular causal graph for P). It seems that one can do these things without being committed to using do(A) in a scenario where there’s some non-causal dependence between A and U (that doesn't disappear outside of training), perhaps via some common cause Y. In any case, the paper doesn’t tell us how to distinguish between U <- Y -> A and A -> Y -> U — all causal relationships are assumed. So while nominally they construct their predictor as E[U | do(A)], it’s a bit unclear how wedded they are to CDT.

Anyway, with a (maybe-causalist) E[U | do(A)] in hand, we can of course build a (maybe-)CDT agent by choosing a to maximize E[U | do(A)]. But I think the paper doesn’t say anything about where to get the causal model from that gives us E[U | do(A)]. They pretty much assume that the model is provided.

I think the “counterfactual teaching” stuff doesn’t really say anything about CDT versus EDT, either. IIUC the basic idea is this. Imagine you want to train an LLM and you want to prevent the issue above. Then intuitively — in my distribution shift view — what we need to do is just train the LLM to make a good prediction  upon observing  that were generated by itself (rather than humans). The simplest, most obvious way to do this is to let the LLM generate some tokens , then get a probabilistic prediction about the next token from the LLM and then ask a human to give a next token . The loss of the LLM is just the, e.g., log loss of its prediction against the  provided by the human. One slightly tricky point here is that we only train the LLM to make good predictions on . We don’t want to train it to output  that make  easier to predict. So we need to be careful to choose the right gradient. I think that’s basically all they’re doing, though. It doesn’t seem like there’s anything causalist here.

So, in conclusion: While very interesting, I don't think these papers tell us anything new about how to build an EDT or a CDT agent.

Comment by Caspar Oesterheld (Caspar42) on On the Apple Vision Pro · 2023-06-15T20:09:36.598Z · LW · GW

Nice overview! I mostly agree.

>What I do not expect is something I’d have been happy to pay $500 or $1,000 for, but not $3,500. Either the game will be changed, or it won’t be changed quite yet. I can’t wait to find out.

From context, I assume you're saying this about the current iteration?

I guess willingness to pay for different things depends on one's personal preferences, but here's an outcome that I find somewhat likely (>50%):

  • The first-gen Apple Vision Pro will not be very useful for work, aside from some niche tasks.
    • It seems that to be better than a laptop for working at a coffee shop or something they need to have solved ~10 different problems extremely well and my guess is that they will have failed to solve one of them well enough. For example, I think comfort/weight alone has a >30% probability of making this less enjoyable to work with (for me at least) than with a laptop, even if all other stuff works fairly well.
    • Like you, I'm sometimes a bit puzzled by what Apple does. So I could also imagine that Apple screws up something weird that isn't technologically difficult. For example, the first version of iPad OS was extremely restrictive (no multitasking/splitscreen, etc.). So even though the hardware was already great, it was difficult to use it for anything serious and felt more like a toy. Based on what they emphasize on the website, I could very well imagine that they won't focus on making this work and that there'll be some basic, obvious issue like not being able to use a mouse. If Apple had pitched this more in the way that Spacetop was pitched, I'd be much more optimistic that the first gen will be useful for work.
  • The first-gen Apple Vision Pro will still produce lots of extremely interesting experiences so that many people would be happy to pay, say, $1000 for, but not $3,500 and definitely not much more than $3,500. For example, I think all the reviews I've seen have described the experience as very interesting, intense and immersive. Let's say this novelty value wears off after something like 10h. Then a family of four gets 40h of fun out of it. Say, you're happy to spend on the order of $10 per hour per person for a fun new experience (that's roughly what you'd spend to go to the movie theater, for example), then that'd be a willingness to pay in the hundreds of dollars.
Comment by Caspar Oesterheld (Caspar42) on On the Apple Vision Pro · 2023-06-15T19:35:55.094Z · LW · GW

>All accounts agree that Apple has essentially solved issues with fit and comfort.

Besides the 30min point, is it really true that all accounts agree on that? I definitely remember reading in at least two reports something along the lines of, "clearly you can't use this for hours, because it's too heavy". Sorry for not giving a source!

Comment by Caspar Oesterheld (Caspar42) on Why "AI alignment" would better be renamed into "Artificial Intention research" · 2023-06-15T19:21:23.462Z · LW · GW

Do philosophers commonly use the word "intention" to refer to mental states that have intentionality, though? For example, from the SEP article on intentionality:

>intention and intending are specific states of mind that, unlike beliefs, judgments, hopes, desires or fears, play a distinctive role in the etiology of actions. By contrast, intentionality is a pervasive feature of many different mental states: beliefs, hopes, judgments, intentions, love and hatred all exhibit intentionality.

(This is specifically where it talks about how intentionality and the colloquial meaning of intention must not be confused, though.)

Ctrl+f-ing through the SEP article gives only one mention of "intention" that seems to refer to intentionality. ("The second horn of the same dilemma is to accept physicalism and renounce the 'baselessness' of the intentional idioms and the 'emptiness' of a science of intention.") The other few mentions of "intention" seem to talk about the colloquial meaning. The article seems to generally avoid the avoid "intention". Generally the article uses "intentional" and "intentionality".

Incidentally, there's also an SEP article on "intention" that does seem to be about what one would think it to be about. (E.g., the first sentence of that article: "Philosophical perplexity about intention begins with its appearance in three guises: intention for the future, as I intend to complete this entry by the end of the month; the intention with which someone acts, as I am typing with the further intention of writing an introductory sentence; and intentional action, as in the fact that I am typing these words intentionally.")

So as long as we don't call it "artificial intentionality research" we might avoid trouble with the philosophers after all. I suppose the word "intentional" becomes ambiguous, however. (It is used >100 times in both SEP articles.)

Comment by Caspar Oesterheld (Caspar42) on MetaAI: less is less for alignment. · 2023-06-15T17:28:58.945Z · LW · GW

>They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would've taken, idk, 2–5 hrs of labour?

I'm not sure I understand what you mean by this. They ran the same prompts with all the LLMs, right? (That's what Figure 1 is...) Do you mean they should have tried the finetuning on the other LLMs as well? (I've only read your post, not the actual paper.) And how does this relate to turning their prompts into a new benchmarks?

Comment by Caspar Oesterheld (Caspar42) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-06-15T07:54:05.640Z · LW · GW

>I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Sorry if I was cryptic! Yes, it's basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because "quasi-strictly proper scoring rule w.r.t. the max decision rule" is a mouthful. :-P) Does that help?

>much safer than having it effectively chosen for them by their specification of a utility function

So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point...

>I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

Let's grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other.

Maybe put differently: I still don't know what property it is that you think favors zero-sum conditional prediction over proper decision scoring rules. I don't think it can be not wanting to specify your utility function / not wanting the agent to pick agents based on their model of your utility function / wanting to instead choose yourself based on reported distributions, because both methods can be used in this way. Also, note that in both methods the predictors in practice have incentives that are determined by (their beliefs about) the human's values. For example, in zero-sum conditional prediction, each predictor is incentivized to run computations to evaluate actions that it thinks could potentially be optimal w.r.t. human values, and not incentivized to think about actions that it confidently thinks are suboptimal. So for example, if I have the choice between eating chocolate ice cream, eating strawberry ice cream and eating mud, then the predictor will reason that I won't choose to eat mud and that therefore its prediction about mud won't be evaluated. Therefore, it will probably not think much about how what it will be like if I eat mud (though it has to think about it a little to make sure that the other predictor can't gain by recommending mud eating).

On whether the property is desirable [ETA: I here mean the property: [human chooses based on reported distribution] but not compared to [explicitly specifying a utility function]]: Perhaps my objection is just what you mean by ELK. In any case, I think my views depend a bit on how we imagine lots of different aspect of the overall alignment scheme. One important question, I think, is how exactly we imagine the human to "look at" the distributions for example. But my worry is that (similar to RLHF) letting the human evaluate distributions rather than outcomes increases the predictors' incentives to deceive the human. The incentive is to find actions whose distribution looks good (in whatever format you represent the distribution) in relation to the other distributions, not which distributions are good. Given that the distributions are so large (and less importantly because humans have lots of systematic, exploitable irrationalities related to risk), I would think that human judgment of single outcomes/point distributions is much better than human judgment of full distributions.

Comment by Caspar Oesterheld (Caspar42) on «Boundaries», Part 2: trends in EA's handling of boundaries · 2023-05-30T05:12:00.438Z · LW · GW

Minor typos:

Freedman

I think it's "Freeman"?

Cotton-Baratt 2022

And "Cotton-Barratt" with two rs.

Comment by Caspar Oesterheld (Caspar42) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-29T17:46:05.712Z · LW · GW

>the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it.

Hmm... Johannes made a similar argument in personal conversation yesterday. I'm not sure how convinced I am by this argument.

So first, here's one variant of the proper decision scoring rules setup where we also don't need to specify the decision maker's utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor's conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don't need to specify your utility function.

One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome -- the one obtained after choosing the recommended action -- and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn't this much harder? (If you're doing expected utility maximization -- doesn't your approach entail assigning probabilities to all hypothetical outcomes?) In realistic settings, these outcome distributions are huge objects!

Comment by Caspar Oesterheld (Caspar42) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-29T17:22:46.393Z · LW · GW

The following is based on an in-person discussion with Johannes Treutlein (the second author of the OP).

>But is there some concrete advantage of zero-sum conditional prediction over the above method?

So, here's a very concrete and clear (though perhaps not very important) advantage of the proposed method over the method I proposed. The method I proposed only works if you want to maximize expected utility relative to the predictor's beliefs. The zero-sum competition model enables optimal choice under a much broader set of possible preferences over outcome distributions.

Let's say that you have some arbitrary (potentially wacky discontinuous) function V that maps a distributions over outcomes onto a real value representing how much you like the distribution over outcomes. Then you can do zero-sum competition as normal and select the action for which V is highest (as usual with "optimism bias", i.e., if the two predictors make different predictions for an action a, then take the maximum of the Vs of the two actions). This should still be incentive compatible and result in taking the action that is best in terms of V applied to the predictors' belief.

(Of course, one could have even crazier preferences. For example, one's preferences could just be a function that takes as input a set of distributions and selects one distribution as its favorite. But I think if this preference function is intransitive, doesn't satisfy independence of irrelevant alternatives and the like, it's not so clear whether the proposed approach still works. For example, you might be able to slightly misreport some option that will not be taken anyway in such a way as to ensure that the decision maker ends up taking a different action. I don't think this is ever strictly incentivized. But it's not strictly disincentivized to do this.)

Interestingly, if V is a strictly convex function over outcome distributions (why would it be? I don't know!), then you can strictly incentivize a single predictor to report the best action and honestly report the full distribution over outcomes for that action! Simply use the scoring rule , where  is the reported distribution for the recommended action,  is the true distribution of the recommended action and  is a subderivative of . Because a proper scoring rule is used, the expert will be incentivized to report  and thus gets a score of , where  is the distribution of the recommended action. So it will recommend the action  whose associate distribution maximizes . It's easy to show that if  -- the function saying how much you like different distribution -- is not strictly convex, then you can't construct such a scoring rule. If I recall correctly, these facts are also pointed out in one of the papers by Chen et al. on this topic.

I don't find this very important, because I find expected utility maximization w.r.t. the predictors' beliefs much more plausible than anything else. But if nothing else, this difference further shows that the proposed method is fundamentally different and more capable in some ways than other methods (like the one I proposed in my comment).

Comment by Caspar Oesterheld (Caspar42) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-28T20:09:48.333Z · LW · GW

Nice post!

Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.

In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsidizing would mean scoring relative to some initial prediction $p_0$ provided by the market. Because the initial prediction might differ in how bad it is for different actions, the predictors might prefer a particular action to be taken. Conversely, the predictors might have no incentive to correct an overly optimistic prediction for one of the actions if doing so causes that action not to be taken. The examples in Section 3.2 of the Othman and Sandholm paper show these things.
- The second is "optimism bias" (a good thing in this context): "If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one." (This is as opposed to taking the market average, which I assume is what Hanson had in mind with his futarchy proposal.) If you don't have optimism bias, then you get failure modes like the ones pointed out in Obstacle 1 of Scott Garrabrant's post "Two Major Obstacles for Logical Inductor Decision Theory": One predictor/trader could claim that the optimal action will lead to disaster and thus cause the optimal action to never be taken and her prediction to never be tested. This optimism bias is reminiscent of some other ideas. For example some ideas for solving the 5-and-10 problem are based on first searching for proofs of high utility. Decision auctions also work based on this optimism. (Decision auctions work like this: Auction off the right to make the decision on my behalf to the highest bidder. The highest bidder has to pay their bid (or maybe the second-highest bid) and gets paid in proportion to the utility I obtain.) Maybe getting too far afield here, but the UCB term in bandit algorithms also works this way in some sense: if you're still quite unsure how good an action is, pretend that it is very good (as good as some upper bound of some confidence interval).


My work on decision scoring rules describes the best you can get out of a single predictor. Basically you can incentivize a single predictor to tell you what the best action is and what the expected utility of that action is, but nothing more (aside from some degenerate cases).

Your result shows that if you have two predictors with the same information, then you can get slightly more: you can incentivize them to tell you what the best action is and what the full distribution over outcomes will be if you take the action.

You also get some other stuff (as you describe starting from the sentence, "Additionally, there is a bound on how inaccurate..."). But these other things seem much less important. (You also say: "while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them." But the same is true of decision scoring rules for example.)

Here's one thing that is a bit unclear to me, though.

If you have two predictors that have the same information, there's other, more obvious stuff you can do. For example, here's one:
- Ask Predictor 1 for a recommendation for what to do.
- Ask Predictor 2 for a prediction over outcomes conditional on Predictor 1's recommendation.
- Take the action recommended by Predictor 1.
- Observe an outcome o with a utility u(o).
- Pay Predictor 1 in proportion to u(o).
- Pay Predictor 2 according to a proper scoring rule.

In essence, this is just splitting the task into two: There's the issue of making the best possible choice and there's the issue of predicting what will happen. We assign Predictor 1 to the first and Predictor 2 to the second problem. For each of these problems separately, we know what to do (use proper (decision) scoring rules). So we can solve the overall problem.

So this mechanism also gets you an honest prediction and an honest recommendation for what to do. In fact, one advantage of this approach is that honesty is maintained even if the Predictors 1 and 2 have _different_ information/beliefs! (You don't get any information aggregation with this (though see below). But your approach doesn't have any information aggregation either.)

As per the decision scoring rules paper, you could additionally ask Predictor 1 for an estimate of the expected utility you will obtain. You can also let the Predictor 2 look at Predictor 1's prediction (or perhaps even score Predictor 2 relative to Predictor 1's prediction). (This way you'd get some information aggregation.) (You can also let Predictor 1 look at Predictor 2's predictions if Predictor 2 starts out by making conditional predictions before Predictor 1 gives a recommendation. This gets more tricky because now Predictor 2 will want to mislead Predictor 1.)

I think your proposal for what to do instead of the above is very interesting and I'm glad that we now know that this method exists that that it works. It seems fundamentally different and it seems plausible that this insight will be very useful. But is there some concrete advantage of zero-sum conditional prediction over the above method?

Comment by Caspar Oesterheld (Caspar42) on Bayesian Networks Aren't Necessarily Causal · 2023-05-21T04:04:26.584Z · LW · GW

>First crucial point which this post is missing: the first (intuitively wrong) net reconstructed represents the probabilities using 9 parameters (i.e. the nine rows of the various truth tables), whereas the second (intuitively right) represents the probabilities using 8. That means the second model uses fewer bits; the distribution is more compressed by the model. So the "true" network is favored even before we get into interventions.
>
>Implication of this for causal epistemics: we have two models which make the same predictions on-distribution, and only make different predictions under interventions. Yet, even without actually observing any interventions, we do have reason to epistemically favor one model over the other.

For people interested in learning more about this idea: This is described in Section 2.3 of Pearl's book Causality. The beginning of Ch. 2 also contains some information about the history of this idea. There's also a more accessible post by Yudkowsky that has popularized these ideas on LW, though it contains some inaccuracies, such as explicitly equating causal graphs and Bayes nets.

Comment by Caspar Oesterheld (Caspar42) on Bayesian Networks Aren't Necessarily Causal · 2023-05-21T00:32:19.400Z · LW · GW

OP:

>You don't like the idea of forced change, of intervention, being so integral to such a seemingly basic notion as causality. It feels almost anthropomorphic: you want the notion of cause and effect within a system to make sense without reference to the intervention of some outside agent—for there's nothing outside of the universe.

RK:

>You may not, but indeed, according to Judea Pearl, interventions are integral to the idea of causation.

Indeed, from the Epilogue of the second edition of Judea Pearl's book Causality (emphasis added):

The equations of physics are indeed symmetrical, but when we compare the phrases “A causes B” versus “B causes A,” we are not talking about a single set of equations. Rather, we are comparing two world models, represented by two different sets of equations: one in which the equation for A is surgically removed; the other where the equation for B is removed. Russell would probably stop us at this point and ask: “How can you talk about two world models when in fact there is only one world model, given by all the equations of physics put together?” The answer is: yes. If you wish to include the entire universe in the model, causality disappears because interventions disappear – the manipulator and the manipulated lose their distinction. However, scientists rarely consider the entirety of the universe as an object of investigation. In most cases the scientist carves a piece from the universe and proclaims that piece in – namely, the focus of investigation. The rest of the universe is then considered out or background and is summarized by what we call boundary conditions. This choice of ins and outs creates asymmetry in the way we look at things, and it is this asymmetry that permits us to talk about “outside intervention” and hence about causality and cause-effect directionality.

Comment by Caspar Oesterheld (Caspar42) on Further considerations on the Evidentialist's Wager · 2023-05-20T16:01:24.138Z · LW · GW

I guess it's too late for this comment (no worries if you don't feel like replying!), but are you basically saying that CDT doesn't make sense because it considers impossible/zero-probability worlds (such as the one where you get 11 doses)?

If so: I agree! The paper on the evidentialist's wager assumes that you should/want to hedge between CDT and EDT, given that the issue is contentious.

Does that make sense / relate at all to your question?

Comment by Caspar Oesterheld (Caspar42) on Contra Hofstadter on GPT-3 Nonsense · 2023-04-23T04:33:46.406Z · LW · GW

Hofstadter's article is very non-specific about why their examples prove much. (Young children often produce complete nonsensical statements and we don't take that as evidence that they aren't (going to be) generally intelligent.) But here's an attempted concretization of Hofstadter's argument which would render the counterargument in the post invalid (in short: because it makes it much easier for GPT by announcing that some of the questions will be nonsense).

I take it that when asked a question, GPT's prior (i.e., credence before looking at the question) on "the question is nonsense" is very low, because nonsense questions are rare in the training data. If you have a low prior on "the question is nonsense" this means that you need relatively strong evidence to conclude that the question is nonsense. If you have an understanding of the subject matter (e.g., you know what it means to transport something across a bridge, how wide bridges are, you know what type of object a country is, how large it is, etc.), then it's easy to overcome a very low prior. But if you don't understand the subject matter, then it's difficult to overcome the prior. For example, in a difficult exam on an unfamiliar topic, I might never come to assign >5% to a specific question being a nonsense/trick question. That's probably why occasionally asking trick questions in exams works -- students who understand can overcome the low prior, students who don't understand cannot. Anyway, I take it that this is why Hofstadter concludes that GPT doesn't understand the relevant concepts. If it could, it would be able to overcome the low prior of being asked nonsense questions.

(Others in the comments here give some arguments against this, saying something to the extent that GPT isn't trying at all to determine whether a given question is nonsense. I'm skeptical. GPT is trying to predict continuations in human text, so it's in particular trying to predict whether a human would respond to the question by saying that the question is nonsense.)

Now the OP asks GPT specifically whether a given question is nonsense or not. This, I assume, should cause the model to update toward a much higher credence on getting a nonsense question (before even looking at the question!). (Probably the prior without the announcement is something like 1/10000 (?) and the credence after the announcement is on the order of 1/3.) But getting from a 1/3 credence to a, say, 80% credence (and thus a "yo be real" response) is much easier than getting there from a 1/10000 credence. Therefore, it's much less impressive that the system can detect nonsense when it already has a decently high credence in nonsense. In particular, it's not that difficult to imagine getting from a 1/3 credence to an 80% credence with "cheap tricks". (For example, if a sentence contains words from very different domains -- say, "prime number" from number theory and "Obama" from politics -- that is some evidence that the question is nonsense. Some individual word combinations are rare in sensible text, e.g., "transport Egypt" probably rarely occurs in sensible sentences.)

Here's another way of putting this: Imagine you had to code a piece of software from scratch that has to guess whether a given question is nonsense or not. Now imagine two versions of the assignment: A) The test set consists almost exclusively of sensible questions. B) The test set has similarly many sensible and nonsense questions. (In both sets, the nonsense questions are as grammatical, etc. as the sensible questions.) You get paid for each question in the test set that you correctly classify. But then also in both cases someone, call her Alice, will look at what your system does for five randomly sampled sensible and five randomly sampled nonsense questions. They then write an article about in The Economist, or LessWrong, or the like. You don't care about what Alice writes. In variant A, I think it's pretty difficult to get a substantial fraction of the nonsense questions right, without doing worse than always guessing "sensible". In fact, I wouldn't be surprised if after spending months on this, I'd unable to outperform always guessing "sensible". In any case, I'd imagine that Alice will write a scathing article about my system in variant A. But in variant B, I'd immediately have lots of ideas that are all based on cheap tricks (like the ones mentioned at the end of the preceding paragraph). I could imagine that one could achieve reasonably high accuracy on this, by implementing hundreds of cheap tricks and then aggregating them with linear regression or the like. So I'd also be much more hopeful about positive comments from Alice for the same amount of effort.

Comment by Caspar Oesterheld (Caspar42) on Contra Hofstadter on GPT-3 Nonsense · 2023-04-23T03:22:56.775Z · LW · GW

>GPT-3 is very capable of saying "I don't know" (or "yo be real"), but due to its training dataset it likely won't say it on its own accord.

I'm not very convinced by the training data point. People write "I don't know" on the Internet all the time (and "that makes no sense" occasionally). (Hofstadter's article says both in his article, for example.) Also, RLHF presumably favors "I don't know" over trying to BS, and still RLHFed models like those underlying ChatGPT and Bing still frequently make stuff up or output nonsense (though it apparently gets the examples from Hofstadter's article right, see LawrenceC's comment).

Comment by Caspar Oesterheld (Caspar42) on Dutch-Booking CDT: Revised Argument · 2023-04-18T03:16:15.181Z · LW · GW

Minor bibliographical note: A related academic paper is Arif Ahmed's unpublished paper, "Sequential Choice and the Agent's Perspective". (This is from memory -- I read that paper a few years ago.)

Comment by Caspar Oesterheld (Caspar42) on An Appeal to AI Superintelligence: Reasons to Preserve Humanity · 2023-03-20T00:01:49.024Z · LW · GW

>We mentioned both.

Did you, though? Besides Roko's basilisk, the references to acausal trade seem vague, but to me they sound like the kinds that could easily make things worse. In particular, you don't explicitly discuss superrationality, right?

>Finally, while it might have been a good idea initially to treat Roko's basilisk as an information hazard to be ignored, that is no longer possible so the marginal cost of mentioning it seems tiny.

I agree that due to how widespread the idea of Roko's basilisk is, it overall matters relatively little whether this idea is mentioned, but I think this applies similarly in both directions.

Comment by Caspar Oesterheld (Caspar42) on Newcomb's paradox complete solution. · 2023-03-19T09:03:28.238Z · LW · GW

I agree that some notions of free will imply that Newcomb's problem is impossible to set up. But if one of these notion is what is meant, then the premise of Newcomb's problem is that these notions are false, right?

It also happens that I disagree with these notions as being relevant to what free will is.

Anyway, if this had been discussed in the original post, I wouldn't have complained.

Comment by Caspar Oesterheld (Caspar42) on An Appeal to AI Superintelligence: Reasons to Preserve Humanity · 2023-03-18T17:36:47.654Z · LW · GW

What's the reasoning behind mentioning the fairly controversial, often deemed dangerous Roko's basilisk over less risky forms of acausal trade (like superrational cooperation with human-aligned branches)?

Comment by Caspar Oesterheld (Caspar42) on Newcomb's paradox complete solution. · 2023-03-15T20:06:09.637Z · LW · GW

Free will is a controversial, confusing term that, I suspect, different people take to mean different things. I think to most readers (including me) it is unclear what exactly the Case 1 versus 2 distinction means. (What physical property of the world differs between the two worlds? Maybe you mean not having free will to mean something very mundane, similar to how I don't have free will about whether to fly to Venus tomorrow because it's just not physically possible for me to fly to Venus, so I have to "decide" not to fly to Venus?)

I generally think that free will is not so relevant in Newcomb's problem. It seems that whether there is some entity somewhere in the world that can predict what I'm doing shouldn't make a difference for whether I have free will or not, at least if this entity isn't revealing its predictions to me before I choose. (I think this is also the consensus on this forum and in the philosophy literature on Newcomb's problem.)

>CDT believers only see the second decision. The key here is realising there are two decisions.

Free will aside, as far as I understand, your position is basically in line with what most causal decision theorists believe: You should two-box, but you should commit to one-boxing if you can do so before your brain is scanned. Is that right? (I can give some references to discussions of discussions of CDT and commitment if you're interested.)

If so, how do you feel about the various arguments that people have made against CDT? For example, what would you do in the following scenario?

>Two boxes, B1 and B2, are on offer. You may purchase one or none of the boxes but not both. Each of the two boxes costs $1. Yesterday, Omega put $3 in each box that she predicted you would not acquire. Omega's predictions are accurate with probability 0.75.

In this scenario, CDT always recommends buying a box, which seems like a bad idea because from the perspective of the seller of the boxes, they profit when you buy from them.

>TDT believers only see the first decision, [...] The key here is realising there are two decisions.

I think proponents of TDT and especially Updateless Decision Theory and friends are fully aware of this possible "two-decisions" perspective. (Though typically Newcomb's problem is described as only having one of the two decision points, namely the second.) They propose that the correct way to make the second decision (after the brain scan) is to take the perspective of the first decision (or similar). Of course, one could debate whether this move is valid and this has been discussed (e.g., here, here, or here).

Also: Note that evidential decision theorists would argue that you should one-box in the second decision (after the brain scan) for reasons unrelated to the first-decision perspective. In fact, I think that most proponents of TDT/UDT/... would agree with this reasoning also, i.e., even if it weren't for the "first decision" perspective, they'd still favor one-boxing. (To really get the first decision/second decision conflict you need cases like counterfactual mugging.)

Comment by Caspar Oesterheld (Caspar42) on GPT-4 · 2023-03-15T17:32:52.442Z · LW · GW

I haven't read this page in detail. I agree, obviously, that on many prompts Bing Chat, like ChatGPT, gives very impressive answers. Also, there are clearly examples on which Bing Chat gives a much better answer than GPT3. But I don't give lists like the one you linked that much weight. For one, for all I know, the examples are cherry-picked to be positive. I think for evaluating these models it is important that they sometimes give indistinguishable-from-human answers and sometimes make extremely simple errors. (I'm still very unsure about what to make of it overall. But if I only knew of all the positive examples and thought that the corresponding prompts weren't selection-biased, I'd think ChatGPT/Bing is already superintelligent.) So I give more weight to my few hours of generating somewhat random prompts (though I confess, I sometimes try deliberately to trip either system up). Second, I find the examples on that page hard to evaluate, because they're mostly creative-writing tasks. I give more weight to prompts where I can easily evaluate the answer as true or false, e.g., questions about the opening hours of places, prime numbers or what cities are closest to London, especially if the correct answer would be my best prediction for a human answer.

Comment by Caspar Oesterheld (Caspar42) on GPT-4 · 2023-03-15T17:14:52.873Z · LW · GW

That's interesting, but I don't give it much weight. A lot of things that are close to Monty Fall are in GPT's training data. In particular, I believe that many introductions to the Monty Hall problem discuss versions of Monty Fall quite explicitly. Most reasonable introductions to Monty Hall discuss that what makes the problem work is that Monty Hall opens a door according to specific rules and not uniformly at random. Also, even humans (famously) get questions related to Monty Hall wrong. If you talk to a randomly sampled human and they happen to get questions related to Monty Hall right, you'd probably conclude (or at least strongly update towards thinking that) they've been exposed to explanations of the problem before (not that they solved it all correct on the spot). So to me the likely way in which LLMs get Monty Fall (or Monty Hall) right is that they learn to better match it onto their training data. Of course, that is progress. But it's (to me) not very impressive/important. Obviously, it would be very impressive if it got any of these problems right if they had been thoroughly excluded from its training data.

Comment by Caspar Oesterheld (Caspar42) on GPT-4 · 2023-03-14T21:49:40.112Z · LW · GW

To me Bing Chat actually seems worse/less impressive (e.g., more likely to give incorrect or irrelevant answers) than ChatGPT, so I'm a bit surprised. Am I the only one that feels this way? I've mostly tried the two systems on somewhat different kinds of prompts, though. (For example, I've tried (with little success) to use Bing Chat instead of Google Search.) Presumably some of this is related to the fine-tuning being worse for Bing? I also wonder whether the fact that Bing Chat is hooked up to search in a somewhat transparent way makes it seem less impressive. On many questions it's "just" copy-and-pasting key terms of the question into a search engine and summarizing the top result. Anyway, obviously I've not done any rigorous testing...

Comment by Caspar Oesterheld (Caspar42) on 2+2=π√2+n · 2023-02-04T01:04:34.234Z · LW · GW

There's a Math Stack Exchange question: "Conjectures that have been disproved with extremely large counterexamples?" Maybe some of the examples in the answers over there would count? For example, there's Euler's sum of powers conjecture, which only has large counterexamples (for high k), found via ~brute force search.

Comment by Caspar Oesterheld (Caspar42) on The Nature of Counterfactuals · 2023-02-03T03:15:09.937Z · LW · GW

>Imagine trying to do physics without being able to say things like, "Imagine we have a 1kg frictionless ball...", mathematics without being able to entertain the truth of a proposition that may be false or divide a problem into cases and philosophy without being allowed to do thought experiments. Counterfactuals are such a basic concept that it makes sense to believe that they - or something very much like them - are a primitive.

In my mind, there's quite some difference between all these different types of counterfactuals. For example, consider the counterfactual question, "What would have happened if Lee Harvey Oswald hadn't shot Kennedy?" I think the meaning of this counterfactual is kind of like the meaning of the word "chair".
- For one, I don't think this counterfactual is very precisely defined. What exactly are we asked to imagine? A world that is like ours, except the laws of physics in Oswalds gun where temporarily suspended to save JFK's life? (Similarly, it is not exactly clear what counts as a chair (or to what extent) and what doesn't.)
- Second, it seems that the users of the English language all have roughly the same understanding of what the meaning of the counterfactual is, to the extent that we can use it to communicate effectively. For example, if I say, "if LHO hadn't shot JFK, US GDP today would be a bit higher than it is in fact", then you might understand that to mean that I think JFK had good economic policies, or that people were generally influenced negatively by the news of his death, or the like. (Maybe a more specific example: "If it hadn't suddenly started to rain, I would have been on time." This is a counterfactual, but it communicates things about the real world, such as: I didn't just get lost in thought this morning.) (Similarly, when you tell me to get a "chair" from the neighboring room, I will typically do what you want me to do, namely to bring a chair.)
- Third, because it is used for communication, some notions of counterfactuals are more useful than others, because they are better for transferring information between people. At the same time, usefulness as a metric still leaves enough open to make it practically and theoretically impossible to identify a unique optimal notion of counterfactuals. (Again, this is very similar to a concept like "chair". It is objectively useful to have a word for chairs. But it's not clear whether it's more useful for "chair" to include or exclude .)
- Fourth, adopting whatever notion of counterfactual we adopt for this purpose has no normative force outside of communication -- they don't interact with our decision theory or anything. For example, causal counterfactuals as advocated by causal decision theorists are kind of similar to the "If LHO hadn't shot JFK" counterfactuals. (E.g., both are happy to consider literally impossible worlds.) As you probably know, I'm partial to evidential decision theory. So I don't think these causal counterfactuals should ultimately be the guide of our decisions. Nevertheless, I'm as happy as anyone to adopt the linguistic conventions related to "if LHO hadn't shot JFK"-type questions. I don't try to reinterpret the counterfactual question as a conditional one. (Note that answers to, "how would you update on the fact that JFK survived the assassination?", would be very different from answers to the counterfactual question. ("I've been lied to all my life. The history books are all wrong.") But other conditionals could come much closer.) (Similarly, using the word "chair" in the conventional way doesn't commit one to any course of action. In principle, Alice might use the term "chair" normally, but never sit on chairs, or only sit on green chairs, or never think about the chair concept outside of communication, etc.)

So in particular, the meaning of counterfactual claims about JFK's survival don't seem necessarily very related to the counterfactuals used in decision making. (The question, "what would happen if I don't post this comment?" that I asked myself prior to posting this comment.)

In math, meanwhile, people seem to consider counterfactuals mainly for proofs by contradiction, i.e., to prove that the claims are contrary to fact. Cf. https://en.wikipedia.org/wiki/Principle_of_explosion , which makes it difficult to use the regular rules of logic to talk about counterfactuals.

Do you agree or disagree with this (i.e., with the claim that these different uses of counterfactuals aren't very closely connected)?

Comment by Caspar Oesterheld (Caspar42) on Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas · 2022-11-26T15:13:58.010Z · LW · GW

In general it seems that currently the podcast can only be found on Spotify.

Comment by Caspar Oesterheld (Caspar42) on Utilitarianism Meets Egalitarianism · 2022-11-23T03:28:18.345Z · LW · GW

So the argument/characterization of the Nash bargaining solution is the following (correct?): The Nash bargaining solution is the (almost unique) outcome o for which there is a rescaling w of the utility functions such that both the utilitarian solution under rescaling w and the egalitarian solution under rescaling w is o. This seems interesting! (Currently this is a bit hidden in the proof.)

Do you show the (almost) uniqueness of o, though? You show that the Nash bargaining solution has the property, but you don't show that no other solution has this property, right?

Comment by Caspar Oesterheld (Caspar42) on Tyranny of the Epistemic Majority · 2022-11-23T01:25:29.118Z · LW · GW

Nice!

I'd be interested in learning more about your views on some of the tangents:

>Utilities are bounded.

Why? It seems easy to imagine expected utility maximizers whose behavior can only be described with unbounded utility functions, for example.

>I think many phenomena that get labeled as politics are actually about fighting over where to draw the boundaries.

I suppose there are cases where the connection is very direct (drawing district boundaries, forming coalitions for governments). But can you say more about what you have in mind here?

Also:

>Not, they are in a positive sum

I assume the first word is a typo. (In particular, it's one that might make the post less readable, so perhaps worth correcting.)

Comment by Caspar Oesterheld (Caspar42) on Utilitarianism Meets Egalitarianism · 2022-11-23T01:02:37.369Z · LW · GW

I think in the social choice literature, people almost always mean preference utilitarianism when they say "utilitarianism", whereas in the philosophical/ethics literature people are more likely to mean hedonic utilitarianism. I think the reason for this is that in the social choice and somewhat adjacent game (and decision) theory literature, utility functions have a fairly solid foundation as a representation of preferences of rational agents. (For example, Harsanyi's "[preference] utilitarian theorem" paper and Nash's paper on the Nash bargaining solution make very explicit reference to this foundation.) Whereas there is no solid foundation for numeric hedonic welfare (at least not in this literature, but also not elsewhere as far as I know).

Comment by Caspar Oesterheld (Caspar42) on Further considerations on the Evidentialist's Wager · 2022-11-09T16:41:27.132Z · LW · GW

>Anthropically, our existence provides evidence for them being favored.

There are some complications here. It depends a bit on how you make anthropic updates (if you do them at all). But it turns out that the version of updating that "works" with EDT basically doesn't make the update that you're in the majority. See my draft on decision making with anthropic updates.

>Annex: EDT being counter-intuitive?

I mean, in regular probability calculus, this is all unproblematic, right? Because of the Tower Rule a.k.a. Law of total expectation or similarly conservation of expected evidence. There are also issues of updatelessness, though, you touch on at various places in the post. E.g., see Almond's "lack of knowledge is [evidential] power" or scenarios like the Transparent Newcomb's problem wherein EDT wants to prevent itself from seeing the content of the boxes.

>It seems plausible that evolutionary pressures select for utility functions broadly as ours

Well, at least in some ways similar as ours, right? On questions like whether rooms are better painted red or green, I assume there isn't much reason to expect convergence. But on questions of whether happiness is better than suffering, I think one should expect evolved agents to mostly give the right answers.

>to compare such maximizations, you already need a decision theory (which tells you what "maximizing your goals" even is).

Incidentally I published a blog post about this only a few weeks ago (which will probably not contain any ideas that are new to you).

>Might there be some situation in which an agent wants to ensure all of its correlates are Good Twins

I don't think this is possible.

Comment by Caspar Oesterheld (Caspar42) on EA, Veganism and Negative Animal Utilitarianism · 2022-09-04T19:53:45.493Z · LW · GW

There have been discussions of the suffering of wild animals. David Pearce discusses this, see one of the other comment threads. Some other starting points:

>As a utilitarian then, it should be far more important to wipe out as many animal habitats as possible rather than avoiding eating a relatively small number of animals by being a vegan.

To utilitarians, there are other considerations in assessing the value of wiping out animal habitats, like the effect of such habitats on global warming.

Comment by Caspar Oesterheld (Caspar42) on Worlds Where Iterative Design Fails · 2022-08-31T16:32:04.327Z · LW · GW

Nice post!

What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans? 

Perhaps relatedly:

>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

I don't get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, "Can this design be turned into a bomb?" and the AI says, "No, it's safe for such and such reasons", then if you can't evaluate these reasons, it doesn't help you that you have asked the right question.

Comment by Caspar Oesterheld (Caspar42) on Announcing: Mechanism Design for AI Safety - Reading Group · 2022-08-21T17:56:44.142Z · LW · GW

Sounds interesting! Are you going to post the reading list somewhere once it is completed?

(Sorry for self-promotion in the below!)

I have a mechanism design paper that might be of interest: Caspar Oesterheld and Vincent Conitzer: Decision Scoring Rules. WINE 2020. Extended version. Talk at CMID.

Here's a pitch in the language of incentivizing AI systems -- the paper is written in CS-econ style. Imagine you have an AI system that does two things at the same time:
1) It makes predictions about the world.
2) It takes actions that influence the world. (In the paper, we specifically imagine that the agent makes recommendations to a principal who then takes the recommended action.) Note that if the predictions are seen by humanity, they themselves influence the world. So even a pure oracle AI might satisfy 2, as has been discussed before (see end of this comment).
We want to design a reward system for this agent such the agent maximizes its reward by making accurate predictions and taking actions that maximize our, the principals', utility.

The challenge is that if we reward the accuracy of the agent's predictions, we may set an incentive on the agent to make the world more predictable, which will generally not be aligned without mazimizing our utility.

So how can we properly incentivize the agent? The paper provides a full and very simple characterization of such incentive schemes, which we call proper decision scoring rules:

We show that proper decision scoring rules cannot give the [agent] strict incentives to report any properties of the outcome distribution [...] other than its expected utility. Intuitively, rewarding the [agent] for getting anything else about the distribution right will make him [take] actions whose outcome is easy to predict as opposed to actions with high expected utility [for the principal]. Hence, the [agent's] reward can depend only on the reported expected utility for the recommended action. [...] we then obtain four characterizations of proper decision scoring rules, two of which are analogous to existing results on proper affine scoring [...]. One of the [...] characterizations [...] has an especially intuitive interpretation in economic contexts: the principal offers shares in her project to the [agent] at some pricing schedule. The price schedule does not depend on the action chosen. Thus, given the chosen action, the [agent] is incentivized to buy shares up to the point where the price of a share exceeds the expected value of the share, thereby revealing the principal's expected utility. Moreover, once the [agent] has some positive share in the principal's utility, it will be (strictly) incentivized to [take] an optimal action.

Also see Johannes Treutlein's post on "Training goals for large language models", which also discusses some of the above results among other things that seem like they might be a good fit for the reading group, e.g., Armstrong and O'Rourke's work.

My motivation for working on this was to address issues of decision making under logical uncertainty. For this I drew inspiration from the fact that Garrabrant et al.'s work on logical induction is also inspired by market design ideas (specifically prediction markets).

Comment by Caspar Oesterheld (Caspar42) on On infinite ethics · 2022-04-15T20:33:41.258Z · LW · GW

>Because there's "always a bigger infinity" no matter which you choose, any aggregation function you can use to make decisions is going to have to saturate at some infinite cardinality, beyond which it just gives some constant answer.

Couldn't one use a lexicographic utility function that has infinitely many levels? I don't know exactly how this works out technically. I know that maximizing the expectation of a lexicographic utility function is equivalent to the vNM axioms without continuity, see Blume et al. (1989). But they only mention the case of infinitely many levels in passing.

Comment by Caspar Oesterheld (Caspar42) on [Closed] Job Offering: Help Communicate Infrabayesianism · 2022-04-14T04:07:17.305Z · LW · GW

Cool that this is (hopefully) being done! I have had this on my reading list for a while and since this is about the kind of problems I also spend much time thinking about, I definitely have to understand it better at some point. I guess I can snooze it for a bit now. :P Some suggestions:

Maybe someone could write an FAQ page? Also, a somewhat generic idea is to write something that is more example based, perhaps even something that just solely gives examples. Part of why I suggest these two is that I think they can be written relatively mechanically and therefore wouldn't take that much time and insight to write. Also, maybe Vanessa or Alex could also record a talk? (Typically one explains things differently in talks/on a whiteboard and some people claim that one generally does so better than in writing.)

I think for me the kind of writeup that would have been most helpful (and maybe still is) would be some relatively short (5-15 pages), clean, self-contained article that communicates the main insight(s), perhaps at the cost of losing generality and leaving some things informal. So somewhere in between the original intro post / the content in the AXRP episode / Rohin's summary (all of which explain the main idea but are very informal) and the actual sequence (which seems to require wading through a lot of intrinsically not that interesting things before getting to the juicy bits). I don't know to what extent this is feasible, given that I haven't read any of the technical parts yet. (Of course, a lot of projects have this presentation problem, but I think usually there's some way to address this. E.g., compare the logical induction paper, which probably has a number of important technical aspects that I still don't understand or forgot at this point. But where by making a lot of things a bit informal, the main idea can be grasped from the short version, or from a talk.)

Comment by Caspar Oesterheld (Caspar42) on In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy · 2022-02-12T00:01:39.798Z · LW · GW

I now have a draft for a paper that gives this result and others.

Comment by Caspar Oesterheld (Caspar42) on Formalizing Objections against Surrogate Goals · 2021-09-07T20:12:56.296Z · LW · GW

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs work (or, rather, make a big difference for bargaining outcomes) shouldn't change our behavior much (for the reasons discussed in Vojta's post).

On this view, it doesn't help substantially to understand / analyze SPIs formally.

But I think there are sufficiently many gaps in this argument to make the analysis worthwhile. For example, I think it's plausible that the effective use of SPIs hinges on subtle aspects of the design of an agent that we might not think much about if we don't understand SPIs sufficiently well.