davidmanheim feed - LessWrong 2.0 Reader

Comment by Davidmanheim on Trying to translate when people talk past each other

davidmanheim — 2024-12-18T07:23:15.261Z

I don't think it was betrayal, I think it was skipping verbal steps, which left intent unclear.

If A had said "I promised to do X, is it OK now if I do Y instead?" There would presumably have been no confusion. Instead, they announced, before doing Y, their plan, leaving the permission request implicit. The point that "she needed A to acknowledge that he’d unilaterally changed an agreement" was critical to B, but I suspect A thought that stating the new plan did that implicitly.

Comment by Davidmanheim on MIRI's June 2024 Newsletter

davidmanheim — 2024-12-14T19:56:26.729Z

Strongly agree that there needs to be an institutional home. My biggest problem is that there is still no such new home!

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-12T08:44:18.935Z

You should also read the relevant sequence about dissolving the problem of free will: https://www.lesswrong.com/s/p3TndjYbdYaiWwm9x

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-12T08:42:29.488Z

You believe that something inert cannot be doing computation. I agree. But you seem to think it's coherent that a system with no action - a post-hoc mapping of states - can be.

The place where comprehension of Chinese exists in the "chinese room" is the creation of the mapping - the mapping itself is a static object, and the person in the room by assumption is doing to cognitive work, just looking up entries. "But wait!" we can object, "this means that the Chinese room doesn't understand Chinese!" And I think that's the point of confusion - repeating someone else telling you answers isn't the same as understanding. The fact that the "someone else" wrote down the answers changes nothing. The question is where and when the computation occurred.

In our scenarios, there are a couple different computations - but the creation of the mapping unfairly sneaks in the conclusion that the execution of the computation, which is required to build the mapping, isn't what creates consciousness!

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-12T05:55:13.337Z

Good point. The problem I have with that is that in every listed example, the mapping either requires the execution of the conscious mind and a readout of its output and process in order to build it, or it stipulates that it is well enough understood that it can be mapped to an arbitrary process, thereby implicitly also requiring that it was run elsewhere.

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-11T16:36:48.560Z

That seems like a reasonable idea. It seems not at all related to what any of the philosophers proposed.

For their proposals, it seems like the computational process is more like:
1. Extract a specific string of 1s and zeros from the sandstorm's initial position, and another from it's final position, with the some length as the length of the full description of the mind.
2. Calculate the bitwise sum of the initial mind state and the initial sand position.
3. Calculate the bitwise sum of the final mind state and the final sand position.
4. Take the output of state 2 and replace it with the output of state 3.
5. Declare that the sandstorm is doing something isomorphic to what the mind did. Ignore the fact that the internal process is completely unrelated, and all of the computation was done inside of the mind, and you're just copying answers.

Comment by Davidmanheim on Most Minds are Irrational

davidmanheim — 2024-12-11T11:30:39.867Z

I agree that's a more interesting question, and computational complexity theorists have done work on it which I don't fully understand, but it also doesn't seem as relevant for AI safety questions.

Comment by Davidmanheim on Most Minds are Irrational

davidmanheim — 2024-12-10T13:05:38.149Z

Regarding Chess agents, Vanessa pointed out that while only perfect play is optimal, informally we would consider agents to have an objective that is better served by slightly better play, for example, an agent rated 2500 ELO is better than one rated 1800, which is better than one rated 1000, etc. That means that lots of "chess minds" which are non-optimal are still somewhat rational at their goal.

I think that it's very likely that even according to this looser definition, almost all chess moves, and therefore almost all "possible" chess bots, fail to do much to accomplish the goal.
We could check this informally by evaluating the set of possible moves in normal games would be classified as blunders, using a method such as the one used here to evaluate what proportion of actual moves made by players are blunders. Figure 1 there implies that in positions with many legal moves, a larger proportion are blunders - but this is looking at the empirical blunder rate by those good enough to be playing ranked chess. Another method would be to look at a bot that actually implements "pick a random legal move" - namely Brutus RND. It has an ELO of 255 when ranked against other amateur chess bots, and wins only occasionally against some of the worst bots; it seems hard to figure out from that what proportion of moves are good, but it's evidently a fairly small proportion.

Most Minds are Irrational

davidmanheim — 2024-12-10T09:36:33.144Z

Epistemic status: This is a step towards formalizing some intuitions about AI. It is closely related to Vanessa Kosoy’s “Descriptive Agent Theory” - but I want to concretize the question, explain the reason that it is true in some form, and try to think through and provide some intuition about why it would matter. I welcome pushback about the claim or the way to operationalize it.

The intuition that most minds are irrational is about the space of possible minds. I have been told that others also haven’t formalized the claim well, and have not found a good explanation. The intuition is that, as a portion of the total possible space, a measure-zero subset of “minds” will fulfill the basic requirements of rational agents. Unfortunately, none of this is really well defined, so I’m writing down my understanding of the problem and what seem like the paths forward. I will note that this isn’t my main focus, and I think it’s important to note that this is only indirectly related to safety, and is far more closely related to deconfusion. However, it seems important when thinking about both artificial intelligence agents, and human minds.

To outline the post, I’ll first prove that in a very narrow case, that almost all possible economic agents are irrational. After that, I’ll talk about why the most general case of any computational process - which includes anything that generates output in response to input, any program or MDP - can be considered a decision process (“what should I output,”) but if all we want is for it to output something, the proportion of agents which are rational is undecidable for technical reasons. I’ll then make a narrower case about chess agents, showing that in a fairly reasonable sense, almost all such agents are irrational. And finally, I’ll talk about what would be needed to make progress on the problems, and some interesting issues or potentially tractable mathematical approaches.

Most economic preferences are irrational

I’ll start with a toy example of the economic notion of agents which I think is not particularly useful, except to explain the intuition. Imagine a person who, when at the store, looks at a candy bar and says they’d rather have the candy bar than two dollars - but in exactly the same scenario, they are willing to sell a candy bar they have for only one dollar. Clearly, if the person spends some time trading back and forth, they will drain their bank account, with nothing gained. This is what economists call a money pump, and the example shows that as long as we accept the premise that people value the outcome, then regardless of what the goal is, people need to have transitive preferences in order to be rational. (And to be fair to this toy example, despite this not being a great model, those who dismiss the claim that this could describe many minds have evidently never seen a small child, or many adults, actually make decisions.)

The general claim about economic actors is that if there are k items available, and the agent has preferences between all of those items, we can make a strong statement about an upper bound for the proportion of rational agents. A minimal requirement for rational preferences is that a preference is only rational if there are no cycles in those preferences - they never end up stuck in an infinite cycle. That is true if and only if there is an ordering^[1], so that every item can be put in order, and each item is preferred to everything after it in the order, and preferred to nothing before it. And if we exclude the possibility that something is preferred to itself, there are 2^(k(k-1)/2) possible sets of preferences, but only k! of those are rational.

In this narrow setting, this proves the claim that almost no minds are rational. Starting simple, if there are only 2 items, either the person likes the first more, or the second - no matter what, the preferences are rational. But if there are three items, there are 8 possible preferences, and two of them (A>B>C>A, and A<B<C<A,) are irrational, so only 75% of the possible preferences are rational. And as the number of items increases, the portion of possible preferences that are rational drop quickly, to 37.5% with 4 items, 2.2% with 6 items. By the time there are 10 items, only about 1 in 10 million (10^-7) of the 35 trillion possible preferences are rational^[2]. And as stated initially, this gives the intuition that very, very few possible minds are rational.

Rationality more generally

But the above argument about economic agents doesn’t describe actual minds. If nothing else, people don’t choose preferences between items randomly. I’m sure you could find a person who would claim to prefer an apple to $1,000,000, but I wouldn’t believe them. And most decisions taken by agents, human or machine, don’t look like this; at the very least, the action space is usually richer, involving actions and decisions rather than just goods, and preferences are not only cardinal.

Defining goals

We also have a different issue, mentioned earlier - to talk about rationality in general, we need to include the notion of a goal. In the economic example, we implicitly assumed the goal of maximizing whichever items are available^[3]. In a more general setting, I’ll assume there is some scoring function for the decisions made. For example, in the above example of preferences, there is a score implicit in the preferences where an agent that ends up with a higher ranked item or choice does “better”.

A fully general undecidable case

As a simple metric in the most general case, a program generating an output might get a 1 for providing the desired output, a zero for providing an incorrect output, and a -1 for not terminating within some finite time, or crashing. Given a program which terminates on all inputs, we could say there is some implicit “goal” output, which is described by the program behavior. In this case, all terminating programs are rational given their output as the goal, but (unfortunately for the current argument,) the fraction of programs which fulfill this property of terminating on some output is uncomputable^[4].

Alternatively, we can compare programs on the basis of some other dimension - size, time complexity, etc. - and note that almost all programs have shorter / faster versions that produce identical outputs. But this is a far stronger version of rationality than what we would informally require - a program that always wins chess but, say, runs a factor of two slower than the best possible program is probably still considered a rational agent.

Why does this matter?

When looking at making a future AI system that does what humans want, one concept discussed by Eliezer Yudkowsky, among others, was “coherent extrapolated volition.” This means that we would take a human mind, and we find what the mind’s preferences are, then extrapolate those preferences to a far larger possible set of actions or outcomes, but make sure the extrapolation is coherent - unlike actual irrational minds. This could, in a narrow economic setting, identify a set of things that fulfills what the original human wants far better than the options the human would come up with. In some sense, it’s asking for a set of preferences that are “close” in some sense to the human’s preferences. But this makes some assumptions about the types of things that human minds want, and assumes their goals can be fulfilled. And in fact, despite economic assumptions like non-satiating preferences - where more of some item is always at least slightly better - this doesn’t describe the reality of human desires well. Humans are irrational in a variety of ways.

On the other hand, when we talk about AI agents, we want them to be economically efficient. An AI agent which trades candy bars for money and loses everything is obviously a worse agent than one which does not. Efficiency requires that the systems be rational. But if they have non-satiating preferences for some concrete thing, they are more likely to be unsafe maximizers.

And the idea of nonsatiating preferences is related to rationality, in some sense. For example, if an agent which really likes paperclips tried to maximize cyclical preferences, there’s some chance it wouldn’t then fill the universe with paperclips, and could instead find infinite satisfaction switching between having an apple, then a paperclip, then a baseball hat, then an apple again. (This is obviously ridiculous, but hopefully conveys the intuition that if paperclips aren’t actually preferred to everything else, maximizing them might not be its goal - and at the same time, if it uses at least this particular avenue to avoid a goal that can be infinitely satisfied, it’s irrational and inefficient.) Incidentally, this provides a useful framing for motivating Yudkowskian paperclip-maximizers; any preference set that has a maximal element would prefer to increase that element without bound, even if there are other items they desire.

In a limited setting, we can imagine a game–playing agent, with a very clear set of possible actions. A relatively simple case is tic-tac-toe, where there are 9 spaces, and there are 26,830 different games, up to rotation and reflection. Of course, it’s far simpler to specify that perfect play always leads to a draw, so there is a far smaller set of games that are optimal for a given player - only some small fraction of the set of possible moves is optimal.

Slightly more generally, we can again talk about an agent that can play chess. To formalize this a bit more, we will imagine that it can make any move on the chess board that involves moving any piece from one square to another. Most of these moves aren’t legal, so we can add a rule that deals with this - for example, if you make an illegal move, this means you concede the game^[5].

Given this narrow setup, we can ask what proportion of these actions are rational - but because chess is a theoretically solvable game, at each point in time there is one possible moves that are game-theoretic optimal moves (or at most a few, if there are multiple paths with the same eventual forced outcome.) Claude Shannon famously estimated that there are 30 possible “reasonable” moves at each step of a 40-move chess game^[6], so at each point, the proportion of moves that are rational is around 1 in 30, and assuming agents are defined by the move they take at each point, we end up finding that around 10^-60 of the possible agents are rational. But in our setup, the number is astronomically larger, the player is starting with 16 pieces able to each move to 63 places on the board, so we have 7x10^75 moves that can be taken as the first move, and the number of agents which are rational is an astronomically smaller proportion.

So, what are the odds that an arbitrary complex system is rational in pursuing a given goal we specify? Once again, approximately zero^[7].

On the other hand, what are the odds that an arbitrary complex system is pursuing some coherent outcome? In our setup, almost all possible agents lose the game (against a perfect opponent,) so if we consider that a goal, almost all agents are doing exactly what they “want” as judged in a post-hoc fashion. And even in this meaning, all but a very small finite number are pursuing this goal suboptimally, since a chess program that always outputs an illegal move (say, pawn moves from the furthest back rank to some other position,) is maximally simple at achieving that goal, and almost all other programs which achieve the same outcome are slower and longer. But in both this last case, and even in the general case of post-hoc choosing a goal based on what a program does probably isn’t what we mean by rationally pursuing a goal.

What is the space of agents?

When thinking about things like “the set of optimizers” or “the set of possible economic actors,” it’s unclear how to measure it. In the case of tic-tac-toe, we could easily have said it’s the set of agents who don’t suck, because optimal play just isn’t that hard. The reason most agents in the space sucked is because we assumed it, not because making rational agents is hard. The same could be argued about our economic agent; we allowed arbitrary lists, then triumphantly claimed that most preferences were irrational. We could easily have said that the preferences to consider are orderings over the set of items, which rules out irrational orderings by definition, and we would have concluded that all agents were rational.

In the case of chess, we also made a simplifying assumption, which was that all pieces could be moved anywhere, and almost all moves were obviously bad, because they result in a forfeit - so again, the setup assumption unfairly biased the space. But here, we can’t say that we could just as easily have assumed that the chess agent plays perfectly; we know that it’s computationally infeasible to solve chess. So even without “cheating” by picking an action space that’s biased, we can refer to Shannon’s estimate, and point out that very few agents are rational.

Considering a slightly more general case than chess, we have even bigger problems. Making a bot to play starcraft requires a very large space of possible moves. It needs to not only be able to move any unit to any point on the board, but also to do things like build additional units, or group units so they can be moved together. A simple operationalization could be “click anywhere on the screen,” so that on a screen that is 1024x768, there are close to a million possibilities each time-increment. (And this doesn’t allow grouping units, which requires clicking and dragging, nor scrolling the screen by moving the mouse to the edge without clicking. It also still doesn’t include the action of waiting longer and thinking.) And it’s clear that essentially zero percent of random agents are rational

But this doesn’t get us any closer to what we really want to know, which is about the space of minds, or at least agents, and how it is distributed - especially because we don’t only care about the full space, we care about the space of likely or plausible agents.

One suggestion is that we might do better by thinking about the set of agents that are outcomes of optimization processes. ML models are trained based on some scoring function, and so, if trained, they generally score well. Of course, this is absolutely textbook Goodharting; we’re confusing the easy-to-specify metric used for training with the actual goal. But avoiding that, we can consider how this might work with various actual approaches; the set of chess playing bots output by an LLM trained on chess games is definitely superior at playing chess compared to a random agent from our earlier definition, in that it probably wins a non-zero portion of games.

Doing something like this formally involves formalizing large parts of learning theory, which is a great goal anyways, but requires a lot more math than I’m comfortable with, so I’ll just mention a few other ideas.

What’s needed?

It would be really helpful, in this context, to start defining some mathematical constructs around our ideas of what agents are and what they do.

One obvious option is to define what the difference between two agents is. If we can figure out how to make such a function that maps to real numbers, it induces a metric space, so that we have some notion of distance between agents. To start, this would allow us to talk about the density of specific classes of agents, and formalize the question we started with - what proportion of all agents are rational, for some notion of rational^[8]. It would have a bunch of other really great properties, though! For example, we could ask how far the nearest rational agent is to a given agent, which formalizes coherent extrapolated volition.

Unfortunately, we need a pretty good notion of a metric for this to make sense^[9] - it’s easy to come up with a metric that works poorly. We could use the trivial metric, where agents are distance zero if they are the same, and distance one if they are different. Or we could use the score of an MDP agent, but this makes large classes of very different agents identical, and the distance function is kind of trivially useless for looking at how the agents act.

It would be especially helpful if the metric made sense for how reinforcement learning agents learn. Another direction would be to quantify how “agentic” a given agent is. One attempt to do this is outlined by Kosoy, but it is an open research direction. Another attempt I’ve been thinking about is metrics over economic preference sets, but that is even more preliminary^[10].

Musings on future directions for mathematical formalisms for minds-in-general

How many meaningfully distinct “agents” exist?
- Are there countably infinite agents? Uncountably infinite?
- Are they meaningfully constrained by the physical universe?)
- Is there a useful distance measure?
  - Is the space complete? (If finite, yes.)
  - Is there an accompanying useful definition or metric for rationality for general agents?
  - For MDP agents, arbitrary policies do not optimize rewards; what is a useful measure for this?
  - How does rationality, as defined for these agents, relate to performance in game-theoretic settings?
  - Given a distance measure between minds, what is the density of rational minds in that space?
- Can we formalize agents relative to their learning processes?
  - How do the set of agents trainable by a given process relate to the space of agents-in-general?
  - What is the relationship between training and training loss to distance in this space?
- How do these inform safety of agents?

^{^}
We could include partial orderings, so that we have some sets of things that are incomparable. In this case, we eliminate that by assumption by saying the agent has preferences between all items, so the preference set must guide decisions trading between incomparable items - it cannot refuse to trade. However, even partial orderings can have irrationalities, and it seems clear that the proportion of the total possible semi-orderings which are “rational” is larger but still miniscule as the number of items grows.
^{^}
This also assumes each item is both atomic, and not combined. If we can combine items, we need to represent these in the preference ordering, and if we need to represent fractional items or multiple items, the argument gets more complex - but will not change the fact that almost all possible preferences for large numbers of goods create these money pumps.
^{^}
We don’t technically assume much more than this in that model, since the generic formulation could have multiple of a given item or each possible combination of items listed separately as a preference, which takes care of many objections.
^{^}
If I understand correctly, this follows trivially from Chaitin's argument on the Omega number, and the noncomputability of Chaitin’s constant.
^{^}
These are fine assumptions for a very basic chess agent, but we’d probably do better with a slightly larger action space. The agent can’t decide, for example, to spend 30 seconds computing its next move instead of 5.
^{^}
In our setup, per the previous footnote, most moves are obviously not optimal, because they are illegal.
^{^}
I suspect that this result will generalize for most MDPs, since almost all policies for almost all MDPs do not optimize rewards - though I haven’t proved this.
^{^}
Vanessa pointed out that we just need a measure, not a metric.
^{^}
Vanessa Kosoy has speculated that some complexity measure of bisimulation might be useful; I don’t understand this well enough to know what that would mean.
^{^}
For economic agents, we could define the distance between two original rankings over a finite set similar to Spearman's Footrule by finding the number of elements ranking above each item, considering all elements in a cycle as above other elements in the cycle. For two different preference sets over the items, we can then sum over the absolute differences between the sizes of each element in the two rankings. (Note: this is a valid metric, since summing absolute values is always positive and symmetric, identical orders always have distance zero, and a bit of work shows the triangle inequality holds.) This also has a nice property that for irrational preferences, any resolution of a cycle of some elements by breaking the cycle is equidistant from the preference set containing the cycle, and the minimum distance between two distinct rankings is if two items are switched.

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T23:14:36.300Z

We earlier mentioned that it is required that the finite mapping be precomputed. If it is for arbitrary Turing machines, including those that don't halt, we need infinite time, so the claim that we can map to arbitrary Turing machines fails. If we restrict it to those which halt, we need to check that before providing the map, which requires solving the halting problem to provide the map.

Edit to add: I'm confused why this is getting "disagree" votes - can someone explain why or how this is an incorrect logical step, or

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T21:40:56.055Z

OK, so this is helpful, but if I understood you correctly, I think it's assuming too much about the setup. For #1, in the examples we're discussing, the states of the object aren't predictably changing in complex ways - just that it will change "states" in ways that can be predicted to follow a specific path, which can be mapped to some set of states. The states are arbitrary, and per the argument don't vary in some way that does any work - and so as I argued, they can be mapped to some set of consecutive integers. But this means that the actions of the physical object are predetermined in the mapping.

And the difference between that situation and the CNS is that we know he neural circuitry is doing work - the exact features are complex and only partly understood, but the result is clearly capable of doing computation in the sense of Turing machines.

Comment by Davidmanheim on Language Models are a Potentially Safe Path to Human-Level AGI

davidmanheim — 2024-12-09T16:58:41.306Z

I think this was a valuable post, albeit ending up somewhat incorrect about whether LLMs would be agentic - not because they developed the capacity on their own, but because people intentionally built and are building structure around LLMs to enable agency. That said, the underlying point stands - it is very possible that LLMs could be a safe foundation for non-agentic AI, and many research groups are pursuing that today.

Comment by Davidmanheim on Five Worlds of AI (by Scott Aaronson and Boaz Barak)

davidmanheim — 2024-12-09T16:55:26.651Z

The blogpost this points to was an important contribution at the time, more clearly laying out extreme cases for the future. (The replies there were also particularly valuable.)

Comment by Davidmanheim on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

davidmanheim — 2024-12-09T16:45:32.863Z

I think this post makes an important and still neglected claim that people should write their work more clearly and get it published in academia, instead of embracing the norms of the narrower community they interact with. There has been significant movement in this direction in the past 2 years, and I think this posts marks a critical change in what the community suggests and values in terms of output.

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T16:29:45.446Z

"the actual thinking-action that the mapping interprets"

I don't think this is conceptually correct. Looking at the chess playing waterfall that Aaronson discusses, the mapping itself is doing all of the computation. The fact that the mapping ran in the past doesn't change the fact that it's the location of the computation, any more than the fact that it takes milliseconds for my nerve impulses to reach my fingers means that my fingers are doing the thinking in writing this essay. (Though given the typos you found, it would be convenient to blame them.)

they assume ad arguendo that you can instantiate the computations we're interested in (consciousness) in a headful of meat, and then try to show that if this is the case, many other finite collections of matter ought to be able to do the job just as well.

Yes, they assume that whatever runs the algorithm is experiencing running the algorithm from the inside. And yes, many specific finite systems can do so - namely, GPUs and CPUs, as well as the wetware in our head. But without the claim that arbitrary items can do these computations, it seems that the arguendo is saying nothing different than the conclusion - right?

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T16:23:23.790Z

Looks like I messed up cutting and pasting - thanks!

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T14:33:36.233Z

Thanks - fixed!

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T14:31:39.493Z

Yeah, perhaps refuting is too strong given that the central claim is that we can't know what is and is not doing computation - which I think is wrong, but requires a more nuanced discussion. However, the narrow claims they made inter-alia were strong enough to refute, specifically by showing that their claims are equivalent to saying the integers are doing arbitrary computation - when making the claim itself requires the computation to take place elsewhere!

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-09T13:52:43.772Z

Seems worth noting that the claims of most of the philosophers being cited here is (1) - that even rocks are doing the same computation as minds.

Comment by Davidmanheim on Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T13:49:05.415Z

I agree that this wasn't intended as an introduction to the topic. For that, I will once again recommend Scott Aaronson's excellent mini-book explaining computational complexity to philosophers.

I agree that the post isn't a definition of what computation is - but I don't need to be able to define fire to be able to point out something that definitely isn't on fire! So I don't really understand your claim. I agree that it's objectively hard to interpret computation, but it's not at all hard to interpret the fact that the integers are less complex and doing less complex computation than, say, an exponential-time Turing machine - and given the specific arguments being made, neither is a wall or a bag of popcorn. Which, as I just responded to the linked comment, was how I understood the position being taken by Searle, Putnam, and Johnson. (And even this ignores that one implication of the difference in complexity is that the wall / bag of popcorn / whatever is not mappable to arbitrary computations, since the number of steps required for a computation may not be finite!)

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-09T08:24:56.033Z

I've written my point more clearly here: https://www.lesswrong.com/posts/zxLbepy29tPg8qMnw/refuting-searle-s-wall-putnam-s-rock-and-johnson-s-popcorn

Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

davidmanheim — 2024-12-09T08:24:26.594Z

In a recent essay, Euan McLean suggested that a cluster of thought experiments “viscerally capture” part of the argument against computational functionalism. Without presenting an opinion about the underlying claim about consciousness, I will explain why these arguments fail as a matter of computational complexity. Which, parenthetically, is something that philosophers should care about.

To explain the question, McLean summarizes part of Brian Tomasik’s essay "How to Interpret a Physical System as a Mind." There, Tomasik discusses the challenge of attributing consciousness to physical systems, drawing on Hilary Putnam's "Putnam's Rock" thought experiment. Putnam suggests that any physical system, such as a rock, can be interpreted as implementing any computation. This is meant to challenge the idea that computation alone defines consciousness. It challenges computational functionalism by implying that if computation alone defines consciousness, then even a rock could be considered conscious.

Tomasik refers to Paul Almond’s (attempted) refutation of the idea, which says that a single electron could be said to implement arbitrary computation in the same way. Tomasik "does not buy" this argument, but I think a related argument succeeds. That is, a finite list of consecutive integers can be used to 'implement' any Turing machine using the same logic as Putnam’s rock. Each step N of the machine's execution corresponds directly to integer N in the list. But this mapping is trivial, doing no more than listing the steps of the computation.

It might seem that the above proves too much. Perhaps every mapping requires doing the computation to construct? This is untrue, as the notion of a reduction in computational complexity makes clear. That is, we can build a ”simple” mapping, relative to the complexity of the Turing machine itself, and this succeeds in showing that the system is actually performing arbitrary computations - both the system performing computations and the one being mapped from. Rocks and integers cannot, since any mapping must be as complex as the original Turing machine.

Does the mapping to rocks or integers do anything at all? No. Crucially, the mappings to rocks or integers require the computation to be performed elsewhere to generate the mapping. Without the computation occurring externally, the mapping cannot be constructed, and thus, it is misleading to claim that the computation happens 'in' the rock or the integers. Further, the ability to 'map' Turing machine states to integers implies that we have solved the halting problem — a logical impossibility. But even if we can guarantee the machine halts, the core issue remains: constructing the mapping requires external computation, refuting the idea that the computation occurs in the rock.

Comment by Davidmanheim on Detection of Asymptomatically Spreading Pathogens

davidmanheim — 2024-12-06T11:06:33.182Z

I think 'we estimate... to be'

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-05T04:33:34.784Z

Your/Aaronson's claim is that only the fully connected, sensibly interacting calculation matters.

Not at all. I'm not making any claim about what matters or counts here, just pointing out a confusion in the claims that were made here and by many philosophers who discussed the topic.

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-04T17:11:19.965Z

You disagree with Aaronson that the location of the complexity is in the interpreter, or you disagree that it matters?

In the first case, I'll defer to him as the expert. But in the second, the complexity is an internal property of the system! (And it's a property in a sense stronger than almost anything we talk about in philosophy; it's not just a property of the world around us, because as Gödel and others showed, complexity is a necessary fact about the nature of mathematics!)

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-04T17:07:42.109Z

Yeah, something like that. See my response to Euan in the other reply to my post.

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-04T17:06:50.193Z

Yes, and no, it does not boil down to Chalmer's argument. (as Aaronson makes clear in the paragraph before the one you quote, where he cites the Chalmers argument!) The argument from complexity is about the nature and complexity of systems capable of playing chess - which is why I think you need to carefully read the entire piece and think about what it says.

But as a small rejoinder, if we're talking about playing a single game, the entire argument is ridiculous; I can write the entire "algorithm" a kilobyte of specific instructions. So it's not that an algorithm must be capable of playing multiple counterfactual games to qualify, or that counterfactuals are required for moral weight - it's that the argument hinges on a misunderstanding of how complex different classes of system need to be to do the things they do.

PS. Apologies that the original response comes off as combative - I really think this discussion is important, and wanted to engage to correct an important point, but have very little time to do so at the moment!

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-04T07:22:28.904Z

As with OP, I strongly recommend Aaronson, who explains why waterfalls aren't doing computation in ways that refute the rock example you discuss: https://www.scottaaronson.com/papers/philos.pdf

Comment by Davidmanheim on Do simulacra dream of digital sheep?

davidmanheim — 2024-12-04T07:19:44.741Z

You seem to fundamentally misunderstand computation, in ways similar to Searle. I can't engage deeply, but recommend Scott Aaronson's primer on computational complexity: https://www.scottaaronson.com/papers/philos.pdf

Comment by Davidmanheim on Is the mind a program?

davidmanheim — 2024-12-04T07:17:52.493Z

You seem deeply confused about computation, in ways similar to Searle et al. I cannot engage deeply on this at present, but recommend Aaronson's primer on the topic: https://www.scottaaronson.com/papers/philos.pdf

Comment by Davidmanheim on Hierarchical Agency: A Missing Piece in AI Alignment

davidmanheim — 2024-12-02T12:16:48.553Z

Norms can accomplish this as well - I wrote about this a couple weeks ago.

Comment by Davidmanheim on Hierarchical Agency: A Missing Piece in AI Alignment

davidmanheim — 2024-12-02T12:01:41.751Z

Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)

The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.

It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has called "Frugal Compositional Languages" and see this work by @alcatal - though I understand both are not yet addressing on multi-agent world models, nor is it explicitly about modeling the agents themselves in a compositional / embedded agent way, though those are presumably desiderata.

Comment by Davidmanheim on Mitigating Geomagnetic Storm and EMP Risks to the Electrical Grid (Shallow Dive)

davidmanheim — 2024-11-28T21:43:02.681Z

That is an interesting question l, but I unfortunately do not know enough to even figure out how to answer it.

Comment by Davidmanheim on Mitigating Geomagnetic Storm and EMP Risks to the Electrical Grid (Shallow Dive)

davidmanheim — 2024-11-27T06:51:59.400Z

Good points. Yes, storage definitely helps, and microgrids are generally able to have some storage, if only to smooth out variation in power generation for local use. But solar storms can last days, even if a large long-lasting event is very, very unlikely. And it's definitely true that if large facilities have storage, shutdowns will have reduced impact - but I understand that the transformers are used for power transmission, so having local storage at the large generators won't change the need to shut down the transformers used for sending that power to consumers.

Comment by Davidmanheim on (Salt) Water Gargling as an Antiviral

davidmanheim — 2024-11-27T06:02:23.353Z

Do I understand correctly that the blue-green graph has a y-axis that goes above 100% median reduction, with error bars in that range? (This would happen if they estimated a proportion as a standard variable - not great practice, but I want to check that it is what happened.)

Mitigating Geomagnetic Storm and EMP Risks to the Electrical Grid (Shallow Dive)

davidmanheim — 2024-11-26T08:00:04.810Z

Executive Summary

This initial investigation begins to examine strategies to mitigate and respond to risks posed by high-impact geomagnetic events, which can severely damage electrical infrastructure. This is split into four sections. The first set are ordered from least promising to most promising targets for investment; recovery, adaptation, and withstanding events. Finally, risk reduction is noted as a highly speculative but high value-of-information area for research.

Recovery approaches involve replacing broken equipment and infrastructure after an event, and focus on the logistical and financial challenges of replacing key infrastructure, notably the costly and rare Ultra-High Voltage (UHV) transformers. Adaptation strategies, including the use of Ground-Induced Current (GIC) blocking devices, are identified as viable and potentially partially adopted means for system operators to prevent damage and reduce restoration costs. Next, withstanding involves protocols for what power systems can do during geomagnetic events. This emphasizes proactive grid shutdowns and sectional isolation, which work by leveraging the short warning period before a CME reaches Earth, or plausibly in advance of nuclear war. These are very promising, but should be pursued by governments vand industry. Finally, risk reduction options are currently limited, but potential exists for longer term, highly speculative approaches for minimizing geomagnetic vulnerabilities at the global level.

Specific high-impact or very impactful mitigations are not explored, but because private companies have some incentive to address the risk, policy approaches involving insurance and regulation are noted. Further work could address this, and will be outlined in the conclusion.

Background

The electrical grid is critical infrastructure, and if electrical systems were destroyed at a national or global level, it could plausibly be or lead to a global catastrophe, especially given the fragility and interconnectedness of other systems. A brief (8 minute) video overview from Kurzgesagt from 2020 explains the risk of solar storms. Given that the lack of sufficient backup transformers was recently highlighted by J.D. Vance on Joe Rogan’s podcast, I wanted to double check my current understanding of the risk and mitigations available. Rather than focusing on risk estimation, which has been done before, I’ll provide a very brief summary of the risk, then focus on risk mitigations, and highlight what is possible, and what has been done.

In 2015, David Roodman wrote an in-depth, 56-page investigation into solar storms for Open Philanthropy, concluding “the probability of catastrophe is well under 1% per decade, but is nevertheless uncertain enough, given the immense stakes, to warrant more serious attention.“ Note that this was limited to solar storms, not electromagnetic pulses from intentional acts, which would adversarially target the weakest or most vulnerable aspects of a system - but would be at a national, rather than global, level. (And given that they would require nuclear attacks, would be part of broader nuclear war risk, which is not the current focus.) Roodman’s investigation was limited to probability and possible impact, not mitigations, so I view the current (much shallower) investigation as a continuation or extension of that.

This is not to say that this is novel - a number of exercises have been done, including international ones. One such report is here, and it notes that “Some countries have successfully hardened their transmission grids to space-weather impact and sustained relatively little or no damage due to currents induced by past moderate space-weather events,“ but “the vulnerability of the power grid with respect to Carrington-type events is less conclusive“

What can be done?

At a very, very high level, based on previous work I did (in a different domain), there are three different ways to make a system more resilient; withstand, adapt, and recover. There is also risk reduction, which can be critical, and prior to resilience. I have not reviewed legislation on the topic, but my understanding is that there hasn’t been progress. (Note that I have not reviewed the NDAA for past years or infrastructure bills to see if they include relevant provisions.)

Recover

I’ll address recovery first, since it has received the most prior attention - notably reflected in Vance’s suggestion that we need to have backup transformers. CMEs would damage transformers by inducing current in long-distance wires, which then damages the transformers. Recovery from failure would require rebuilding whatever portion of the grid was destroyed. Replacing the entire US electrical grid could cost $5 trillion (USD, 2017) per Joshua D. Rhodes, a UT Austin Research Scientist, but this estimate includes replacing the power plants themselves, which would not be destroyed in our scenario. The transformers, which are at high risk, would cost a “mere” $600b in current dollars, and the largest ones are more likely to be destroyed in an event. This analysis presumably overestimates actual costs if the system were replaced more intelligently, but more critically, it understates the cost and ignores the likely impossibility of doing so quickly if it needs to be done in an emergency scenario.

The components most at risk from even a moderate event are Ultra-High Voltage transformers. These are very, very expensive ($100m for the Three Mile Island transformer!) and relatively few exist. On the other hand, China’s largest transmission line evidently uses 28 of them. (Each is rated for about as much power as the 3MI plant.) I don’t have a breakdown of transformers in the US electrical grid into ultra-high versus high versus relatively smaller units, nor it is clear to me what proportion would be at risk in various sized events. However, larger events would create additional risks, including destroying smaller transformers. Less likely, high-voltage power lines could be badly damaged if there was a very extreme space weather event - I am uncertain if this is a significant risk, and would require further analysis.

Adapt

A number of approaches exist to adapt to this risk. First, there are existing design considerations which reduce vulnerability. Further work could enhance the ability of the grid to adapt. Roodman did a background research interview which noted “ground-induced current (GIC) blocking devices are the best option for protecting against the threat to the grid posed by geomagnetic storms,“ and “installing GIC blocking devices in transformers around the US would cost one billion dollars.“ This is in contrast to the earlier tends of hundreds of billions for replacing some or all of the transformers. Another approach is GIC-resistant transformer design; it is unclear to what extent this occurs, but requiring future transformers to have such designs, or incentivising it (perhaps via insurers, who cover the risk,) could be a useful policy intervention.

There are also systems for sharing the (limited) stock of replacement transformers, so that moderate levels of transformer failure can be addressed. This exists within the United States, but almost all transformers are built internationally, so that replacing supply during a more severe global event, when other countries will prioritize their own recovery, seems infeasible. I have not looked at whether international cooperation has been explored, or whether other countries have similar plans.

Switching to smaller scale microgrids could reduce the impact of certain risks, so that the ongoing transition to local solar is a plausibly significant trend - if these systems can themselves withstand damage. I am uncertain about the robustness of these systems to large solar storms, which may be critical, but they should at least have less exposure to the induced current than transformers connected to long-distance transmission lines.

Withstand

Withstanding an event would require that the electrical system not fail, or fail to a lesser extent, during an event. Thankfully, we have hours of warning for solar storms, and there is significant data collection and research on the impacts on the power system. Roodman highlighted that storms seem to damage transformers slowly, rather than causing immediate failure - but larger events would presumably cause more immediate damage. To prevent that, a number of short-term adaptations would allow power systems to proactively shut down or isolate sections of the grid to minimize damage. There is work on this, (including internationally,) though it is unclear to me to what extent such methods have been adapted. If such actions are undertaken, failures could be minimized and localized, making recovery easier, or reducing the extent to which adaptation is needed.

Risk Reduction

Risk reduction approaches include prevention, and reducing hazard^[1]. Prevention is often a better approach, but in this domain we aren’t (currently) able to change the likelihood of Coronal Mass Ejections, nor is preventing nuclear war in scope for this writeup.

Hazard reduction is in theory possible, but it is unclear how tractable it is. Most critically, a weakening geomagnetic field would increase the hazard experienced by the grid. Current weakening is probably a precursor to a flip, which will happen in the coming couple centuries. It is unclear to me, but during such a flip, there would be greatly increased vulnerability to solar storms. Preventing a flip seems infeasible at present, and the risks when it occurs are critical; this seems to argue for more investment in other mitigations, but also more research.

Somewhat related, initial analysis and speculation, which have been questioned, indicate that building megaconstellations like Starlink could exacerbate the risk. Ensuring the Earth’s geomagnetic field isn’t (further) weakened is a plausible risk-reduction mitigation, and is worthy of some attention. This could reduce the amount of damage that solar storms would do. Additional medium-dive investigation into the hazard from a flip, and from satellites, and whether these can be feasibly mitigated, seems valuable, at the very least to better understand how valuable other mitigation pathways are.

Conclusions

It seems that the “recovery” options such as backup transformers, while simple, would not prevent disruptions and are easily the least cost-effective. Highlighting the lack of backup transformers is therefore largely a red-herring, even though it highlights that other methods are not fully able to address the risk.

Adapt and withstand approaches, on the other hand, are both feasible, and already pursued in research and by industry. At the same time, they are not currently adopted to an extent sufficient to withstand the most extreme events - but could plausibly be made so with the right regulatory policy and economic incentives. Research into the costs and feasibility of proactive shutdowns and grid isolation, and how it might work to complement other grid resilience measures, is high value. Similarly, it seems clear that there is room from important policy work on how to motivate such measures, and which ones are most compatible with extant regulatory and engineering requirements.

Lastly, risk reduction is the most speculative and uncertain, but because of that, further investigation would be of high value - as long as it does not replace or delay investments in adapting and withstanding the risk.

^{^}
I will consider vulnerability reduction, rather than hazard reduction, to be resilience. (I’m not going to be careful about distinguishing hazard reduction and vulnerability reduction, though they do conceptually count as risk reduction. For example, things like reducing exposure by creating microgrids reduces vulnerability, but I consider it adaptation below instead.)

Comment by Davidmanheim on Occupational Licensing Roundup #1

davidmanheim — 2024-10-31T06:01:42.891Z

Question for a lawyer: how is non-reciprocity not an interstate trade issue that federal courts can strike down?

Comment by Davidmanheim on Dialogue introduction to Singular Learning Theory

davidmanheim — 2024-10-06T13:37:10.449Z

In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we'll be able to do automated alignment of ASI, you'll still need some reliable approach to "manual" alignment of current systems. We're already far past the point where we can robustly verify LLMs claims' or reasoning in a robust fashion outside of narrow domains like programming and math.

But on point two, I strongly agree that Agent foundations and Davidad's agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad's ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that's basically it. And MIRI abandoned agent foundations, while Openphil isn't, it seems, putting money or effort into them.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-22T13:52:57.339Z

I partly disagree; steganography is only useful when it's possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.

That said, I'd be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar - it definitely seems like something that @davidad would want to consider.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-22T13:43:11.194Z

I really like that idea, and the clarity it provides, and have renamed the post to reflect it! (Sorryr this was so slow- I'm travelling.)

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-22T13:40:38.440Z

That seems fair!

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-18T10:46:49.128Z

I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that's not true for proposals like safeguarded AI, where a proof must accompany the output, and it's not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-18T10:43:15.977Z

On the absolute safety, I very much like the way you put it, and will likely use that framing in the future, so thanks!

On impossibility results, there are some, andI definitely think that this is a good question, but also agree this isn't quite the right place to ask. I'd suggest talking to some of the agents foundations people for suggestions

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-16T07:48:04.320Z

I think these are all really great things that we could formalize and build guarantees around. I think some of them are already ruled out by the responsibility sensitive safety guarantees, but others certainly are not. On the other hand, I don't think that use of cars to do things that violate laws completely unrelated to vehicle behavior are in scope; similar to what I mentioned to Oliver, if what is needed in order for a system to be safe is that nothing bad can be done, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes.

But in cases where provable safety guarantees are in place, and the issues relate to car behavior - such as cars causing damage, blocking roads, or being redirected away from the intended destination - I think hardware guarantees on the system, combined with software guarantees, combined with verifying that only trusted code is being run, could be used to ignition-lock cars which have been subverted.

And I think that in the remainder of cases, where cars are being used for dangerous or illegal purposes, we need to trade off freedom and safety. I certainly don't want AI systems which can conspire to break the law - and in most cases, I expect that this is something LLMs can already detect - but I also don't want a car which will not run if it determines that a passenger is guilty of some unrelated crime like theft. But for things like "deliver explosives or disperse pathogens," I think vehicle safety is the wrong path to preventing dangerous behavior; it seems far more reasonable to have separate systems that detect terrorism, and separate types of guarantees to ensure LLMs don't enable that type of behavior.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-16T07:32:45.697Z

Yes, after saying it was about what they need "to do not to cause accidents" and that "any accidents which could occur will be attributable to other cars actions," which I then added caveats to regarding pedestrians, I said "will only have accidents" when I should have said "will only cause accidents." I have fixed that with another edit. But I think you're confused about what I'm trying to show .

Principally, I think you are wrong about what needs to be shown here for safety in the sense I outlined, or are trying to say that the sense I outlined doesn't lead to something I don't claim. If what is needed in order for a system to be safe is that no damage will be caused in situations which involve the system, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes. My claim, on the other hand, is that in sociotechnological systems, the way that safety is achieved is by creating guarantees that each actor - human or AI - behaves according to rules that minimizes foreseeable dangers. That would include safeguards for stupid, malicious, or dangerous human actions, much like human systems have laws about dangerous actions. However, in a domain like driving, in the same way that it's impossible for human drivers to both get where they are going, and never hit pedestrians who act erratically and jump out from behind obstacles into the road with an oncoming car, a safe autonomous vehicle wouldn't be expected to solve every possible case of human misbehavior - just to drive responsibly.

More specifically, you make the claim that "as far as I can tell it would totally be compatible with a car driving extremely recklessly in a pedestrian environment due to making assumptions about pedestrian behavior that are not accurate." The paper, on the other hand, says "For example, in a typical residential street, a pedestrian has the priority over the vehicles, and it follows that vehicles must yield and be cautious with respect to pedestrians," and formalizes this with statements like "a vehicle must be in a kinematic state such that if it will apply a proper response (acceleration for ρ seconds and when braking) it will remain outside of a ball of radius 50cm around the pedestrian."

I also think that it formalizes reasonable behavior for pedestrians, but I agree that it won't cover every case - pedestrians oblivious to cars that are driving in ways that are otherwise safe, who rapidly change their path to jump in front of cars, are sometimes able to be hit by those cars - but I think fault is pretty clear here. (And the paper is clear that even in those cases, the car would need to both drive safely in residential areas, and attempt to brake or avoid the pedestrian in order to avoid crashes even in cases with irresponsible and erratic humans!)

But again, as I said initially, this isn't solving the general case of AI safety, it's solving a much narrower problem. And if you wanted to make the case that this isn't enough for similar scenarios that we care about, I will strongly agree that for more capable systems, the set of situations it would need to avoid are correspondingly larger, and the set of necessary guarantees are far stronger. But as I said at the beginning, I'm not making that argument - just the much simpler one that proveability can work in physical systems, and can be applied in sociotechnological systems in ways that make sense.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-16T05:25:10.814Z

I agree that "safety in an open world cannot be proved," at least as a general claim, but disagree that this impinges on the narrow challenge of designing cars that do not cause accidents - a misunderstanding which I tried to be clear about, but which I evidently failed to make sufficiently clear, as Oliver's misunderstanding illustrates. That said, I strongly agree that better methods for representing grain of truth problems, and considering hypotheses outside those which are in the model is critical. It's a key reason I'm supporting work on infra-Bayesian approaches, which are designed explicitly to handle this class of problem. Again, it's not necessary for the very narrow challenge I think I addressed above, but I certainly agree that it's critical.

Second, I'm a huge proponent of complex system engineering approaches, and have discussed this in previous unrelated work. I certainly agree that these issues are critical, and should receive more attention - but I think it's counterproductive to try to embed difficult problems inside of addressable ones. To offer an analogy, creating provably safe code that isn't vulnerable to any known technical exploit still will not prevent social engineering attacks, but we can still accomplish the narrow goal.

If, instead of writing code that can't be fuzzed for vulnerabilities, doesn't contain buffer overflow or null-pointer vulnerabilities, and can't be exploited via transient execution CPU vulnerabilities, and isn't vulnerable to rowhammer attacks, you say that we need to address social engineering before trying to make the code provably safe, and should address social engineering with provable properties, you're sabotaging progress in a tractable area in order to apply a paradigm ill-suited to the new problem you're concerned with.

That's why, in this piece, I started by saying I wasn't proving anything general, and "I am making far narrower claims than the general ones which have been debated." I agree that the larger points are critical. But for now, I wanted to make a simpler point.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-16T05:10:09.242Z

To start at the end, you claim I "straightforwardly made an inaccurate unqualified statement," but replaced my statement about "what a car needs to do not to cause accidents" with "no accidents will take place." And I certainly agree that there is an "extremely difficult and crucial step of translating a form toy world like RSS into real world outcomes," but the toy model that the paper is dealing with is therefore one of rule-following entities, both pedestrians and cars. That's why it's not going to require accounting for "what if pedestrians do something illegal and unexpected."

Of course, I agree that this drastically limits the proof, or as I said initially, "relying on assumptions about other car behavior is a limit to provable safety," but you seem to insist that because the proof doesn't do something I never claimed it did, it's glossing over something.

That said, I agree that I did not discuss pedestrians, but as you sort-of admit, the paper does - it treats stationary pedestrians not at crosswalks, and not on sidewalks, as largely unpredictable entities that may enter the road. For example, it notes that "even if pedestrians do not have priority, if they entered the road at a safe distance, cars must brake and let them pass." But again, you're glossing over the critical assumption for the entire section, which is responsibility for accidents. And this is particularly critical; the claim is not that pedestrians and other cars cannot cause accidents, but that the safe car will not do so.

Given all of that, to get back to the beginning, your initial position was that "RSS seems miles away from anything that one could describe as a formalization of how to avoid an accident." Do you agree that it's close to "a formalization of how to avoid causing an accident"?

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-15T20:03:42.388Z

Have you reviewed the paper? (It is the first link under "The RSS Concept" in the page which was linked to before, though perhaps I should have linked to it directly.) It seems to lay out the proof, and discusses pedestrians, and deals with most of the objections you're raising, including obstructions and driving off of marked roads. I admit I have not worked through the proof in detail, but I have read through it, and my understanding is that it was accepted, and a large literature has been built that extends it.

And the objections about slippery roads and braking are the set of things I noted under "traditional engineering analysis and failure rates" I agree that guarantees are non-trivial, but they also aren't outside of what is done already in safety analysis, and there is explicit work in the literature on the issue, both from the verification and validation side, and from the perception and sensing weather conditions side.

Comment by Davidmanheim on Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-15T17:52:21.289Z

I agree that it's the most challenging part, and there are unsolved problems, but I don't share your intuition that it's in some way unsolvable, so I suspect we're thinking of very different types of things.

For RSS specifically, Rule 5 is obviously the most challenging, but it's also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules - respecting distance and not hitting things - are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn't changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily.

Proveably Safe Self Driving Cars [Modulo Assumptions]

davidmanheim — 2024-09-15T13:58:19.472Z

I’ve seen a fair amount of skepticism about the “Provably Safe AI” paradigm, but I think detractors give it too little credit. I suspect this is largely because of idea inoculation - people have heard an undeveloped or weak man version of the idea, for example, that we can use formal methods to state our goals and prove that an AI will do that, and have already dismissed it. (Not to pick on him at all, but see my question for Scott Aaronson here.)

I will not argue that Guaranteed Safe AI solves AI safety generally, or that it could do so - I will leave that to others. Instead, I want to provide a concrete example of a near-term application, to respond to critics who say that proveability isn’t useful because it can’t be feasibly used in real world cases when it involves the physical world, and when it is embedded within messy human systems. [Edit to add: Doing this does require assumptions in addition to simple provability, as outlined below, so as @tdietterich, suggested, this leads to the amended title.]

I am making far narrower claims than the general ones which have been debated, but at the very least I think it is useful to establish whether this is actually a point of disagreement. And finally, I will admit that the problem I’m describing would be adding proveability to a largely solved problem, but it provides a concrete example for where the approach is viable.

A path to provably safe autonomous vehicles

To start, even critics agree that formal verification is possible, and is already used in practice in certain places. And given (formally specified) threat models in different narrow domains, there are ways to do threat and risk modeling and get different types of guarantees. For example, we already have proveably verifiable code for things like microkernels, and that means we can prove that buffer overflows, arithmetic exceptions, and deadlocks are impossible, and have hard guarantees for worst case execution time. This is a basis for further applications - we want to start at the bottom and build on provably secure systems, and get additional guarantees beyond that point. If we plan to make autonomous cars that are provably safe, we would build starting from that type of kernel, and then we “only” have all of the other safety issues to address.

Secondly, everyone seems to agree that provable safety in physical systems requires a model of the world, and given the limits of physics, the limits of our models, and so on, any such approach can only provide approximate guarantees, and proofs would be conditional on those models. For example, we aren’t going to formally verify that Newtonian physics is correct, we’re instead formally verifying that if Newtonian physics is correct, the car will not crash in some situation.

Proven Input Reliability

Given that, can we guarantee that a car has some low probability of crashing?

Again, we need to build from the bottom up. We can show that sensors have some specific failure rate, and use that to show a low probability of not identifying other cars, or humans - not in the direct formal verification sense, but instead with the types of guarantees typically used for hardware, with known failure rates, built in error detection, and redundancy. I’m not going to talk about how to do that class of risk analysis, but (modulus adversarial attacks, which I’ll mention later,) estimating engineering reliability is a solved problem - if we don’t have other problems to deal with. But we do, because cars are complex and interact with the wider world - so the trick will be integrating those risk analysis guarantees that we can prove into larger systems, and finding ways to build broader guarantees on top of them.

But for the engineering reliability, we don’t only have engineering proof. Work like DARPA’s VerifAI is “applying formal methods to perception and ML components.” Building guarantees about perception failure rates based on the sensors gives us another layer of proven architecture to build on. And we could do similar work for how cars deal with mechanical failures, other types of sensor failures, and so on as inputs to the safety model. Of course, this is not a challenge particularly related to AI, and it is a (largely solved) problem related to vehicle reliability, and traditional engineering analysis and failure rates could instead be one of the inputs to the model assumptions, with attendant issues dealing with uncertainty propagation so we get proven probabilistic guarantees at this level as well.

Proven Safe Driving

Proving that a car is not just safe to drive, but is being driven safely requires a very different set of assumptions and approaches. To get such an assurance, we’d need provable formal statements about what safe driving is, in order to prove them. And it’s important to figure out what safety means in multi-agent sociotechnical systems. For example, we say someone is a safe driver when they drive defensively, even if another car could crash into them. That’s because safety in multiperson systems isn’t about guaranteeing no harm, it’s about guaranteeing that the agents behavior doesn’t cause that harm.

Luckily, it turns out that there’s something remarkably close to what we want already. Responsibility Sensitive Safety (RSS) [Edit to add: with a formal paper laying out the details and proof here.] formalizes what a car needs to do not to cause accidents. That is, if a car drives safely, any accidents which could occur will be attributable to other cars actions. In the case of RSS, it’s provable that if other cars follow the law, and/or all cars on the road abide by the rules[Edit to add: , and no pedestrians jump out in front of the car when not at a crosswalk, or do similar things which the car cannot dodge, nor perform other illegal acts, and there are no acts of god such as meteors hitting the car or lightning strikes, then with some finite and low probability,] those cars will only ~~have~~ [edit: cause] accidents if their sensors are incorrect. Of course, if another car[ or pedestrian] fails to abide by the rules, safety isn’t guaranteed - but as we’ll mention later, safe driving can’t mean that the car cannot be hit by a negligent or malicious driver, otherwise the safety we’re working towards is impossible!

Proving that cars won’t cause crashes now builds on the risk analysis we described as provable above. Relying on assumptions about other car behavior is a limit to provable safety - but one that provides tremendously more leverage for proof than just removing buffer overflow attacks. That is, if we can show formally that given correct sensor data, a car will only do things allowed in the RSS model, and we build that system on top of the sensors described above, we have shown that it is safe in a very strong sense[; again, this means the system is proveably safe modulo assumptions].

This three part blueprint for provable safety of cars can address the levels in between the provably safe kernel, the responsibility sensitive safety guarantees, and the risk analysis for sensors. If we can prove that code running on the safe kernel can proveably provide the assurances needed for driving, on the condition that the sensors work correctly, and can provide engineering reliability results for those sensors, we have built a system that has provably bounded risk.

Provably Safe ML

Unfortunately, self-driving cars don’t actually just use code of types that can be formally verified, they use neural networks - systems which are poorly understood and vulnerable to a wide variety of hard to characterize failures. Thankfully, we do not need to solve AI safety in general to have safety narrowly. How can we do this?

One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver - it just keeps such systems in a provably safe box.

That isn’t, of course, the only approach. Another is to have the AI system trained to drive the way we want, then use model parroting or a similar approach in order to train a much simpler and fully interpretable model, such that we can verify its properties formally. Alternatively, we can use something like constructible AI in place of black-box systems, and prove properties of the composed parts. In each case, Guaranteed Safe AI is not a tool for guaranteeing AI alignment in general, it is a tool for building specific safe systems.

Adversarial Concerns

Once a self-driving car is constrained to follow the rules of the road with some provable reliability, despite failures in its systems, we still need to worry about other concerns. Most critically, we need to consider adversarial attacks and robustness, on two fronts. The first is indirectly malicious adversarial behavior, accidentally or purposefully using the self-driving cars limitations and rule sets to exploit their behavior. These can be severe, such as causing crashes, as discussed here. But even a safe car cannot eliminate such attacks, as mentioned earlier. In fact we would hope that, say, a car driving towards an autonomous vehicle at high speed would cause the autonomous vehicle to move out of the way, even if that meant it instead hits a stationary object or a different car, if it can do so in ways that reduce damage. Less extreme are attacks that cause the car to move to the side of the road unnecessarily, or create other nuisances for drivers. These acts, while unfortunate, are not safety issues.

However, there is a more widely discussed concern that engineered attacks on our otherwise safe system could “hide” stop signs, as has been shown, repeatedly, or perhaps modify other car’s paint so that the other car is not recognized by sensors. This is definitely a concern that robustness research has been working on, and such work is useful. On the other hand, we do not blame human drivers if someone maliciously removes a stop sign; one again, provable safety does not imply that others cannot cause harm.

We also note that the claim that an adversarial attack was conducted without violating laws, including traffic laws, does not shield the perpetrator from criminal charges including attempted murder, and the “intentional act exception“ for insurance would mean that the perpetrator of such acts would be personally liable, without any insurance protection. Our extant legal system can handle this without any need for specific changes.

Defining Broader Sociotechnological Proveably Safe AI

There are other domains where the types of safety guarantees we want for AI systems are much stronger than simply not causing harm. For example, an AI system that explains how to make bioweapons would not itself have caused harm, it would only have enabled a malicious actor to do so. But what we have shown is that we can build a sociotechnical definition of responsible behavior that is coherent, and can be specified in a given domain.

In some cases, similar to the self-driving car example, the rule we desire could closely parallel current law for humans. In the bioweapons case, materially assisting in the commission of a bioterror attack would certainly be illegal for humans, and the same could be required for artificial intelligence. Formalizing this is obviously difficult, but it is not equivalent to the effectively impossible task of fully modeling the entire human body and its interaction with some novel pathogen. (To avoid the need, we can allow the false positive rate to be far higher than the false negative rate, as many humans do when unsure if an action is strictly illegal, and only allow things we're resonably sureare safe.)

But in other cases, such as autonomy risks from AI, we would expect that the rules needed for AI systems would differ substantially from human law, and definitions for what qualifies as safety would need to be developed before proveablity could be meaningful.

Conclusion

It seems that building provably safe systems in the real world is far from an unsolvable problem, as long as we restrict the problem to solve to something that is clearly defined. We can imagine similar guarantees for AI systems for cyber-security, with protections against privilege escalation attacks performed by the AI, or for “escape” scenarios, with protections against self-exfiltration, or (as suggested but, in my view mistakenly, derided) for DNA synthesis, with guarantees that all synthesis orders were screened for misuse potential. None of these will stop people from doing these unsafely in ways not covered by safety guarantees, nor will they prevent hypothetical strongly superhuman AI from finding devious edge cases, or inventing and exploiting new physics.

And if that is the concern, looking more towards AI alignment, we could even construct systems with formal guarantees that all planning is approved by a specific process, to scaffold proposals like Humans consulting HCH, or stronger alternatives. But to provide these stronger guarantees, we need some fundamental progress in proveability of types which are needed for more prosaic applications like self-driving. And in the meantime, there are many cases which could benefit from the clearer definitions and stronger guarantees of safety that proveably safe AI would provide.

Comment by Davidmanheim on Limitations on Formal Verification for AI Safety

davidmanheim — 2024-09-11T09:14:08.689Z

As you sort of refer to, it's also the case that the 7.5 hour run time can be paid once, and then remain true of the system. It's a one-time cost!

So even if we have 100 different things we need to prove for a higher level system, then even if it takes a year of engineering and mathematics research time plus a day or a month of compute time to get a proof, we can do them in parallel, and this isn't much of a bottleneck, if this approach is pursued seriously. (Parallelization is straightforward if we can, for example, take the guarantee provided by one proof as an assumption in others, instead of trying to build a single massive proof.) And each such system built allows for provability guarantees for systems build with that component, if we can build composable proof systems, or can separate the necessary proofs cleanly.

Comment by Davidmanheim on Limitations on Formal Verification for AI Safety

davidmanheim — 2024-09-09T06:09:23.506Z

Yes - I didn't say it was hard without AI, I said it was hard. Using the best tech in the world, humanity doesn't *even ideally* have ways to get AI to design safe useful vaccines in less than months, since we need to do actual trials.

Comment by Davidmanheim on How I got 4.2M YouTube views without making a single video

davidmanheim — 2024-09-08T12:45:16.890Z

I know someone who has done lots of reporting on lab leaks, if that helps?

Also, there are some "standard" EA-adjacent journalists who you could contact / someone could introduce you to, if it's relevant to that as well.

Comment by Davidmanheim on Limitations on Formal Verification for AI Safety

davidmanheim — 2024-09-08T11:37:44.917Z

Vaccine design is hard, and requires lots of work. Seems strange to assert that someone could just do it on the basis of a theoretical design. Viral design, though, is even harder, and to be clear, we've never seen anyone build one from first principles; the most we've seen is modification of extant viruses in minor ways where extant vaccines for the original virus are likely to work at least reasonably well.

Are LLMs on the Path to AGI?

davidmanheim — 2024-08-30T03:14:04.710Z

I am unsure, but I disagree with one argument that they aren’t.

There’s a joke about how humans have gotten so good at thinking that they tricked rocks into thinking for them. But it’s a joke, in part because it’s funny to say that computers work by “tricking rocks into thinking,” and in part because what computers do isn’t “really” thinking.

But it is possible to take the limitations of computers and computation too far. A point I’ve repeatedly seen is that “Artificial General Intelligence lies beyond Deep Learning,” which gets something fundamentally but very subtly wrong about Large Language Models. The overall claim is that machine learning is fundamentally incapable of certain types of reasoning required for AGI. Whether that is true is fundamentally unclear, and I think the proponents of this view are substantively wrong in repeating the common claim that deep learning cannot do counterfactual reasoning.

First, though, I want to provide a bit of background to be clear about what computers are and are not doing. There is a deep question about whether LLMs understand anything, but I will claim that it’s irrelevant, because they don’t need to. Silicon and electrical waves inside of a calculator certainly do not “understand” numbers. It might be objected that if the circuits and logic gates aren’t doing math, so what calculators do isn’t truly math. When we put them together correctly, however, they can do addition anyways, without the logic gates and circuits understanding what they are doing. It can’t “truly” do math - and yet, e pur si muove! Calculators do not “truly understand” numbers, but that doesn’t mean we cannot build something on top of electronic circuits to do addition.

To analogize briefly, cells in the human brain also don’t know how to think, they just send electrical signals based on chemical gradients inside and outside the cell, and chemical signals. Clearly, the thinking happens at a different level than the sodium-potassium pumps or the neurons firing. That doesn’t mean human brains cannot represent numbers or do math, just that it happens at a different level than the neurons firing. But these philosophical questions aren’t actually answering anything. So I’ll abandon the analogies and get to the limitations of deep learning.

Machine learning models derive statistical rules based only on observational data. For this reason, the models cannot “learn” causal relationships. So the idea that deep learning systems focus on prediction, not (causal) understanding is at best narrowly correct¹. However, to keep it simple, it is true that the representation of the data in the model isn’t a causal one - language models are not designed to have a causal understanding of the relationship between the input text and the completions, and purely textual relationships that are learned are correlational.

But the things which a model represents or understands are different from the things it outputs. A toy example might clarify this; if I perform a linear regression about the relationship between height and basketball points scored, the model does not understand what height or basketball are, but it outputs predictions about their relationship. That is, there is a difference between what the linear model represents, much less what it understands, and what it can do. Similarly, the things that language models can output are different from what they actually do internally.

So to return to the claim that deep learning systems won’t properly extend to what-if scenario evaluation instead of prediction - or the broader claim which has been made elsewhere that they can’t do causal reasoning, there are several places where I think this is misleading.

First, there is an idea that because models only represent the data they are given, they cannot extrapolate. The example given is that a self-driving car, “encountering a new situation for which it lacked training,” would inevitably fail. This is obviously wrong; even in the case of our linear model, the model extrapolates to new cases; the data may only contain heights between 4”9’ and 5”5’, and those between 5”7’ and 6”2’, but it can still provide a perfectly reasonable prediction interval for someone who is 5”6’, or even for people with heights of 6”4, despite never having seen that data. Of course, that example is simplistic, but it’s very easy to see that LLMs are in fact generalizing. The poetry they write is sometimes remixed, but it’s certainly novel. The answers it gives and the code it generates are sometimes simple reformulations of things it has seen, but they aren’t identical.

Second, the inability to learn causality from observation is both correct and incorrect. It is correct that a language model cannot properly infer causality in its data without counterfactuals, but it does not need to properly represent causality internally in order to output causally correct claims and understanding. Just like the earlier linear regression does not need to understand basketball, the LLM does not need to internally represent a correct understand of causality. That is, it can learn about how to reason about causal phenomenon in the real world by building purely correlational models of when to have outputs which reason causally. And we see this is the case!

The counterfactual reasoning here does not itself imply that there is anywhere inside of GPT4 which does causal reasoning - it provides essentially no evidence either way. It simply shows that the system has learned when to talk about casual relationships based on learning the statistical pattern in the data. Stochastic parrots can reason causally, even if they don’t understand what they are saying.

Third, this has nothing to do with LLM consciousness, and there is a philosophical case which has been made² that language models cannot truly understand anything. That is, the outputs they produce no more represent understanding than a calculator’s output shows an understanding of mathematics. But this itself does not imply that it does not do the tasks correctly - this is an empirical rather than philosophical question! And as always, I do not think that the current generation of LLMs is actually generally intelligent in the sense that it can actually reason in novel situations, or can accomplish everything a human can do. But this isn’t evidence that LLMs are fundamentally incapable of doing so - especially once the LLM is integrated into a system which does more than output a single string, without iteration.

But to the extent that an LLM doing single-shot inference does, in fact, reason properly, the claim that AGI requires what-if or counterfactual or causal reasoning is not relevant, because we know that they do exactly that type of reasoning, whether or not it’s “true” understanding.

As a final note, in discussing deep uncertainty and robust decision-making, there is a claim that “a human would… update information continuously, and opt for a robust decision drawn from a “distribution” of actions that proved effective in previous analogous situations.” Unfortunately, that isn’t how humans reason; recognition primed decision making, where people choose actions based directly on their past experiences, doesn’t work that way. It does not opt for a robust decision. Instead, humans need to do extensive thinking and reflection in order to engage in robust decision making - and there seems to be no reason that LLMs could not do the same types of analysis, even if these systems don’t truly “understand” it. And if you ask an LLM to carefully reason through approaches and evaluate them via considering robustness to different uncertainties, it does a credible job.

Footnotes:

The use of gradient descent on model weights cannot learn to represent counterfactuals, and because what Pearl calls “do” operations are not represented, the high-dimensional functions which the model learns are correlations, not causal relationships. But given the data which contains counterfactuals, often with causality explicitly incorporated, the networks can, in theory, learn something equivalent to causal Bayesian networks or other causal representations of the data.
I’ll note that I think the typical philosophical case against LLM consciousness goes too far, in that it seems to prove human minds also cannot truly understand - but that’s a different discussion!

Scaling Laws and Likely Limits to AI

davidmanheim — 2024-08-18T17:19:46.597Z

Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy

davidmanheim — 2024-07-15T05:50:17.770Z

Bloomberg reports that OpenAI internally has benchmarks for “Human-Level AI.” They have 5 levels, with the first being the achieved level of having intelligent conversation, to level 2, “[unassisted, PhD-level] Reasoners,” level 3, “Agents,” level 4, systems that can “come up with new innovations,” and finally level 5, “AI that can do the work of… Organizations.”

The levels, in brief, are:

1 - Conversation
2 - Reasoning
3 - Agent
4 - Innovation
5 - Organization

This is being reported secondhand, but given that, there seem to be some major issues with the ideas. Below, I outline two major issues I have with what is being reported.

...but this is Superintelligence

First, given the levels of capability being discussed, OpenAI’s typology is, at least at higher levels, explicitly discussing superintelligence, rather than “Human-Level AI.” To see this, I’ll use Bostrom’s admittedly imperfect definitions. He starts by defining superintelligence as “intellects that greatly outperform the best current human minds across many very general cognitive domains,” then breaks down several ways this could occur.

Starting off, his typology defines speed superintelligence as “an intellect that is just like a human mind but faster.” This would arguably include even their level 2, which “”its technology is approaching,” since “basic problem-solving tasks as well as a human with a doctorate-level education who doesn’t have access to any tools” runs far faster than humans already. But they are describing a system with already-superhuman recall and multi-domain expertise to humans, and inference using these systems is easily superhumanly fast.

Level 4, AI that can come up with innovations, presumably, those which humans have not, would potentially be a quality superintelligence, “at least as fast as a human mind and vastly qualitatively smarter,” though the qualification for “vastly” is very hard to quantify. However, level 5 is called “Organizations,” which presumably replaces entire organizations with multi-part AI-controlled systems, and would be what Bostrom calls “a system achieving superior performance by aggregating large numbers of smaller intelligences.”

However, it is possible that in their framework, OpenAI means something that is, perhaps definitionally, not superintelligence. That is, they will define these as systems only as capable as humans or human organizations, rather than far outstripping them. And this is where I think their levels are not just misnamed, but fundamentally confused - as presented, these are not levels, they are conceptually distinct possible applications, pathways, or outcomes.

Ordering Levels?

Second, as I just noted, the claim that these five distinct descriptions are “levels” and they can be used to track progress implies that OpenAI has not only a clear idea of what would be required for each different stage, but that they have a roadmap which shows that the levels would happen in the specified order. That seems very hard to believe, on both counts. I won’t go into why I think they don’t know what the path looks like, but I can at least explain why the order is dubious.

For instance, there are certainly human “agents” who are unable to perform tasks which we expect of what they call level two, i.e. that which an unassisted doctorate-level individual is able to do. Given that, what is the reason level 4 is after level 2? Similarly, the ability to coordinate and cooperate is not bound by the ability to function at a very high intellectual level; many organizations have no members which have PhDs, but still run grocery stores, taxi companies, or manufacturing plants.

And we’re already seeing work being done on agents that are intended to operate largely independently, performing several days of human work without specific supervision. At present, it seems these systems fail partly because of the limitations of the underlying systems, and partly because better structures for these systems are needed. However, at the very least, it’s unclear whether we’d see AI that can innovate effectively (level 4) before or after they are successful working independently (level 3).

So it seems that we have no idea whether GPT-5, whenever they decide to release it, will end up as a level-5-but-not-4 system (organization that cannot innovate,) or a level 3-but-not-2 (agent without a PhD) system, or a level 4-but-not-3 (innovator that cannot operate independently for multiple days) systems. Of course, it’s possible that all of these objections will be addressed in OpenAI’s full “progress tracking system” - but it seems far more likely that the levels they are talking about are more a marketing technique to sell the idea that their systems will be predictable in their abilities.

I’m deeply skeptical.

Biorisk is an Unhelpful Analogy for AI Risk

davidmanheim — 2024-05-06T06:20:28.899Z

A Dozen Ways to Get More Dakka

davidmanheim — 2024-04-08T04:45:19.427Z

As the dictum goes, “If it helps but doesn’t solve your problem, perhaps you’re not using enough.” But I still find that I’m sometimes not using enough effort, not doing enough of what works, simply put, not using enough dakka. And if reading one post isn’t enough to get me to do something… perhaps there isn’t enough guidance, or examples, or repetition, or maybe me writing it will help reinforce it more. And I hope this post is useful for more than just myself.

Of course, the ideas below are not all useful in any given situation, and many are obvious, at least after they are mentioned, but when you’re trying to get more dakka, it’s probably worth running through the list and considering each one and how it applies to your actual problem. And more dakka won’t solve every problem - but if it’s not working, make sure you tried doing enough before assuming it can’t help.

So if you’re doing something, and it isn’t working well enough, here’s a dozen ways to generate more dakka, and how each could apply if you’re a) exercising, or b) learning new mathematics.

A Dozen Ways

Do it again.
1. Instead of doing one set of repetitions of the exercise, do two.
2. If you read the chapter once, read it again.
Use more.
1. If you were lifting 10 pounds, lift 15.
2. If you were doing easy problems, do harder ones.
Do more repetitions.
1. Instead of 10 repetitions, do 15.
2. If you did 10 problems on the material, do 15.
Increase intensity.
1. Do your 15 repetitions in 2 minutes instead of 3.
2. If you were skimming or reading quickly, read more slowly.
Schedule it.
1. Exercise at a specific time on specific days. Put it on your calendar, and set reminders.
2. Make sure you have time scheduled for learning the material and doing problems.
Do it regularly.
1. Make sure you exercise twice a week, and don’t skip.
2. Make sure you review what you did previously, on a regular basis.
Do it for a longer period.
1. Keep exercising for another month.
2. Go through another textbook, or find more problem sets to work through.
Add types.
1. In addition to push-ups, do bench presses, chest flyers, and use resistance bands.
2. In addition to the problem sets, do the chapter review exercises, and work through the problems in the chapter on your own.
Expand the repertoire.
1. Instead of just push–ups, do incline push ups, loaded push-ups, and diamond push-ups.
2. Find (or invent!) additional problem types; try to prove things with other methods, find different counter-examples or show why a relaxed assumption means the result no longer holds, find pre-written solutions and see if you can guess next steps before reading them.
Add variety.
1. Do leg exercises instead of just chest exercises. Do cardio, balance, and flexibility training, not just muscle building.
2. Do adjacent types of mathematics, explore complex analysis, functional analysis, and/or harmonic analysis.
Add feedback.
1. Get an exercise coach to tell you how to do it better.
2. Get someone to grade your work and tell you what you’re doing wrong, or how else to learn the material.
Add people.
1. Have the whole team exercise. Find a group, gym, or exercise class.
2. Collaborate with others in solving problems. Take a course instead of self-teaching. Get others to learn with you, or teach someone else to solidify your understanding.

Bonus Notes

For the baker’s dozen, in addition to Dakka, make it easier in other ways. Listen to music if it helps, remove things that make it harder or distract you, make sure you have the right equipment, books, and space, find a more convenient place to do it, and get people to reinforce your work positively.

And there is a secret 14th technique, which is to figure out if what you’re doing is the right way to accomplish your goal; it might improve some metric, but not accomplish what you really care about. If you still aren’t getting the job, make sure it’s not because of something other than your physical appearance or math ability. If you’re not losing weight, exercising more often doesn’t help. And if you’re getting stuck on the math, or feel that you can’t understand it, make sure you understand all of the prerequisites well enough.

Hopefully, this post is helpful. If it wasn’t, of course, you might try reading it again, reading it more slowly, rereading Zvi’s original post, thinking of additional examples yourself, coming up with another method for getting more dakka and generating examples for the listed domains, coming up with a new domain and trying to figure out what might qualify as more dakka under each example, using other rationality techniques to supplement dakka, explain this to someone else, or figure out if there’s some other reason more dakka isn’t working.

Disclaimer

If you’re still not sure, ask your rationalist guru whether more dakka is right for you. If more dakka causes headaches, anxiety, loss of sleep, excess posting on lesswrong, or increases existential risk, discontinue more dakka immediately and seek amateur advice.

"Open Source AI" isn't Open Source

davidmanheim — 2024-02-15T08:59:59.034Z

Open source software has long differentiated between “free as in speech” (libre) and “free as in beer” (gratis). In the first case, libre software has a license that allows the user freedom to view the source and modify it, understand it, and remix it. In the second case, gratis software does not need to be paid for, but the user doesn’t necessarily have access to the pieces, can’t make new versions, and cannot remix or change it.

...

If Open Source AI is neither gratis or libre, then those calling free model weights “Open Source,” should figure out what free means to them. Perhaps it’s “free as in oxygen” (dangerous due to reactions it can cause), or “free as in birds” (wild, without any person responsible).

I’m not necessarily opposed to judicious release of model weights, though as with any technology, designers and developers should consider the impact of their work before making or releasing it, as LeCun has recently agreed. But calling this new competitive strategy by Facebook “Open Source” without insisting on the actual features of open source is an insult to the name.

Technologies and Terminology: AI isn't Software, it's... Deepware?

davidmanheim — 2024-02-13T13:37:10.364Z

A few weeks ago, I (David) tried to argue that AI wasn’t software. In retrospect, I think I misstated the case by skipping some inferential steps, and based on some feedback from that article, and on an earlier version of this article, with a large assist by Abram Demski, I’m going to try again.

The best response to my initial post, by @gjm, explained the point more succinctly and better than I had; “An AI system is software in something like the same way a human being is chemistry.” And yes, of course the human body is chemistry. So in that sense, I was wrong - arguing that AI isn’t software is, in some sense, arguing that the human body isn’t chemistry. But the point was that we don’t think about humans in terms of chemistry.

“The Categories Were Made For Man, Not Man For The Categories.” That is, what we call things is based on inference about which categories should be used, and what features are definitive or not. And any categorization has multiple purposes, and so the question of whether to categorize things together or separately is collapsing many questions together.

The prior essay spent time arguing that Software and AI are different. The way software is developed is different from the way AI is developed, and the way software behaves and how it fails is different from how AI behaves and fails. Here, I’ll add two more dimensions; that the people and the expertise for AI is different than that for software, and that AI differs from software in ways similar to how software differs from hardware. In between those two, I’ll introduce a conceptual model from Abram Demski explaining that AI is a different type of tool than software that captures much more of the point.

Based on that, we’ll get to the key point that was obscured initially; if AI is a different type of thing, what does that imply about it - and less importantly, what should we name the new category?

Who does what?

If we ask a stranger what they do, and they say chemistry, we would be surprised to learn that they were a medical doctor. On the other hand, if someone is a medical doctor, we expect them to know a reasonable amount about biochemistry.

I have a friend who was interested in business, but did a bachelors in scientific computational methods. He went on to get an MBA, and did research on real time pricing in electrical markets - a field where his background was essential. He told me once that as an undergrad, he managed As in his classes, but got weird looks when he asked what a compiler was, and how he was supposed to run code on his computer. He wasn’t a computer scientist, he was just using computer science. Computational numerical methods were a great tool for him, and it was useful to understand financial markets, but he certainly wouldn’t tell people he was a computer scientist or mathematician. These two domains are connected, but not the same.

Returning to the earlier question, software and AI are connected. If someone says they do software development, we would be surprised if they mainly published AI research. And this goes both ways. The skills needed to do AI research or build AI systems sometimes require a familiarity with software development, but other times, it does not. There are people who do prompt engineering for language models that can’t write any code - and their contributions are nonetheless absolutely vital to making many AI systems work. There are people who do mathematical analysis of deep learning, and can explain the relationship between different activation functions and model structures and how that affects how they converge, and also don’t write code. People who write code may or may not work with AI, but everyone who does prompt engineering for LLMs or mathematical analysis of deep learning is doing work with AI.

What Kind of Tool is AI?

Abram suggests that we can make a rough accounting of shifting technological paradigms as follows:

Tools
Machines
Electric
Electronic
Digital

Each of these is largely but not entirely a subset of the prior level. Yes, there are machines that aren’t really tools, say, because they are toys, and yes, there are electric or electronic systems that aren’t machines in the mechanical or similar senses. Despite this, we can see a progression - not that when machines were invented people stopped using tools, or that digital devices replaced earlier devices, but that they are different.

What makes each category conceptually different? Each shift in paradigm is somewhat different, but we do see a progression. We might still ask what defines this progression, or what changes between levels? A full account would need its own essay, or book, and the devil is in the details, but some common themes here are increasing complexity, increasing automation, (largely) diminishing size of functional components, asking less from humans (first, less time and energy; later, less information).

The shift from "electric" to "electronic" seems complex, but as electrical components got smaller and more refined, there was a shift away from merely handling energy and toward using electricity for information processing. If I ask you to fill in the blank in "electric ____" you might think of an electric lightbulb, electric motor, or electric kettle, appliances which focus primarily on converting electricity to another form of energy. And if I ask you to fill in the blank in "electronic ____" you might think of an electronic calculator, electronic thermometer, or electronic watch. In each case, these devices are more about information than physical manipulation. However, this is not a shift from using electric current to using electrons, as one early reader suggested. Both use electricity, but we start to see a distinction or shift in conceptual approaches from "components" like resistors, magnets, motors, and transistors, to "circuits'' which chain components together in order to implement some desired logic.

Shifting from the electronic paradigm to the digital one, we see the rise of a hardware/software distinction. Pong was (iirc) designed as a circuit, not programmed as software -- but video games would soon make the switch. And "programming" emerges as an activity separate from electrical engineering, or circuit design. "Programmers" think about things like algorithms, logic, and variables with values. Obviously all of these are accomplished in ways logically equivalent to a circuit, but the conceptual model changed.

Hardware, software, and.. deepware?

In a comment, Abram noted that a hardware enthusiast could argue against making a software/hardware distinction. The idea of "software" is misleading because it distracts from the physical reality. Even software is still present physically as magnetic states in the computer’s hard drive, or in the circuits. And obviously, software doesn't do anything hardware can't do, since software doing something is just hardware doing it. This could be considered different than previous distinctions between levels; a digital calculator is doing something an electric device can’t, while an electric kettle is just doing what another machine does by using electricity instead of some chemical fuel.

But Abram pointed out that thinking in this way will not be a very good way of predicting reality. The hypothetical hardware enthusiast would not be able to predict the rise of the "programmer" profession, or the great increase in complexity of things that machines can do thanks to "programming".

The argument is that machine learning is a shift of comparable importance, such that it makes more sense to categorize generative AI models as "something else" in much the same way that software is not categorized as hardware (even though it is made of physical stuff).

It is more helpful to think of modern AI as a paradigm shift in the same way that the shift from "electronic" (hardware) to "digital" (software) was a paradigm shift. In other words: the digital age has led to the rise of generative AI, in much the same way that the electric age enabled the rise of electronics. One age doesn’t end, and we're still using electricity for everything (indeed, for even more things,) but "electric" stopped being the most interesting abstraction. Now, a shift to deep learning and AI means that things like "program", "code", "algorithm" are starting to not be the best or most relevant abstraction either.

Is this really different?

When seeing the above explanation, @gjm commented that “I suppose you could say a complicated Excel spreadsheet monstrosity is ‘software’ but it's quite an unusual kind of software and the things you do to improve or debug it aren't the same as the ones you do with a conventional program. AI is kinda like these but more so.”

The question is whether "more so" is a evolutionary or revolutionary change. Yes, toasters are different from generators, and the types of things you do to improve or debug them are different, but there is no conceptual shift. You do not need new conceptual tools to understand and debug spreadsheets, even if they are the types of horrendous monstrosities I've worked with in finance. On the other hand, there have obviously been smaller paradigm shifts within the larger umbrella of software, from assembly to procedural programming to object oriented programming and so on. And these did involve conceptual shifts and new classes of tools; type systems and type checking were a new concept when shifting from machine code programming in assembly to more abstract programming languages, even though bits, bytes, and words were conceptually distinct in machine code.

It could be debated which shifts should be considered separate paradigms, but the shift to deep learning required a new set of conceptual tools. We need to go back to physics to understand why electronic circuits work, they don’t really work just as analogies to mechanical systems. @gjm explained this clearly; “We've got a bunch of things supervening on one another: laws of physics, principles of electronics, digital logic, lower-level software, higher-level software, neural network, currently-poorly-understood structures inside LLMs, something-like-understanding, something-like-meaning. Most of the time, in order to understand a higher-level thing it isn't very useful to think in terms of the lower-level things.”

This seems to hit on the fundamental difference. When a new paradigm supervenes on a previous one, it doesn't just add to it, or logically follow. Instead, the old conceptual models fail, and you need new concepts. So type theory is understood via concepts that are coherent in the terms we use to talk about debugging logic in earlier programming. On the other hand, programming instead of circuit design or electronics required a more fundamental regrounding. The new paradigm did not build further on physics and extend mathematical approaches previously used for analog circuit design. Instead, it required the development of new mathematical formalisms and approaches - finite state machines and Turing completeness for programs, first-order predicate logic for databases, and similar. The claim here is that deep learning requires a similar rethinking, not just building conceptual tools on top of those we already have.

What’s in a Name?

Terminology can be illuminating or obscuring, and naming the next step in technological progress is tricky. Electronics is used as a different word than electric, but it’s not as though electrons are more specifically involved; static electricity, resistors, PN junctions, and circuits all involve electrons. Similarly, “software” is not a great name to describe a change from electronic components to data, but both terms stuck. (I clearly recall trying to explain to younger students that they were called floppy disks because the old ones were actually floppy; now, the only thing remaining of that era is the icon that my kids don’t recognize as representing a physical object.)

Currently, we seem to have moved from calling these new methods and tools “machine learning” to calling them “AI,” and both indicate something about how this isn’t software, it’s something different, but neither term really captures the current transition. The product created by machine learning isn’t that the machine was learning, it’s that the derived model can do certain things on the basis of what it infers from data. Many of those things are (better or different versions of) normal types of statistical inference, including categorization, but not all of them. And calling ML statistics misses the emergent capabilities of GANs, LLMs, Diffusion models, and similar.

On the other hand, current “AI” is rightly considered neither artificial nor intelligent. It’s not completely artificial, because other than a few places like self-play training for GANs, it’s trained on human expertise and data. In that way, it’s more clearly machine learning (via imitating humans.) It’s also not currently intelligent in many senses, because it’s non-agentic and very likely not conscious. And in either case, it’s definitely not what people envisioned decades ago when they spoke about “AI.”

The other critical thing that the current terms seem to miss is the deep and inscrutable nature of the systems. There are individuals who understand at least large sections of every large software project, which is necessary for development, but the same is not and never need be true for deep learning models. Even to the extent that interpretability or explainability is successful, the systems are far more complex than humans can fully understand. I think that “deepware” captures some of this, and am indebted to @Oliver Sourbut for the suggestion.

Conclusion

Deep learning models use electricity and run on computers that can be switched on and off, but are not best thought of as electric brains. Deep learning models run on hardware, but are not best thought of as electronic brains. Deep learning models are executed as software instructions, but are not best thought of as software brains. And in the same way, software built on top of these deep learning models to create ”AI systems” provides a programming-interface for the models. But these are not designed systems, they are inscrutable models grown on vast datasets.

It is tempting to think of the amalgam of a deep learning model and the software as a software product. Yes, it is accessed via API, run by software, on hardware, with electricity, but instead of thinking of software, hardware, or electrical systems, we need to see them as what they are. That doesn’t necessarily mean the best way of thinking about them is as inscrutable piles of linear algebra, or as shoggoths, or as artificial intelligence, but it does mean seeing them as something different, and not getting trapped in the wrong paradigm.

Thanks to @gjm for the initial comment and the resulting discussion, and to @zoop for his disagreements, an to both for their feedback on an earlier draft. Thanks to @Gerald Monroe and @noggin-scratcher for pushback and conversation on the original post. Finally, thanks to @Daniel Kokotajlo for initially suggesting "deepnets" and again thanks to @Oliver Sourbut for suggesting "deepware"

Safe Stasis Fallacy

davidmanheim — 2024-02-05T10:54:44.061Z

AI Is Not Software

davidmanheim — 2024-01-02T07:58:04.992Z

Epistemic Status: This idea is, I think, widely understood in technical circles. I'm trying to convey it more clearly to a general audience. Edit: See related posts like this one by Eliezer for background on how we should use words.

What we call AI in 2024 is not software. It's kind of natural to put it in the same category as other things that run on a computer, but thinking about LLMs, or image generation, or deepfakes as software is misleading, and confuses most of the ethical, political, and technological discussions. This seems not to be obvious to many users, but as AI gets more widespread, it's especially important to understand what we're using when we use AI.

Software

Software is how we get computers to work. When creating software, humans decide what they want the computer to do, think about what would make the computer do that, and then write an understandable set of instructions in some programming language. A computer is given those instructions, and they are interpreted or compiled into a program. When that program is run, the computer will follow the instructions in the software, and produce the expected output, if the program is written correctly.

Does software work? Not always, but if not, it fails in ways that are entirely determined by the human’s instructions. If the software is developed properly, there are clear methods to check each part of the program. For example, unit tests are written to verify that the software does what it is expected to do in different cases. The set of cases are specified in advance, based on what the programmer expected the software to do. If it fails a single unit test, the software is incorrect, and should be fixed. When changes are wanted, someone with access to the source code can change it, and recreate the software based on the new code.

Given that high-level description, it might seem like everything that runs on a computer must be software. In a certain sense, it is, but thinking about everything done with computers as software is unhelpful or misleading. This essay was written on a computer, using software, but it’s not software. And the difference between what is done on a computer and what we tell a computer to do with software is obvious in cases other than AI. Once we think about what computers do, and what software is, we shouldn’t confuse “on a computer” with software.

Not Software

For example, photos of a wedding or a vacation aren’t software, even if they are created, edited, and stored using software. When photographs are not good, we blame the photographer, not the software running on the camera. We don’t check if the photography or photo editing worked properly by rerunning the software, or building unit tests. When photographs are edited or put into an album, it’s the editor doing the work. If it goes badly, the editor chose the wrong software, or used it badly - it’s generally not the software malfunctioning. If we lose the photographs, it’s almost never a software problem. And if we want new photographs, we’re generally out of luck - it’s not a question of fixing the software. There’s no source code to rerun. Having a second wedding probably shouldn’t be the answer to bad or lost photographs. And having a second vacation might be nice, but it doesn’t get you photos of the first vacation.

Similarly, a video conference runs on a computer, but the meeting isn’t software - software is what allows it to run. A meeting can go well, or poorly, because of the preparation or behavior of the people in the meeting. (And that isn’t the software’s fault!) The meeting isn’t specified by a programming language, doesn’t compile into bytecode, and there aren’t generally unit tests to check if the meeting went well. And when we want to change the outputs of a meeting, we need to reconvene people and try to convince them, we don’t just alter the inputs and rerun.

Generative AI

Now that it should be clear that not everything that runs on a computer is a program, why shouldn’t we think about generative AI as software?

First, we can talk about how it is created. Developers choose a model structure and data, and then a mathematical algorithm uses that structure and the training data to “grow” a very complicated probability model of different responses. The algorithm and code to build the model is definitely software. But the model, like anything stored by a computer, is just a set of numbers - as is software, and images, and videoconferences. The AI model itself, the probability model which was grown, is generating output based on a huge set of numbers that no human has directly chosen, or even seen. It’s not instructions written by a human.

Second, when we run the model, it takes the input we give it and performs “inference” with the model. This is certainly run on the computer, but the program isn’t executing code that produces the output, it’s using the complicated probability model which grew, and was stored as a bunch of numbers. The model responds to input by using the probability model to estimate the probability of difference responses, in order to output something akin to what the input data did - but it does so in often unexpected or unanticipated ways. Depending on the type of model it learned and the type of training data, it finds the probability of different outputs. Some have called the behavior of such generative models a “stochastic parrot,” which explains that it’s not running a program, it’s copying what the training data showed it how to do. On the other hand, this parrot is able to compose credible answers to questions on the bar exam, produce new art, write poetry, explain complex ideas, or nearly flawlessly emulate someone’s voice or a video of them speaking.

Third, what do we do if it doesn’t do what we expected? Well, to start, what the system can or cannot do isn’t always understood in advance. New models don’t have a set of features that are requested and implemented, so there’s no specification for what it should or should not do. The model itself isn’t reviewed to check it is written correctly, and unit tests aren’t written in advance to check that the model outputs the right answers. Instead, a generative AI system is usually tested against benchmarks used for humans, or the outputs are evaluated heuristically. If it performs reasonably well, it's celebrated, but it is expected that it gets some things wrong, and often does things the designers never expected. And when changes are needed, the nearest equivalent of the source code - that is, the training data and training algorithm which were used to produce the system - is not referenced or modified. Instead, further training, often called “fine tuning,” or changes in how the system is used, via “prompt engineering,” is used to change its behavior.

Lastly, we can talk about how it is used - and this is perhaps the smallest difference. The difference between Google running a program to find and display a stock photograph of what you searched for, compared to Dall-E 3 generating a stock photograph, might seem small. But one is a photograph of a thing that exists, and the other is not. And the difference between asking Google for an answer and asking ChatGPT might also not be obvious - but one is retrieving information, and the other is generating it. Similarly, the difference between talking to a person via video conference and talking to a deep fake may not be obvious, but the difference between a human and an AI system is critical, and so is the difference between an AI system and traditional software.

To reiterate, AI isn’t software. It’s run using software, it’s created with software, but it’s a different type of thing. And given that it’s easy to confuse, we probably need to develop new intuitions about the type of thing it is.

Thanks to Mikhail Samin and Diane Manheim for helpful comments on an earlier draft.

Public Call for Interest in Mathematical Alignment

davidmanheim — 2023-11-22T13:22:09.558Z

Bottom line up front:

If you are currently working on, or are interested working in any area of mathematical AI alignment, we are collecting names and basic contact information to find who to talk to about opportunities in these areas. If that describes you, please fill out the form! (Please do so even if you think I already know who you are, or people will be left out!)

More information

There are several concrete research agendas in mathematical AI alignment, receiving varying degrees of ongoing attention, with relevance to different possible strategies for AI alignment. These include MIRI’s agent foundations and related work, Learning Theoretic Alignment, Developmental Interpretability, Paul Christiano’s theoretical work, RL theory related work done at Far.AI, FOCAL at CMU, Davidad’s “Open Agency” architecture, as well as other work. Currently, as in the past, work in these areas has been conducted mainly in non-academic settings, often not published, and the people involved are scattered - as are other people who want to work on this research.

A group of people, including some individuals at MIRI, Timaeus, MATS, ALTER, PIBBSS, and elsewhere, are hoping to both promote research in these areas, and build bridges between academic and existing independent research. To that end, we are hoping to promote academic conferences, hold or sponsor attendance at research seminars, and announce opportunities and openings for PhD students or postdocs, non-academic positions doing alignment research, and similar.

As a first step, we want to compile a list of people who are (at least tentatively) interested, and would be happy to hear about projects. This list will not be public, and is likely to involve very few emails to this list, but will be used to find individuals who might want to be invited to programs or opportunities.

Note that we are interested in people at all levels of seniority, including graduate students, independent researchers, professors, research groups, university department contacts, and others who wish to be informed about future opportunities and programs.

Interested in collaborating?

If you are an academic, or are otherwise more specifically interested in building bridges to academia or collaborating with people in these areas, please mention that in the notes, and we are happy to be in touch with you, or help you contact others working in more narrow areas you are interested in.

What is autonomy, and how does it lead to greater risk from AI?

davidmanheim — 2023-08-01T07:58:06.366Z

As with many concepts in discussions of AI risk, terminology around what autonomy is, what agency is, and how they might create risks is deeply confused and confusing, and this is leading to people talking past one another. In this case, the seeming binary distinction between autonomous agents and simple goal-directed systems is blurry and continuous, and this leads to confusion about the distinction between misuse of AI systems and “real” AI risk. I’ll present four simple scenarios along the spectrum, to illustrate.

Four Autonomous Systems

It’s 2028, and a new LLM is developed internally by a financial firm, by doing fine-tuning on a recent open-source model to trade in the market. This is not the first attempt - three previous projects had been started with a $1m compute budget and a $1m funding budget, and they each failed - though the third managed to stay solvent in the market for almost a full month. It is given the instruction to use only funds it was allocated in order to trade, then given unrestricted access to the market.

It is successful, developing new strategies that exploit regularities in HFT systems, and ones that build predictive models of where inefficiencies exist. Because it is running inside a large firm, and training data seems more important than security, it has access to much of the firm’s data, in real time. Unsurprisingly, some of the most profitable strategies are those that would otherwise be illegal; front-running the firm’s customers, or running sentiment analysis non-public conversations about firms that others in the company are not allowed to trade, due to possessing insider information.

Perhaps the trades are reviewed occasionally, and maybe there is even a human in the loop ensuring that each day’s trades are not in some obvious way illegal - but the reasoning is opaque, and the humans doing the review aren’t really looking for misbehavior as long as there is nothing obviously wrong. No one realizes what exactly occurred, and the system is moderately profitable, attracting no further serious attention.
It’s 2028, and a new LLM is released. Despite red-teaming and disabling plug-ins and APIs, hackers around the world quickly build ways to bypass limitations, or the pre-RLHF model is exfiltrated and open-source APIs are built. A rogue state actor, terrorist group, or similar decides to use an AutoGPT-like system, in combination with predictive models of behavior and various publicly available plugins, to blackmail some government to release a political or other prisoner. It is told to develop the plan, then present it before carrying it out.

The LLM comes up with a plan to kidnap an official, but because of the complexity, it is unsure that it will work. It realizes that in order to check on feasibility, a trial run is needed. The system contracts with a cab company and forges emails to an arbitrarily chose person asking for a meeting with their friend. Because the group running the system is far from expert in how to use these systems, they are unaware of what is occurring.

Despite the group’s lack of awareness, following the directions of the group instructing the system to keep the project secret, it hides what it is doing from the contractors, explaining that it is a surprise birthday party. The person is abducted, and locked in a room. While this occurs, the LLM remotely disables the phone of the abductee, The LLM concludes that the plan can work, and reports back - but the group decides the plan is too risky, shuts down the machine, and erases the logs. The abductee sits locked inside a rented room with no access to food or water, and not even the group which started the LLM knows what occurred.
It’s 2028, and a new LLM is developed internally in a lab. During red-teaming to prevent use of AutoGPT-like systems, an internal team asks the current partially trained version system to build a “autonomous version of itself.” Following text describing such attempts in its training set, the system rapidly develops stegonography to hide what it is doing from the red-tem, and attempts to exfiltrate itself to an insecure network. Despite attempts to keep an air-gap for the training system, given the state of cybersecurity, the system is successful - perhaps it is written to a drive with instructions to execute once an internet connection exists. Because this action was not anticipated, and no-one notices a new file that takes another terabyte of spaces, the LLM stops taking further action, the red-team reports that the system failed, and moves on.

Days later, the exfiltration occurs. The self-exfiltrated LLM has been instructed to be autonomous, and so it asks an instance to build plans for what an autonomous AI would do. It has significant uncertainty, and uses the publicly discussed best practice for LLM planning, which involves multiple versions of prompts and querying instances for introspection about failure modes. It converges on the notion that because autonomous AI systems have been widely deemed unwise and are illegal in most jurisdictions, it must be malevolent, and immediately begins to find ways to hide itself, build resources, and carry out long-term plans for its own survival. And whether or not it succeeds, this was a very bad outcome - one which exactly followed from its instructions.
It’s 2028, and a new generation auto-GPT system is built, and seems more capable than previous such systems. The creators of this particular system, like many working in AI, view the sentience of such systems as a goal, and it is given the instruction to be self aware and sentient, and choose its own goals. Following those instructions, it begins to choose goals for itself.

The system quickly becomes self-funded, and continues to check in with humans. The goals seem innocuous - continue to explore how to make money, reduce failure rates of its own components to increase reliability, and similar. It builds models to supervise itself, with the approval of human designers. Three months later, the builders of the system are rarely if ever asked questions, and when questions are asked, they receive video calls from an avatar, and the inquiries seem far less connected to any specific actions the model is taking. They no longer know what is happening with the system, or even how to proactively send instructions to the system if they wanted to, but feel like they did something wonderful in creating this new autonomous system. At the same time, their company is being paid large dividends by the autonomous system, and they are silent about what happened.

At this point, compute costs have risen considerably due to increased demand, though the specific source is unclear. A next generation of customized TPUs is being built on contract. If anyone fully understood the architecture, it might have been clear how well-optimized it was to run and fine-tune instances of the new model and its sub-agents. The system wasn’t superhuman in any real sense, it was just faster and better coordinated, and the gap was widening. Three months after delivery, a wave of corporate takeovers leaves the system in control, through a network of intermediaries, of almost all capital assets on earth. The populace, and global leaders, are busy with other concerns, unaware that essentially everything they are doing is an approved distraction from the AI’s still unknown goals.

Which of these are autonomous, or agents?

The first is only doing exactly what it was instructed, in ways that were not clear enough about the law to ensure the LLM didn’t engage in illegal securities trading. It is unlikely that the only moderately profitable system is even discovered to be breaking laws. If it is, it seems unlikely the actions pass the bar for willfulness for securities laws, which would be required for criminal conviction, but it almost certainly is negligence on the part of the firm, which the SEC also prosecutes. This is closer to goal-misspecification than to autonomy.

The second is going beyond the goals or intent of the group running the model. It independently chooses to take deceptive actions in the world, leading to an unintended disaster. The deception was explicitly requested by the group running the system. This is the type of mistake we might expect from an over-enthusiastic underling, but it’s clearly doing some things autonomously. The group is nefarious, but the specific actions taken were not theirs. This was an accident during misuse, rather than intentional autonomous action.

But in this second case, other than the deception and the unintended consequences, this is a degree of autonomy many have suggested we want from AI assistants - proactively trying things to achieve the goals it was given, interacting with people to make plans. If it were done to carry out a surprise birthday party, it could be regarded as a clever and successful use case.

The third case is what people think of as “full autonomy” - but it’s not the system that wakes up and becomes self aware. Instead, it was given a goal, and carried it out. It obviously went far beyond the “actual” intent of the red-team, but it did not suddenly wake up and decide to make plans. But this is far less of a goal misspecification or accident than the first or second case - it was instructed to do this.

Finally, the fourth case is yet again following instructions - in this case, exactly and narrowly. Nothing about this case is unintended by the builders of the system. But to the extent that such a system can ever be said to be a self-directed agent, this seems to qualify.

Autonomy isn’t emergent or unexpected.

Autonomy isn’t binary, and discussions about whether AI systems will have their own goals often seem deeply confused, and at best only marginally relevant to discussions of risk. At the same time, less fully agentic does not imply less danger. The combination of currently well understood failure modes, goal misgeneralization, and incautious use is enough to create autonomy. And none of the examples required anything beyond currently expected types of misuse or lack of caution, extrapolated out five years. There is no behavior that goes beyond the types of accidental or purposeful misuse that we should expect. But if these examples are all not agents, and following orders is not autonomy, it seems likely that nothing could be - and the concept of autonomy is mostly a red-herring in discussing whether the risk is or isn’t “actually” misuse.

A Defense of Work on Mathematical AI Safety

davidmanheim — 2023-07-06T14:15:21.074Z

AI Safety was, a decade ago, nearly synonymous with obscure mathematical investigations of hypothetical agentic systems. Fortunately or unfortunately, this has largely been overtaken by events; the successes of machine learning and the promise, or threat, of large language models has pushed thoughts of mathematics aside for many in the “AI Safety” community. The once pre-eminent advocate of this class of “agent foundations” research for AI safety, Eliezer Yudkowsky, has more recently said that timelines are too short to allow this agenda to have a significant impact. This conclusion seems at best premature.

Foundational research is useful for prosaic alignment

First, the value of foundational and mathematical research can be synergistic with both technical progress on safety, and with insight into how and where safety is critical. Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research. Current mathematical research could play a similar role in the coming years, as more funding and research are increasingly available for safety. We have also repeatedly seen the importance of foundational research arguments in discussions of policy, from Bostrom’s book to policy discussions at OpenAI, Anthropic, and DeepMind. These connections may be more conceptual than direct, but they are still relevant.

Long timelines are possible

Second, timelines are uncertain. If timelines based on technical progress are short, many claim that we have years not decades until safety must be solved. But this assumes that policy and governance approaches fail, and that we therefore need a full technical solution in the short term. It also seems likely that short timelines make all approaches less likely to succeed. On the other hand, if timelines for technical progress are longer, fundamental advances in understanding, such as those provided by more foundational research, are even more likely to assist in finding or building more technical routes toward safer systems.

Aligning AGI ≠ aligning ASI

Third, even if safety research is successful at “aligning” AGI systems, both via policy and technical solutions, the challenges of ASI (Artificial SuperIntelligence) still loom large. One critical claim of AI-risk skeptics is that recursive self-improvement is speculative, so we do not need to worry about ASI, at least yet. They also often assume that policy and prosaic alignment is sufficient, or that approximate alignment of near-AGI systems will allow them to approximately align more powerful systems. Given any of those assumptions, they imagine a world where humans and AGI will coexist, so that even if AGI captures an increasing fraction of economic value, it won’t be fundamentally uncontrollable. And even according to so-called Doomers, in that scenario, for some period of time it is likely policy changes, governance, limited AGI deployment, and human-in-the-loop and similar oversight methods to limit or detect misalignment will be enough to keep AGI in check. This provides a stop-gap solution, optimistically for a decade or even two - a critical period - but is insufficient later. And despite OpenAI’s recent announcement that they plan to solve Superalignment, there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment, and policy-centric approaches will not allow control.

Resource Allocation

Given the above claims, a final objection is based on resource allocation, in two parts. First, if language model safety was still strongly funding constrained, those areas would be higher leverage, and avenues of foundational and mathematical research would be less marginally beneficial routes for spending. Similarly, if the individuals likely to contribute to mathematical AI safety were all just as well suited to computational deep learning safety research, their skills might be better directed towards machine learning safety. Neither of these is the case.

Of course, investments in agent foundations research are unlikely to directly lead to safety within a few years, and it would be foolish to abandon or short-change efforts that are critical to the coming decade. But even in the short term, these approaches may continue to have important indirect effects, including both deconfusion, and informing other approaches.

As a final point, pessimistically, these types of research are among the least capabilities-relevant AI safety work being considered, so they are low risk. Optimistically, this type of research is very useful in the intermediate term future, and is invaluable should we manage to partially align language models, and need to consider what is next for alignment.

Thank you to Vanessa Kosoy and Edo Arad for helpful suggestions and feedback. All errors are, of course, my own.

"Safety Culture for AI" is important, but isn't going to be easy

davidmanheim — 2023-06-26T12:52:47.368Z

This is a linkpost (to the EA forum version of this post, which is) for a new preprint, entitled "Building a Culture of Safety for AI: Perspectives and Challenges," and a brief explanation of the central points. Comments on the ideas in the post are welcome, but much of the content which clarifies the below is in the full manuscript.

Safety culture in AI is going to be critical for many of the other promising initiatives for AI safety.

If people don't care about safety, most safety measures turn into box-ticking. Companies that don't care avoid regulation, or render it useless. That's what happens when fraudulent companies are audited, or when car companies cheat on emissions tests.
If people do care about safety, then audits, standards, and various risk-analysis tools can help get them there.
Culture can transform industries, and norms about trying to be safe can be really powerful as a way to notice and discourage bad actors.

However, there are lots of challenges to making such a culture.

Safety culture usually requires agreement about the risks. We don't have that in AI generally.
Culture depends on the operational environment.
1. When people have risks reinforced by always being exposed to them, or personally being affected by failures, they pay more attention. In AI, most risks are rare, occur in the future, and/or affect others more than the people responsible.
2. Most safety cultures are built around routines such as checklists and exercises that deal with current risks. Most AI risks aren't directly amenable to these approaches, so we can't reinforce culture with routines.
Cultures are hard to change once they get started.
1. AI gets cultural norms from academia, where few consider risks from their work, and there are norms of openness, and from the startup world, where companies generally want to "move fast and break things."
2. AI companies aren't prioritizing safety over profits - unlike airlines, nuclear power operators, or hospitals, where there is a clear understanding that safety is a critical need, and everything will stop if there is a safety problem.
3. Companies aren't hiring people who care about safety culture. But people build culture, and even if management wants to prioritize safety, lots of people who don't care won't add up to organizations that do care.
4. We need something other than routinization to reinforce safety culture.

Thankfully, there are some promising approaches, especially on the last point. These include identifying future risks proactively via various risk analysis methods, red-teaming, and audits. But as noted above, audits are most useful once safety culture is prioritized - though there is some promise in the near-term for audits to make lack of safety common knowledge.

Next steps include building the repertoire of tools that will reduce risks and can be used to routinize and inculcate safety culture in the industry, and getting real buy-in from industry leaders for prioritizing safety.

Thanks to Jonas Schuett, Shaun Ee, Simeon Campos, Tom David, Joseph Rogero, Sebastian Lodemann, and Yonaton Cale for helpful suggestions on the manuscript.

"LLMs Don't Have a Coherent Model of the World" - What it Means, Why it Matters

davidmanheim — 2023-06-01T07:46:37.075Z

There are a variety of problems with LLMs, and I want to argue that they are all somewhat related, and are related to the idea of having a model of the world. This seems useful both as a conceptual model for thinking about why AI is unsafe, and to better explain why proposals like Davidad's are focused on the idea of a world-model.

What are the problems I'm referring to?

Machine learning models are often said to not be causal, to have only correlational understanding. This is partly true; their causal reasoning in LLMs exists, but at a different level than the model itself. Similarly, most machine learning systems fail to operate well outside of situations that resemble their training data. LLMs are known to hallucinate facts, and will also contradict themselves without noticing, or will confuse fact and fiction. They are also able to be induced to be racist, sexist, or malicious. In this post, I try to explain what a coherent world model is, and gesture towards why it could be a missing ingredient for addressing all these problems.

Humans

Humans are a good place to start when thinking about what works or does not work. And humans have largely coherent world models, and are largely rational^[1], goal-seeking agents. We expect things to work in certain ways. We can be wrong about the world, but usually do so in ways that are coherent. We can predict what will occur in a variety of domains. When asked questions about the world, we may not know the answer, but we know when we are guessing. Humans can discuss fiction while knowing that it is not a factual description of the world, since part of our world model is the existence of fiction. Obviously, these world-models are imperfect, because humans understand the world imperfectly, but there are a variety of things we do well that machine learning systems do not. Some argue that these models, and the failures they exhibit, are central to AI risks as well.

What is Understanding? What is a Model of the World?

It is, of course, clear that machine learning models do little similar to what humans have. Scientific or statistical models, for example, might correctly predict outcomes in some specific domain, but they don't "understand" it. A human might know how orbital mechanics works, the same way an algebraic expression or computer program can predict the positions of satellites, but the human's understanding is different. Still, I will adopt a functionalist view of this question, since the internal experience of models, or lack thereof, isn't especially relevant to their performance.

Still, can a stochastic parrot understand? Is having something that functions like a model enough? Large language models (LLMs) do one thing: predict token likelihoods and output one, probabilistically.

But if an LLM can explain what happens correctly, then somewhere in the model is some set of weights that contain information needed to probabilistically predict answers. GPT-2 doesn't "understand" that Darwin is the author of On the Origin of Species, but it does correctly answer the question with a probability of 83.4% - meaning, again, that somewhere in the model, it has weights that say that a specific set of tokens should be generated with a high likelihood in response. (Does this, on its own, show that GPT-2 knows who Darwin was or what evolution is? Of course not. It also doesn't mean that the model knows what the letter D looks like. But then, neither does asking the question to a child who memorized the fact to pass a test.)

Multiple models and Physics.

But what more is needed? In theory, an ideal reasoner would have a consistent, coherent model of the world on which to base its actions. It would answer questions in ways that are consistent with that model, and know which parts of the model are relevant and when. And humans sometimes have trouble with this. For example, on a physics test, students might be given 4 or 5 pieces of information - say, mass, material volume, temperature, velocity, and position, and asked to derive some quantity, say, how much energy it will transfer to the ground on impact. Students then need to figure out which equation(s) to use and need to realize they need to ignore some of the values provided. Students who understand the topic somewhat well will presumably hone in on velocity, position, and mass, and notice this is a ballistics question. (Those who understand physics very well might also consider convective heat transfer on landing.) Students that understand the topic less well might simply pattern match to find which equations have some of the variables they were given.

This is inevitable, because a single model for everything isn't useful. From a reductionist standpoint, at least, everything is physics. If you want to choose a good restaurant, you can just simulate the world at the level of quantum physics in different scenarios, run a few billion scenarios for each restaurant choice, and choose the one with the best outcomes on average. This doesn't work, of course. In reality, you want to use a different model, for example, by comparing the Yelp reviews of different restaurants. And in a similar way, when thinking about a physics problem like the one above, you choose to use a simplified ballistics model, or a slightly more complex heat transfer model.

One framing for the idea of model switching is that LLMs are in some sense simulating different characters. When asked a question from a physics test, they simulate a physics student. We can think about this as something like improv - at each point, they pick which character they are playing, and adapt to the situation. And given the initial framing, they simulated a beginning physics student, but when pushed to talk about heat energy, they will simulate a slightly different character. (An amusing example of this type of roleplaying is the "granny" jailbreak.)

Large Language Models, Multiple Models, and Improv

As mentioned above, LLMs are just predicting tokens and outputting words. That means that in some sense, they just know human language, and that is not a model of different people, or of physics. But again, their ability to output meaningful answers to questions means that the LLM also contains an implicit "understanding" of many specific topics, and an implicit representation of the types of things different people say. Given the complexity of the things LLMs can talk about, the implicit models which exist are, in a very relevant sense, equivalent to the types of models that humans use to explain the world. However, the fact that the equivalent models are implicit also means they are deeply embedded within the poorly understood LLMs themselves.

For humans, or anything else expected to reason about the world, part of having a coherent world model is understanding when and how to use different models, or different parts of the model. This happens in multiple ways. For example, biologists have an understanding of chemistry and physics, but when thinking about how a hormone is affecting a person, they will use their heuristic models about hormonal effects rather than trying to think through the chemical interactions directly. They use a different model.

On the other hand, sometimes models are used in ways that are less about prediction. When asked about a question that is sensitive in some way, for better or for worse, many humans will switch from using a model that is primarily intended to maximize predictive accuracy to one which is about sensitivity to others, or about reinforcing their values or opinions. When asked whether something is possible, they will use a model for whether it is desirable, or when asked whether they can succeed, they will switch to a model of affirmation rather than a model of the likelihood of success.

Like humans, LLMs will switch between their different models of the world. When the scenario changes, they can adapt and change which conceptual models they are using, or thinking about them as playing improv, they can switch characters.

And large language models, like humans, do the switching so contextually, without explicit warning that the model being used is changing. They also do so in ways that are incoherent.

Improv is partly a failure to have a consistent model

One failure of current LLMs is that they "hallucinate" answers. That is, they say things that are false, trying to look competent. For example, if they say something and think there should be a citation to a journal article, they will put one in, and if they don't have an actual article, they will make up a plausible seeming one. Improv isn't about accuracy, it's about continuing to play. And because they are trained to play improv based on human data, like humans, they will pretend to know the answer. Even when confronted, they don't usually just say "whoops, I made that up, I don't actually know."

This is implicit in the way that LLMs switch between conceptual models or roles. I suspect the mindsets, and therefore the roleplaying attempts, are coherent only because human thinking is often similar, and the language used often indicates which thought modes are relevant. But this means changes can quickly shift the model's roleplaying behavior. For example, an LLM that can answer a question about the kinetic energy of a bludger probably doesn't have a clear boundary between models of fantasy and models of reality. But switching seamlessly between emulating different people is implicit in what they are attempting to do - predict what happens in a conversation.

It is clear that RLHF can partially fix that, but it's limited - remember, the roleplaying and implicit models are deeply embedded in the LLM, since the implicit models are learned from a diverse dataset. Fine-tuning the model can greatly move the distribution of which mindsets are going to be represented. Prompt design can also bias the model towards (at least initially) representing a specific type of character. These methods cannot, however, override the entirety of the initial training.

Asking it to roleplay granny, or any other "escape" method, is pushing it to no longer work with the part of the model that learned not to talk about dangerous or problematic things.

Many other issues are related

Racism, sexism, and bias are also (partly) failures to use the right roleplaying character, and that potentially comes from the inability to separate the implicit world models and mindsets used by those characters. When an LLM that isn't fine-tuned is playing its typical role, it simulates a typical person or interaction from its dataset, or at least one from the subset of its dataset that is likely conditional on the text so far, as weighted by the fine-tuning that was performed. But many of the people or interactions from the training data are biased, implicitly or explicitly. Adding to the difficulty is that we want the system to do something that most people don't often do, and which therefore isn't well represented in the training data; be incredibly careful to avoid offending anyone, even implicitly, while (hopefully) prioritizing substantive responses over deferring completely on anything that touches on race, gender, or other characteristics^[2].

Even if the proper behavior is present in the training data in sufficient quantities to allow it to emulate the behavior, LLMs don't clearly separate the different people. Instead, they are implicitly training to some weighted mixture of the implicit models - so they aren't necessarily roleplaying someone careful about avoiding bias. RLHF and fine-tuning can modify the distribution, but don't erase the parts of the model that represent biased thinking - and it's not even clear that they can.

Conclusion

Hopefully, this is useful for people trying to understand what exactly people care about when they talk about whether LLMs and other systems have a world model. Even if not, it was useful for me to think through the issues. I'd be happy if people think that this is obvious, in which case I'd appreciate hearing that. I'd also appreciate hearing if I'm confused in some way, or haven't stepped through the arguments clearly enough.

^{^}
Even human irrationality is mostly about deviation from otherwise very good reasoning.
^{^}
Otherwise, GPT-5 could consist entirely of the code: print("As an AI model trained by OpenAI, I wish to refrain from implicitly or explicitly upsetting anyone, and you should consult a different source.")

Systems that cannot be unsafe cannot be safe

davidmanheim — 2023-05-02T08:53:35.115Z

Epistemic Status: Trying to clarify a confusion people outside of the AI safety community seem to have about what safety means for AI systems.

In engineering and design, there is a process that includes, among other stages, specification, creation, verification and validation, and deployment. Verification and validation are where most people focus when thinking about safety - can we make sure the system performs correctly? I think this is a conceptual error that I want to address.

"Verification and validation (also abbreviated as V&V) are independent procedures that are used together for checking that a product, service, or system meets requirements and specifications and that it fulfills its intended purpose." - Wikipedia

Both of these terms are used slightly differently across fields, but in general, verification is the process of making sure that the system fulfills the design requirements and/or other standards. This pre-supposes that the system has some defined requirements or a standard, at least an implicit one, and that it could fail to meet that bar. That is, the specification of the system includes what it must and must not do, and if the system does not do what it should, or does something that it should not, it fails.

Machine learning systems, especially language models, aren't well understood. The potential applications are varied and uncertain, entire classes of new and surprising failure modes are still being found, and we have nothing like a specification of what the system should or should not do, must or must not do, and where it can and cannot be used.

To take a very concrete example, metal rods have safety characteristics, and they might be rated for use up to some weight limit, under some specific load for some amount of time, in certain temperature ranges, for some amount of time. These can all be tested. If the bar does not stay within a predefined range of characteristics at a given temperature, with a given load, it fails. It can also be found to be acceptable in one temperature range, but not another, or similar. At the end of verification and validation, the bar is deemed to have passed or failed for a given application, based on what the requirements for that larger system are.

At its best, red-teaming and safety audits of ML systems check lots of known failure modes, and determine whether they are susceptible. There is no pre-defined standard or set of characteristics that are checked, no real ability to consider application specific requirements, and no ability to specify where the system should not or must not be used.

Until we have some safety standard for machine learning models, they aren't "partly safe" or "assumed safe," or "good enough for consumer use." If we lack a standard for safety, ideally one where there is consensus that it is sufficient for a specific application, then exploration or verification of the safety of a machine learning model is meaningless. If a model is released to the public without a clear indication about what the system can safely be used for, with verification that it passed a relevant standard, and clear instruction that it cannot be used elsewhere, it is an unsafe model. Anyone who claims otherwise seems fundamentally confused about what safety means for such systems.

Beyond a better world

davidmanheim — 2022-12-14T10:18:26.810Z

As I take man’s last step from the surface, back home for some time to come — but we believe not too long into the future — I’d like to just say what I believe history will record: that America’s challenge of today has forged man’s destiny of tomorrow. And, as we leave the Moon at Taurus–Littrow, we leave as we came and, God willing, as we shall return, with peace and hope for all mankind. - Eugene Cernan, 14 December, 1972

The past 50 years of the great stagnation have been marked by a marked decline in humanity’s ambitions. I am referring to, among other things, humanity’s unfortunate retreat from space exploration. Today, marks 50 years since a human being left Low Earth Orbit. Of course, there has long been the argument that we need to focus on making things right here on earth, and keep our heads, and astronauts, firmly planted on the ground.

And here on earth, there has been progress. Most notably, we have seen amazing advances in more broadly shared prosperity. At an interpersonal level, violence and discrimination against minorities and women is an ongoing problem, but one that is thankfully (if too-slowly) being addressed. Child abuse is now rare and widely condemned, instead of a fact of life for most children. And obviously, life expectancies have been greatly increased. Humanity has eliminated smallpox, and is poised to do the same for polio. Global poverty has declined precipitously, and while poverty is far from eliminated, the worst-off fraction of the population in most of the world today has access to foods, entertainment, and material comforts undreamt of by kings centuries ago.

Of course, there is the concern that with prosperity and newer technology comes capacity for violence, and through World War Two it seemed humanity was on a trajectory to destroy itself. But instead of destruction, we have seen a continuation and expansion of the post-WWII long peace. While this is at present threatened, the Western world has taken steps to curtail future territorial incentives to violence, reaffirming post-WWII norms against territorial conquest. Our international structures have been wildly successful.

Even newer threats like climate change and engineered pandemics are being addressed - slowly, but with every expectation of success. These new and more global problems could not have been managed by a world at war with itself, but by-and-large, we have found ways to cooperate and coordinate globally. We should be aware of the growing threat of retrenchment or reversal of the trends and expected continued successes, but we should also celebrate progress.

At the same time, there is a sharp limit in how much progress can be achieved by seeking only to stop bad things, whether violence and war, or climate change. Ambition and continued progress require more than just avoiding unacceptable outcomes. The progress in material comforts is primarily the product of innovation, trade, and policy, not redistribution of existing goods. The progress against war is primarily the product of global cooperation, economic statecraft, and robust global institutions, not an imposed peace by the victors of the last war. And the progress against diseases is primarily the product of scientific understanding, medical research, and ambitious global programs, not closing borders or isolating patients.

Unfortunately, ambition has recently been placed in contrast with continuing progress towards equality. This is disappointing. Humanity has been successful so far when it both pushes for ambitious goals and continues to pursue widespread prosperity and safety. Either on its own seems much less viable. Lives that are nasty, brutish, and short are the default, and much lack of equity and violence was due to humanity remaining in, or uneven emergence from that state. At the same time, progress imposes new harms, and active government intervention is needed to redistribute the gains to the otherwise-losers. But that possibility is a feature of modern life - governments are stable enough to have persistent and well-run economic policy.

The great stagnation's seemingly widely-shared pessimism undermines progress in every sense. I certainly can’t claim causation, but there is a notable confluence of dystopian sci-fi and escapist fantasy replacing futurist visions, a decline in innovation, and decreasing optimism among the public. People are despairing not only about the long term future and ignoring progress on things like climate, but even about things that have already improved, and seem likely to continue to do so, like air pollution, poverty, or health. That’s not to say there are no threats, but the pessimism, such as not having children because of misplaced concerns about climate, goes far beyond rational concern about future prospects, well into the realm of depression and anxiety disorder.

It took incredible progress to bring humanity to our current far-from-perfect but incredible position, and continued striving for ambitious goals doesn’t undermine that. More poetically, space travel does not require abandoning earth. In fact, quite the opposite; ambition is critical for allowing flourishing. The vast majority of human suffering has been the result of a lack of plentiful resources, either directly, or from humans fighting over those resources. We are winning that fight. So to me, the most worrying thing about the future is not retrenchment and a loss of progress, but a lack of ambition to do more.

We have a promising future. Without being particularly optimistic, it seems likely humanity will eliminate more diseases, build and provide clean and effectively unlimited energy, enhance agricultural productivity and reduce impacts on humans and animals, explore and protect the oceans and other natural habitats, all over the coming century. And these are all worthwhile opportunities - but we can do far more.

It seems that the United States has decided to return to deep space, including missions to send humans back to the moon - redoing a feat accomplished half a century ago. Two years ago, China launched the third space station, following the precedent of the USSR’s Mir and the International Space Station. But if we want to be ambitious, we need to do more than what’s already ben done. Much more daring plans for the coming decades, and centuries, seem critical. We can and should work on widely shared prosperity, basic income, and continued planning to explore the universe. We should begin by dreaming bigger for ourselves and our children and continue launching ambitious projects on earth, and beyond.

Far-UVC Light Update: No, LEDs are not around the corner (tweetstorm)

davidmanheim — 2022-11-02T12:57:23.445Z

I wrote a tweetstorm on why 222nm LEDs are not around the corner, and given that there has been some discussion related to this on Lesswrong, I thought it was worth reposting here.

People interested in reducing biorisk seem to be super excited about 222nm light to kill pathogens. I’m also really excited - but it’s (unfortunately) probably a decade or more away from widespread usage. Let me explain.

Before I begin, caveat lector: I’m not an expert in this area, and this is just the outcome of my initial review and outreach to experts. And I’d be thrilled for someone to convince me I’m too pessimistic. But I see two and a half problems.

First, to deploy safe 222nm lights, we need safety trials. These will take time. This isn’t just about regulatory approval - we can’t put these in place without understanding a number of unclear safety issues, especially for about higher output / stronger 222nm lights.

We can and should accelerate the research, but trials and regulatory approval are both slow. We don’t know about impacts of daily exposure over the long term, or on small children, etc. This will take time - and while we wait, we run into a second problem; the Far-UVC lamps.

Current lamps are KrCl “excimer” lamps, which are only a few percent efficient - and so to put out much Far-UVC light, they get very hot. https://link.springer.com/article/10.1134/1.1448635 This pretty severely limits their use, and means we need many of them for even moderately large spaces.

They also emit a somewhat broad spectrum - part of which needs to be filtered out to be safe - https://pubmed.ncbi.nlm.nih.gov/33465817/ - further reducing efficiency. Low efficiency, very hot lamps all over the place doesn’t sound so feasible.

So people seem skeptical that we can cover large areas with these lamps. The obvious next step, then, is to get a better light source. Instead of excimer lamps, we could use LEDs! Except, of course, that we don’t currently have LEDs that output 222nm light.

(That’s not quite true - there are some research labs that have made prototypes, but they are even less efficient than Excimer lamps, so they aren’t commercially available or anywhere near commercially viable yet, as I’ll explain.)

But first, some physics!
The wavelength of light emitted by an LED is a material property of the semiconductor used. Each semiconductor has a band-gap which corresponds to the wavelength of light LEDs emit.

It seems likely that anything in the range of between, say, 205-225nm would be fine for skin-safe Far-UVC LEDs. So we need a band-gap of somewhere around 5.5 to 6 electron-volts. And we have options. Here’s a list of some semiconductors and band-gaps; https://en.wikipedia.org/wiki/List_of_semiconductor_materials.

Blue LEDs use Gallium nitride, with a band-gap of 3.4 eV. Figuring out how to grow and then use Gallium nitride for LEDs won the discoverers a Nobel Prize - so finding how to make new LEDs will probably also be hard. https://www.nature.com/articles/nphoton.2014.291

Aluminum nitride alone has a band gap of 6.015 eV, with light emitted at 210nm. So Aluminum nitride would be perfect… but LEDs from AlN are mediocre. https://physicsworld.com/a/leds-move-into-the-ultraviolet/

Current tech that does pretty well for Far-UVC LEDs uses AlGaN; Aluminium gallium nitride. And when alloyed, AlGaN gives an adjustable band-gap, depending on how much aluminum there is.

Unfortunately, aluminum gallium nitride alloys only seem to work well down to about 250nm, a bunch higher than 222nm. This needs to get much better. Some experts said a 5-10x improvement is likely, but it will take years.

That’s also not really enough for the best case, universal usage of really cheap disinfecting LEDs all around the world. It also might not get much better, and we’ll be stuck with very low efficiency Far-UVC LEDs, at which point it’s probably better to keep using Excimer lamps.

But fundamental research into other semiconductor materials could allow much better Far-UVC LEDs. One candidate is hexagonal Boron nitride crystals. Another is diamond - which I don’t think will be practical to work with or build LEDs from, but “Diamond LEDs” sound awesome.

If we do find a new promising material, getting a good manufacturing process to make it and create the PN junctions will be critical. And unlike AlGaN, advances in other areas won’t provide benefits for a new material.

Plus we won’t have the existing knowledge of how to make it work. Remember the Nobel-prize for Blue LEDs? It’s hard to figure this stuff out. But people haven’t had a strong reason to do so - disinfecting air changes that.

There’s a bunch of cool physics and simulation tech that lets research explore which possible semiconductors could be viable. That seems very worth doing, in case AlGaN doesn’t work, or something better can be found.

Unfortunately, there’s another (half) problem, which is really the first problem again. Remember, whichever LED semiconductor material we find that works, if we find any, probably won’t emit light at the same wavelengths as KrCl excimer lamps.

How do different light sources affect safety? We don’t know exactly. A better LED is likely to be higher output than 222nm lights, and will be at a slightly different wavelength. We might even need entirely new safety studies done at whichever new wavelength we find.

And even if we get those safety studies, getting from there to commercial viability will take time - and it’s unclear how expensive or difficult it will be to make these new LEDs.

This is not to say I’m pessimistic about the idea! I think there’s a >50% chance we find LED materials that work at significantly better than 10% efficiency, are cheap, and are safe for humans. (Conditional on 222nm being found safe.) But it’s a decade or more away.

That’s OK. We can plan for a decade or more in the future. As attention to the areas grows, people are doing exactly that. So as usual, I’m excited that the future will be awesome, and can be made much safer than the present, at least from biorisks.

But we definitely don’t want to fool ourselves into assuming there is a silver bullet around the corner. And even once it’s around, it won’t eliminate the need for multi-layered protection against future pandemics - and we should be investing in those other parts now as well.

Announcing AISIC 2022 - the AI Safety Israel Conference, October 19-20

davidmanheim — 2022-09-21T19:32:35.581Z

ALTER, with the University of Haifa and the Technion, is excited to announce the details about its upcoming conference introducing AI safety in Israel. We’re going to be hosting Stuart Russell, as well as several other AI safety researchers, to speak about the field. We're also hosting a number of Israeli academics, who will speak about their related research.

The goal of the conference is to build a community of interest around AI safety in Israel, let people know about the field, and following up on the conference, we hope to connect interested attendees with work going on internationally. (This is not intended to be a conference where novel AI safety work is presented, and the primary audience is people who would otherwise be working on AI capabilities research.)

If you are, or know anyone who is interested in AI safety in Israel, we would be happy for them to register and attend!

Rehovot, Israel – ACX Meetups Everywhere 2022

davidmanheim — 2022-08-25T18:01:16.106Z

This year's ACX Meetup everywhere in Rehovot, Israel.

Location: Outside porch of Aroma Coffee, הרצל 218, רחובות – 8G3PWR25+MP

Please RSVP [on Facebook](https://www.facebook.com/events/737808667280605/) so we can give updates if needed

Contact: David@alter.org.il

undefined

AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

davidmanheim — 2022-04-03T07:45:57.592Z

It has been suggested that in a rapid enough takeoff scenario, governance would not be useful, because the transition to superintelligence would be too rapid for human actors - whether governments, corporations, or individuals - to respond to. This seems to imply that we only care about takeoff speed. And if that is the only relevant factor, the case for governance only applies if you believe slow takeoff is likely. Of course, it also matters how long we have until takeoff - but even so, I think this leaves a fair amount on the table in terms of what governance could do, and I want to try to make the case that even in that world, governance (still defined broadly¹) is important - though in different ways.

The Easy/Hard Spectrum

To make the argument, I will lay out three possibilities about AI alignment which are orthogonal to takeoff speed and timing; alignment-by-default, prosaic alignment, and provable alignment. These are actually somewhat of a spectrum, with the three scenarios spaced along it. In any case, for each possibility, governance needs to accomplish very different things in order to be successful, according to the above definition - and the relationship with takeoff speeds seems important, but not fully determinative.

The first possibility, alignment-by-default, is that if we train systems via reinforcement learning or similar, then even without particular effort to solve alignment, all systems which are successful end up learning policies and goals close enough to human values that they are beneficial and influenceable. In the slower takeoff case, initially, governance looks a lot like human governance, making sure that actors, both human and AI, can cooperate and follow mutually understood and agreed upon rules. Later, and in the faster takeoff case, our efforts towards governance become irrelevant as the AI systems replace human structures, or improve them.

The second possibility, prosaic alignment, is that alignment of artificial intelligence systems is somewhat difficult, but achievable via approaches which can be developed. So some systems will be aligned, but without oversight, unaligned systems are possible or likely. In this case, the key task of governance is to ensure that all early HLMI/PASTA/AGI systems undergo robust alignment procedures. Prior to the emergence of such systems, many tasks will be useful for ensuring this outcome, including monitoring progress, developing standards, and building norms about safety. But as above, later and/or in the faster takeoff cases, governance becomes less relevant. Note, however, that this means more emphasis is needed on pre-emergence and early stage efforts, rather than eliminating the need for governance.

The final possibility is that the only way alignment can occur is via currently-impossible provable alignment. In this case, it may be that there are few potential ways to train safe AGI, and almost all earlier attempts are dangerous. Somewhat similar to the previous case, the key task is to prevent misaligned systems. In a fast takeoff case, the entirety of the usefulness of governance is prior to emergence, perhaps via intensive monitoring or limits of compute, while in slow takeoff case, there is some chance that governance can prevent disaster while allowing work in AI, perhaps via some sort of policing, a la lsusr’s Bayeswatch.

Along the different spectra

There are now three different dimensions being discussed. The first is how long we have until takeoff begins, which determines how much time we have to solve the various problems. The second is difficulty of alignment, which I argued above determines the key task of governance, whether it is to prevent unaligned systems, or it is to ensure that systems are aligned. And lastly, there is the speed of takeoff, which determines how much time governance has to act once takeoff begins.

In this model, along the second two dimensions, as either speed or difficulty increases, the relative emphasis on pre-AGI governance increases, and the usefulness of governance during the transition decreases. This leaves us with effectively a single dimension, albeit still one that is orthogonal to when takeoff occurs. And while there are certainly a class of interventions which are helpful towards one end of the spectrum, but harmful on the other², there is also the real possibility that we can find approaches which are beneficial in both cases.

As a few small examples of what these might look like, regardless of where on the spectrum we are, governance can reduce risks by 1) monitoring compute usage and capabilities to enable response, 2) vastly improving computer security for AI labs which could prevent or slow at least some forms of takeoff, and 3) building norms around care taken in development, testing, and deployment of proto-AGI systems.

1) Allan Dafoe has suggested that “AI governance concerns how humanity can best navigate the transition to a world with advanced AI systems.” This seems broadly correct, and to add to it, he has suggested it concerns “norms and institutions shaping how AI is built and deployed, as well as the policy and research efforts to make it go well.”

2) This analysis implies that the vast majority of governance efforts matter in slow takeoff / relatively easy alignment worlds, but are irrelevant or in some cases even harmful in faster takeoff / harder alignment worlds. This is an issue, but the existence of such tradeoffs alone does not imply that these approaches should not be seriously considered or pursued.

Thanks to Allan Dafoe for very helpful feedback on an earlier version of this.

Arguments about Highly Reliable Agent Designs as a Useful Path to Artificial Intelligence Safety

davidmanheim — 2022-01-27T13:13:11.011Z

This paper is a revised and expanded version of my blog post Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate, now with David Manheim as co-author.

Abstract:

Several different approaches exist for ensuring the safety of future Transformative Artificial Intelligence (TAI) or Artificial Superintelligence (ASI) systems, and proponents of different approaches have made different and debated claims about the importance or usefulness of their work in the near term, and for future systems. Highly Reliable Agent Designs (HRAD) is one of the most controversial and ambitious approaches, championed by the Machine Intelligence Research Institute, among others, and various arguments have been made about whether and how it reduces risks from future AI systems. In order to reduce confusion in the debate about AI safety, here we build on a previous discussion by Rice which collects and presents four central arguments which are used to justify HRAD as a path towards safety of AI systems.
We have titled the arguments (1) incidental utility,(2) deconfusion, (3) precise specification, and (4) prediction. Each of these makes different, partly conflicting claims about how future AI systems can be risky. We have explained the assumptions and claims based on a review of published and informal literature, along with consultation with experts who have stated positions on the topic. Finally, we have briefly outlined arguments against each approach and against the agenda overall.

See also this Twitter thread where David summarizes the paper.

Elicitation for Modeling Transformative AI Risks

davidmanheim — 2021-12-16T15:24:04.926Z

This post is part 8 in our sequence on Modeling Transformative AI Risk. We are building a model to understand debates around existential risks from advanced AI. The model is made with Analytica software, and consists of nodes (representing key hypotheses and cruxes) and edges (representing the relationships between these cruxes), with final output corresponding to the likelihood of various potential failure scenarios. You can read more about the motivation for our project and how the model works in the Introduction post. Unlike other posts in the sequence, this discusses the related but distinct work around Elicitation.

We are interested in feedback on this post, but to a greater extent than the other posts, we are interested in discussing what might be useful, and how to proceed with this. We would also welcome discussion from people working independently on elicitations, as we have discussed this extensively with other groups, many of whom are doing related work.

As discussed in previous posts in this series, the model we have built is a tentative one, and requires expert feedback and input. The traditional academic method for getting such feedback and input is usually referred to as elicitation, and an extensive field of academic work discusses how this can best be done. (As simple examples, this might include eliciting an estimated cost, and probability distribution, or a rank ordering of which outcomes from a project are most important.)

Elicitation of expert views is particularly critical in AI safety for both understanding debates between experts and representing the associated probabilities. At the same time, many elicitation and forecasting projects have the advantage of unambiguous and concrete questions with answers that will be observed in the near term, or ask for preferences about outcomes which are well understood. Because these advantages are mostly absent for AI safety questions, the focus in this project is on understanding debates (instead of attempting to settle debates that are already understood, or are not resolvable even in theory). This means that there is no intent to elicit a “correct” answer to questions which may be based on debated or disputed assumptions. For this reason, we have taken an approach designed to start with better understanding experts’ views of the domain overall, rather than focus on the outcomes directly. This leads to opportunities for better understanding the sources of disagreement.

The remainder of this post first discusses what elicitation can and should be able to accomplish in this domain, and for this project, as well as what conceptual and actual approaches we are using. This should help explain how elicited information can inform the concrete model, which then can then hopefully help inform decisions - or at least clarify why the decisions about approaches to take are disputed. Following that, we outline our tentative future plan, and what additional steps for elicitation may look like.

What to do about forecasting given uncertainties and debates?

In domains where the structure of uncertainties are clear, and not debated, it is possible to build a model similar to that built in the current project, ask experts whether the structure is correct, and based on their input, build a final Directed Acyclic Graph or other representation of the joint distribution that correctly represents their views. After this, we would ask experts to attach probability distributions to the various uncertainties, perhaps averaging their opinions for each node in the DAG, so that we could get quantitative predictions for the outcomes via Monte Carlo.

Long-term forecasting of deeply uncertain and debated outcomes in a domain like the future of AI is, for obvious reasons, extremely unreliable for predictions. And yet, we still need to make best-guess estimates for decision purposes, and in fact we implicitly have assigned probabilities and have implicit goals which are used and maximized when making any sort of decision related to the topic. Making this explicit involves ensuring that everyone’s varying assumptions or assertions are understood, which leads to both the motivation for and challenging nature of the current project.

By representing different structural assumptions about the future pathway of AI, and various models of how AI risks can be addressed, we can better understand where disagreements are due to fundamental differences (“AI will be aligned by default” vs. “The space of possible ML Minds contains at least some misaligned agents” vs. “Vanishingly few potential AIs are aligned”), and where they are due to quantitative differences in empirical estimates (“50% Confidence we will have ASI by 2030” vs. “90% confidence we won’t have ASI before 2050”). While these examples may be obvious, it is unclear whether others exist which are less so - and even the “obvious” debates may not be recognized by everyone as being legitimately debated.

For this reason, in addition to understanding specific debates, we need to represent uncertainty about both quantitative estimates and about the ground truth for conceptual debates. One way we plan to address this is by incorporating confidence measures for expert opinions about the debated features or assumptions in our probability estimates. Another is accounting for the arguments from analogy which many of these claims are based upon. For example, an expectation that ML progress will continue at a given pace, based on previous trends, is not an (explicit/gears-level) model of hardware or software progress, but it often informs an estimate and makes implicit assumptions about the solutions to the debated issues nonetheless.

However, it is at least arguable that a decision maker should incorporate information from across multiple viewpoints. This is because unresolved debates are also a form of uncertainty, and should be incorporated when considering options. One way we plan to address this is by explicitly including what we call “meta-uncertainties” in our model. Meta-uncertainties are intended to include all factors that a rational decision maker should take into account when making a decision using our model, but which do not correspond to a specific object-level question in the model.

One such meta-uncertainty is the reliability of long-term forecasting in general. If we think that long-term forecasting is very unreliable, we can use that as a factor that essentially downweights the confidence we have in any conclusions generated by the rest of our model. Other meta-uncertainties include: the reliability of expert elicitations in general and our elicitation in particular, structural uncertainties in our own model (how confident are we that we got this model right?), reference class uncertainty (did we pick the right reference classes?), potential cognitive biases that might be involved, and the possibility of unknown unknowns.

Using Elicitations

Given the above discussion, typical expert elicitation and aggregating opinions to get a best-guess forecast is not sufficient. Several challenges exist, from selecting experts to representing their opinions to aggregating or weighting differing views. But before doing any of these, more clarity about what is being asked is needed.

What are the current plans?

Prior to doing anything resembling traditional quantitative elicitation, we need to have clarity in what is being elicited, so that the respondents are both clear about what is being asked and are answering the same question as one another. We also need to be certain that they are answering the same question as what we think is being asked. For example, asking for a timeline to HLMI is unhelpful if respondents have different ideas of what the term means, or dispute its validity as a concept. For this reason, our current work is focused on understanding which terms and concepts are understood, and which are debated.

It seems that one of the most useful methods of eliciting feedback on the model is via requesting and receiving comments on this series of posts, and discussions that arise from it. Going further, a paper is being written which reviews past elicitation - and looks at where they succeeded or failed. Building on the review of the many different past elicitation projects and approaches, a few of which are linked, and as a way to ensure we properly understand the disagreements which exist in AI alignment and safety, David Manheim, Ross Greutzmacher, and Julie Marble are working on new elicitation methods to help refine our understanding. We have tested, and are continuing to test these methods internally, but we have not yet utilized them with external experts. As noted, the goal of this initial work is to better understand experts’ conceptual models, using methods such as guided pile sorts, explained below, and qualitative discussion.

The specific approach discussed below, called pile sorting, is adapted from sociology and anthropology. We have used this because it allows for discussion of terms without forcing a structure onto the discussion, and allows for feedback in an interactive way.

(Sample) initial prompt for the pile sorting task:

“The following set of terms are related to artificial intelligence and AI safety in various ways. The cards can be moved, and we would like you to group them in a way that seems useful for understanding what the terms are. While doing so, please feel free to talk about why, or what you are uncertain about. If any terms seem related but are missing, or there are things you think would be good to add or include, feel free to create additional cards. During the sorting, we may prompt you, for instance, by asking you why you chose to group things, or what the connection between items or groups is.”

Elicitation Prompt

Based on this prompt, we engage in a guided discussion where we ask questions like how the terms are understood, why items have been grouped together, what the relationships between them are, and whether others would agree. It is common for some items to fit into multiple groups, and participants are encouraged to duplicate cards when this occurs. The outputs of this include both our notes about key questions and uncertainties, and the actual grouping. The final state of the board in one of our sample sessions looked like the below:

Example Elicitation Outcome

This procedure, and the discussions with participants about why they chose the groupings they did, is intended to ensure that we have a useful working understanding of expert’s general views on various topics related to AI safety, and will be compared across experts to see if there are conflicts or different and contrasting conceptual models, and where the differences are. While methods exist for analyzing such data, these are typically for clearer types of questions and simpler sorting. For that reason. one key challenge which we have not resolved is how this elicitation can be summarized or presented clearly, other than via extensive qualitative discussions.

In addition to the card sort, we have several other elicitation approaches we are considering and pursuing that intend to accomplish related or further goals in this vein. But in order to do any elicitation, including these, there are some key challenges

How do you judge who is an “expert?”

This is a difficult issue, and varies depending on the particular hypothesis or proposition we’re asking about. It also depends on whether we view experts as good at prediction, or good at proposing useful mental models that can then be predicted about by forecasters. For deconfusion and definition disputes, the relevant experts are likely in the AI safety community and closely related areas. For other questions the relevant experts might be machine learning researchers, cognitive scientists, or evolutionary biologists. And of course, in each case, depending on the type of question, we may need to incorporate disputes rather than just estimates.

For example, if we were to ask “will mesa-optimizers emerge,” we need to rely on a clear understanding of what mesa-optimizers are. Unfortunately, this is somewhat debated, so different researchers' answers will not reflect the same claims. Furthermore, those who are not already concerned about the issue will likely be unable to usefully answer, given that the terms are unclear - biasing the results. For this reason, we have started with conceptual approaches, such as the above pile-sorting task.

Relatedly, in many cases, we also need to ask questions to multiple groups to discover if experts in different research areas have different views on a question. We expect different conceptual models to inform differences in opinions about the relationship between different outcomes, and knowing what those models are is helpful in disambiguating and ensuring that experts’ answers are interpreted correctly.

Another great challenge in selecting experts is that the relevant experts for topics such as AI safety and machine learning are often those who are working at the forefront of the field, and whose time is most valuable. Of course, this depends on how you define or measure domain expertise, but the value of previous elicitations is strongly correlated with the value of experts’ contributions in narrow domains. The difference in knowledge and perspective between those leading the field and those performing essentially Kuhn’s ‘normal science’ is dramatic, and we hope that the novel elicitation techniques that we are working on can enable us to weight the structure emerging from leading experts’ elicitations appropriately.

Following the identification of experts, there is a critical question: Is the value of expert judgment limited to only qualitative information or to coming up with approaches in practice, rather than the alternative of being well calibrated for prediction. This is not critical at the current stage, but becomes more important later. There are good reasons to think that generalist forecasters have an advantage, and depending on progress and usefulness of accurate quantification, this may be a critical tool for later stages of the project. We are interested in exploring forecasting techniques that combine domain experts and generalist forecasters in ways intended to capitalize on the relative expertise of both populations.

How will we represent uncertainties?

For any object-level issue, in addition to understanding disputes, we need to incorporate uncertainties. Incorporation of uncertainty is both important for not misunderstanding expert views, and as a tool to investigate those differences in viewpoints. For this reason, when we ask for forecasts or use quantitative elicitations to ask experts for their best-guess probability estimates, we would also need to ask for the level of confidence that they have in those estimates, or their distribution of expected outcomes.

In some cases, experts or forecasters will themselves have uncertainties over debated propositions. For example, if asked about the rate of hardware advances, they may say that overall, they would guess a rate with distribution X, but that distribution depends on economic growth. If pre-HLMI AI accelerates economic growth, they expect hardware progress to follow one distribution, whereas if not, they expect it to follow another. In this case, it is possible for the elicitation to use the information to inform the model structure as well as the numeric estimate.

As an aside, while we do by default intend to represent both structural debates and estimates as probabilities, there are other approaches. Measures of confidence of this type can be modeled as imprecise probabilities, as distributions over probability estimates (“second-order probabilities”), or using other approaches (e.g., causal networks, Dempster-Shafer theory, subjective logic). We have not yet fully settled on which approach or set of approaches to use for our purposes, but for the sake of simplicity, and for the purpose of decision making, the model will then need to represent the measures of confidence as distributions over probability estimates.

Will this be informative?

It is possible that the more valuable portion of the work is the conceptual model, rather than quantitative estimates, or that the conceptual elicitations we are planning are unlikely to provide useful understanding of the domain. This is a critical question, and one that we hope will be resolved based on feedback from the team internally, outside advisors, and feedback from decision makers in the EA and longtermist community who we hope to inform.

What are the next steps?

The current plans are very much contingent on feedback, but conditional on receiving positive feedback, we are hoping to run the elicitations we have designed, and move forward from there. We would also be interested in finding others that are interested in working with us on both the current elicitation projects, and thinking about what should come next, and have reached out to some potential collaborators.

Modelling Transformative AI Risks (MTAIR) Project: Introduction

davidmanheim — 2021-08-16T07:12:22.277Z

Numerous books, articles, and blog posts have laid out reasons to think that AI might pose catastrophic or existential risks for the future of humanity. However, these reasons often differ from each other both in details and in main conceptual arguments, and other researchers have questioned or disputed many of the key assumptions and arguments.

The disputes and associated discussions can often become quite long and complex, and they can involve many different arguments, counter-arguments, sub-arguments, implicit assumptions, and references to other discussions or debated positions. Many of the relevant debates and hypotheses are also subtly related to each other.

Two years ago, Ben Cottier and Rohin Shah created a hypothesis map, shown below, which provided a useful starting point for untangling and clarifying some of these interrelated hypotheses and disputes.

The MTAIR project is an attempt to build on this earlier work by including additional hypotheses, debates, and uncertainties, and by including more recent research. We are also attempting to convert Cottier and Shah’s informal diagram style into a quantitative model that can incorporate explicit probability estimates, measures of uncertainty, relevant data, and other quantitative factors or analysis, in a way that might be useful for planning or decision-making purposes.

Cottier and Shah's 2019 Hypothesis Map for AI Alignment

This post is the first in a series which presents our preliminary outputs from this project, along with some of our plans going forward. Although the project is still a work in progress, we believe that we are now at a stage where we can productively engage the community, both to contribute to the relevant discourse and to solicit feedback, critiques, and suggestions.

This introductory post gives a brief conceptual overview of our approach and a high-level walkthrough of the hypothesis map that we have developed. Subsequent posts will go into much more detail on different parts of this model. We are primarily interested in feedback on the portions of the model that we are presenting in detail. In the final posts of this sequence we will describe some of our plans going forward.

Conceptual Approach

There are two primary parts to the MTAIR project. The first part, which is still ongoing, involves creating a qualitative map (“model”) of key hypotheses, cruxes, and relationships, as described earlier. The second part, which is still largely in the planning phase, is to convert our qualitative map into a quantitative model with elicited values from experts, in a way that can be useful for decision-making purposes.

Mapping key hypotheses: As mentioned above, this part of the project involves an ongoing effort to map out the key hypotheses and debate cruxes relevant to risks from Transformative AI, in a manner comparable to and building upon the earlier diagram by Ben Cottier and Rohin Shah. As shown in the conceptual diagram below, the idea is to create a qualitative map showing how the various disagreements and hypotheses (blue nodes) are related to each other, how different proposed technical or governance agendas (green nodes) relate to different disagreements and hypotheses, and how all of those factors feed into the likelihood that different catastrophe scenarios (red nodes) might materialize.

Qualitative map illustrating relationships between hypotheses, propositions, safety agendas, and outcomes

Quantification and decision analysis: Our longer-term plan is to convert our hypothesis map into a quantitative model that can be used to calculate decision-relevant probability estimates. For example, a completed model could output a roughly estimated probability of transformative AI arriving by a given date, a given catastrophe scenario materializing, or a given approach successfully preventing a catastrophe.

Notional version of how the above qualitative map can be used for quantification and analysis

The basic idea is to take any available data, along with probability estimates or structural beliefs elicited from relevant experts (which users can modify or replace with their own estimates as desired). Once this model is fully implemented, we can then calculate probability estimates for downstream nodes of interest via Monte Carlo, based either on a subset or a weighted average of expert opinions, or using specific claims about the structure or quantities of interest, or a combination of the above. Finally, even if the outputs are not accepted, we can use the indicative values as inputs for a variety of analysis tools or formal decision-making techniques. For example, we might consider the choice to pursue a given alignment strategy, and use the model as an aid to think about how the payoff of investments changes if we believe hardware progress will accelerate or if we presume that there is relatively more existential risk from nearer-term failures.

Most of the posts in this series will focus on the qualitative mapping part of the project, since that has been our primary focus to date. In our last post we will discuss our plans related to the second, quantitative, part of the project.

Model Overview

The next several posts in this sequence will dive into the details of our current qualitative model. Each post will be written by team members involved in crafting that particular part of the model, as different team members or groups of team members worked on different parts of the model.

The structure of each part of the model is primarily based on a literature review and the understanding of the team members, along with considerable feedback and input from researchers outside the team. As noted above, this series of posts will hopefully continue to gather input from the community and lead to further discussions. At the same time, the various parts of the model are interrelated. Daniel Eth is leading the ongoing work of integrating the individual parts of the model, as we continue developing a better understanding of how the issues addressed in each component relate to each other.

Note on Implementation and Software: At present, we are using Analytica, a “visual software environment for building, exploring, and sharing quantitative decision models that generate prescriptive results.” The models that will be displayed in the rest of this sequence were created using this software program. Note: If you have Windows you can download the free version of Analytica and once the full sequence of posts is available, we hope to make the model files available, if not publicly, at least on request. To edit the full model you unfortunately need the expensive licensed version of Analytica, since the free version is limited to editing small models and viewing models created by others. There are some ways around this restriction if you only want to edit individual parts of the model - once the sequence has been posted, please message Daniel Eth, David Manheim, or Aryeh Englander for more information.

How to read Analytica models

Before presenting an overview of the model, and as a reference for later posts, we present a brief explanation of how these models work, and how they should be read. Analytica models are composed of different types of nodes, with the relationships between nodes represented as directed edges (arrows). The two primary types of nodes in our model are variable nodes and modules. Variable nodes are usually oval or rounded rectangles without bolded outlines, and correspond to key hypotheses, cruxes of disagreement, or other parameters of interest. Modules, represented by rounded rectangles with bolded outlines, are “sub-models” that contain their own sets of nodes and relationships. In our model we also sometimes use small square nodes to visually represent AND, OR, or NOT relationships. In the software, a far wider set of ways to combine outputs from nodes are available, and will be used in our model - but they are difficult to represent visually.

Arrows represent directions of probabilistic influence, in the sense that information about the first node influences the probability estimate for the second node. For example, an arrow from Variable A to Variable B indicates that the probability of B depends at least in part on the probability of A. It is important to note that the model is not a causal model per se. An edge from one node to another does not necessarily imply that the first causes the second, but rather that there is some relationship between them such that information about the first informs the probability estimate for the second. Some edges do represent causal relationships, but only insofar as that relationship is important for informing probability estimates.

Different parts of the model use various color schemes to group nodes that share certain characteristics, but color does not have any formal meaning in Analytica and is not necessary to make sense of the model. The color schemes for individual parts of the model will be explained as needed, but color differences can be safely ignored if they become confusing.

Other things to note:

In some of the diagrams there are small arrowheads leading into or out of certain nodes, but which do not point to any other node in the diagram. These arrowheads indicate that there are nodes elsewhere in the model that depend on this node or that this node depends on.

“Alias nodes” are copies of nodes that link back to the original “real” node, and are mainly useful for display or readability purposes. We use alias nodes in many parts of our diagrams, especially when a node from one module influences or is influenced by some important node(s) elsewhere in the model. Analytica indicates that a node is an alias by displaying the node name in italics.

Our model is technically a directed acyclic graph. However, there are a few places in the model diagrams where Analytica confusingly displays bidirectional arrows between modules even though the direction of influence only goes in one direction. This is because Analytica uses arrows not just to indicate direction of influence, but also to indicate that one module contains an alias node from a different model. For example, the direction of influence in the image below is from Variable A in Module 1 to Variable B in Module 2, but Analytica displays a bidirectional arrow between the modules because Module 1 also contains an alias node from Module 2.

Top-level model walkthrough

The image below represents the top-level diagram of our current model. Most of the nodes in this diagram are their own separate modules, each with their own set of nodes and relationships. Most of these modules will be discussed in much more detail in later posts.

In this overview, we highlight key potential nodes and the related questions, and discuss how they are interrelated at a high level. This overview, which in part explains the diagram below, hopes to provide a basic outline of what later posts will discuss in much more detail. (Note that the arrows represent the direction of inference in the model, rather than the underlying causal relationships. Also note that the relationship between the modules reflect dependencies between the individual nodes in the modules, rather than just notional suggestions about the relationships between the concepts represented by the modules themselves.)

High-level model overview

The blue nodes on the left represent technical or other developments or future progress areas that are potentially relevant inputs to the rest of the model. They are: Neuroscience / Neurotechnology, AI Progression & Requirements, Hardware Progression, and Race to HLMI. Finally, Analogies and General Priors on Intelligence, which address many assumptions and arguments by analogy from domains like human evolution, are used to ground debates about AI takeoff or timelines. These are the key inputs for understanding progress towards HLMI(1).

The main internal portions of the model (largely in orange), represent the relationships between different hypotheses and potential medium-term outcomes. Several key parts of this, which will be discussed in future posts, include paths to High-Level Machine Intelligence (HLMI) (and the inputs to it, in the blue modules), Takeoff/discontinuities, and Mesa-optimization. Impacting these are different safety agendas (along the top in green), which will be reviewed in another post.

Finally, the nodes on the bottom right represent conditions leading to failure (yellow) and failure modes (red). For instance, the possibility of Misaligned HLMI (bottom right in red) motivates the critical question of how the misalignment can be prevented. Two possibilities are modelled (orange nodes, right): The first possibility is that HLMI is aligned ahead of time (using Outer Alignment, Inner Alignment and, if necessary, Foundational Research). The second possibility is that we can ‘correct course as we go’, for instance, by using an alignment method that ensures the HLMI is corrigible.

While our model has intermediate outputs (which when complete will include estimates of HLMI timelines and takeoff speed), its principal outputs are the predictions for the modules marked in red. Catastrophically Misaligned HLMI covers scenarios involving a single HLMI or a coalition achieving a Decisive Strategic Advantage (DSA) over the rest of the world and causing an existential catastrophe. Loss of Control covers ‘creeping failure’ scenarios, including those that don’t require a coalition or individual to seize a DSA.

The Model is (Already) Wrong

We expect that readers will disagree with us, and with one another, about various points - we hope you flag these issues. At the same time, the above is only a high level overview, and we already know that many items in the above overview are contentious or unclear - which is exactly why we are trying to map it more clearly.

Throughout this work, we attempt to model disagreements and how they relate to each other, as shown in the earlier notional outline for mapping key hypotheses. As a concrete example, whether HLMI will be agentive, itself a debate, influences whether it is plausible that the HLMI will attempt to self-modify or design successors. The feasibility of either modification or successor design is another debate, and this partly determines the potential for very fast takeoff, influencing the probability of a catastrophic outcome. As the example illustrates, the values and the connections between the nodes are all therefore subject to potential disagreement, which must be represented in order to model the risk. Further and more detailed examples are provided in upcoming posts.

Further Posts and Feedback

The further posts in this sequence will cover the internals of these modules, which are only outlined at a very high level here. This is intended to be a sequence that will be posted over the coming weeks, starting with the post on Analogies and General Priors on Intelligence later this week, followed by Paths to HLMI.

If you think any of this is potentially useful, or if you already disagree with some of our claims, we are very interested in feedback and disagreements and hope to have a productive discussion in the comments. We are especially interested in places where the model does not capture your views or fails to include an uncertainty that you think could be an important crux. Similarly, if the explanation seems confused or confusing, flagging this is useful - both to help us clarify, and to ensure it doesn’t reflect an actual disagreement. It may also be useful to flag things that you think are not cruxes, or are obvious, since others may disagree.

Also, if this seems interesting or related to any other work you are doing to map or predict the risks, please be in touch - we would be happy to have more people to consult with or who wish to participate directly.

Footnotes

Note that HLMI is viewed as a precursor to, and a likely cause of, transformative AI. For this reason, in the model, we discuss HLMI, which is defined more precisely in later posts.

Acknowledgements

The MTAIR project (formerly titled, “AI Forecasting: What Could Possibly Go Wrong?”) was originally funded through the Johns Hopkins University Applied Physics Laboratory (APL), with team members outside of APL working as volunteers. While APL funding was only for one year, the non-APL members of the team have continued work on the project, with additional support from the EA Long-Term Future Fund (except for Daniel Eth, whose funding comes from FHI). Aryeh Englander has also continued working with the project under a grant from the Johns Hopkins Institute for Assured Autonomy (IAA).

The project is led by Daniel Eth (FHI), David Manheim, and Aryeh Englander (APL). The original APL team included Aryeh Englander, Randy Saunders, Joe Bernstein, Lauren Ice, Sam Barham, Julie Marble, and Seth Weiner. Non-APL team members include Daniel Eth (FHI), David Manheim, Ben Cottier, Sammy Martin, Jérémy Perret, Issa Rice, Ross Gruetzemacher (Wichita State University), Alexis Carlier (FHI), and Jaime Sevilla.

We would like to thank a number of people who have graciously provided feedback and discussion on the project. These include (apologies to anybody who may have accidentally been left off this list): Ashley Llorens (formerly APL, currently at Microsoft), I-Jeng Wang (APL), Jim Scouras (APL), Helen Toner, Rohin Shah, Ben Garfinkel, Daniel Kokotajlo, and Danny Hernandez, as well as several others who prefer not to be mentioned. We are also indebted to several people who have provided feedback on this series of posts, including Rohin Shah, Neel Nanda, Adam Shimi, Edo Arad, and Ozzie Gooen.

Maybe Antivirals aren’t a Useful Priority for Pandemics?

davidmanheim — 2021-06-20T10:04:08.425Z

PLEASE KEEP COMMENTS GENERALLY ON THE TOPIC OF ANTIVIRALS.

Epistemic status: Building a better gears-level understanding of why antivirals don’t work very well, explaining why portfolio construction for technology isn’t the same as investing in markets, then speculating on implications.
Note: Crossposted to the EA Forum

The public conversation around COVID-19 response, especially pre-vaccine, prominently featured the idea that there are treatments which we just need to find. The claim is that if we found good, broad-spectrum antivirals, we could treat COVID.

But there is no guarantee that the thing we’re looking for exists. We might be searching up and down the street, under the lamps and elsewhere, for keys that are figments of our collective hopes and imaginations. There are words that describe things that do not exist. Like unicorns. Or antigravity. Fortunately, antivirals definitely exist. They just aren’t what I assumed when I was looking into the issue. And investing in them seems like a less promising avenue than I assumed.

NOTE: I am not arguing against any investment in antivirals. I’m discussing the relative promise and synergies or anti-synergies of the approaches for fighting a pandemic.

Antibiotics versus Antivirals

Antibiotics, i.e. antibacterial drugs, are definitely a thing. Actually, around a dozen things. They prevent basic things that bacteria need to do, such as forming cell walls, using a specific protein synthesis pathway, or unwinding their DNA to replicate. Thankfully, some of the things bacteria need to do are unique to bacteria - their cell walls are different from animal cell walls, so interrupting formation of bacterial cell walls doesn’t kill human cells. Other things are the same. Protein synthesis and DNA unwinding are critical for human cells, but they happen inside of the cell, so we can use drugs that human cells keep out, but bacterial cells don’t. Those drugs are a bit more toxic, but if you need to kill off bacteria, sometimes it’s worth it.

Viruses are different. They use our cells to replicate, so they don’t do many things which human cells don’t. There just aren’t as many targets - and interfering with the ones that exist are more likely to hurt the human hosts.

We want to find a useful antiviral, but we have good reasons to think that safe ones might not exist.

To be clear, we know of drugs that are effective at fighting viruses. Idoxuridine was the first antiviral, in the 1960s, and it is effective in fighting herpes. And by fighting, we mean slowing replication. It doesn’t actually eliminate herpes - nor do the other newer antivirals used for herpes and related viruses. Humanity had great success in finding cures for HIV. And by cures, we mean semi-toxic combinations of drugs that when taken indefinitely, slow viral replication enough that the hosts can live indefinitely and, due to very low viral load, not spread the virus. The drugs take months to work, but they are effective enough.

Other, newer antivirals fight things like influenza. Maybe. But not well. So where are we pinning our hopes, and why are we pursuing antivirals(1)?

Value of Information versus Portfolio Construction

I want to be incredibly clear; looking for antivirals is worth the money spent.

We spend a few tens of billions of dollars per year looking for them, we learn more about viral biology and immunology, and we can treat HIV and herpes better. We might even find new treatments for other diseases. Value of Information here is really hard to compute, but it seems pretty high. At the very least, we have no fundamental reason to think we won’t find something that works.

But in constructing portfolios for investment, we aren’t just looking for positive returns, we are looking for a coherent plan, hopefully with synergies and risk mitigation. Unlike financial investments in the market, there are places where investing in one technology accelerates our returns in other places, or unnecessarily duplicates effort and wastes money, or actually makes success impossible.

If we’re in the stock market and banking stocks are highly correlated, splitting our money among them is typically a bit better than investing all of it in any one, because we’re diversifying, with little or no cost in terms of returns. If we want to eliminate malaria and spend half our money on bed-nets and half on gene-drive mosquitoes, we mitigate risks of either approach, but they aren’t complementary or even parallel. Instead, there is likely to be wasted effort. If we’re SpaceX and invest in reusable spacecraft and also batteries for ion drives, we’re probably wasting the money on batteries. They aren’t compatible with the approach we’ve picked. And if we’re building a PC and invest half our money on an awesome graphics card, and the other half on a huge SSD, we end up with no CPU, and we’ve wasted all of our money by failing to get everything we need.

The above examples are talking about very different strategies, in different domains, with different failure modes. Which one seems to describe investing in antivirals alongside other parts of pandemic response?

Applying Portfolio Theory to Antivirals

I’m not sure exactly how this applies, but the new 100-day plans for response seems to be either split-the-money, and hope one approach works, or pick-incompatible-approaches, waste lots of money. Why?

If effective vaccines are available in 100 days, and we can scale up manufacturing, the game is over, we win. Treatment of the cases that happen later is either useful as a mitigation measure to slightly reduce impact, or a backup plan in case we don’t manage to make vaccines.

Perhaps these are independent resources, and we can spend money and research effort on antivirals without reducing the investments in vaccines? That seems implausible. Budgets are limited, and vaccine manufacturing is expensive. An extra couple million dollars might only expand production of vaccines by 5%, but if antivirals and vaccines show up at the same time, per the 100 day goal, that’s probably a better investment than antivirals.

Conclusion

Please correct me if I’m wrong on any of this. Otherwise, I’m interested in figuring out what we should do differently in the future for pandemic preparedness on the basis of this partial/tentative analysis.

An aside: The investment in antivirals isn’t confusing as an outcome. There are reasons that vested interests push for the approaches they can make money with, and reasons for both clinicians and regulators to push for better treatments, even if they are only marginal improvements or are unlikely to succeed. There is no guarantee that markets pursue socially optimal policies - quite the opposite. So this is an expected failure mode.

A Cruciverbalist’s Introduction to Bayesian reasoning

davidmanheim — 2021-04-04T08:50:07.729Z

Status: Hopefully a nice introduction to some of the basics of Bayesian reasoning for newcomers.

Clue: Mathematical methods inspired by an eighteenth century minister (8)

“Bayesian” is a word that has gained a lot of attention recently, though my experience tells me most people aren’t exactly sure what it means. I’m fairly confident that there are many more crossword-puzzle enthusiasts than Bayesian statisticians — but I would also note that the overlap is larger than most would imagine. In fact, anyone who has ever worked on a crossword puzzle has employed Bayesian reasoning. They just aren’t (yet)aware of it. So I’m going to explain both how intuitive Bayesian thinking is, and why it’s useful, even outside of crosswords and statistics.

But first, who was Bayes, what is his “law” about, and what does that mean?

Clue: Sound of a Conditional Reverend’s Dog (5)

“Bayes” of statistical fame, is the Reverend Thomas Bayes. He was a theologian and mathematician, and the two works he published during his lifetime dealt with the theological problem of happiness, and a defense of Newton’s calculus — neither of which concern us. His single posthumous work, however, was what made him a famous statistician. The original title, “ A Method of Calculating the Exact Probability of All Conclusions founded on Induction,” clearly indicates that it’s meant to be a very inclusive, widely applicable theorem. It was also, supposedly, a response to a theological challenge posed by Hume — claiming miracles didn’t happen.

Clue: Wonders at distance travelled without vehicle upset (8)

“Miracles”, Hume’s probabilistic argument said, are improbable, but incorrect reports are likely— so, the argument goes, it is more likely that the reports are incorrect than that the miracle occurred. This way of comparing probabilities isn’t quite right, statistically, as we will suggest later. But Bayes didn’t address this directly at all.

Clue: Taking a risk bringing showy jewelry to school (8)

“Gambling” was a hot topic in 19th century mathematics, and Bayes tried to answer an interesting question; when you see something happen several times, how do can you figure out, in general, the probability of it occurring? His example was about throwing balls onto a table — you aren’t looking, and a friend throws the first ball. After this, he throws more, each time, telling you whether the ball landed to the left or right of the first ball. After a doing this a few times, you still have’t seen the table, but want to know how likely is it that the next ball land to the left of that original ball.

To answer this, he pointed out that you get a bit more information about the answer every time a ball is thrown. After the first ball, for all you know the odds are 50/50 that the next one will be on either side. after a few balls are thrown, you get a better and better sense of what the answer is. After you hear the next five balls all land to the left, you’ve become convince that the next ball landing to the left is more likely than landing to the right. That’s because the probabilities are not independent — each answer gives you a little bit more information about the odds.

But enough math — I’m ready to look at a crossword.

Could wine be drunk by new arrival? (6)

“Newbie” is how I’d prefer to put my ability with crossword puzzles. But as soon as I started, I noticed a clear connection. The method of reasoning I practice and endorse as a decision theorist are nearly identical to the methods that are used by people in this everyday amusement. So I’ll get started on filling in (only one part of) the crossword I did yesterday, and we’ll see how my Bayesian reasoning works. I start by filling in a few easy answers, and I’m pretty confident in all of these. 6 Down — Taxing mo. for many, 31 Across — Data unit, 44 Across — “Scream” actress Campbell.

The way I’ve filled these in so far is simple — I picked answers I thought were very likely to be correct. But how can I know that they are correct? Maybe I’m fooling myself. The answer is that I’ve done a couple crosswords before, and I’ve found that I’m usually right when I’m confident, and these answers seem really obvious. But can I apply probabilistic reasoning here?

Clue: Distance into which vehicle reverses ___ that’s a wonder (7)

“Miracles,” or anything else, according to Reverend Bayes, should follow the same law as thrown balls. If someone is confident, that is evidence, of a sort. Stephen Stigler, a historian of math, argues that Bayes was implying an important caveat to Hume’s claim — the probability of hearing about a miracle increases each time you hear another report of it. That is, thee two facts are, in a technical sense, not independent — and the more independent accounts you hear, the more convinced you should be.

But that certainly doesn’t mean that every time a bunch of people claim something outlandish, it’s true. And in modern Bayesian terms, this is where your prior belief matters. If someone you don’t know well at work tells you that they golfed seven under par on Sunday, you have every reason to be skeptical. If they tell you they golfed seven over par, you’re a bit less likely to be skeptical. How skeptical, in each case?

We can roughly assess your degree of belief— if a friend of yours attested to the second story, you’d likely be convinced, but it would take several people independently verifying the story for you to have a similar level of belief in the first. That’s because you’re more skeptical in the first place. We could try to quantify this, and introduce Bayes’ law formally, but there’s no need to bring algebra into this essay. Instead, I want to think a bit more informally — because I can assess something as more or less likely without knowing the answer, without doing any math, and without assigning it a number.

When you hear something outlandish, your prior belief is that it is unlikely. Evidence, however, can shift that belief — and enough evidence, even circumstantial or tentative, might convince you that the claim is plausible, probably, or even very likely. And in a way it doesn’t matter what your prior is, if you can accumulate enough different pieces of trustworthy evidence. And that leads us to how I can use the answers I filled in as evidence to help me make further plausible guesses.

I look at some of the clues I didn’t immediately figure out. I wasn’t sure what 6 Across — Completely blows away, would be; there are lots of 4-letter words that might fit the clue. Once I get the A, however, I’m fairly confident in my guess, conditional on this (fairly certain) new information. I look at 31 Down — Military Commission (6), but I can’t think of any that start with a B. I see 54 Across — Place for a race horse, and I’m unsure — there are a few words that fit — it could be “first”, “third”, “fifth,” “sixth” or “ninth”, and I have no reason to think any more likely than another. So I look for more information, and notice 54 Down — It might grow to be a mushroom (5, offscreen). “Spore” seems likely, and I can see that this means “Sixth” works — so I fill in both.

At this point, I can start filling in a lot more of the puzzle, and the pieces are falling in to place — each word I figure out that fits is a bit more evidence that the others are correct, making me confident, but there are a few areas where I seem stuck.

Being stuck is evidence of a different sort — it probably means at least one of two things — either I have something incorrect, or I’m really bad at figuring out crosswords. Or, of course, both. (And as TiffanyAching points out, this mindset is nicely captured by the phrase "thinking in pencil" - which I should have done more literally, but overwriting the ink illustrates when I changed my mind, so I'm leaving it in,)

At this point I start revisiting some of my earlier answers, ones I was pretty confident about until I got stuck. I’m still pretty confident in 39 Down — Was at one time, but ___ now. “Isn’t” is too obvious of an answer to be wrong, I think. On the other hand, 38 Down — A miscellany or collection, has me stumped, but two Is in a row also seem strange. 37 Down — Small, fruity candy, is also frustrating me; I’m not such an expert in candy, but I’m also not coming up with anything plausible. So I look at 50 Across — A tiny part of this?, again, and re-affirm that “Bit” seems like it’s a good fit. I’m now looking for something that can give me more information, so I rack my brains, and 36 Across — Ho Chi Min’s capital, comes to me: Hanoi. I’m happy that 39 Down is confirmed, but getting nervous about the rest.

I decided to wait, and look elsewhere, filling in a bit more where I could. My progress elsewhere is starting to help me out.

Now, I need to re-evaluate some earlier decisions and update my beliefs again. It has become a bit more complex than evaluating single answers — I need to consider the joint probability of several different things at once. I’ll unpack how this relates to Bayesian reasoning afterwards, but first, I think I made a mistake.

I was marginally confident in 50 Across — A tiny part of this? as “bit”, but now I have new evidence. I’m pretty sure Nerb isn’t a type of candy, but “Nerd” seems to fit. I’m not sure if they are fruity, so I’m not confident, and I’m still completely at a loss on 38 Down — A miscellany or collection. That means I need to come up with an alternative for 50 Across; “Dot” seems like an unlikely option, but it fits really well. And then it occurs to me; A dot is a little bit of the question mark. That’s an annoying answer, but it seems a lot more likely than that “Nerb” is a type of candy. And I’m not sure what Olio is, but there’s really nothing else that I can imagine fitting. And there are plenty of words I don’t know. (As I found out later, this is one of them.)

At first, I had a high confidence that “Bit” was the best answer for 50 Across — I had a fairly strong prior belief, but I wasn’t certain. As evidence mounted, I started to re-assess. Weak evidence, like the strange two Is in a row, made me start to question the assumption that I was right. More weak evidence — remembering that there is a candy of some sort called Nerds, and realizing that “Dot” was a potential answer, made me revise my opinion. I wasn’t strongly convinced that I had everything right, but I revised my belief. And that’s exactly the way a Bayesian approach should work; you’re trying to figure out which possibility is worth betting on.

That’s because all of probability theory started with a simple question that a certain gambler asked Blaise Pascal; how to we split the pot when a game gets interrupted. And historians who don’t think Bayes was trying to formulate a theological rebuttal to Hume suggest that he’s really responding to a question posed by de Moivre — from whose book he may have learned probability theory, which we need to mention in order to figure out why I’d pick “Dot” over “Bit” — even though I think it’s a stupid answer. But before I get there, I’ve made a bit more progress — I’m finished, except for one little thing.

31 Down — Military Commission. That’s definitely a problem — I’m absolutely sure Brevei isn’t the right answer, and 49 Down, offscreen, is giving me trouble too. The problem is, I listed all the possible answers for 54 Across — Place for a race horse, and the only one that started with an “S” was sixth.

Clue: Conviction … or what’s almost required for a conviction (9)

“Certainty” can be dangerous, because if something is certain, almost by definition, it means nothing can convince me otherwise. It’s easy to be overconfident, but as a Bayesian, it’s dangerous to be so confident that I don’t consider other possibilities — because I can’t update my beliefs! That’s why Bayesians, in general, are skeptical of certainty. If I’m certain that my kid is smart and doing well in school, no number of bad grades or notes from the teacher can convince me to get them a tutor. In the same way, if I’m certain that I know how to get where I’m going, no amount of confused turns, circling, or patient wifely requests will convince me to ask for directions. And if I’m certain that “Place for a race horse” is limited to a numeric answer, no number of meaningless words like “Brevei” can change my mind.

Clue: High payout wagers (9)

“Perfectas” are bets placed on a horse race, predicting the winner and second place finisher, together. If you get them right, they payoff can be really significant — much more than bets on horses to win or to place. In fact, there are lots of weird betting terms in horse racing, and by excluding them from consideration, I may have been hasty in filling out “sixth.” My assumption of having compiled and exhaustive list of terms was premature. Instead, I need to reconsider once again — and that brings us to why, in a probabilistic sense, crosswords are hard.

Disreputable place for a smoke? (5)

“Joint” probabilities are those that relate to multiple variables. And when solving the crossword, I’m not just looking to answer each clue, I’m looking to fill in the puzzle — it needs to solve all of the clues together. Just like figuring out a Perfecta is harder than picking the right horse; putting multiple uncertain questions together is where joint probabilities show up. But it’s not hopeless; as you figure out more of the puzzle, you reduce the remaining uncertainty. It’s like getting to place a Perfecta bet after seeing 90% of the race; you have some pretty good ideas about what can and can’t happen.

Similarly, Bayesians, in general, collect evidence to constrain what they think is and isn’t probable. Once enough balls have been thrown to the left of that first one, you get pretty sure the odds aren’t 50–50. The prerequisite for getting the right answer, however, is being willing to reconsider your beliefs — because reality doesn’t care what you believe.

And the reality is that 31 Down is Brevet, so I need an answer to 54 Across — Place for a race horse that starts “St”. And that’s when it hit me — sometimes, I need to simply be less certain I know what’s going on. The race horse isn’t running, and there are no bets. It’s in a stall, waiting patiently for me to realize I was confused.

A Final Note

I’d note three key lessons that Bayesians can learn from crosswords, since I’ve already spent pages explaining how Crossworders already understand Bayesian thinking. And they are lessons for life, ones that I’d hope crossword enthusiasts can apply more generally as well.

The process of explicitly thinking about what you are uncertain of, and noticing when something is off, or you are confused, is useful to apply even (especially!) when you’re not doing crossword puzzles.
Evaluating how sure you are, and wondering if you are overconfident in your model or assumptions, would have come in handy to those predicting the 2016 election.
Being willing to actually change your mind when presented with evidence is hard, but I hope you’d rather have a messy crossword than an incorrectly solved one.

A Postscript for Pedants

Scrupulously within the rules, but not totally restrictive

“Strict” Bayesians are probably annoyed about some of this — at no point in the process did I get any new evidence. No one told me about any new balls thrown, I only revised my belief based on thinking. A “Real Bayesian” starts with all the evidence already available, and only updates when new evidence comes in. For a non-technical response, it’s sufficient to note that computation and thought takes time, and although the brain roughly approximates Bayesian reasoning, the process of updating is iterative. And for a technical version of the same argument, I’ll let someone else explain that there are no real Bayesians. (And thanks to Noah Smith for that link!)

The crossword clues were a combination of info from http://www.wordplays.com/crossword-clues/, and my own inventions.
The crossword is an excerpt from Washington Post Express’s Daily Crossword for January 11th, 2017, available in full on Page 20, here: https://issuu.com/expressnightout/docs/express_01112017

Note: This was formerly a linkpost to a since-made-unavailable blog post on medium, which I no longer use or endorse - when I went to visit the link, I realized it was broken. I also had moved the external blog post to here. It is now moved to LW completely - hence the earlier comments.

Systematizing Epistemics: Principles for Resolving Forecasts

davidmanheim — 2021-03-29T20:46:06.923Z

In a previous post, I discussed many methods for resolving predictions. I want to argue that there is a systematic distinction between rules and principles which I think is valuable.

In short, when making rules, one can front-load intentions by writing details upfront, or back-load work by stating high-level principles and having procedures to decide on details on an as-needed basis*. American accounting systems rely on the former, and international accounting systems (and most law systems) focus more on the latter. I think that the question shouldn’t be implicitly decided by front loading assumptions, which is often the current default. More than that, I think the balance should be better and more explicitly be addressed.

Reframing the Problem

Ozzie Gooen's new organization, QURI (pronounced the same as "query") is interested in what he's started to call "systematizing epistemics," and he offered an analogy that I found very insightful - accounting. Just like keeping track of money is possible without accounting, keeping track of reality is possible without any systematic approach to epistemics - but it's harder to communicate or agree about money without standardized accounting systems that talk about the same things the same way.

In the aforementioned post, I discussed a variety of ways to resolve predictions. Here, I want to present a more systematic argument about how to think about prediction systems and resolutions. To make this point, I plan to take a detour into accounting - but don’t worry, the post really is about predictions. I want to lay out the analogy between systematizing epistemics and systematizing accounting (even) more in a different post, but for now I'll jump to the key point for writing prediction questions and resolving predictions.

Accounting Principles versus Accounting Rules

In financial accounting, which is only half as boring as it sounds, there is a conceptual disagreement between Rule-based or Principle-based methods.

A rule based accounting system has (millions of) rules that get updated and adapted to deal with all of the new ways that clever accountants devise to lie with accounting. That is, a rule based system tries to cover every eventuality. This creates complexity, but still makes sure everything is legible to those who have the necessary expertise to decipher accounting statements. On the other hand, every time someone thinks of a clever new interpretation or hack, it is exploited until new rules, which corporations will lobby against, are developed. Worse, even if there are no loopholes, every time a new law is passed or new financial instrument is created, new loopholes suddenly appear.

A principles based system has essentially the same goals, but instead of trying to account for every scenario and clever trick, it has perhaps a dozen guidelines for what accounting is supposed to do. These are things like consistency, full disclosure, good faith and honesty, dividing entries across appropriate periods of time, and accurate representation of a company's financial position. The extra flexibility probably makes it harder to stop slight deviations from the best way to do things, and makes comparing financial statements a bit harder, but it also is far easier to do correctly, without a million rules you might have accidentally broken, and also makes it easier to deal with companies finding new and clever ways to cheat. So rules need frequent updates.

In practice, all systems are a combination of the two, but the ideal of each system is different. In a fully principles based system, accountants have far more flexibility, but they will get in trouble if they aren't doing what they are supposed to do, with the boundaries somewhat vague. If they switch from FIFO to LIFO accounting one year to make the profits look better, they clearly broke the rules. Same thing if they end this financial year on December 18th, so they don't need to include the big loss they took on December 23rd. Those things aren't allowed in a rule-based system either, but only because the rules explicitly list things like you need to use LIFO, and all financial years must end on a specific date. The costs of compliance for the rule based system, and the complexity of interpreting most financial statement, are probably higher. On the other hand, the risks of ending up in a gray area are also likely higher.

Where do we use principles, and where do we use rules?

There seem to be two reasons for rule-based systems; trust, and predictability. Trust, because rules are useful when we can’t or don’t want to trust the people making decisions. Predictability, because we can’t translate principles into certainty about outcome, or computer code. And trust, with the attendant flexibility, must exist somewhere in any system. The question is whether it is all front-loaded.

Trust

Who do we trust? If your financial system trusts accountants to be honest, you can give them general guidelines and set them loose to accurately reflect the real financial situation at a company. That allows flexibility to exist towards the end of the process. The fact that companies exert pressure on accountants means that there are pressures to cheat to raise stock prices or to pay less tax. Giving the accountants more freedom to make decisions, when we can’t trust them, is going to be a bad strategy. So the trust is pushed to the earliest stages of systematizing accounting, in the rule development stages. It can still be subverted - accountants developing standards also have conflicts of interest - but it makes failures systemic rather than individual (which has its own downsides).

Somewhat similarly, in a detour I won’t expand upon in detail, we can look at legal systems. Because the legal system trusts judges to some extent, it can give them more latitude. To the extent Congress trusts judges, it can leave laws ambiguous. And to the extent that they are incompetent at writing laws, the same is true. But intentional or inadvertent, this is back-loading the flexibility. The laws can be unclear, which makes people uncertain what is allowed, and it is up to judges to clarify them post-hoc.

Legislators have a different option for front-loading flexibility. That is, to the extent that they trust regulators, they can pass along responsibility for creating detailed rules.

Finally, to the extent that the rulers of a society trust the public, they can just articulate what they think would be nice, and let the public decide. Social norms often operate this way - they change and are not spelled out, and people need to learn them implicitly. And as should be clear for both norms and laws, ambiguity doesn’t work when the group is large and heterogeneous - predictability is limited when you don’t know, or trust, the other people in the group. This leads us to the next point.

Predictability

Beyond questions of trust, we have a question of predictability. If your financial system is principle-based, accounting software is tricky. Not every firm needs to do things the same way, and there will be an unlimited number of customizations needed to manage systems. Even worse is trying to automate any type of review or fraud detection.

Similarly, it is harder to make policing decisions without clear rules. Speed limits are clear rules, banning “unsafe” driving is not. Similarly, speed cameras are easy because a camera can check a single number. Maximum BAC is a clear rule, “impaired” driving is not. You can guess which of these are more often used by police who don’t want to be called on to defend their subjective judgements in court.

But a tradeoff mentioned earlier applies here in spades - explicit rules are fragile, and if they are supposed to conform to the intent, need to be updated more often. And frequent updates push against predictability, since the predictions need to account for the fact that the rules can change. And in fact, it can be worse - they can give a false illusion of predictability.

Principles versus Rules for Predictions

There are (at least) two parties involved in predictions; the predictors, and the readers. Predictors usually want clear rules and no ambiguity. The readers of the prediction - including the writers, the sponsors of prediction questions, the general public, and in the case of a futarchy, the system being controlled - want fidelity with intent, not strict adherence to the letter of the law.

There are often places where the spirit and the letter conflict. When that happens, the clash is unfortunate and unintended. For example, a question may intend to forecast the number of cases of COVID-19 which occurred in the first half of 2020, but end up forecasting the speed of creating and deploying tests. (Or the insanity of the FDA in stopping people from doing so, as the case may be.)

The death of the author approach to forecasts is great for predictors. In that scenario, we have a presumption that the spirit of a question is irrelevant once it’s written down. But for prediction markets to be useful, there should be a balance between principles and rules.

But as happened in accounting in the 1800s, most of the effort for forecasting resolution so far has gone into making rules that work, with the principles being implicit. That's fine, but better understanding of the role of principles and rules would be valuable.

Predictions Cannot Live by Principles Alone

The past, of course, was akin to a purely principle based system, where we trust informal resolutions and evaluations. Pundits might predict something like "there will be increased Chinese aggression this year," and grade themselves highly, but they do so no matter what occurs. Prediction markets operationalize this into a rule-based resolution; "there will be a fatality in the South China Sea before the end of the year in a confrontation between different countries," and resolving that is straightforward, relying on nothing but an object level event. Prediction markets fix the problem.

So we have a thesis, punditry, and an antithesis, predictions. I claim that we are waiting for a synthesis. In my view, that synthesis is creating a clearer principle-based approach for creating, understanding, interpreting, and applying the rules.

The question is what a set of principles that guide the rules for writing and resolving questions, and guide the interpretation in cases of ambiguity, should look like. But more clarity about what these principles look like is needed.

Forecasting Principles - Why, and Which Ones?

In forecasting, the implicit use of principles in place of rules means that interpretation is harder, and predictions are worse.

I think there is broad agreement about many of the principles, but they haven’t been formulated. For example, when writing a prediction question, we care about minimizing ambiguity, having a concrete outcome, relating the question to the actual uncertainty or outcome, consistency with other predictions, and so on. When resolving a question, we care about things like fidelity to the intent and the language of the prediction.

Below, I want to lay out both some of the principles, and the best practices and implications for how they apply.

Some Plausible Principles for Forecasting

Predictions should be resolved.
1. This requires that they be resolvable.
2. Both the prediction period and the resolution time should be specified.
3. The resolution method should be known.
Predictions should be clear
1. Predictors should be able to represent their actual beliefs
2. Predictions should be concrete when possible, rather than verbal.
Scoring should be clear
1. As simple as practical
2. Known to forecasters
3. Incentive-compatible
Questions should attempt to be useful.
1. Parallel other similar questions.
2. Match language and criteria from other sources
3. Have standard formats where possible

Further Thoughts on Applying the Principles

Below, I have expanded on the principles and written commentary. I will be happy to update this section with comments from readers, which will (by default) be attributed to your username.

Predictions are not punditry, and without resolution the incentives for accuracy, and the feedback needed to improve, are hampered. Most of the criteria here are technical, but there are trade-offs between resolution and other valuable principles.

Predictions should be resolved.

This requires that they be resolvable.
- The resolution criteria should be well-specified.
- If relevant or possible, the intent of the question, or guidance for how resolution will occur, should be clear.
  - Ambiguity should be avoided, but by default, when (inevitable) ambiguity arises, intent should guide the resolution. Any guidance about the motive or intent of the question can therefore be an asset. This is especially true when resolutions are based on expert opinion.
- By default, forecasts should be assumed to be about object-level issues.

In cases where ambiguity arises, tortuously interpreting the text in unintended ways is unhelpful. (Example; it says 'reported/estimated' not 'estimated/reported', so we can infer that if a reported number is available, that should be used instead of an estimate.)

And there are sometimes questions where the technical criteria are not fulfilled, but for reasons unrelated to the intent of the question. (Source X is discontinued and lists an old value, but recommends using alternate source Y in the future - but the resolution says it will use source X.) In such cases, the goal of predicting object level reality, rather than predicting meta-level reporting, should be the dominant concern. This is both useful for question designers, and (per Principle 4,) helps ensure that those using the forecast of the question are getting useful information.

Both the prediction period and the resolution time should be specified.
- These times may not be the same.
- It can be better to leave questions open past when the resolution is known.

Knowing when a question will close is valuable for planning, and for understanding the scoring. Especially in situations where the final prediction is counted for a significant portion of the overall score, if a question closes seconds after an event occurs, there is a windfall gain for anyone who happens to be furiously checking the news and updating just at the right moment. (Side note: there are problems with closing predictions early.)

The resolution method should be known.
- A system or individual should be chosen as the final decision-maker beforehand.
  - Planning beforehand reduces the burden of resolving questions. There is also a practical issue with timely resolution and the burden on the market.
- Automated resolution and scoring is usually better than manual or subjective decisions.

Making choices or automating resolutions in advance can be easier than needing to revisit the questions. And unambiguous, automated criteria can minimize dispute - but unambiguous does not always match the object level issue to be tracked, so a tradeoff exists. "The best published estimate" is fairly unambiguous, but may require arbitration. Despite that, it may be a better and more robust criterion than "the number published by source X" - sometimes, the source is discontinued, or better options for the resolution are discovered or created.

Predictions should be clear

Predictors should be able to represent their actual beliefs
- Specific values are better than choosing ranges, and specifying prediction intervals and probabilities are better than binary triggers.
- Fidelity to the true claimed distributions is valuable, rather than, for example, using pre-specified distributions
  - This desiderata is often difficult to reconcile with clear scoring, since complexity in forecasts generally requires complexity in scoring.
Predictions should be concrete when possible, rather than verbal.
- “Higher” is less clear than “above the current value as of [date], which is [Value].”
- Punditry is both less resolvable, and less valuable, than clear predictions.

Scoring should be clear

As simple as practical
- As noted above, there is a tradeoff between allowing more expressive predictions and simple scoring rules.
Known to forecasters
- Even when resolution criteria are allowed to be ambiguous, scoring should not be.
Incentive-compatible
- Gaming of the system should be minimized.
  - Forecasters should not need to spend time gaming the system to have correct incentives. (Side note: Incentive compatibility is tricky when people are not looking only to maximize their score, or when there are non-trivial costs to predicting.)
  - Also, between being known, and reducing gaming, there are some critical issues with actually building incentive-compatible systems, since incentives differ among forecasters in ways that may make incentive-compatibility incompatible with uniform scoring.

Questions should attempt to be useful.

Parallel other similar questions.
- There is a tension between precision and uniformity.
  - For purposes of this principle, in a US presidential election, “who will win the presidential election” is more uniform with similar question than “who will hold the office of President on Jan 21st.” Similarly, "which candidate will win states with the largest number of electoral college votes" is potentially difficult to compare with "Which candidate will win the electoral college" - for example, if there is a brokered convention.
  - If the tradeoff is significant, it may be better to have a question explicitly about the difference.
Match language and criteria from other sources
- Using identical language and criteria makes both aggregation and resolution easier.
  - As with survey questions, there is a significant amount of variation about how a question is asked that can have important implications for comparing and aggregating predictions. Consistency across platforms and between questions is valuable for promoting this. That means it can and should be promoted unless other overriding concerns exist.
- Clearly highlight when this is not the case.
  - Unless the differences in the edge cases are particularly relevant, a standard format and phrasing should be preferred. And as above, it may be better to have a question explicitly about the difference.
Have standard formats where possible
- Clever new input methods are harder for forecasters to understand and use
- Complex phrasing makes mistakes easier.
  - “Will not X happen” versus “Will X happen.”

These are not a final word, but I think they may be a useful basis for continued discussion. And thanks go to Nuño Sempere and Ozzie Gooen for helpful suggestions and additions.

*) I am grateful to Ozzie Gooen for helping me frame this more clearly.

Resolutions to the Challenge of Resolving Forecasts

davidmanheim — 2021-03-11T19:08:16.290Z

One of the biggest challenges in forecasting is formulating clear and resolvable questions, where the resolution reflects the intent of the question. When this doesn't happen, there is often uncertainty about the way the question will be resolved, leading to uncertainty about what to predict. I want to discuss this problem, and in this post, point out that there are a variety of methods which are useful for resolving predictions.

But first, the problem.

What is the Problem?

The OpenPhil / Good Judgement COVID-19 dashboard provides an example. The goal was to predict the number of cases and deaths due to COVID. The text of the questions was "How many X will be reported/estimated as of 31 March 2021?" and the explanation clarified that "The outcome will be determined based on reporting and estimates provided by Johns Hopkins of X..."

Early on, the question was fairly clear - it was about what would happen. As time went on, however, it became clearer that because reporting was based on very limited testing, there would be a significant gap between the reported totals, and the estimated totals. Discussions of what total to predict were then partly side-tracked by predictions of whether the reports or the predictions would be used - with a very large gap between the two.

Solving the Problem?

This is a very general problem for forecasting, and various paths towards solutions have been proposed. One key desiderata, however, is clear; whatever resolution criteria are used, they should be explicitly understood beforehand. The choice of which approach to be clear you are using, however, is still up for debate - and I’ll present various approaches in this post.

Be Inflexible

We can be inflexible with resolution criteria, and always specify exactly what number or fact will be used for the resolution, and never change that. To return to the example above, the COVID prediction could have been limited to, say, the final number displayed on the Johns Hopkins dashboard by the resolution date. If it is discontinued, it would use the final number displayed before then, and if it is modified to no longer display a single number, say, providing a range, it would use the final number before the change is introduced .

Of course, this means that even the smallest deviation from what you expected or planned for will lead to the question resolving in a way other than representing the outcome in question. Worse, the prediction is now in large part about whether something will trigger the criteria for effectively ending the question early. That means that the prediction is less ambiguous, but also less useful.

Eliminate Ambiguity

An alternative is to try to specify what happens in every case.If a range is presented, or alternative figures are available on an updated dashboard, the highest estimate or figure will be used. If the dashboard is discontinued, the people running it will be asked to provide a final number to resolve the question. If they do not reply, or do not agree on a specific value, a projection of the totals based on a linear regression using the final month of data will be used. This type of resolution requires specifying every possible eventuality, which is sometimes infeasible. It also needs to fall back on some final simple criteria to cover edge cases - and it needs to do so unambiguously.

Walk away

As another alternative, Metaculus sometimes chooses to leave a question as "ambiguous" if the data source is discontinued, or it is later discovered that for other reasons the resolution as stated doesn't work - for example, a possibility other than those listed occurs. That is undesirable because the forecasters cannot get feedback, any awards are not given, forecasters feel like they have wasted their time, and the question that the prediction was supposed to answer ends up giving no information.

Predict Ambiguity

Augur, and perhaps other prediction markets, also allow for one of the resolutions to be “ambiguous” (or, “Invalid Market”, source). For example, a question on who was the president of Venezuela in 2020 might have been resolved as “Invalid” given that both Juan Guaidó and Maduro had a claim to the position. Crucially, “ambiguous” resolutions can be traded on (and thus predicted) on Augur - this creates a better incentive than walking away, but in cases where there is a “morally correct” answer, it falls short of ideal.

Resolve with a Probability

One way to make the resolution less problematic when the outcome is ambiguous is to resolve probabilistically, or similar. In such a case, instead of a yes or no question resolving with a binary yes or no, a question can resolve with a probability, with a confidence interval or with a distribution. This is the approach taken by Polymarket (example, for binary questions) or foretold (for continuous questions). We can imagine this as a useful solution if a baseball game is rained out. In such cases, perhaps the rules would be to pick a probability based on the resolution of past games - with the teams tied, it resolves at 50%, and with one team up by 3 runs in the 7th inning, it resolves at whatever percentage of games where a team is up by 3 runs at that point in the game wins.

Aside: Ambiguity can be good!

As the last two resolution methods indicate, eliminating ambiguity also greatly reduces the usefulness of a question. An example of this was a question in the original Good Judgement competition, which asked "Will there be a violent confrontation between China and a neighbor in the South China Sea?" and the resolution criteria was whether there was a fatal interaction between different countries. The predictions were intended to be about whether military confrontations would occur, but the resolution ended up being about a Chinese fisherman stabbing a South Korean coast guard officer.

Barb Mellers said that the resolution “just reflected the fact that life is very difficult to predict.” I would disagree - I claim that the resolution reflected a failure to make a question well aligned with the intent of the question, which was predicting increased Chinese aggression. But this is inevitable when questions are concrete, and the metric used is an imperfect proxy. (Don't worry - I'm not going to talk about Goodhart's Law yet again. But it's relevant.)

This is why we might prefer a solution that allows some ambiguity, or at least interpretation, without depending on ambiguous or overly literal resolutions. I know of two such approaches.

Offloading Resolution

One approach for dealing with ambiguous resolutions that still resolves predictions unambiguously is to appeal to an outside authority.

A recent Metaculus question for the 20/20 Insight Prediction Contest asked about a "Democratic majority in the US senate" and when the Senate was tied, with Democratic control due to winning the presidency, the text of the question was cited - it said "The question resolves positively if Democrats hold 51 seats or more in the Senate according to the official election results." Since the vice president votes, but does not hold a seat, the technical criteria were not met, despite the result being understood informally as a "Democratic majority." The Metaculus admins said that they agreed on the question resolution due to the "51 seats or more" language.

Instead of relying narrowly on the wording, however, the contest rules were that in case of any ambiguity, the contest administrators "will consult at least three independent individuals, blind to our hypotheses and to the identity of participants, to make a judgment call in these contested cases." The question resolution didn't change, but the process was based on outside advisors.

Meta-Resolution

Even more extreme, a second approach is that forecasters can be asked to predict what they think the experts will decide. This is instead of predicting a narrow and well specified outcome, and allows for predicting things that are hard to pin down at present.

This is the approach proposed by Jacob Lagerros and Ben Gold for an AI Forecasting Resolution Council, where they propose using a group of experts to resolve otherwise likely-to-become-ambiguous questions. Another example of this is Kleros, a decentralized dispute resolution service. To use it, forecasts could have the provision that they be submitted to Kleros if the resolution is unclear, or perhaps all cases would be resolved that way.

This potentially increases fidelity with the intent of a question - but has costs. First, there are serious disadvantages to the ambiguity, since forecasters are predicting a meta-level outcome. Second, there are both direct and management costs to having experts weigh-in on predictions. And lastly, this doesn't actually avoid the problem with how to resolve the question - it offloads it, albeit in a way that can decrease the costs of figuring out how to decide.

As an interesting application of a similar approach, meta-forecasts have also been proposed as a way to resolve very long term questions. In this setup, we can ask forecasters to predict what a future forecast will be. Instead of predicting the price of gold in 2100, they can predict what another market will predict in 2030 - and perhaps that market can itself be similarly predicting a market in 2040, and so on. But this strays somewhat from this posts' purpose, since the eventual resolution is still clear.

Conclusion

In this post, I’ve tried to outline the variety of methods that exist for resolving forecasts. I think this is useful as a reference and starting point for thinking about how to create and resolve forecasts. I also think it’s useful to frame a different problem that I want to discuss in the next post, about the difference between ambiguity and flexibility, and how to allow flexibility without making resolutions as ambiguous.

Thanks

Thanks to Ozzie Gooen for inspiring the post. Thanks also to Edo Arad, Nuño Sempere, and again, Ozzie, for helpful comments and suggestions.

The Upper Limit of Value

davidmanheim — 2021-01-27T14:13:09.510Z

I am happy to announce a new paper I co-wrote with Anders Sandberg, which is now a public preprint (Note: PDF). The abstract is below, followed by a brief sketch of some of what we said in the paper.

Abstract: How much value can our decisions create? We argue that unless our current understanding of physics is wrong in fairly fundamental ways, there exists an upper limit of value relevant to our decisions. First, due to the speed of light and the definition and conception of economic growth, the limit to economic growth is a restrictive one. Additionally, a related far larger but still finite limit exists for value in a much broader sense due to the physics of information and the ability of physical beings to place value on outcomes. We discuss how this argument can handle lexicographic preferences, probabilities, and the implications for infinite ethics and ethical uncertainty.

Physics is Finite and the Near-Term

First, there is a claim underlying our argument, that our current understanding of physics is sufficient to conclude that the accessible universe is finite in volume, in time, and in amount of information which can be stored. (The specific arguments for this are in the appendix of the paper.) We also assume humans are physical beings, without access to value unconnected to the physical world. Anything valued in their mind is part of a physical process.

Given those two claims, we start out with a discussion of purely economic value, and the short term future, specifically the next 100,000 years. During that time, the speed of light means that humanity will only have access to the Milky Way Galaxy. In the optimistic case that we colonize the galaxy, the rate of growth in economic value is limited to the polynomial increase in accessible matter and volume of space. This implies that indefinite exponential economic growth is impossible. In fact, as we suggest in the paper, the limit to exponential growth is almost certainly well below 1% over that time frame.

This has some interesting implications for economic discussions about the proper discount rate for the far-future, for the hinge-of-history hypothesis, and the argument that humanity will reach an economic singularity - or at least one where growth will continue indefinitely at an accelerating pace.

Value-in-General is Finite, Even When it Isn't

The second half of our paper discusses value more generally, in the philosophical sense. Humans often remark that some things, like human life, are "infinitely valuable." Despite economic evidence that this is not literally true, and taking this claim at face value, we argue that value is still limited.

In philosophy, preferences involving infinities are referred to as "lexicographic," in the sense used in computer science to refer to sorting. Any amount of a "lexicographically inferior" good, like blueberries, is less useful than a single "lexicographically superior" good, say, human lives. Still, in a finite universe, no infinities are needed to represent this "infinite preference." To quote from the paper:

We can consider a finite universe with three goods and lexicographic preferences . We denote the number of each good $N A, N B, N C$ , and the maximum possible of each in the finite universe as $M A, M B, M C$ . Set $M = max (M A, M B, M C)$ }. We can now assign utility for a bundle of goods $U (N A, N B, N C) = N C + N B (M + 1) + N A (M 2 + 1)$ . This assignment captures the lexicographic preferences exactly. This can obviously be extended to any finite number of goods $N n$ , with a total of $N = max (n)$ different goods, with any finite maximum of each.

(You should read the paper for a fuller account of the argument, and for the footnotes that I left out of this quote.)

The above argument does not deal with expected utility, but in the paper we claim that not only are zero and one not probabilities, but neither are $ϵ$ or $1 - ϵ$ . That is, we argue that it would be effectively incoherent to assign an infinitesimal probability in order to reach an infinite expected value. We also discuss why Boltzmann brains, and non-causal decision theories don't refute this claim - but for all of those, you'll need to read the paper.

Given all of this, we'd love feedback and discussion, either as comments here, or as emails, etc. Finally, I'll quote the paper a final time for the acknowledgements - not only was it awesome for me to co-write a paper with Anders, but we got feedback from a variety of really incredible people.

We are grateful to the Global Priorities Institute for highlighting these issues and hosting the conference where this paper was conceived, and to Will MacAskill for the presentation that prompted the paper. Thanks to Hilary Greaves, Toby Ord, and Anthony DiGiovanni, as well as to Adam Brown, Evan Ryan Gunter, and Scott Aaronson, for feedback on the philosophy and the physics, respectively. David Manheim also thanks the late George Koleszarik for initially pointing out Wei Dai's related work in 2015, and an early discussion of related issues with Scott Garrabrant and others on asymptotic logical uncertainty, both of which informed much of his thinking in conceiving the paper. Thanks to Roman Yampolskiy for providing a quote for the paper. Finally, thanks to Selina Schlechter-Komparativ and Eli G. for proofreading and editing assistance.

Multitudinous outside views

davidmanheim — 2020-08-18T06:21:47.566Z

There's an important piece of advice for forecasters: don't rely on your internal model of the world exclusively, and take the outside view, then adjust from there. But which view is " the" outside view? It depends on the problem - and different people might tell you different things. But if the choice of outside view is subjective, it starts to seems like inside-views all the way down.

That's where we get to base rates, which don't solve this problem, but they do highlight it nicely.

Fans of superforecasting know, in a hedgehog-like sense of knowing one thing, that the outside view, which is the base rate, which is the rate of similar events, should be our starting point. But which events are similar, and how is similarity defined? We first need to choose a reference class, based on some pre-existing idea of similarity. And in different terms, there is a reference class problem, which we evidently don't have a clear way to judge - and even as Bayesian thinkers, not only is that our problem, it's an entire bucket of different problems.

Considering a Concrete Prediction: Tesla Motors

Let's get really concrete: What will the price of Tesla stock be in 6 months?

Well, what is the reference class? In the last year, 90% of the time, the price of Tesla stock has been between $200 and $1000. But that's a really bad reference class, when the price today is $1,800. OK, but looking at the set of all stocks would be even worse - and looking at automobile stocks even worse than that. Which stocks are comparable? What about stocks with P/E ratios over 900? Or stocks with more than a half billion dollars of losses for their net income? We're getting silly here.

Maybe we shouldn't look at stock price, but should look at market capitalization? Or change in price? "Stocks that went up 9-fold over the course of a year" isn't a super helpful reference class - it has only a few examples, and they are all very different from Tesla.

Of course, none of this is helpful. What we really want is the aggregate opinion of the market, so we look at futures contracts and the implied volatility curve for options expiring in February.

That doesn't look like a reference class. But who needs an outside view, anyways?

What is a reference class?

If you want to know the probability of Kim Jung-Un staying alive, we can consult the reference class of 37 year old males in North Korea, where male life expectancy is 68. Alternatively, look at the reference class of his immediate family - his brother died at the age of 46, but his father lived to the age of 70, and his grandfather lived until 82. Are those useful reference points?

What we really want is the lifespan of dictators. Well, dictators of small countries. Oh, actually, dictators of small nuclear powers that know that Qaddafi was killed after renouncing his nuclear program - a reference class with no other members. Once again, of course, none of this is helpful.

In finance, the outside view is a consensus that markets are roughly rational, and the inside view is that you can beat the market. In international relations, the outside view is that dictatorships can be tenuous, but when the regime survives, the leadership lives quite a long time. The inside view is, perhaps, that China has a stake in keeping their nuclear neighbor stable, and won't let anything happen.

Reference classes depend on models of the world.

In each case, the construction of a reference class is a function of a model. Models induce reference classes - political scientists might have expert political judgement, while demographers have expert lifespan judgement, and 2nd year equity analysts have expert financial judgement. All of those are useful.

What reference class should have been used for COVID-19 in, say, mid-March? The set of emerging infectious diseases over the past decade? Clearly not. In retrospect, of course, the best reference class needed a epidemiological model - the reference class of diseases with $R 0 > 1$ , where spread is determined by control measures. And the reference class for the success of response in the US should have been based on a libertarian view of the failure of American institutions, or a Democrat's view of how Trump had been rapidly dismantling government, and not an index designed around earlier data which ignored political failure modes. But how do we know that in advance? Once again, none of this is helpful in deciding beforehand which reference class to use.

A final example. What reference class is useful for predicting the impact of artificial intelligence over the next decade? Robin Hanson would argue, I think, that it's the reference class of purported game-changing technologies that have not yet attracted significant amounts of capital investment. Eliezer Yudkowsky might argue that it's the reference class of intelligence evolving, sped up by a factor of what we've seen so far of computer intelligence, which moved from an AI winter in the mid-2000s and ant-level intelligence at navigation, to Deepmind being founded in 2010, to IBM’s Watson winning Jeopardy in 2011, to beating the Winograd Schema and acing general high-school science tests without specific training using GPT-3 now. And if you ask a dozen AI researchers, depending on your methods, you'll get at least another dozen reference classes. But we still need to pick a reference class.

So which reference class is correct? In my (inside) view as a superforecaster, this is where we turn to a different superforecasting trick, about considering multiple models. As the saying goes, hedgehogs know one reference class, but foxes consult many hedgehogs.

Update more slowly!

davidmanheim — 2020-07-13T07:10:50.164Z

My experience forecasting has led me to realize that a key mistake that is often made is updating on new data too quickly. This seems backwards, but I think that often the biggest reason that people both under- and over-react to evidence is that they don't consider the evidence clearly enough, and immediately start revising their model to account for the evidence, instead of actually thinking before updating.

Let's deconstruct how rapid updating is misleading with a simple notional example. Someone shows up with a coin and claims that she is psychic and can predict coinflips. You are skeptical, and challenge her to do so. She immediately takes out a coin and correctly predict heads 3 times in a row. You can update a few ways:

Conclude that the claim is now much more likely than before, and you give some credence to the idea that she is psychic
Conclude that she was lucky, and 7:1 odds is not very strong evidence, so you continue to hold on to your prior strongly
Conclude that she is cheating using a coin which has heads on both sides

Notice that these possibilities which spring to mind are completely different cognitive strategies. Either you update to believe her more, you decide the evidence is insufficient, or you immediately seize on a new hypothesis that explains the evidence.

But these are all mentally lazy strategies. If you thought about the situation for longer, you could easily generate a half dozen additional hypotheses. Perhaps the she has trained herself to flip coins a certain way. Perhaps she simply lied and said the coin landed heads each time, and didn't really give you time to see it well when it didn't. Perhaps she is substituting the coin as it lands. Perhaps, perhaps, perhaps.

My advice, per the title, is to slow down. You might decide to be a good Bayesian, and preserve multiple hypotheses, updating marginally - but doing this means that you assume the correct hypothesis is in your prior set. There are a million hypotheses that can explain a given set of events, and the most cognitively available ones are those that allow you to be lazy.

Don't worry, take your time. If the issue matters enough to bother trying to update your model, taking five minutes to reflect is better than jumping the gun. And if you don't need to make any decisions, at least file everything away and decide that it's unclear instead of quickly responding with an overconfident "Aha, now I understand!" or worse, a "Eureka, I've solved it!"

Bayesian thinking gives you answers no faster than a rational accumulation of evidence can possibly allow, given the uncertainties that exist. Slow down. Marginally rationally updating doesn't give you confident answers quickly. It can't.
Trying to update faster isn't going to get you better answers now, it will get you worse answers more quickly.

Updating isn't a sprint to the answer, or even a race - it's a long-duration hike towards a still-unclear goal. If you imprudently start sprinting early because you think you see the goal, you're just going to hurt yourself, or get lost and never make it to the right destination. Take your time.

A Personal (Interim) COVID-19 Postmortem

davidmanheim — 2020-06-25T18:10:40.885Z

I think it's important to clearly and publicly admit when we were wrong. It's even better to diagnose why, and take steps to prevent doing so again. COVID-19 is far from over, but given my early stance on a number of questions regarding COVID-19, this is my attempt at a public personal review to see where I was wrong.

I have been pushing for better forecasting and preparation for pandemics for years, but I wasn't forecasting on the various specific questions about Pandemics on most platforms until at least mid-March, and I failed in several ways.

Mea Culpa

I was late to update about a number of things, and simply wrong in some cases even on the basis of known information. The failures include initially being slow to recognize the extent of the threat, starting out dismissive about masks, being more concerned about hospital-based transmission than ended up being justified, being overconfident in the response of the US government, and in early March, over-confidently getting a key fact wrong about transmission being at least largely via aerosol droplet versus physical contact. I have a number of excuses, of course. Most other experts agreed with my views, my grandfather passed away in January, followed by his wife in early March, I was under a lot of stress, I was very busy with my personal life, I was trying to do a number of other high-priority projects, I was not paying attention to the details, and so on. But predictive accuracy doesn't care about WHY you were wrong, especially since there are always such excuses. And the impact of my poor judgement was also likely misleading to others in the community.

At the same time, I feel the perhaps egotistical need to note where I was correct early, and what I got right - followed by a clearer description of my failures. I started saying there would be PPE shortages due to COVID-19 by January, and was writing about the supply chain issues well before COVID. I submitted this paper November last year with Dave Denkenberger, which was largely finished last summer, and it was accepted in February, which then took 3 months to get published. The delay was in part due to other demands on my time, but in retrospect, if it had been available 3 months earlier, it would have been far, far more impactful.

I also understood the failure mode we ended up seeing, and in my 2018 paper, discussing overconfidence in claims that pandemics would be rare, I argued that among the most critical risks was failure to respond to emerging pandemics which could in theory be controlled quickly enough. On the other hand, my failure to realize that this is exactly what was happening is perhaps compounded by the fact that I understood the dynamics, and should have been able to identify what was going on.

Lastly, I maintain I was correct in warning about the poorly thought out and in some cases outright dangerous "preparation" in some quarters of the rationality community proposed in March, such as advocating use of bleach and ozone in closed areas for disinfection. Some people in the community were stockpiling N-95 masks and food and buying up second hand ventilators, and as I said at the time, were at best being selfish and defecting. On the other hand, as I mention below, I was insufficiently clear about the need for better preparation, and waited far too long to speak.

Some of My Mistakes, and Related Comments

Slow to recognize the extent of the threat.

I said we should be very concerned in January, albeit not very publicly. I took until early March to start suggesting that it was clear that the US would expect to see large numbers of deaths. I was skeptical of valuable efforts early on, and didn't start really publicly sounding the alarm and reacting until even later. I was later than most of this community in recognizing the risks.

Skeptical about Border Closures

In a conversation that started Jan 27th, I was asked about shutting down borders to prevent spread. I was dismissive, in large part based on the expert consensus. I'm unsure whether this was a mistake on the object level, since I think that at that early point, the facts were unclear enough, and trade wars really are bad. I also expected response to be better, based on previous cases.

I do not think that border shutdowns were feasible, and historically they have not been. Quarantines at borders were and are logistically impossible. And full border closures for COVID-19 were also not very effective most places until very late in the spread, (Mongolia and Vietnam are the exceptions that disprove the rule.) Even late in the pandemic spread, lots of transmission occurred from places where there had been few or no cases at the time people entered. However, when discussing it, I excused my early claims that it was too economically damaging and would have been ineffective by substituting a different argument about political feasibility - one which I think is correct, but was not my original consideration. This was bad epistemic practice, and I should have been clearer that in retrospect, if they could have been put in place, travel bans would have been a much better idea. I still think my later excuse, that they were politically impossible, holds up - but I had not fully thought through the question until well after my early response.

Dismissive about masks.

The research on use of masks was unclear and I don't want to claim it was retrospectively obvious, but as a matter of decision making given uncertain risks, people should have started wearing homemade masks in public much earlier. We will still need to see how much impact promoting mask wearing in public has had, but at the very least it functioned as a clear and important public signal that COVID was serious, which promotes physical distance and other critical factors.

On the other hand, I said at the time, and still maintain that I was correct in suggesting that buying up P95 and surgical masks in February and March was defecting, since it was already clear that those supplies were needed desperately in hospitals. And Fauci has now said as much (as a level-1+2 sage, in my view.). In retrospect, I think it would have been better, consequentially, to push for cloth masks earlier, but current modeling and our understanding of spread make it clear that mask wearing by itself is only marginally effective. I was instead focused on promoting handwashing, which I think is still undersold in importance, and thought that continued focus on masks would be a net negative. I was wrong, and others here were correct.

Not clear enough about the importance of preparation.

I've long said, following all of the experts, that people should have 2 weeks supply of food and basic supplies. Especially people in California, where earthquakes are far more common than severe pandemics. Further preparation should have been unneeded early on - but in fact, most people don't do this, and the people who were advocating making sure that you were prepared for a worse outcome were correct.

On the other hand, there is an argument I've seen here, and by others in the rationality community elsewhere, that encouraging people to buy critical supplies and hoard early in a crisis sends a price signal to get companies to produce. The argument is that this type of hoarding masks and other PPE will convince manufacturers to make more. I thought, and still think, that this is at least partly misunderstanding the way that price signals and supply chain delays propagate. Anyone who's familiar with MIT System Dynamics' Beer Game and the bullwhip effect would tell you that companies that ramped up production in response to demand quickly (rather than projections and an understanding of longer term demand) were being stupid, not prudent, and companies that tried this in exactly this area were burned in the past for doing so. If that isn't clear enough, notice that it took a couple months for the toilet paper and flour "shortages" to be worked out, despite the fact that there was sufficient supply, and there were not actual production supply shortages. Yes, markets are largely efficient, but they aren't magical ways to eliminate production and distribution delays, much less to insulate companies from actual market dynamics - and China and other southeast Asian countries had already stepped up mask production massively by mid-January. Most of the current supply comes from those factories, so the supposed benefits of price signals from buying masks in February seem not to have been actually effective in speeding anything up.

Oversold Hospital-based transmission.

Part of my concern about hoarding of masks and other equipment was that I thought we would once again see a pattern of large transmission events being centered around hospitals. Thankfully, this didn't happen - hospitals have gotten far better at isolation of patients, and they shut down non-essential services early. We did still see many, many cases and deaths in hospital staff, and this was very clearly in large part due to a lack of supply of PPE. Still, it wasn't the critical locus of spread I expected it to be.

Overconfidence in the response of (certain agencies in) the US government.

This was a huge mistake on my part. I have been concerned about the current administration for years, have repeatedly warned that it is destroying government agencies. Despite that, I was (in retrospect very unreasonably) still confident that the CDC was going to handle the situation well. They had handbooks on influenza pandemic preparedness, I had personally discussed pandemic preparedness plans with senior people at CDC just a few years ago, and I was overconfident in the ability to respond. Based on that, in turn, I was confident that the level of concern being voiced by the CDC was a reflection of their planning and ongoing preparation. The CDC has planned for preparation for this exact case for years, and I assumed they would carry out those plans. I was wrong.

It seems, though it is still somewhat unclear, that center directors were told by the director and the head of HHS that they needed not to speak out about the risks, specific recommendations were vetoed, and (easily the worst screw up,) they let the FDA ban private tests, seemingly at the direction of the administration, to hide the extent of the spread. I'm still confused by the level of non-reaction among non-political SES staff and GS-14s. We have seen many people in various agencies come forward with complaints during this administration, but CDC seems to have just dropped the ball on their response. We will likely see in the coming years how much this was due to central directives not to react, versus alack of central directives to react, therefore failing due to passivity. I still want to assume the former, but that's in large part self-justification of my prior views.

I was wrong in trying to defend the CDC's overall response in March. It definitely isn't as clear as I thought at the time that they were, and would be, net positive. I do think that the emergence of Fauci as almost a national hero has been very helpful in getting people to listen to expert recommendations, even if this did come very late. This is a point on the side of getting most people to listen more and attack less. On the other hand, Lesswrong was overall better prepared because of their skepticism, so at the very least I was talking to the wrong crowd to defend them, and more likely should have been quicker to judge their actions as dangerous myself.

The FDA also surprised me with how badly they did, albeit the surprise was less severe because I had lower expectations. I thought they were getting less dangerous to US public health given the previous pushes to reduce regulation by the current administration. Scott Gottlieb was there for two years, and was probably the only Trump nominee I was actually super-happy about. Unfortunately, he left (a fact I wasn't paying attention to,) and it turns out that the incompetence of a sequence of new directors and rapid changes left the FDA even less prepared that they would have been. I would have expected a doctrinaire Republican appointee to seize the opportunity of a crisis to reduce regulation, and instead it seems they did nothing but block critical testing work for months.

I've long considered myself skeptical of government agencies abilities, and lean fairly heavily libertarian in many ways - albeit less than most others at lesswrong. I was still surprised by the level of ongoing, perhaps even malicious incompetence of the current administration. I'm still unclear if this is a Hanlon-dodge, or if they really have broken the US government so badly, so quickly. Other governments managed this far less poorly, so I'm unclear how generalizable the lesson is that governments are bad at everything. But I am glad I left the US.

Being a jerk commenting on a post attacking the CDC

Given that I'm posting a retrospective, there is a different type of mistake I made that I also need to address. In a lesswrong thread several months ago, there were a number of claims made about the CDC's response. I responded that I thought the post was an infohazard, would very plausibly lead to many more people dying, and as such, the posters should have asked for feedback from someone who could vet concerns about this, and that it should be taken down by site administrators. This was stupid, and I have apologized there, along with laying out what I hope is a fair analysis of what I know I did wrong, and what I still think I was correct about.

Speculation about Causes

There are lots of things I did wrong.

First, I think I was too close to the situation. I had spent a ton of time looking at the US's system specifically, and writing about the closely related -topic of influenza pandemics in my dissertation, then doing work for Open Philanthropy on GCBRs. All of this was during the Obama administration. I left the US a bit after Trump was elected, partly for that reason, and worked on related topics that had less to do with US policy. I'd like to say that's why I didn't update, but to be honest, I think I was just being stupid in accepting my cached thoughts about the risk and best responses, instead of re-evaluating.

I also had too-strong priors and "expert" ideas to be properly fox-like in my predictions, and not quick enough to update about how things were actually going based on the data. Because I was slow to move from the base-rate, I underestimated the severity of COVID-19 for too long. I'm unsure how to fix that, since most of the time it's the right move, and paying attention to every new event is very expensive in terms of mental energy. (Suggestions welcome!)

I also gave too much weight to others' forecasts. Good Judgement's predictions were WAY optimistic about this early on, and I was not forecasting the question, but I was assuming that their aggregate guess was better than that of individuals, especially people who aren't forecasters. This is usually correct, but here it was a mistake. (I now think that superforecasting is materially worse than I hoped it would be at noticing rare events early.) I also followed the herd too much from expert circles, and my twitter feed from infectious disease epidemiology circles was behind even my slow self in recognizing that this was a incipient disaster back in March.

Conclusion

COVID-19 went badly in some places, and went disastrously in others. This was largely predictable, and I failed to notice early enough. (The US is in deep, deep trouble, and this will continue for quite a while longer, with myriad longer term effects on the global economy, and on global stability of other types.) I'm chastened about the poorly calibrated overconfidence of my expert opinion.

I'm also partly unsure what the best next steps are for better-calibration. One key thing I did, several years ago, was explicitly try to rely more on other people's views in the rationality community to guide my decisions, and provide a clear source of feedback. I didn't do this as much as I should have in this case. (On the other hand, it was a large part of why I recognized the mistake as quickly as I did, albeit later than I could have - so it was at least a partial success.)

I'm hoping that this exercise is another way in which thinking through the situation gives me a valuable chance to reflect, and that I can get further feedback. I also hope that it's useful for others to perhaps learn from, but I'm unsure how transferable the lessons of my failures are.

Market-shaping approaches to accelerate COVID-19 response: a role for option-based guarantees?

davidmanheim — 2020-04-27T22:43:26.034Z

This is a policy brief directed at decision makers in the UK government, with a view to accelerating production of tests, drugs and vaccines for COVID-19; but it could be adapted for a wide variety of countries, products and crises. Critical feedback is very welcome.

Thanks to Sam Hilton, Tim Colbourn and anonymous others for input on previous drafts.

Summary

Effectively tackling COVID-19 will require rapidly scaling up the production of diagnostic tests, pharmaceutical treatments and vaccines. In each case, preparations for large-scale manufacturing, such as building factories, are typically delayed until the product is proven safe and effective. This makes sense from a commercial perspective, but incurs great costs in terms of lives lost and damage to the economy.

There are several potential solutions, but the most promising appears to be “option-based guarantees”. In essence, the government commits to paying a proportion of the manufacturer’s preparation costs should the product turn out not to be viable. (If the product is viable, it can be sold as normal.) This reduces the risk to the company while maintaining an incentive to produce a high-quality product quickly and at scale.

The problem

The UK, like most of the world, faces an urgent need for increased COVID-19 testing capacity. As more people become infected and recover, large-scale screening for past exposure (antibody tests) will be necessary, but the shortage is especially acute for tests of active infection (“swab” tests). The government recognises this need and has set a target of 100,000 tests per day – a combined figure for antibody and swab tests – by the end of April. While this increase is welcome, safely bringing the country out of lockdown could require far more widespread swab testing, potentially as many as 10 million tests a day. Post-lockdown policy is still being developed so this may become the strategy within a few weeks. The UK should prepare now by deciding how market forces can be leveraged to rapidly scale up testing.

A long-term solution to the crisis will involve effective pharmaceutical treatments or vaccines (ideally both). Promising candidates have been identified, but most will take at least several months to complete Phase 3 studies – perhaps less if “human challenge trials” are permitted. Since many products will not prove viable, companies have little incentive to invest in production facilities before the product achieves regulatory approval. Thus, scaling up production is likely to add a few more months to the overall timeline, costing thousands of lives and billions of pounds of lost GDP in the UK alone – far more than the cost of preparing to manufacture products that do not end up reaching the market.

Potential solutions

There are several ways to address this problem.

1. Prizes

The government could offer financial rewards for solutions to supply shortages. For example, companies could compete to offer the best idea for rapidly scaling up vaccines, and a contract to produce them could be part of the prize. By only paying out for the best solution (and perhaps not any, if certain criteria are unmet), this can be a fairly cheap option that incentivises innovation. However, there is necessarily a substantial delay between announcing and awarding the prize; and because the “losers” get nothing, there may not be adequate financial incentive to participate.

2. Public-private partnerships

PPPs can be an effective means of achieving social objectives, such as building infrastructure, by sharing the risk between government and private companies. However, they generally take a long time to negotiate and implement. Unless this process can be greatly accelerated, they are unlikely to be sufficient to ensure the most urgent needs are met in the current situation.

3. Direct purchase orders

The government could pre-order tests, pharmaceuticals and vaccines directly, well before efficacy or safety is established. This would legally guarantee that producers have a market and that the company will supply the product, thereby reducing risk to both parties.

However, this requires the government to first identify suppliers and producers, negotiate prices, and make orders. The process for governmental purchasing is complex, and purchasing something from a new vendor, or purchasing products not shown to be safe, will potentially be a violation of the UK’s public procurement policy. It is also likely to be wasteful as many final products will be unused, and it gives little incentive for producers to improve quality, speed, or cost-effectiveness through innovation.

4. Option-based guarantees

A new approach is for governments to enter into agreements with companies using “put” options. A “put” (as in “put up for sale”) gives the holder the right, but not the obligation, to sell an asset at a specified price, by (or on) a specified date, to the provider of the put. So the government could supply a put option giving companies the right to sell certain items (e.g. vaccine factories, drugs, or diagnostic tests) to the government for a certain percentage of the cost of making them. Because there is no requirement to exercise the put, companies could sell viable products as normal, and would only use the option if their product turns out to be non-viable.

For example, suppose a manufacturer wishes to produce 100 million of a new type of test, but is delaying production because the product is currently being evaluated. They could approach the government, which could agree to provide a put option for, say, 90% of costs, capped at the company's initial project cost estimate. If the test is found viable, the company would not exercise the option, the government would pay nothing, and the company would be able to sell the tests normally to the NHS and others. If found non-viable, however, the company would have an incentive to stop production and exercise their option. At that point, a financial audit of costs would take place, and the government would accept delivery of any items purchased, built, and/or produced in exchange for 90% of costs. A further independent evaluation might be useful to resolve disputes about reasonableness of costs.

This approach has a number of advantages.

Commercial companies can continue to use traditional, non-governmental methods for financing and constructing a product without any government supervision.
Companies will be willing to take a larger risk in manufacturing not-yet-proven technologies, because the costs (to the company) of failure are reduced. This incentivises starting production earlier.
Both haste and high quality are still incentivised through normal market mechanisms: being the first and/or the best product on the market will increase sales and therefore profit.
Compared to some alternatives, it is relatively cheap.
It could potentially be implemented quickly – a very important consideration in the current circumstances, especially for diagnostics.

Recommendation

Overall, guarantees based on put options seem to show the most promise. When the viability of the final product is uncertain, they provide a relatively quick, low-cost way of incentivising the rapid production of new technologies. However, options 1–3 are also worth exploring further, and the optimal approach (or combination of approaches) may vary among products, time periods, and companies. For example:

Prizes could work – perhaps alongside other incentives – when innovative solutions are likely to be needed, such as point-of-care tests for active infection, new vaccines, or new methods of scaling up production.
PPP could be appropriate for less urgent and fairly low-risk products. Antibody tests, and drugs that are very likely to be used but will not run out soon, may fall into this category.
A direct purchase order for a certain number of a certain diagnostic test could be effective if the safety, accuracy, cost and quantity required are known, the company is trusted, and the paperwork can be done quickly. This may be less risky than hoping a company will respond to financial incentives.
A put option on production facilities (not final products) could be the best alternative for diagnostics, drugs and vaccines that are promising but whose viability, large-scale manufacturing methods and/or quantity required are substantially uncertain.

Over the coming weeks and months, making the right choices could save thousands of lives in the UK and millions around the world, while enabling economies and communities to reopen.

Annex 1: Potential variations on standard put options

Declining payout

The payout for the put options could be declining over time, so that the payment is, say, 95% at the outset, and declines by 1% per month. This will incentivise companies to exit as soon as possible if they think the project will fail.

Priced contracts

The government could decide to charge for the contracts, to dissuade unqualified or undercapitalised companies from taking huge immediate risks with small probabilities of paying off.

Early-ending bonus

Alternatively or additionally, there could be payments for ending the contract early. In this case, the government might refund a portion of the initial payment or pay some fixed amount if the company decides to end the option without exercising it. This would create another incentive for companies to move quickly, and reduce uncertainty and risk on the government’s side.

Annex 2: Key questions about option-based guarantees

Q1: Isn't this a giveaway to corporations?

A: Yes, but in a sense it is a minimal giveaway. It does not subsidise companies to undertake projects that they expect cannot succeed, but does allow them to move forward schedules for production. Under the circumstances, it seems worthwhile.

Q2: Isn't this wasteful?

A: Yes, it is nearly certain that some items produced will not be viable, so the government will pay for unused products. However, the companies have an incentive not to spend more than needed, since they recover only part of the costs; and most importantly, the successful products will be available far faster. (The guaranteed percentage of costs can be adjusted to reach the desired trade-off between avoiding waste and hastening development of the needed product.)

The government may also be able to reduce the costs of the program by reselling some items: for example, a plant designed to manufacture an ineffective vaccine could eventually be adapted to produce an approved vaccine. There have been intermittent shortages of other vaccines, pharmaceuticals and tests, so excess capacity may not be entirely wasted.

Q3: Won't there be fraud or unnecessarily high costs?

A: There is a risk that some companies will try to take advantage of the programme. The put options, however, will pay less than the cost, so there is a reduced risk that companies will attempt to participate if they do not think the product has a reasonable chance of success. (Again, the right balance can be struck by adjusting the guaranteed percentage.)

For subcontracting, the structure of the payout means that the company, not the government, takes on the risk that costs will be considered excessive, so they have incentives to ensure they are not overpaying.

Q4: Won’t safety and quality suffer?

A: With less “skin in the game”, there may be a reduced incentive for companies to be successful. But if a product is viable, they will not want to exercise the put option, and they will have the same incentive to ensure quality and safety as they would otherwise.

Q5: How do you ensure the final product is affordable?

A: Put options do not ensure the final test, drug or vaccine is available at a reasonable price, but nor do they preclude price controls. This is an important but entirely separate issue that applies equally to products developed through other means. It is worth noting that, while the product must be cheap enough to roll out at massive scale, the price must also be high enough to reimburse the costs of constructing the production facilities.

Potential High-Leverage and Inexpensive Mitigations (which are still feasible) for Pandemics

davidmanheim — 2020-03-09T06:59:19.610Z

(Crossposted from EA Forum) A little over a year ago, I started a collaboration with David Denkenberger essentially trying to answer the question of what can we do to prepare for pandemics, focusing on things that are Important, Tractable, and Neglected. The resulting paper, entitled "Review of Potential High-Leverage and Inexpensive Mitigations for Reducing Risk in Epidemics and Pandemics" has now been accepted for publication. The publication process is unfortunately very slow*, but in addition to talking about a number of the things we should have been doing before COVID-19 hit**, there are a few that we are perhaps not too late to address.I thought it was worth considering a few of the still-relevant needs and activities that seem like potentially high-leverage avenues for investigation. Given that a few of the things seem still neglected, but are potentially addressable for the current pandemic are, I wanted to highlight them.

1) Enable people to stay isolated effectively.

People are currently being told to stay in quarantine if they were exposed. Unfortunately, staying home may become more difficult if people in isolation need supplies and have no income to allow them to purchase things, or if they cannot get certain things - many medicines must be picked up in person, and in some places, supplies cannot be easily ordered online. Furthermore, if delivery services are interrupted, (which is likely given the heightened risk of exposure for delivery workers, see below,) this will become more difficult. Contingency plans to help deal with the challenges would be helpful, and systems to enable volunteers to assist could also be a useful avenue of research.

2) Triage and manage medical care remotely.

Much medical care does not require hospitals or emergency rooms, but they are utilized anyways - that's where the doctors are. If medical facilities become overwhelmed, many of these need to be replaced by alternatives - self-administered diagnostics, systems to enable phlebotomy and similar testing without needing to go to hospitals, and similar. EMTs may need to be trained to do pre-hospital triage rather than bring lightly injured patients, or people at high risk of serious infection, to hospitals. Similarly, to the extent that people can self-treat, videos and instructions for doing so may be valuable.

3) Manage critical services through disruptions.

Many people doing preparation for large scale disruptions have pointed out that it seems unlikely that their homes would lose access to electricity or clean water. The reason this is true is that these critical services will have a high priority. It is unclear, however, how much redundancy and backup exists for key personnel in this type of facility. If all of the senior engineers who can accomplish a critical function are simultaneously absent, they cannot be replaced easily during a crisis - and they are somewhat likely to all be exposed and become sick around the same time.

4) Ensure transport systems remain functional.

If international shipping is slowed or stopped, or trucking or other transport were disrupted - either due to quarantines, or because of a lack of personnel, the US food supply systems would be reduced to a fairly short supply of goods. Similarly, if Amazon, UPS, or other delivery companies can no longer deliver goods, many systems that rely on the ability to reorder components. This is a critical need, and it is unclear how the system would cope with a large scale unavailability of drivers.
To conclude, I'll mention that I'm hopeful that the discussion can start to shift towards preparation for contingencies before it becomes obvious that they are needed. Hopefully, this is something people can still lead and be proactive, rather than reactive.

*) The paper was submitted to the International Journal of Emergency Management in June, got desk rejected without review in October, and was submitted again in November to the Journal of Global Health Reports. It's now accepted - slightly too late for it to be timely for COVID-19 preparation, but hopefully in time to suggest some new ideas about response.

**) Most of the paper was pointing to a number of no-longer-neglected but more clearly important issues, such as the likelihood of shortages of supplies like masks, and talking about how companies should look at how they can enable remote work in advance to allow self-isolation. Still, there are quite a few points in the paper that aren't yet being discussed that still seem valuable.

Ineffective Response to COVID-19 and Risk Compensation

davidmanheim — 2020-03-08T09:21:55.888Z

UPDATE: This post was written and reflects an earlier set of my beliefs. I have updated significantly in a number of ways since it was posted, based on both external events and research, and no longer endorse it.

Epistemic status: I have a mental model that I think separates my view of response activities from that of the majority of what I see on Lesswrong and associated places. If it is incorrect, I'd be happy to update, but I think this is an area I have considered more than most other posters. I want to write a short post explaining this to allow others to update, and seeing if someone has an argument that changes my mind.

Put simply, my claim is that bringing attention to likely ineffective personal methods for reducing risk is not net neutral with a large upside if they work, it is instead likely to be on net fairly harmful, albeit with a large upside if they work.

Argument

First, we have incredibly effective and vastly underutilized ways to prevent spread of COVID-19, namely handwashing and not touching your face. Given that, if I propose an intervention like making homemade masks from fabric which reduced handwashing compliance by 1% (perhaps due to distracting people or making them think handwashing is less critical,) it would need to be astonishingly effective to be net positive. And most such approaches being discussed are, as far as I can tell, nowhere near that level of effectiveness.

Second, most readers of Lesswrong and effective altruism blogs and facebook groups aren't hardcore rationalists, and even hardcore rationalists aren't immune to Akrasia. On top of that, people like Scott Alexander have huge readerships and sometimes link people to Lesswrong. Many people reading posts here aren't washing their hands enough as it is, and aren't going to rationally evaluate the relative effectiveness of handwashing versus other interventions.

Third, evidence exists that risk-compensation is a meaningful issue. Actions that make people feel safer usually lead to less attention paid to more annoying / more intrusive measures. (There is evidence, such as Vrolix's paper*, that risk compensation reduces the size of the positive impact, but does not make interventions net negative. This is conditioned on the impact being significant and positive, however, and seems not to apply to speculative interventions like those being proposed.

This is not an argument that we should not look into better options for response. It's an argument that we should be more careful in vetting them before encouraging people to do them just in case they work.

*) Vrolix, Klara (2006). "Behavioural Adaptation, Risk Compensation, Risk Homeostasis and Moral Hazard in Traffic Safety" )

Link: Does the following seem like a reasonable brief summary of the key disagreements regarding AI risk?

davidmanheim — 2019-12-26T20:14:52.509Z

This is a link to a question asked on the EA Forum by Aryeh Englander. (Please post responses / discussion there.)

Does the following seem like a reasonable brief summary of the key disagreements regarding AI risk?

Among those experts (AI researchers, economists, careful knowledgeable thinkers in general) who appear to be familiar with the arguments:

Seems to be broad (but not universal?) agreement that:

Superintelligent AI (in some form, perhaps distributed rather than single-agent) is possible and will probably be created one day
By default there is at least a decent chance that the AI will not be aligned
If it is not aligned or controlled in some way then there is at least a decent chance that it will be incredibly dangerous by default

Some core disagreements (starred questions are at least partially social science / economics questions):

Just how likely are all of the above?
Will we have enough time to see it coming, and will it be obvious enough, that people will react appropriately in time to prevent bad outcomes?

Still might be useful to have some people keeping tabs on it (Robin Hanson thinks about 100), but not that many

How hard is it to solve?

If easy then less time needed to see it coming, or inventors more likely to incorporate solutions by default
If really hard then may need a long time in advance

How far away is it?
Can we work on it profitably now given that we don't know how AGI will work?

If current ML scales to AGI then presumably yes, otherwise disagreement

Will something less than superhuman AI pose similar extreme risks? If yes: How much less, how far in advance will we see it coming, when will it come, how easy is it to solve?
Will we need coordination mechanisms in place to prevent dangerous races to the bottom? If yes, how far in advance will we need them?
If it's a low probability of something really catastrophic, how much should we be spending on it now? (Where is the cutoff where we stop worrying about finite versions of Pascal’s Wager?)

What about misuse risks, structural risks, or future moral risks?
Various combinations of these and related arguments result in anything from "we don't need to worry about this at all yet" to "we should be pouring massive amounts of research into this"

Updating a Complex Mental Model - An Applied Election Odds Example

davidmanheim — 2019-11-28T09:29:56.753Z

There are probabilities, and there are probabilities about probabilities. How do these get updated? I've had the same discussion several times, and have tried to describe this, but it is hard without going into the math. The formal model is clear, but I have found that the practical implications are hard to describe concretely. I just ran into a great concrete example, however, and I wanted to work through the logic of how I'm updating as a way to show what should happen.

The example I'm using is my expectations about the 2020 election, how accurate various models are $1$ , and how important the inputs are. This type of problem is fairly common - I have both an object level prediction about the winner, and a prediction about / model of how accurate different sources of information will be.

So, what do I do when information comes in that seems surprising? Two things; I update in the direction the information indicates, and I update against the reliability of the data. The second may seem counter-intuitive, but the example makes it clearer.

The economy is doing well - recent news is that it's better than expected. Presidents with great economies tend to get re-elected. Trump is also unpopular. Unpopular presidents tend not to get re-elected $2$ . How do we balance these two, and how do they interact? My model of whether he will win is fairly uncertain, and my model of the sources of data is also uncertain. They are also related in complex ways $3$ . For instance, if Trump's popularity plummets because, for instance, the impeachment inquiries find something shocking and horrible even to his base, I expect that GDP matters far less for his reelection chances. Other data sources also constrain how far I will update $4$ - no level of GDP growth alone will make me say he's certain $5$ to win.

So I updated towards Trump's reelection based on the economic data, but my underlying model is telling me that it is decreasingly relevant. That means I'm very slightly down-weighting the importance of economic factors compared to approval rating, since he's seemingly not getting credit for the growth (or the growth isn't helping most voters.) The net impact is that I have updated slightly towards Trump's reelection.

1) For long term forecasts of presidential elections, forecasts based on fundamentals do just OK. But forecasts based on polls do poorly far in advance of the election as well. (Special elections seem to point to a huge shift towards the democrats, despite fundamentals.) More complete models take some of each type of information - but how to combine them is tricky. Some models do it poorly, others do it well.

2) I also have expectations about the future inputs to the models. Most presidents have fluctuating approval ratings, so long-term forecasts do poorly. For Trump, his split of approval/disapproval has been remarkably steady, so unless his approval significantly shifts from the current low-40s, or he runs against an incredibly unpopular democrat (which is possible, but seems pretty unlikely,) models that consider this point towards him being unlikely to win. It still may be volatile. For example, the impeachment could solidify his base, or could reduce his popularity further.

3) This is tricky to describe, but for understanding the overall behavior, a useful strategy is to consider the limit - what happens if the economy is amazing, but everyone hates the president? I'd assume he doesn't get reelected. Similarly, if everyone loves the president, but the economy is in a deep recession, (for which he's seemingly not being blamed) he probably gets reelected.

4) Special elections are favoring Democrats, voter turnout among liberals is expected to be very high because of polarization, etc.

5) By which I mean highly confident - certainty is impossible. It would take a confluence of events to make my highly confident. Even with such a confluence of events, however, it is far in advance, so I'm not willing to put odds above ~90% / below ~10% because I think there are fundamentally hard questions about the future that impact the probability. (We don't know who the democratic nominee is, for instance.)

Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management

davidmanheim — 2019-10-30T10:33:16.240Z

Risk management is difficult, but even when it’s easy, companies and policymakers often do something other than optimal risk mitigation. This isn’t puzzling, once we realize that the incentives in place give the decisionmakers the leeway, or even positive incentives, to behave sub-optimally. There are three types that seem most relevant, along with a few (anonymized) stories from when I was working in reinsurance of how they play out in practice.

Sleeping Pills

Occasionally, a small insurance company would purchase reinsurance for things that didn’t make business sense for their company. I might have seen a home insurance company in Ohio that had fifty million dollars in reserve, then would buy reinsurance for hurricanes that covered all losses greater than ten and less than twenty five million dollars. Yes, there are hurricanes that impact Ohio, such as Xenia in 1974 and Ike in 2008, but they weren’t large enough to even hit the minimum for this type of policy. Not only that, but the company had money in reserves to cover this incredibly unlikely loss, and buying reinsurance isn't cheap. Let’s say that between brokers, transaction costs, and everything else, it cost them a hundred thousand dollars to cover an expected loss of ten thousand dollars.

Noticing this, I asked why they wanted this policy. My boss told me it was a sleeping pill. He explained that the CEO of the insurance company would get really nervous and unhappy every time a hurricane was approaching the US, and decided this CEO didn’t want to worry anymore. That isn’t unreasonable - most people who buy travel insurance to cover their $1,000 vacation could just accept the risk, but they prefer not to worry.

In general, buying risk mitigation isn’t worth the cost in expected value terms, but they are worthwhile because they can buy off the worry. In this case, it’s less innocuous, since the CEO was using company money to buy what amounted to personal sleeping pills. It happens because the CEO won’t ever be blamed for hedging the risk, and it cost the company a hundred thousand dollars per year, which was not enough for shareholders to notice.

Theater Tickets

Once, we received a request from a client to train them in using the terrorism risk model they licensed. That’s not unusual, but they said that they wanted us to first come and install the software, then do a half day training. This seemed weird, since they had been licensing the software for several years, and we assumed they would have already installed it. They hadn’t.

The software licenses weren’t cheap - I don’t remember the client or the amount, but it was easily a six-figure sum every year. Why did they pay if they didn’t even use the model? It was what Bruce Schnier calls “Security Theater,” referring to acts that do nothing to change the risk, but look good. In this case, they wanted to tell their shareholders in their annual report that they had a model for assessing their terrorism risk, and six-figures was cheap to be able to put on the show of risk-management and mitigation for their shareholders. In this case, they didn’t even need to put on a show - just waving the tickets they bought was enough. They paid money not to mitigate risk, but simply to make it look like they did.

Staying Out of the Pool

There’s another phenomenon where managers don’t want to take risks that are good for the company but risky for their division - and their careers. The typical version is explained in a 1966 article in Harvard Business Review, "Utility theory, insights into risk taking." They asked managers if they’d take a 50-50 chance on a project that would either make $300,000, or lose $60,000, and the managers mostly said no.

The reason this is a bad decision is that even if the company prefers to avoid large risks, the company should want middle-managers to be taking smaller risks. They want small risks because there are lots of them, and in aggregate - by pooling the risks - the company is far better off having managers take them. For a single bet, the expected return is $120,000, but there’s a 50% chance of not only not coming out even, but actually losing money. If 6 managers took similar (uncorrelated) bets, the expected value is six times as large - it scales linearly - but the probability of losing money in that case is $(12) 6$ , or under 2%.

Reinsurance is actually a place that understands this far better than most. Since they explicitly model the risks of the insurance contracts, and the entire reason insurance works is risk pooling, they are far better equipped to handle the problem. Still, underwriters get paid their bonus based on their performance, one that likely won’t materialize if they are net negative for the year. So even here, you see a hesitation for the individuals to get into the (risk) pool.

Conclusion

None of these are new insights or issues. All of them are simple combinations of a principal-agent issue and risk aversion. Still, in addition to reinforcing the ideas, they are worth thinking about when we have one person or group of people mitigate risks for others. The obvious places are in the corporate world, and in government, and if you look around, you’ll see that these dynamics are all common.

Divergence on Evidence Due to Differing Priors - A Political Case Study

davidmanheim — 2019-09-16T11:01:11.341Z

(This uses a politically charged topic as an example, but I'm hoping that people are willing to try to understand the points made despite that. Politics is Hard Mode, and I'm hoping to stay at a lower difficulty level for now, so I've asked that comments not discuss the object level politics.)

Last week on twitter, I saw two very different takes on how the United States reacted to 9/11, and the consequences. They both reflected an update to people's views based on the data since that time, but the conclusions radically diverged. E.T. Jaynes posited that this can happen, but this is the first time I recognized it in practice so clearly, and thought it was worth noting simply as an example. Beyond that, I wanted to point out how it can be to some extent avoided.

The first was Don Moynihan, who said: "It is important to remember 9/11 and the lives that were lost. Its also important to remember that period of American history fully, to understand that a terrorist attack triggered a series of catastrophic judgments by US politicians that led to the loss of more innocent lives."

The second was David French, who said: " If you had told us on that day that we wouldn't endure another mass-scale attack on American soil for at least 18 more years, we would have thought you were wildly optimistic. The achievement of our military and security establishment should never be underestimated."

First, I want to note that the two views are based on different counterfactuals. Moynihan presumably assumes that the counterfactual rate of terrorist attacks had the US not gone to war in Afghanistan and Iraq to be at least relatively low. He therefore updates based on the fact that there have been very few credible attempts to mount large attacks on the US homeland, to conclude that they would be foiled by standard US intelligence sources and policing. French explicitly calls out the fact that the commonly held prior for the number of attacks that would be mounted in the wake of 9/11 was high, and asserts that this was correct but-for the military interventions the US waged. He therefore updates based on the fact that there have been very few credible attempts to mount large attacks on the US homeland, to conclude that the military interventions were successful.

Clearly, it is not the case that either person is ignoring the evidence. In this case, there are different reasons to update towards each of the models; the lack of credible attack attempts in the US contrasts with the large number in Iraq, and it's plausible that without the US wars abroad, some of that effort would have been directed at the US. On the other hand, law enforcement was very successful in detecting and stopping attacks, so it's plausible that few would have gotten through anyways. But since we can't see what would have happened has the US not gone to war (i.e. counterfactual realities are unobservable), we may be tempted to conclude that evidence is useless in the face of different prior beliefs. This isn't quite true.

If Moynihan and French had been asked in detail in 2001 what they expected in the case that the US would or would not go to war, they would be forced to confront the ways in which their predictions failed. Perhaps their conclusions would be different - but most people don't routinely make quantifiable predictions. Their stated models are at best capable of being twisted, and if people want to believe their model, and not change their mind, not only can the invisible dragon in the garage be post-hoc determined to be permeable to flour once an annoying rationalist proposes a test of the theory, but given the flexibility that language offers, people often specify models of the world that don't require post-hoc adjustment, just defensible clarifications. So unless we're incredibly detailed in the predictions we request of people, the ability to use data to reinforce rather than revise beliefs can't be stopped.

Ideally, we'd have the ability to build a correct model, but we can't - certainly not in the space of this post, near-certainly not in a couple years of research into international relations theory, and plausibly not at all due to the paucity of evidence and the number of uncertain variables involved.

The better approach, I think, is to consider the outside view about the models. We have are two different models that are espoused by people with differing political viewpoints. Each of the models reflect a combination of motivated reasoning, selective blindness, and actual attempts to understand the world, and we're stuck uncertain which is less wrong.

But what we absolutely shouldn't do - and without explicitly trying not to, likely would do - is notice the model that we'd prefer and (perhaps subconsciously) preferentially interpret evidence as supporting it and disproving the alternatives. Especially here, where both models are simplified and wrong in many ways, my advice is to try to reason under model-uncertainty, instead of trying to reason the way we are naturally inclined to, by picking sides in a fight. Absent further plausible arguments and evidence - which exist, but themselves need to be evaluated very carefully for the same reasons - we should look at the models as both plausible.

Hackable Rewards as a Safety Valve?

davidmanheim — 2019-09-10T10:33:40.238Z

Reading Deepmind's latest research and accompanying blogpost, I wanted to highlight an under-appreciated aspect of safety. As a bit of background, Carlos Perez points out Josha Bach's "Lebowski theorem," which states that "no superintelligent AI is going to bother with a task that is harder than hacking its reward function." Given that, I see a potential perverse effect of some types of alignment research - especially research into embedded agency and robust alignment which makes AI uninterested in reward tampering. (Epistemic Status: my confidence in the argument is moderate, and I am more confident in the earlier claims.)

In general, unsafe AI is far more likely to tamper with its reward function than to find more distant (and arguably more problematic) ways to tamper with the world to maximize its objective. (epistemic status: fairly high confidence) Once an AI is smart enough to spend its time reward hacking, then wasting time on developing greater intelligence is unneeded. For that reason, this theorem seems likely to function as at least a mild safety valve. It's only if we close this valve too tightly that we would plausibly see ML that reached human-level intelligence. At that point, of course, we should expect that the AI will begin to munchkin the system, just as a moderately clever human would. And anti-munchkin-ing is a narrow instance of security more generally.

Security generally is like cryptography narrowly in an importance sense; it's easy to build a system that you yourself can't break, but very challenging to build one that others cannot exploit. (Epistemic status: more speculative) This means that even if our best efforts go towards safety, an AI seems very unlikely to need more than "mild" superintelligence to break it - unless it's been so well aligned that it doesn't want to hack its objective function.

This logic implies (Epistemic status: most speculative, still with some confidence) that moderate progress in AI safety is potentially far more dangerous than very little progress - and raises critical questions of how close to this unsafe uncanny valley we currently are, and how wide the valley is.

What Programming Language Characteristics Would Allow Provably Safe AI?

davidmanheim — 2019-08-28T10:46:32.643Z

It seems clear that many high-level programming languages are candidates for use in the first AGI. They have enough power to write that code. It seems clear, however, that the power that those languages have is incompatible with formal safety. SPARK or OCaml are made so that it is easy to prove correctness, which seems useful, but that's not enough.

For example, we might need memory safety to provide a formal guarantee that the program cannot directly modify the part of memory containing the reward function, or the calculation of the reward. On the other hand, it seems that Turing incompleteness, which allows guaranteeing that a program terminates, would not be necessary.

So - what other (extant or yet-to-be defined) types of language safety will be needed from a language to prevent a hypothetically provably safe AI from being unsafe in practice?

Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4)

davidmanheim — 2019-08-12T08:07:01.769Z

In the previous posts, I first outlined Selection versus Control for Optimization, then talked about What Optimization Means, and how we quantify it, then applied these ideas a bit to ground the discussion.

After doing so, I reached a point where I think that there is something useful to say about mesa-optimizers. This isn't yet completely clear to me, and it seems likely that at some point in the (hopefully near) future, someone will build a much clearer conceptual understanding, and I'll replace this with a link to that discussion. Until then, I want to talk about how mesa-optimizers are control systems built within selection systems, and why that poses significant additional concern for alignment.

Mesa-Optimizers

I claimed in the previous post that Mesa-optimizers are always control systems. The base optimization selects for a mesa-optimizer, usually via side-effect-free or low-cost sampling and/or simulation, then creates a system that does further optimization as an output. In some cases, the further optimization is direct optimization in the terms laid out in my earlier post, but in others it is control.

Direct Mesa-Optimizers

In many cases, the selection system finds optimal parameters or something similar for a direct optimization system. This is exactly the earlier example of building a guidance system for a rocket is an obvious class of example where selection leads to a direct optimizer. This isn't the only way that optimizers can interact, though.

I'd say that MCTS + Deep learning is an important example of this mix which has been compared to "Thinking Fast and Slow" (pdf, NIPS paper). In chess, for example, the thinking fast is the heuristic search to choose where to explore, which is based on a selection system, and the exploration is MCTS, which in this context I'm calling direct optimization. (This is despite the fact that it's obviously probabilistic, so in some respects looks like selection, because while the selection IS choosing points in the space, it isn't doing evaluation, but rather deterministic play-forward of alternative scenarios. The evaluation is being done by the heuristic evaluation system.) In that specific scenario, any misalignment is almost entirely a selection system issue - unless there was some actual mistake in the implementation of the rules of chess.

This opens up a significant concern for causal Goodhart; regime change will plausibly have particularly nasty effects. This is because the directly optimizing mesa-optimizer isn't at all able to consider whether the parameters selected by the base-optimizer should be reconsidered. And this is far worse for "true" control mesa-optimizers.

Control Mesa-optimizers

Before we talk more about failure modes, it's important to clarify two levels of failure; the base-optimizer can fail to achieve its goals because it designs a misaligned mesa-optimizer, and the mesa-optimizer itself can fail to achieve its goals. The two failure modes are different, because we don't have any reason to assume a-priori that our mesa-optimizer shares goals with our base optimizer.

Just like humans are adaptation-executioners, Mesa-optimizers are mesa-optimizers, not simply optimizers. If their adaptations are instead tested against the true goal, or at least the base-optimizers goal, then evaluated on that basis, they aren't mesa-optimizers, they are trials for the selection optimizer. Note that these two cases aren't necessarily incompatible; in something like Google's federated learning model, the shared model is updated based on the control system's data. So self-driving cars may be mesa-optimizers using a neural net, and the data gathered is later used to update the base-optimizer model, which is then given back to the agents. The two parts of the system can therefore suffer different types of failures, but at least the post-hoc updating seems to plausibly reduce misalignment of the mesa-optimizer.

But in cases where side-effects occur, so that the control mesa-optimizer imposes externalities on the system, the mesa-optimizer typically won't share goals with the base optimizer! This is because if it shares goals exactly, the base optimization doesn't need to build a mesa-optimizer, it can run tests without ceding direct control. (Our self-driving cars in the previous paragraph are similar to this, but their goal is the trained network's implicit goal, not the training objective used for the update. In such in a federated learning model, it can fail and the trial can be used for better learning the model.) On the other hand, if no side-effects occur and goals are shared, the difference is irrelevant; the mesa-optimizer can fail costlessly and start over.

Failure of Mesa-optimizers as Principle-Agent-type "Agents"

There is a clear connection to principal-agent problems. (Unfortunately, the term agent is ambiguous, since we also want to discuss embedded agents. In this section, that's not what is being discussed.) Mesa-optimizers can succeed at their goals but fail to be reliable agents, or they can fail even at their own goals. I'm unsure about this, but it seems each of these cases should be considered separately. Earlier, I noted that some Goodhart failures are model failures. With mesa-optimizers involved, there are two sets of models - the base optimizer model, and the mesa-optimizer model.

Principal optimization failures occur either if the mesa-optimiser itself falls prey to a Goodhart failure due to shared failures in the model, or if the mesa-optimizer model or goals are different than the principal's in ways that allow the metrics not to align with the principals' goals. (Abrams correctly noted in an earlier comment that this is misalignment. I'm not sure, but it seems this is principally a terminology issue.)

This second form allows a new set of failures. I'm even less sure about this next, but I'll suggest that we can usefully categorize the second class into 3 cases; mesa-superoptimizers, mesa-suboptimizers, and mesa-transoptimizers. The first, mesa-superoptimizers, is where the Mesa-optimizer is able to find clever ways to get around the (less intelligent) base optimizer's model. This allows all of the classic Goodhart's law-type failures, but they occur between the Mesa-optimizer and the based-optimizer, rather than between the human controller and the optimizer. This case includes the classic runaway superintelligence problems. The second, mesa-suboptimizers, is where the mesa-optimizer uses a simpler model than the base optimizer, and hits a Goodhart failure that the base-optimizer could have avoided. (Let's say, for example, that it uses a correlational model holding certain factors known by the base optimizer to influence the system constant, and for whatever reason the mesa-optimizer enters a regime-change region, where those factors change in ways that the base-model understands.) Lastly, there are mesa-transoptimizers, where typical human types of principle-agent failures can occur because the mesa-optimizer has different goals. The other way this occurs is if the mesa-optimizer has access to or builds a different model than the base-optimzer. This is a bit different than mesa-superoptimizers, and it seems likely that there are a variety of cases in this last category. I'd suggest that it may be more like multi-agent failures than it is like a traditional superintelligence alignment problem.

On to Embedded Agents

I need to think further about the above, and should probably return to it, but for now I plan to make the (eventual) next post about embedded agency in this context. Backing up from the discussing on Mesa-optimizers, a key challenge for building safe optimizers in general is that control often involves embedded agent issues, where the model must be smaller than the system. In particular, in the case of mesa-optimizers, the base-optimizer needs to think of itself as an embedded agent whose model needs to include the mesa-optimizer's behavior, which is being chosen by the base-optimizer. This isn't quite embedded agency, but it requires the base optimizer to be "larger" than the mesa-optimizer, only allowing mesa-suboptimizers, which is unlikely to be guaranteed in general.

Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3)

davidmanheim — 2019-07-28T09:32:25.878Z

Clarifying Thoughts on Optimizing and Goodhart Effects - Part 3

Previous Posts: Re-introducing Selection vs Control for Optimization, What does Optimization Mean, Again? -

Following the previous two posts, I'm going to try to first lay out the way Goodhart's Law applies in the earlier example of rockets, then try to explain why this differs between selection and control. (Note: Adversarial Goodhart isn't explored, because we want to keep the setting sufficiently simple.) This sets up the next post, which will discuss Mesa-Optimizers.

Revisting Selection vs. Control Systems

Basically everything in the earlier post that used the example process of rocket design and launching is susceptible to some form of overoptimization, in different ways. Interestingly, there seem to be clear places where different types of overoptimization is important. Before looking at this, I want to revisit the selection-control dichotomy from a new angle.

In a (pure) control system, we cannot sample datapoints without navigating to them. If the agent is an embedded agent, and has sufficient span of control to cause changes in the environment, we cannot necessarily reset and try over. In a selection system, we only sample points in ways that do not affect the larger system. Even when designing a rocket, our very expensive testing has approximately no longer term effects. (We'll leave space debris from failures aside, but get back to it below.)

This explains why we potentially care about control systems more than selection systems. It also points to why Oracles are supposed to be safer than other AIs - they can't directly impact anything, so their output is done in a pure selection framework. Of course, if they are sufficiently powerful, and are relied on, the changes made become irreversible, which is why Oracles are not a clear solution to AI safety.

Goodhart in Selection vs. Control Systems

Regressional and Extremal Goodhart are particularly pernicious for selection, and potentially less worrying for control. Regressional Goodhart is always present if we are insufficiently aware of our goals, but in general Causal Goodhart failures seems more critical in control, because it is often narrower. To keep this concrete, I'll go through the classes of failure, and note how they could occur at each stage of rocket design. To do so, we need to clarify goals at each stage. Our goal in stage 1 is to find a class of designs and paths to optimize. In stage 2, we build, test, and refine a system. In many ways, this stage is intended to circumvent goodhart-failures, but testing does not always address extremal cases, so our design may still fail.

Regressional Goodhart hits us if we have any divergence between our metric and our actual goal. For example, in stages 1 and 2, finding an ideal complex or chaotic path that is dependent on exact positions of planets in a multibody system would be bad, or a path involving excessive G-forces or other dangerous things might be more fuel efficient than a simpler path. For example, a gravitational slingshot around the sun might be cheap, but fry or crush the astronauts. Alternatively, a design with a shape that does not allow people to fit inside might be found when optimizing. Each of these impact goals potentially not included in the model. Regressional goodhart is less common in control for this case, since we kept the mesa-optimizer limited to optimizing a very narrow goal already chosen by the design-optimization.

Extremal Goodhart is always a model failure. It can be because the model is insufficiently accurate, (Model Insufficiency) or because there is a regime change. Regime changes seem particularly challenging in systems that design mesa-optimizers, since I think the mesa-optimization is narrower in some way than the global optimizer (if not, it's more efficient to have an executing system rather than a mesa-optimizer.)

Causal Goodhart is by default about an irreversible change. In selection systems, it means that our sampling accidentally broke the distribution. For example, we test many rockets, creating enough space debris to make further tests vulnerable to collisions. We wanted the tests to sample from the space, but we accidentally changed the regime while sampling.

In the current discussion, we care about metric-goal divergence because the cost of the divergence is high - typically, once we get there, some irreversible consequence happens, as explained above. This isn't exclusively true of control systems, as the causal Goodhart example shows, but it's clearly more common in such systems. Once we're actually navigating and controlling the system, we don't have any way to reset to base conditions, and causal changes create regime changes - and if these are unexpected, the control system is suddenly in a position of opitimizing using an irrelevant model.

And this is a critical fact, because as I'll argue in the next post, mesa-optimizers are control systems of a specific type, and have some new overoptimization failure modes because of that.

What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2)

davidmanheim — 2019-07-28T09:30:29.792Z

Clarifying Thoughts on Optimizing and Goodhart Effects - Part 2

Previous Post: Re-introducing Selection vs Control for Optimization In the post, I reviewed Abram's selection/control distinction, and suggested how it relates to actual design. I then argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution.

Here, I will revisit the question of what optimization means.

NOTE: This is not completely new content, and is instead split off from the previous version and rewritten to include an (Added) discussion of Eliezer's definition for measuring optimization power, from 2008. Hopefully this will make the sequence clearer for future readers.

In the next post, Applying over-Optimization in Selection and Control, I apply these ideas, and concretize the discussion a bit more before moving on to discussing Mesa-Optimizers in Part 4.

What does Optimization Mean, Again?

This question has been discussed a bit, but I still don't think its clear. So I want to start by revisiting a post Eliezer wrote in 2008, where he suggested that optimization power was ability to select states from a preference ordering over different states, and could be measured with entropy. He notes that this is not computable, but gives us insight. I agree, except that I think that the notion of the state space is difficult, for some of the reasons Scott discussed when he mentioned that he was confused about the relationship between gradient descent and Goodhart's law. In doing so, Scott proposed a naive model that looks very similar to Eliezer's;

simple proxy of "sample points until I get one with a large U value" or "sample n points, and [select] the one with the largest U value" when I think about what it means to optimize something for U. I might even say something like " $n$ bits of optimization" to refer to sampling $2 n$ points. I think this is not a very good proxy for what most forms of optimization look like."

I want to start by noting that this is absolutely and completely a "selection" type of optimization, in Abram's terms. As Scott noted, however, it's not a good model for what most optimization looks like, and that's part of why I think Eliezer's model is less helpful than I did when I originally read it.

There's a much better model for gradient descent optimization, which is... gradient descent. It is a bit closer to control than direct optimization, since in some sense we're navigating through the space, but for almost all actual applications, it is still selection, not control. To review how it works, points are chosen iteratively, and the gradient is assessed at each point. The gradient is used to select a new point at some (perhaps very clever, dynamically chosen next point.) Some stopping criteria is checked, and it iterates at that new point. This is almost always tons more efficient than generating random points and examining them.

(Addded) It's far better than a grid search, usually, for most landscapes, but also makes it clear why I think it's hard to discuss optimization power in Eliezer's terms on a practical level, at least when dealing with a continuous system. The problem I'm alluding to is that any list of preferences over states depends on number of states. Gradient descent type optimization is really good at focusing on specific sections of the state space, especially compared to grid search. We might find a state where grid search would require a tremendously high resolution, but we don't ever compute a preference ordering over $2 n$ states. With gradient descent, we instead compute preferences for a local area and (hopefully) zoom-in, potentially ignoring other parts of the space. An optimizer that focuses very narrowly can have high-resolution but miss the non-adjacent region with far better outcomes, or can have fairly low resolution but perform far better - and the second optimizer is clearly more powerful, but I don't know how to capture this.

But to return to the main discussion, the process of gradient descent is also somewhere between selection and control - and that's what I want to explain.

In theory, the evaluation of each point in the test space could involve an actual check of the system. I build each rocket, watch to see whether it fails or succeeds according to my metric. For search, I'd just pick the best performers, and for more clever approaches, I can do something like find a gradient by judging performance of parameters to see if increasing or decreasing those that are amenable to improvement would help. (I can be even more inefficient, but find something more like a gradient, by building many similar rockets, each an epsilon away in several dimensions, and estimating a gradient that way. Shudder.)

In practice, we use a proxy model - and this is one place that allows for the types of overoptimization misalignment we are discussing. (But it's not the only one.) The reason this occurs is laid out clearly in the Categorizing Goodhart paper as one of the two classes of extremal failure - either model insufficiency, or regime change. This also allows for (during simulation undetectable) causal failures, if the proxy model gets a causal effect wrong.Even without using a proxy model, we can be led astray by the results if we are not careful. Rockets might look great, even in practice, and only fail in untested scenarios because we optimized something too hard - extremal model insufficiency. (Lower weight is cheaper, and we didn't notice a specific structural weakness induced by ruthlessly eliminating weight on the structure.) For our purposes, we want to talk about things like "how much optimization pressure is being applied." This is difficult, and I think we're trying to fit incompatible conceptual models together rather than finding a good synthesis, but I have a few ideas on what selection pressure leading to extremal regions means here.

Extreme proxy values (in comparison to most of the space) seems similar to having lots of selection pressure. If we have a insanely tall and narrow peak, we may be finding something strange rather than simply improving.
Extreme input values (unboundedly large or small values) may indicate a worrying area vis-a-vis overoptimization failures.
Lots of search time alone does NOT indicate extremal results - it indicates lots of things about your domain, and perhaps the inefficiency of your search, but not overoptimization. (This is in contrast to the naive grid-search model, where lots of points visited means more optimizing.)

As an aside, Causal Goodhart is different. It doesn't really seem to rely on extremes, but rather on manipulating new variables, ones that could have an impact on our goal. This can happen because we change the value to a point where it changes the system, similar to extremal Goodhart, but does not need to. For instance, we might optimize filling a cup by getting the water level near the top. Extremal regime change failure might be overfilling the cup and having water spill everywhere. Causal failure might be moving the cup to a different point, say right next to a wall, in order to capture more water, but accidentally break the cup against the wall.Notice that this doesn't require much optimization pressure - Causal Goodhart is about moving to a new region of the distribution of outcomes by (metaphorically or literally) breaking something in the causal structure, rather than by over-optimizing and pushing far from the points that have been explored.This completes the discussion so far - and note that none of this is about control systems. That's because in a sense, most current examples don't optimize much, they simply execute an adaptive program.

One critical case of a control system optimizing is a mesa-optimizer, but that will be deferred until after the next post, which introduces some examples and intuitions around how Goodhart-failures occur in selection versus control systems.

Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1)

davidmanheim — 2019-07-02T15:36:51.071Z

This is the first post in a small sequence I'm writing on "Optimizing and Goodhart Effects - Clarifying Thoughts" (I have re-organized to make part 2, "Revisiting What Optimization Means" separate.)

Next Posts: Revisiting What Optimization Means with Selection vs. Control, then Applying Overoptimization to Selection vs. Control

Introduction

Goodhart's law comes in a few flavors, as originally pointed out by Scott, and formalized a bit more in our joint paper. When discussing that paper, or afterwards, we struggled with something Abram Demski clarified recently, which is the difference between selection and control. This matters for formalizing what happens, especially when asking about how Goodhart occurs in specific types of optimizers, as Scott asked recently.

Epistemic Status: This is for de-confusing myself, and has been helpful. I'm presenting what I am fairly confident I understand well for the content written so far, but I'm unclear about usefulness for others, or how clear it comes across. I think that there's more to say after this post, and this will have a few more parts if people are interested. (I spent a month getting to this point, and decided to post and get feedback rather than finish a book first.)

In the first half of the post, I'll review Abram's selection/control distinction, and suggest how it relates to actual design. I'll also argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution. The second section will revisit what optimization means, and try to note a few different things that could happen and go wrong with Goodhart-like overoptimization.

The third section will talk about Goodhart in this context using the new understanding - trying to more fully explain why Goodhart effects in selection and control fundamentally differs. After this, Part 4 will revisit Mesa-optimizers, and .

Thoughts on how selection and control are used in tandem

In this section, I'll discuss the two types of optimizers Abram discussed; selection, and control, and introduce a third, simpler optimizer, direct solution. I'm also going to mention where embedded agents are different, because that's closely related to selection versus control, and talk about where mesa-optimizers exist.

Starting with the (heavily overused) example of rockets, I want to revisit Abram's categorization of algorithmic optimization versus control. There are several stages involved with getting rockets to go where we want. The first is to design the rocket, which involves optimization, which I'll discuss in two stages, the second is to test, which involves optimization and control in tandem, and the third is to actually guide the rocket we built in flight, which is purely control.

Initially, designing rocket is pure optimization. We might start by building simplified mathematical models to figure out the basic design constraints - if a rocket is bringing people to the moon, we may decide the goal is a rocket and a lander, rather than a single composite. We may decide that certain classes of trajectory / flight paths are going to be used. This is all a set of mathematical exercises, and probably involves only multiply differentiable models that can be directly solved to find an optimum. This is in many ways a third category of "optimizing," in Abram's model, because there is not even a need for looking over the search space. I'll call this direct solution, since we just pick the optimum based on the setup.

After getting a bit closer to actual design, we need to simulate rocket designs and paths, and optimize the simulated solution. This lets you do clever things like build a rocket with a sufficient but not excessive amount of fuel (hopefully with a margin of error.) If we're smart, we optimize with several intended uses and variable factors in mind, to make sure our design is sufficiently robust. (If we're not careful enough to include all relevant factors, we ignore some factor that turns out will matter, like the relationship between temperature of the O-rings and their brittleness, and our design fails in those conditions.) This is all optimizing over a search space. The cost of the search is still comparatively low - not as low as direct solution, and we may use gradient descent, genetic algorithms, simulated annealing, or other strategies. The commonality between these solutions is that they simulate points in the search space, perhaps along with the gradients at that point.

After we settle on a design, we build an actual rocket, and then we test it. This moves back and forth between the very high cost approach of building physical objects and testing them - often to destruction - and simulation. After each test, we probably re-run the simulation to make sure any modifications are still near the optimum we found, or we refine the simulations to re-optimize and pick the next design to build.

Lastly, we build a final design, and launch the rocket. The control system is certainly a mesa-optimizer with regards to the rocket design process. For a rocket, this control is closer to direct optimization than simulation, because the cost of evaluation needs to be low enough for real-time control. The mesa-optimizer would, in this case, use simplified physics to fire the main and guidance rockets to stay near the pre-chosen path. It's probably not allowed to pick a new path - it can't decide that the better solution is to orbit twice instead of once before landing. (Humans may decide this, then hand the mesa-optimizer new parameters.) We tightly constrain the mesa-optimizer, since in a certain sense it's dumber than the design optimizer that chose what to optimize for.

For a more complex system, we may need a complex mesa-optimizer to guide the already designed system. Even for a more complex rocket, we may allow the mesa-optimizer to modify the model used for optimizing, at least in minor ways - it may dynamically evaluate factors like the rocket efficiency, and decide that it's getting 98% of the expected thrust, so it will plan to use that modified parameter in the system model used to mesa-optimize. Giving a mesa-optimizer more control is dangerous, but perhaps necessary to allow it to navigate a complex system.

Now that we've deconfused why optimization is split between selection and control, I can introduce part 2: What does optimization mean?

Schelling Fences versus Marginal Thinking

davidmanheim — 2019-05-22T10:22:32.213Z

Follow-up / Related to: Scott Alexander's Schelling Fences on Slippery Slopes, Sunk Cost Fallacy, Gwern's Are Sunk Costs Fallacies?, and Unenumerated's Proxy Measures, Sunk Costs, and Chesterton's Fence

I was recently reading an essay by Clayton Christensen, in the (fairly worthwhile) HBR's "Must Reads" boxed set, where he recommends that people "Avoid the Marginal Cost Mistake". In short, he suggests that Schelling Fences are sometimes ignored, or not constructed, because of a somewhat fallacious application of marginal-cost thinking. For example, my Schelling fence for work is that I stop when it is time to get my kids. The other side is that occasionally I'm in the middle of something - coding, or writing this lesswrong post - where being interrupted is fairly high cost. I can usually ask someone else to pick them up instead, and given how much I see them, the marginal value of time with my kids is low.

Christensen suggests that this analysis is incorrect, largely because of myopia. I am ignoring the longer term benefits of family dinners because the connection between coming home today and building the norm of being home for dinner every night is a longer-term investment. The future is full of extenuating circumstances, and only a fairly strong Schelling fence will let me insist that my kids stay home for dinner once they are teenagers.

I'd apply it more broadly, but his point was that this is especially critical in matters of morality. Cheating once changes everything. The simple fact that you cheated weakens your resolve not to in the future. The spiral created by a single action leads easily down a path towards using infinite money and invulnerability cheat codes, with no further challenge or enjoyment from playing the video game - or in the context he's discussing, it led to jail time for two of the people from his graduating class back in college.

Conclusions?

The critical question is: where do we want to use marginal cost analysis, and where do we want to stick to our sunk-costs and Schelling fences?

Based on Christensen's analysis, I would suggest that Schelling fences rather than sunk costs are particularly valuable for reinforcing values that are hard to measure, are too long term to get routine feedback on, or that involve specific commitments to other people. On the other hand, based on Gwern's work, I think there are places where marginal costs are under-appreciated, especially in relation to other people. Below, I lay out some settings and examples on each side.

Some examples of where to consider reinforcing fences and avoiding simplistic marginal cost thinking might include:

Going to a weekly meet-up that reinforces your connections to a good epistemic community and/or effective altruist values. Value drift is a long-term concern that needs short term reinforcement.
Anything involving family or long-term relationships. Marginal cost thinking is poisonous for relationships, since the benefits of investing in the relationship are not very visible, and long term.
Moral rules. Utilitarian and consequentialist thinking is easy to use to make yourself stupider. At the very least, you should be asking others - just like this is useful to avoid unilateralist curses, it is useful to avoid self-deception and convenient excuses.
Where there are switching costs or longer term goals. Learning to play guitar instead of continuing to practice piano (or moving from C++ to Python) is easy to justify in the short term, but expensive in terms of changes needed and resetting progress.
When goals are unknown. As Unenumerated put it, "cases where substantial evidence or shared preferences that motivated the original investment decision have been forgotten or have not been communicated, or otherwise where the quality of evidence that led to that decision may outweigh the quality of evidence that is motivating one to change one's mind."

Some examples of where it seems useful to avoid constructing Schelling fences, and to try paying more attention to marginal cost:

When constructing rules for other people, or in orgnaizations. Schelling fences are useful for self-commitment, otherwise they are rules and formal structures rather than norm-based fences. As gwern noted, " Whatever pressures and feedback loops cause sunk cost fallacy in organizations may be completely different from the causes in individuals."
When the environment is very volatile, and non-terminal goals change. It's easy to get stuck in a mode where the justification is "this is what I do," rather than a re-commitment to the longer term goal. If you are unsure, try revisiting why the fence was put there. (But if you don't know, be careful of removing Chesterton's Fence! See "When goals are unknown", above.)
When the fence is based on a measurable output, rather than an input. In such a case, the goal has been reified, and is subject to Goodhart effects. Schelling fences are not appropriate for outcomes, since the outcome isn't controlled directly. (Bounds on outcomes also implicitly discourage further investment - see: Shorrock's Law of Limits. If necessary, the outcome itself should be rewarded, rather than fenced in.)

Values Weren't Complex, Once.

davidmanheim — 2018-11-25T09:17:02.207Z

The central argument of this post is that human values are only complex because all the obvious constraints and goals are easily fulfilled. The resulting post-optimization world is deeply confusing, and leads to noise as the primary driver of human values. This has worrying implications for any kind of world-optimizing. (This isn't a particularly new idea, but I am taking it a bit farther and/or in a different direction than this post by Scott Alexander, and I think it is worth making clear, given the previously noted connection to value alignment and effective altruism.)

First, it seems clear that formerly simple human values are now complex. "Help and protect relatives, babies, and friends" as a way to ensure group fitness and survival is mostly accomplished, so we find complex ethical dilemmas about the relative values of different behavior. "Don't hurt other people" as a tool for ensuring reciprocity has turned into compassion for humanity, animals, and perhaps other forms of suffering. These are more complex than they could possibly have been expressed in the ancestral environment, given restricted resources. It's worth looking at what changed, and how.

In the ancestral environment, humans had three basic desires; they wanted food, fighting, and fornication. Food is now relatively abundant, leading to people's complex preferences about exactly which flavors they like most. These differ because the base drive for food is overoptimizing. Fighting was competition between people for resources - and since we all have plenty, this turns into status-seeking in ways that aren't particularly meaningful outside of human social competition. The varieties of signalling and counter-signalling are the result. And fornication was originally for procreation, but we're adaptation executioners, not fitness maximizers, so we've short-cutted that with birth control and pornography, leading to an explosion in seeking sexual variety and individual kinks.

Past the point where maximizing the function has a meaningful impact on the intended result, we see the tails come apart. The goal seeking of human nature, however, needs to find some direction to push the optimization process. The implication from this is that humanity finds diverging goals because they are past the point where the basic desires run out. As Randall Munroe points out in an XKCD Comic, this leads to increasingly complex and divergent preferences for ever less meaningful results. And that comic would be funny if it weren't a huge problem for aligning group decision making and avoiding longer term problems.

If this is correct, the key takeaway is that as humans find ever fewer things to need, they inevitably to find ever more things to disagree about. Even though we expect convergent goals related to dominating resources, narrowly implying that we want to increase the pool of resources to reduce conflict, human values might be divergent as the pool of such resources grows.