Link: Does the following seem like a reasonable brief summary of the key disagreements regarding AI risk? 2019-12-26T20:14:52.509Z · score: 11 (5 votes)
Updating a Complex Mental Model - An Applied Election Odds Example 2019-11-28T09:29:56.753Z · score: 10 (4 votes)
Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management 2019-10-30T10:33:16.240Z · score: 26 (14 votes)
Divergence on Evidence Due to Differing Priors - A Political Case Study 2019-09-16T11:01:11.341Z · score: 27 (11 votes)
Hackable Rewards as a Safety Valve? 2019-09-10T10:33:40.238Z · score: 18 (5 votes)
What Programming Language Characteristics Would Allow Provably Safe AI? 2019-08-28T10:46:32.643Z · score: 5 (5 votes)
Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) 2019-08-12T08:07:01.769Z · score: 17 (9 votes)
Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) 2019-07-28T09:32:25.878Z · score: 19 (6 votes)
What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) 2019-07-28T09:30:29.792Z · score: 29 (6 votes)
Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) 2019-07-02T15:36:51.071Z · score: 29 (6 votes)
Schelling Fences versus Marginal Thinking 2019-05-22T10:22:32.213Z · score: 23 (14 votes)
Values Weren't Complex, Once. 2018-11-25T09:17:02.207Z · score: 34 (15 votes)
Oversight of Unsafe Systems via Dynamic Safety Envelopes 2018-11-23T08:37:30.401Z · score: 11 (5 votes)
Collaboration-by-Design versus Emergent Collaboration 2018-11-18T07:22:16.340Z · score: 12 (3 votes)
Multi-Agent Overoptimization, and Embedded Agent World Models 2018-11-08T20:33:00.499Z · score: 9 (4 votes)
Policy Beats Morality 2018-10-17T06:39:40.398Z · score: 15 (15 votes)
(Some?) Possible Multi-Agent Goodhart Interactions 2018-09-22T17:48:22.356Z · score: 21 (5 votes)
Lotuses and Loot Boxes 2018-05-17T00:21:12.583Z · score: 29 (6 votes)
Non-Adversarial Goodhart and AI Risks 2018-03-27T01:39:30.539Z · score: 64 (14 votes)
Evidence as Rhetoric — Normative or Positive? 2017-12-06T17:38:05.033Z · score: 1 (1 votes)
A Short Explanation of Blame and Causation 2017-09-18T17:43:34.571Z · score: 1 (1 votes)
Prescientific Organizational Theory (Ribbonfarm) 2017-02-22T23:00:41.273Z · score: 3 (4 votes)
A Quick Confidence Heuristic; Implicitly Leveraging "The Wisdom of Crowds" 2017-02-10T00:54:41.394Z · score: 1 (2 votes)
Most empirical questions are unresolveable; The good, the bad, and the appropriately under-powered 2017-01-23T20:35:29.054Z · score: 3 (4 votes)
A Cruciverbalist’s Introduction to Bayesian reasoning 2017-01-12T20:43:48.928Z · score: 1 (2 votes)
Map:Territory::Uncertainty::Randomness – but that doesn’t matter, value of information does. 2016-01-22T19:12:17.946Z · score: 6 (11 votes)
Meetup : Finding Effective Altruism with Biased Inputs on Options - LA Rationality Weekly Meetup 2016-01-14T05:31:20.472Z · score: 1 (2 votes)
Perceptual Entropy and Frozen Estimates 2015-06-03T19:27:31.074Z · score: 17 (12 votes)
Meetup : Complex problems, limited information, and rationality; How should we make decisions in real life? 2013-10-09T21:44:19.773Z · score: 3 (4 votes)
Meetup : Group Decision Making (the good, the bad, and the confusion of welfare economics) 2013-04-30T16:18:04.955Z · score: 4 (5 votes)


Comment by davidmanheim on Becoming Unusually Truth-Oriented · 2020-01-27T06:49:11.652Z · score: 4 (2 votes) · LW · GW

On rereading, I really liked the second portion of Abrams post, but I strongly second this comment, and think it does not go far enough - on two fronts.

First, my understanding is there is a disagreement in the literature, and that dreams aren't things that get recalled, but it's unclear exactly how much is post-facto confabulation, and how much is your brain inferring details that weren't imagine at the time. Given that, I think that recalling dreams is a misleading example, and regardless of which interpretation is correct, would be a worrying model for recall.

Second, I agree with the personal epistemic issue, since I am largely aphantasic, and rarely "remember" dreams. That said, the reason I reacted strongly against the opening passage is a combination of my personal inside-view inability to relate, and my understanding that the example is misleading, especially compared to how useful the remainder of the post is. That means that I was turned off from the post early on, and had to reread the second half to reevaluate whether I was unfairly dismissive. I decided that I was, but it was a high bar, and if I hadn't known that the author was usually really insightful, I wouldn't have done so.

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-15T17:53:35.890Z · score: 2 (1 votes) · LW · GW


Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-13T11:46:04.984Z · score: 4 (2 votes) · LW · GW

Yes - I suspect a large amount of the variance is explained by features we can measure, and the residual may be currently unexplained, but filtering on the features you can measure probably gets most of what is needed.

However, I don't think the conclusion necessarily follows.

The problem is a causal reasoning / incentive issue (because of reasons) - just because people who update frequently do well doesn't mean that telling people you'll pay those who update frequently will cause them to do better now that they update more often. For instance, if you took MMORPG players and gave them money on condition that they spend money on the game, you'll screw up the relationship between spending and success.

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-12T11:26:49.055Z · score: 4 (2 votes) · LW · GW

That makes sense as an approach - but as mentioned initially, I think the issue with calling people superforecasters is deeper, since it's unclear how much of the performance is even about their skill, rather than other factors.

Instead of basketball and the NBA, I'd compare superforecasting to performance at a modern (i.e. pay-to-win) mobile MMORPG: you need to be good to perform near the top, but the other factor that separates winners and losers is being willing to invest much more than others in loot boxes and items (i.e. time spent forecasting) because you really want to win.

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-11T17:52:05.553Z · score: 4 (2 votes) · LW · GW


Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-11T17:51:46.796Z · score: 6 (3 votes) · LW · GW

Superforecasters used only public information, or information they happened to have access to - but the original project was run in parallel with a (then secret) prediction platform for inside the intelligence community. It turned out that the intelligence people were significantly outperformed by superforecasters, despite having access to classified information and commercial information sources, so it seems clear that the information access wasn't particularly critical for the specific class of geopolitical predictions they looked at. This is probably very domain dependent, however.

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-10T13:26:03.750Z · score: 6 (3 votes) · LW · GW
it reads to me like "2% of people are superheroes" — they have performance that is way better than the rest of the population on these tasks.

As you concluded in other comments, this is wrong. But there doesn't need to be a sharp cutoff for there to be "way better" performance. If the top 1% consistently have brier scores on a class of questions of 0.01, the next 1% have brier scores of 0.02, and so on, you'd see "way better performance" without a sharp cutoff - and we'd see that the median brier score of 0.5, exactly as good as flipping a coin, is WAY worse than the people at the top. (Let's assume everyone else is at least as good as flipping a coin, so the bottom half are all equally useless.)

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-10T13:19:02.642Z · score: 4 (2 votes) · LW · GW

Agreed. As I said, "it is unlikely that there is a sharp cutoff at 2%, there isn't a discontinuity, and power law is probably the wrong term."

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-09T07:57:08.146Z · score: 2 (1 votes) · LW · GW

See my response below - and the dataset of forecasts is now public if you wanted to check the numbers.

Comment by davidmanheim on Are "superforecasters" a real phenomenon? · 2020-01-09T07:55:36.228Z · score: 62 (20 votes) · LW · GW

The clear answer to the question posed, "do the performances of GJP participants follow a power-law distribution, such that the best 2% are significantly better than the rest" is yes - with a minor quibble, and a huge caveat. (Epistemic status: I'm very familiar with the literature, have personal experience as a superforecaster since the beginning, had discussions with Dan Gardner and the people running, have had conversations with the heads of Good Judgement Inc, etc.)

The minor quibble is identified in other comments, that it is unlikely that there is a sharp cutoff at 2%, there isn't a discontinuity, and power law is probably the wrong term. Aside from those "minor" issues, yes, there is a clear group of people who outperformed multiple years in a row, and this groups was fairly consistent from year to year. Not only that, but the order withing that group is far more stable than chance. That clearly validates the claim that "superforcasters are a real thing."

But the data that those people are better is based on a number of things, many of which aren't what you would think. First, the biggest difference between top forecasters and the rest is frequency of updates and a corresponding willingness to change their minds as evidence comes in. People who invest time in trying to forecast well do better than those who don't - to that extent, it's a skill like most others. Second, success at forecasting is predicted by most of the things that predict success at almost everything else - intelligence, time spent, and looking for ways to improve. Some of the techniques that Good Judgement advocates for superforecasters are from people who read Kahneman and Twersky, Tetlock, and related research, and tried to apply the ideas. The things that worked were adopted - but not everything helped. Other techniques were original to the participants - for instance, explicitly comparing your estimate for a question based on different timeframes, to ensure it is a coherent and reasonable probability. (Will X happen in the next 4 months? If we changed that to one month, would be estimate be about a quarter as high? What about if it were a year? If my intuition for the answer is about the same, I need to fix that.) Ideas like this are not natural ability, they are just applying intelligence to a problem they care about.

Also, many of the poorer performers were people who didn't continue forecasting, and their initial numbers got stale - they presumably would have updated. The best performers, on the other hand, checked the news frequently, and updated. At times, we would change a forecast once the event had / had not happened, a couple days before the question was closed, yielding a reasonably large "improvement" in our time-weighted score. This isn't a function of being naturally better - it's just the investment of time that helps. (This also explains a decent part of why weighting recency in aggregate scores is helpful - it removes stale forecasts.)

So in short, I'm unconvinced that superforecasters are a "real" thing, except in the sense that most people don't try, and people who do will do better, and improve over time. Given that, however, we absolutely should rely on superforecasters to make better predictions that the rest of people - as long as they continue doing the things that make them good forecasters.

Comment by davidmanheim on What are we assuming about utility functions? · 2019-12-31T13:15:32.175Z · score: 2 (1 votes) · LW · GW

See my other reply about pseudo-pareto improvements - but I think the "understood + endorsed" idea is really important, and worth further thought.

Comment by davidmanheim on What are we assuming about utility functions? · 2019-12-31T13:13:47.316Z · score: 6 (1 votes) · LW · GW

My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don't even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do "wrong" so that we can pick what an AI should respect of human preferences, and what can be ignored.

For instance, I love my children, and I like chocolate. I'm also inconsistent with my preferences in ways that differs; at a given moment of time, I'm much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don't know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies - they are either inconsistent, or at least can be exploited by an AI that has a model which doesn't think about resolving those inconsistencies.

With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV - but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.

(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that's a different discussion.)

Comment by davidmanheim on What are we assuming about utility functions? · 2019-12-25T19:30:15.154Z · score: 2 (1 votes) · LW · GW
"Arguably, you can't fully align with inconsistent preferences"
My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it "unaligned" with me? More generally, what is it about these other coherence conditions that prevent meaningful "alignment"? (Maybe it takes a big discursive can of worms, but I actually haven't seen this discussed on a serious level so I'm quite happy to just read references).

I've been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler's thoughts on what he has called "pareto-topia". (I haven't gotten anywhere thinking about this because I'm spending my time on other things.)

Comment by davidmanheim on What are we assuming about utility functions? · 2019-12-25T19:25:49.311Z · score: 13 (3 votes) · LW · GW

I don't think you're putting enough weight on what REALLY convinced economists, which was the tractability that assuming utility provides, and their enduring physics envy. (But to be fair, who wouldn't wish that their domain was as tractable as Newtonian physics ended up being.)

But yes, Utility is a useful enough first approximation for humans that it's worth using as a starting point. But only as a starting point. Unfortunately, too many economists are instead busy building castles on their assumptions, without trying to work with better approximations. (Yes, prospect theory and related. But it's hard to do the math, so micro-economic foundations of macroeconomics mostly just aren't being rebuilt.)

I certainly agree that this isn't a good reason to consider human inability to approximate a utility function when looking at modeling AGI. But it's absolutely critical when discussing what we're doing to align with human "values," and figuring out what that looks like. That's why I think that far more discussion on this is needed.

Comment by davidmanheim on Some Comments on "Goodhart Taxonomy" · 2019-12-25T18:54:05.394Z · score: 22 (5 votes) · LW · GW

Glad to see engagement on this - and I should probably respond to some of these points, but before doing so, want to point to where I've already done work on this, since much of that work either admits your points, or addresses them.

First, I think you should read the paper I wrote with Scott that extended the thoughts from his post. It certainly doesn't address all of this, but we were very clear that adversarial Goodhart was less clear than the other modes and needed further work. We also more clearly drew the connection to tails fall apart, and clarified some of the sub-cases of both extremal and causal Goodhart. Following that, I wrote another post on the topic, trying to expand on the points made in the paper - but specifically excluding multi-agent issues, because they were hard and I wasn't clear enough about how they worked.

I tried to do a bit of that work in a paper, Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence. This attempts to provide a categorization for multi-agent cases similar to the one made in Scott's post. It made a few key points that I think need further discussion about the relationship to embedded agents, and other issues. I was less successful than I hoped at cutting through the confusion, but a key point it does make is that all multi-agent failures are actually single agent failure modes, but they are caused by misaligned goals or coordination failures. (And these aren't all principal-agent issues, though I agree that many are. For instance, some cases are tragedy of the commons, and others are more direct corruption of the other agents.) I also summarized the paper a bit and expanded on certain key points in another lesswrong post.

And since I'm giving a reading list, I also think my even more recent, but only partially-completed sequence of posts on optimization and selection versus control (in the single agent cases) might clarify some of the points about Regressional versus Extremal Goodhart further. Post one of that sequence is here.

Comment by davidmanheim on Updating a Complex Mental Model - An Applied Election Odds Example · 2019-12-03T08:46:47.971Z · score: 2 (1 votes) · LW · GW

Sorry, I should have clarified that the news was US GDP Growth:

Comment by davidmanheim on Spaghetti Towers · 2019-11-28T10:57:24.031Z · score: 11 (3 votes) · LW · GW

This idea has become part of my conceptual toolkit for discussing / describing a key failure mode.

(Note: the below part of the comment is from my Facebook comment about the article when it came out.)

There's a great connection you make to bureaucracy, and it's definitely worth exploring.

This gives me a good language to discuss something I've noted a number of times. I'd posit that selection pressure for bureaucracy limits how stupid the system gets as a function of the best simple alternative, and the difficulty of transitioning to it without turning off the system. This means that for critical systems where there is no incremental pathway to improve, it's near-permanent even if there are better alternatives - see the US healthcare system. For less critical systems, once an alternative is found, as long as the transition isn't too long/too expensive, and there are incentives that actually promote efficiency, it will happen. The critical fact is that the incentives need to exist for everyone that is involved - not just the end user. So if bob in accounting doesn't like the change, unless someone else can induce cooperation (like senior management,) it never happens.

Comment by davidmanheim on On Doing the Improbable · 2019-11-28T10:50:03.686Z · score: 10 (2 votes) · LW · GW

This post has influenced my evaluations of what I am doing in practice by forcing me to consider lowering the bar for expected success for high return activities. Despite "knowing" about how to shut up and multiply, and needing to expect a high failure rate if taking reasonable levels of risk, I didn't consciously place enough weight on these. This helped move me more in that direction, which has led to both an increased number of failures to get what I hoped, and a number of mostly unexpected successes when applying for / requesting / attempting things.

It is worth noting that I still need to work on the reaction I have to failing at these low cost, high-risk activities. I sometimes have a significant emotional reaction to failing, which is especially problematic because the emotional reaction to failing at a long-shot can influence my mood for multiple days or weeks afterwards.

Comment by davidmanheim on The Costly Coordination Mechanism of Common Knowledge · 2019-11-28T10:43:59.140Z · score: 4 (2 votes) · LW · GW

Until seeing this post, I did not have a clear way of talking about common knowledge. Despite understanding the concept fairly well, this post made the points more clearly than I had seen them made before, and provided a useful reference when talking to others about the issue.

Comment by davidmanheim on A LessWrong Crypto Autopsy · 2019-11-28T10:42:36.754Z · score: 6 (1 votes) · LW · GW

This post has been a clear example of how rationality has and has not worked in practice. It is also a subject of critical practical importance for future decisions, so it frequently occurs to me as a useful example of how and why rationality does and does not help with (in retrospect) critical decisions.

Comment by davidmanheim on Embedded Agents · 2019-11-28T10:40:40.380Z · score: 10 (2 votes) · LW · GW

This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.

Comment by davidmanheim on A test for symbol grounding methods: true zero-sum games · 2019-11-27T12:50:22.400Z · score: 11 (0 votes) · LW · GW

I don 't think this is straightforward in practice - and putting a cartesian boundary in place is avoiding exactly the key problem. Any feature of the world used as the item to minimize/maximize is measured, and uncorruptable measurement systems seems like a non-trivial problem. For instance, how do I get my GAI to maximize blue in an area instead of maximizing the blue input into their sensor when pointed at that area? We need to essentially solve value loading and understand a bunch of embedded agent issues to really talk about this.

Comment by davidmanheim on Towards optimal play as Villager in a mixed game · 2019-11-24T10:03:04.624Z · score: 2 (1 votes) · LW · GW

There is also overhead to scaling and difficulty aligning goals that they want to avoid. (As above, I think my Ribbonfarm post makes this clear.) Once you get bigger, the only way to ensure alignment is to monitor - trust, but verify. And verification is a large part of why management is so costly - it takes time away from actually doing work, it is pure overhead for the manager, and even then, it's not foolproof.

When you're small, on the other hand, high-trust is almost unnecessary, because the entire org is legible, and you can see that everyone is (or isn't) buying in to the goals. In typical startups, they are also well aligned because they all have similar levels of payoff if things go really well.

Comment by davidmanheim on Towards optimal play as Villager in a mixed game · 2019-11-24T09:57:45.294Z · score: 4 (2 votes) · LW · GW

My claim is that *competence* isn't the critical limiting factor in most cases because structure doesn't usually allow decoupling, not that it's not limited. When it IS the limiting factor, I agree with you, but it rarely is. And I think alignment is a different argument.

In EA orgs, alignment can solve the delegation-without-management problem because it can mitigate principal-agent issues. Once we agree on goals, we're working towards them, and we can do so in parallel and coordinate only when needed. In most orgs, alignment cannot accomplish this, because it's hard to get people to personally buy into your goals when those goals are profit maximization for a company. (Instead, you use incentive structures like bonuses to align them. But then you need to monitor them, etc.)

Comment by davidmanheim on Towards optimal play as Villager in a mixed game · 2019-11-20T19:57:06.054Z · score: 6 (1 votes) · LW · GW

On your points about scaling, I mostly agree, but want to note that there are fundamental issues with scaling that I explained in a post here:

The post is rather long. In short, however, I don't think that your Kingdom metaphor works, because large bureaucracies are big *not* because they have many mini-kingdoms doing similar things in parallel, but because they need to specialize and allow cross-functional collaboration, which requires lots of management.

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-20T16:14:34.796Z · score: 9 (5 votes) · LW · GW

I partly agree, but burden of proof is often the wrong framing for truth seeking.

The article provides strong evidence that ads are ineffective in certain classes of cases, and that fact in turn provides weaker evidence that ads are ineffective more generally. To support Akshat's skepticism that the result generalizes, we'd need to evidence or priors that points towards ads being differentially effective depending on the type - targeted keywords vs. brand-ad keywords, and brand presence verus no brand presence.

In the first case, I'm somewhat skeptical that the difference between targeted and brand keywords will be large. My prior for the second difference is that there would be some difference, as Gordon argued in another comment. I don't know of any evidence in either direction, but I haven't looked. (The actual result doesn't matter to me except as an exercise in Bayesian reasoning, but if it matters to you or others, it's plausible high VoI to search a bit. )

Comment by davidmanheim on Book Review: Design Principles of Biological Circuits · 2019-11-20T08:20:12.413Z · score: 4 (3 votes) · LW · GW

It's not just selection effects on organisms - it's within organisms. The examples given are NOT fully understood, so (for example,) the bacterial transcription network motifs only contain the relationships that we understand. Given that loops and complex connections are harder to detect, that has selected for simplicity.

Given that, I still want to read the book and see the argument more fully.

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-20T08:12:03.075Z · score: 9 (4 votes) · LW · GW

There was a bubble, and there is also secular growth in the market, with a lot of churn that makes buy-and-hold a fairly bad idea. Those aren't inconsistent. Here's a graphic of the churn. Most of the early companies died.

But if you put all your money in the hot IPOs of Netscape, Yahoo, Lycos, and Excite in 1995, you'd have done very poorly. If you extended this to 1996, you could add Mindspring and Checkpoint (the only one that did well, so far, which is up 29x, for a 16% annualized return to date.) It took until 1997 to get any long-term fantastic return, for Amazon - which is up 1000x since 1997, or a 37% annual return - fantastic, but if you were prescient, and it was an entire tenth of your portfolio, on average you did just OK. Skipping ahead to 1999-2000, the height of the bubble, here's the list. Nothing made big bucks.

So we can construct a portfolio with 10 stocks, 8 of which went bust, and 2 of which, checkpoint and Amazon, did well. Your compound 22-year return? 5.25% (And if you bought an S&P index fund in 1998 at 1,000, you'd have made 6.5% annually.)

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-20T07:52:18.781Z · score: 4 (2 votes) · LW · GW

Agreed on both points.

Comment by davidmanheim on Aumann Agreement Game · 2019-11-19T15:32:36.355Z · score: 7 (2 votes) · LW · GW

For those interested in running the game, I put together a Python script that pulls questions from an API for trivia questions, and outputs 2 LaTeX files that can be compiled and used, one for the quizmaster, plus a set of corresponding cut-out cards for the players.

Pull requests and improvement suggestions are welcome!

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-19T05:52:53.303Z · score: 14 (3 votes) · LW · GW
To this I'll just add that this problem is somewhat solvable, but it's tricky.

This is a very important point. I will self-promote and mention my pre-print paper on metric design and avoiding Goodharting (not in the context of AI):

Abstract: Metrics are useful for measuring systems and motivating behaviors. Unfortunately, naive application of metrics to a system can distort the system in ways that undermine the original goal. The problem was noted independently by Campbell and Goodhart, and in some forms it is not only common, but unavoidable due to the nature of metrics. There are two distinct but interrelated problems that must be overcome in building better metrics; first, specifying metrics more closely related to the true goals, and second, preventing the recipients from gaming the difference between the reward system and the true goal. This paper describes several approaches to designing metrics, beginning with design considerations and processes, then discussing specific strategies including secrecy, randomization, diversification, and post-hoc specification. Finally, it will discuss important desiderata and the trade-offs involved in each approach.

(Currently working on a rewrite, but feedback on the ideas and anything missing is especially appreciated.)

Comment by davidmanheim on Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management · 2019-11-06T13:23:25.577Z · score: 2 (1 votes) · LW · GW

Yeah, I could see giving sleeping pills as a phenomenon a different name, but (while I am unsure,) in the industry I think the term is well known, so I'm sticking with it.

Still, I'm open to better suggestions, but sugar pill / placebo doesn't seem better, since it's much less clear as an analogy.

Comment by davidmanheim on Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management · 2019-11-06T13:21:07.717Z · score: 2 (1 votes) · LW · GW

Absolutely true - they are slightly less related to principle agent problems, but I can add stories illustrating these, because I have some good ones.

Re: Risk distribution, there's a great game that someone suggested to understand what 100 year floods look like, since people complain they happen every year. Get a room full of, say, 40 people, and give them each 2 dice. Tell them snake-eyes is a flood, then have everyone roll, and to call out if there was a flood. Then roll and announce again. And again. Yes, many "years" will have no floods, but some will have 2 or even 3 - because the risk isn't highly correlated across areas.

Re: Threshold of misery, I knew an underwriter who said they would write New York City nuclear terrorism insurance whenever they had the chance, at basically any price. His reasoning? He lives and works in NYC, so if a nuclear bomb goes off, he's dead, and the fact that he lost money isn't what matters. Otherwise he collects premiums.

Comment by davidmanheim on Prospecting for Conceptual Holes · 2019-10-30T10:38:16.221Z · score: 13 (9 votes) · LW · GW

It's worth noting that these types basically match the Johari Window for types of risks; known knowns, known unknowns, unknown knowns, and unknown unknowns. This is because, at least according to one definition of the term, risks are things we don't expect. Given that definition, a risk is the product of a conceptual hole of some sort - and so the two have a somewhat trivial mapping.

Comment by davidmanheim on All I know is Goodhart · 2019-10-24T10:59:12.269Z · score: 4 (2 votes) · LW · GW

This adversarial issue can be true, and is critical, but I don't think it's what Stuart was pointing to. in his post or his reply.

Comment by davidmanheim on All I know is Goodhart · 2019-10-24T10:56:54.922Z · score: 10 (3 votes) · LW · GW

I think you want to differentiate between different mechanisms for Goodhart's law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here:

Given that, I see several different things going on.

First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called "The Tails Come Apart". This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)

Second, there is the potential for divergences between, in your example,"a measurement of my internal sense of beauty rather than my internal sense of beauty itself" is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there's also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.

Alternatively, if the initial sample of "your internal sense of beauty" is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents - inferring a continuous, potentially unbounded function from a finite sample. That's an important and fundamental issue, but it's only partially about, in this case, extremal Goodhart. It's also a more general issue about inferring preferences, i.e. learning is hard and this is learning.

Comment by davidmanheim on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T06:53:59.169Z · score: 16 (5 votes) · LW · GW

I commented on the thread (after seeing this) in order to add a link to my paper that addresses Bengio's last argument;

@Yoshua Bengio I attempted to formalize this argument somewhat in a recent paper. I don't think the argument there is particularly airtight, but I think it provides a significantly stronger argument for why we should believe that interaction between optimizing systems is fundamentally hard.

Paper abstract: "An important challenge for safety in machine learning and artificial intelligence systems is a set of related failures involving specification gaming, reward hacking, fragility to distributional shifts, and Goodhart’s or Campbell’s law. This paper presents additional failure modes for interactions within multi-agent systems that are closely related. These multi-agent failure modes are more complex, more problematic, and less well understood than the single-agent case, and are also already occurring, largely unnoticed. After motivating the discussion with examples from poker-playing artificial intelligence (AI), the paper explains why these failure modes are in some senses unavoidable. Following this, the paper categorizes failure modes, provides definitions, and cites examples for each of the modes: accidental steering, coordination failures, adversarial misalignment, input spoofing and filtering, and goal co-option or direct hacking. The paper then discusses how extant literature on multi-agent AI fails to address these failure modes, and identifies work which may be useful for the mitigation of these failure modes."

Comment by davidmanheim on Follow-Up to Petrov Day, 2019 · 2019-10-02T10:42:00.773Z · score: 4 (2 votes) · LW · GW

Re: examples of point #1, I don't think that shaming in this forum is productive - it's polarizing and stigmatizing rather than helpful. But I do know of several individuals and a couple organizations which are guilty of this, each repeatedly.

I do think that people should be more willing to personally / privately respond if someone does something, and I have done so in several specific cases where someone decided on a unilateralist approach that I thought was damaging.

Comment by davidmanheim on Follow-Up to Petrov Day, 2019 · 2019-09-29T10:08:28.581Z · score: 23 (8 votes) · LW · GW
Even ignoring the above problem, I'm confused why it's valuable to build up a "real tradition" among LW users, given that the wider unilateralist curse problem that our world faces can't possibly be solved by LW users having such a tradition.

A few points.

First, I don't think it's clear that in the Rationalist / EA community, there is enough reinforcement of this, and I routinely see issues with people "going rogue" and unilaterally engaging in activities that others have warned them would be dangerous, net negative, etc.

Second, it's valuable even as an exemplar; we should be able to say that there is such a community, and that they are capable of exercising at least this minimal level of restraint.

Third, I think it's clear that in the next decade the number of people in the Rationalist-sphere that are in actual positions of (relatively significant) power will continue to grow, and we have already seen some such people emerge in government and in the world of NGOs. For AI, in particular, there are many people who have significant influence in making decisions that could significantly affect Humanity's future. Their active (i.e. passive) participation in this seems likely to at least give them a better understanding of what is needed when they are faced with these choices.

Comment by davidmanheim on Divergence on Evidence Due to Differing Priors - A Political Case Study · 2019-09-17T20:28:56.581Z · score: 3 (2 votes) · LW · GW

They don't contradict directly, but they reflect nearly incompatible updates to their world-models based on the same data.

Comment by davidmanheim on Hackable Rewards as a Safety Valve? · 2019-09-12T06:54:17.571Z · score: 2 (1 votes) · LW · GW
But I don't understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking

I was mostly noting that I hadn't thought of this, hadn't seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn't previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.

(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for "safer self-driving cars" and "reducing and mitigating side-effects" is plausibly a net negative if done poorly, instead of being benign.

Comment by davidmanheim on Hackable Rewards as a Safety Valve? · 2019-09-11T09:08:49.045Z · score: 2 (0 votes) · LW · GW

My claim here is that superintelligence is a result of training, not a starting condition. Yes, a SAI would do bad things unless robustly aligned, but building the SAI requires it not to wirehead at an earlier stage in the process. My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack - not necessarily via direct access, but via whatever channel is cheapest. And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.

Comment by davidmanheim on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-09-05T11:19:25.416Z · score: 3 (2 votes) · LW · GW

This is really helpful - thanks!

Comment by davidmanheim on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-08-29T09:05:20.527Z · score: 6 (2 votes) · LW · GW

Yes, an unsafe AI cannot be boxed on unsafe hardware, not can any AI running on physical hardware be made immune to attacks - but those are very different questions. For this question, first we assume that a provably safe AI can be written, then I wanted to ask what language would be needed.

Comment by davidmanheim on Where are people thinking and talking about global coordination for AI safety? · 2019-08-26T10:13:12.607Z · score: 2 (1 votes) · LW · GW

No, that's implicit in the model - and either *some* crisis requiring higher capacity than we have will overwhelm us and we'll all die (and it doesn't matter which,) or the variance is relatively small so no such event occurs, and/or our capacity to manage risks grows quickly enough that we avoid the upper tail.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-20T11:18:23.249Z · score: 2 (1 votes) · LW · GW

Fully agree - I was using the example to make a far less fundamental point.

Comment by davidmanheim on Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) · 2019-08-19T09:56:54.942Z · score: 2 (1 votes) · LW · GW
2) Calling the issues between the agents because of model differences "terminology issues" could also work well - this may be a little like people talking past each other.

I really like this point. I think it's parallel to the human issue where different models of the world can lead to misinterpretation of the "same" goal. So "terminology issues" would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets "temperature" as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we're not calling terminology issues.

I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It's possible that two models disagree on representation, but agree on all object level claims - think of using different coordinate systems. Because terminology issues can cause mistakes, I'd suggest that agents with non-shared world models can only reliably communicate via object-level claims.

The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.

Comment by davidmanheim on Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) · 2019-08-19T09:44:17.623Z · score: 2 (2 votes) · LW · GW
It seems like there's a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn't a perfect score), because it solved them the way I solve them, that doesn't sound like misalignment.

The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That's why I'm calling it a principal alignment failure - even though it's the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T09:22:34.816Z · score: 4 (2 votes) · LW · GW

Separate from my other comment, I want to question your assumption that we must worry about an AI-takeoff that is exponentially better than humans at everything, so that a very slight misalignment would be disastrous. That seems possible, per Eliezer's Rocket Example, but is far from certain.

It seems likely that instead there are fundamental limits on intelligence (for a given architecture, at least) and while it is unlikely that the overall limits are coincidentally the same as / near human intelligence, it seems plausible that the first superhuman AI system still plateaus somewhere far short of infinite optimization power. If so, we only need to mitigate well, instead of perfectly align the AI to our goals.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T09:15:33.897Z · score: 10 (5 votes) · LW · GW

I'll start by noting that I am in the strange (for me) position of arguing that someone is too concerned about over-optimization failures, rather than trying to convince someone who is dismissive. But that said, I do think that the concern here, while real, is mitigable in a variety of ways.

First, there is the possibility of reducing optimization pressure. One key contribution here is Jessica Taylor's Quantilizers paper, which you note, that shows a way to build systems that optimize but are not nearly as subject to Goodhart's curse. I think you are too dismissive. Similarly, you are dismissive of optimizing the target directly. I think that the epistemological issues you point to are possible to mitigate to the extent that they won't cause misalignment between reality and an AI's representation of that reality. Once that is done, the remaining issue is aligning "true" goals with the measured goals, which is still hard, but certainly not fundamentally impossible in the same way.

Second, you note that you don't think we will solve alignment. I agree, because I think that "alignment" presupposes a single coherent ideal. If human preferences are diverse, as it seems they are, we may find that alignment is impossible. This, however, allows a very different approach. This would optimize only when it finds Pareto-improvements across a set of sub-alignment metrics or goals, to constrain the possibility of runaway optimization. Even if alignment is possible, it seems likely that we can specify a set of diverse goals / metrics that are all aligned with some human goals, so that the system will be limited in its ability to be misaligned.

Lastly, there is optimization for a safe and very limited goal. If the goal is limited and specific, and we find a way to minimize side-effects, this seems like it could be fairly safe. For example, Oracle AIs are an attempt to severely limit the goal. More broadly, however, we might be able to build constraints that work, so that it that can reliably perform limited tasks (“put a strawberry on a plate without producing any catastrophic side-effects.”)