Posts

Updating a Complex Mental Model - An Applied Election Odds Example 2019-11-28T09:29:56.753Z · score: 10 (4 votes)
Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management 2019-10-30T10:33:16.240Z · score: 26 (14 votes)
Divergence on Evidence Due to Differing Priors - A Political Case Study 2019-09-16T11:01:11.341Z · score: 27 (11 votes)
Hackable Rewards as a Safety Valve? 2019-09-10T10:33:40.238Z · score: 18 (5 votes)
What Programming Language Characteristics Would Allow Provably Safe AI? 2019-08-28T10:46:32.643Z · score: 5 (5 votes)
Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) 2019-08-12T08:07:01.769Z · score: 17 (9 votes)
Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) 2019-07-28T09:32:25.878Z · score: 19 (6 votes)
What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) 2019-07-28T09:30:29.792Z · score: 29 (6 votes)
Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) 2019-07-02T15:36:51.071Z · score: 27 (5 votes)
Schelling Fences versus Marginal Thinking 2019-05-22T10:22:32.213Z · score: 23 (14 votes)
Values Weren't Complex, Once. 2018-11-25T09:17:02.207Z · score: 34 (15 votes)
Oversight of Unsafe Systems via Dynamic Safety Envelopes 2018-11-23T08:37:30.401Z · score: 11 (5 votes)
Collaboration-by-Design versus Emergent Collaboration 2018-11-18T07:22:16.340Z · score: 12 (3 votes)
Multi-Agent Overoptimization, and Embedded Agent World Models 2018-11-08T20:33:00.499Z · score: 9 (4 votes)
Policy Beats Morality 2018-10-17T06:39:40.398Z · score: 15 (15 votes)
(Some?) Possible Multi-Agent Goodhart Interactions 2018-09-22T17:48:22.356Z · score: 21 (5 votes)
Lotuses and Loot Boxes 2018-05-17T00:21:12.583Z · score: 29 (6 votes)
Non-Adversarial Goodhart and AI Risks 2018-03-27T01:39:30.539Z · score: 64 (14 votes)
Evidence as Rhetoric — Normative or Positive? 2017-12-06T17:38:05.033Z · score: 1 (1 votes)
A Short Explanation of Blame and Causation 2017-09-18T17:43:34.571Z · score: 1 (1 votes)
Prescientific Organizational Theory (Ribbonfarm) 2017-02-22T23:00:41.273Z · score: 3 (4 votes)
A Quick Confidence Heuristic; Implicitly Leveraging "The Wisdom of Crowds" 2017-02-10T00:54:41.394Z · score: 1 (2 votes)
Most empirical questions are unresolveable; The good, the bad, and the appropriately under-powered 2017-01-23T20:35:29.054Z · score: 3 (4 votes)
A Cruciverbalist’s Introduction to Bayesian reasoning 2017-01-12T20:43:48.928Z · score: 1 (2 votes)
Map:Territory::Uncertainty::Randomness – but that doesn’t matter, value of information does. 2016-01-22T19:12:17.946Z · score: 6 (11 votes)
Meetup : Finding Effective Altruism with Biased Inputs on Options - LA Rationality Weekly Meetup 2016-01-14T05:31:20.472Z · score: 1 (2 votes)
Perceptual Entropy and Frozen Estimates 2015-06-03T19:27:31.074Z · score: 10 (11 votes)
Meetup : Complex problems, limited information, and rationality; How should we make decisions in real life? 2013-10-09T21:44:19.773Z · score: 3 (4 votes)
Meetup : Group Decision Making (the good, the bad, and the confusion of welfare economics) 2013-04-30T16:18:04.955Z · score: 4 (5 votes)

Comments

Comment by davidmanheim on Updating a Complex Mental Model - An Applied Election Odds Example · 2019-12-03T08:46:47.971Z · score: 2 (1 votes) · LW · GW

Sorry, I should have clarified that the news was US GDP Growth: https://www.bea.gov/news/2019/gross-domestic-product-third-quarter-2019-second-estimate-corporate-profits-third-quarter

Comment by davidmanheim on Spaghetti Towers · 2019-11-28T10:57:24.031Z · score: 7 (3 votes) · LW · GW

This idea has become part of my conceptual toolkit for discussing / describing a key failure mode.

(Note: the below part of the comment is from my Facebook comment about the article when it came out.)

There's a great connection you make to bureaucracy, and it's definitely worth exploring.

This gives me a good language to discuss something I've noted a number of times. I'd posit that selection pressure for bureaucracy limits how stupid the system gets as a function of the best simple alternative, and the difficulty of transitioning to it without turning off the system. This means that for critical systems where there is no incremental pathway to improve, it's near-permanent even if there are better alternatives - see the US healthcare system. For less critical systems, once an alternative is found, as long as the transition isn't too long/too expensive, and there are incentives that actually promote efficiency, it will happen. The critical fact is that the incentives need to exist for everyone that is involved - not just the end user. So if bob in accounting doesn't like the change, unless someone else can induce cooperation (like senior management,) it never happens.

Comment by davidmanheim on On Doing the Improbable · 2019-11-28T10:50:03.686Z · score: 2 (1 votes) · LW · GW

This post has influenced my evaluations of what I am doing in practice by forcing me to consider lowering the bar for expected success for high return activities. Despite "knowing" about how to shut up and multiply, and needing to expect a high failure rate if taking reasonable levels of risk, I didn't consciously place enough weight on these. This helped move me more in that direction, which has led to both an increased number of failures to get what I hoped, and a number of mostly unexpected successes when applying for / requesting / attempting things.

It is worth noting that I still need to work on the reaction I have to failing at these low cost, high-risk activities. I sometimes have a significant emotional reaction to failing, which is especially problematic because the emotional reaction to failing at a long-shot can influence my mood for multiple days or weeks afterwards.


Comment by davidmanheim on The Costly Coordination Mechanism of Common Knowledge · 2019-11-28T10:43:59.140Z · score: 4 (2 votes) · LW · GW

Until seeing this post, I did not have a clear way of talking about common knowledge. Despite understanding the concept fairly well, this post made the points more clearly than I had seen them made before, and provided a useful reference when talking to others about the issue.

Comment by davidmanheim on A LessWrong Crypto Autopsy · 2019-11-28T10:42:36.754Z · score: 6 (1 votes) · LW · GW

This post has been a clear example of how rationality has and has not worked in practice. It is also a subject of critical practical importance for future decisions, so it frequently occurs to me as a useful example of how and why rationality does and does not help with (in retrospect) critical decisions.

Comment by davidmanheim on Embedded Agents · 2019-11-28T10:40:40.380Z · score: 10 (2 votes) · LW · GW

This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.

Comment by davidmanheim on A test for symbol grounding methods: true zero-sum games · 2019-11-27T12:50:22.400Z · score: 11 (0 votes) · LW · GW

I don 't think this is straightforward in practice - and putting a cartesian boundary in place is avoiding exactly the key problem. Any feature of the world used as the item to minimize/maximize is measured, and uncorruptable measurement systems seems like a non-trivial problem. For instance, how do I get my GAI to maximize blue in an area instead of maximizing the blue input into their sensor when pointed at that area? We need to essentially solve value loading and understand a bunch of embedded agent issues to really talk about this.

Comment by davidmanheim on Towards optimal play as Villager in a mixed game · 2019-11-24T10:03:04.624Z · score: 2 (1 votes) · LW · GW

There is also overhead to scaling and difficulty aligning goals that they want to avoid. (As above, I think my Ribbonfarm post makes this clear.) Once you get bigger, the only way to ensure alignment is to monitor - trust, but verify. And verification is a large part of why management is so costly - it takes time away from actually doing work, it is pure overhead for the manager, and even then, it's not foolproof.

When you're small, on the other hand, high-trust is almost unnecessary, because the entire org is legible, and you can see that everyone is (or isn't) buying in to the goals. In typical startups, they are also well aligned because they all have similar levels of payoff if things go really well.

Comment by davidmanheim on Towards optimal play as Villager in a mixed game · 2019-11-24T09:57:45.294Z · score: 4 (2 votes) · LW · GW

My claim is that *competence* isn't the critical limiting factor in most cases because structure doesn't usually allow decoupling, not that it's not limited. When it IS the limiting factor, I agree with you, but it rarely is. And I think alignment is a different argument.

In EA orgs, alignment can solve the delegation-without-management problem because it can mitigate principal-agent issues. Once we agree on goals, we're working towards them, and we can do so in parallel and coordinate only when needed. In most orgs, alignment cannot accomplish this, because it's hard to get people to personally buy into your goals when those goals are profit maximization for a company. (Instead, you use incentive structures like bonuses to align them. But then you need to monitor them, etc.)

Comment by davidmanheim on Towards optimal play as Villager in a mixed game · 2019-11-20T19:57:06.054Z · score: 6 (1 votes) · LW · GW

On your points about scaling, I mostly agree, but want to note that there are fundamental issues with scaling that I explained in a post here: https://www.ribbonfarm.com/2016/03/17/go-corporate-or-go-home/

The post is rather long. In short, however, I don't think that your Kingdom metaphor works, because large bureaucracies are big *not* because they have many mini-kingdoms doing similar things in parallel, but because they need to specialize and allow cross-functional collaboration, which requires lots of management.

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-20T16:14:34.796Z · score: 9 (5 votes) · LW · GW

I partly agree, but burden of proof is often the wrong framing for truth seeking.

The article provides strong evidence that ads are ineffective in certain classes of cases, and that fact in turn provides weaker evidence that ads are ineffective more generally. To support Akshat's skepticism that the result generalizes, we'd need to evidence or priors that points towards ads being differentially effective depending on the type - targeted keywords vs. brand-ad keywords, and brand presence verus no brand presence.

In the first case, I'm somewhat skeptical that the difference between targeted and brand keywords will be large. My prior for the second difference is that there would be some difference, as Gordon argued in another comment. I don't know of any evidence in either direction, but I haven't looked. (The actual result doesn't matter to me except as an exercise in Bayesian reasoning, but if it matters to you or others, it's plausible high VoI to search a bit. )

Comment by davidmanheim on Book Review: Design Principles of Biological Circuits · 2019-11-20T08:20:12.413Z · score: 4 (3 votes) · LW · GW

It's not just selection effects on organisms - it's within organisms. The examples given are NOT fully understood, so (for example,) the bacterial transcription network motifs only contain the relationships that we understand. Given that loops and complex connections are harder to detect, that has selected for simplicity.

Given that, I still want to read the book and see the argument more fully.

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-20T08:12:03.075Z · score: 9 (4 votes) · LW · GW

There was a bubble, and there is also secular growth in the market, with a lot of churn that makes buy-and-hold a fairly bad idea. Those aren't inconsistent. Here's a graphic of the churn. Most of the early companies died.

But if you put all your money in the hot IPOs of Netscape, Yahoo, Lycos, and Excite in 1995, you'd have done very poorly. If you extended this to 1996, you could add Mindspring and Checkpoint (the only one that did well, so far, which is up 29x, for a 16% annualized return to date.) It took until 1997 to get any long-term fantastic return, for Amazon - which is up 1000x since 1997, or a 37% annual return - fantastic, but if you were prescient, and it was an entire tenth of your portfolio, on average you did just OK. Skipping ahead to 1999-2000, the height of the bubble, here's the list. Nothing made big bucks.

So we can construct a portfolio with 10 stocks, 8 of which went bust, and 2 of which, checkpoint and Amazon, did well. Your compound 22-year return? 5.25% (And if you bought an S&P index fund in 1998 at 1,000, you'd have made 6.5% annually.)


Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-20T07:52:18.781Z · score: 4 (2 votes) · LW · GW

Agreed on both points.

Comment by davidmanheim on Aumann Agreement Game · 2019-11-19T15:32:36.355Z · score: 2 (1 votes) · LW · GW

For those interested in running the game, I put together a Python script that pulls questions from an API for trivia questions, and outputs 2 LaTeX files that can be compiled and used, one for the quizmaster, plus a set of corresponding cut-out cards for the players.

https://github.com/davidmanheim/Aumann_Game_Printouts

Pull requests and improvement suggestions are welcome!

Comment by davidmanheim on The new dot com bubble is here: it’s called online advertising · 2019-11-19T05:52:53.303Z · score: 13 (2 votes) · LW · GW
To this I'll just add that this problem is somewhat solvable, but it's tricky.

This is a very important point. I will self-promote and mention my pre-print paper on metric design and avoiding Goodharting (not in the context of AI): https://mpra.ub.uni-muenchen.de/90649/1/MPRA_paper_90649.pdf

Abstract: Metrics are useful for measuring systems and motivating behaviors. Unfortunately, naive application of metrics to a system can distort the system in ways that undermine the original goal. The problem was noted independently by Campbell and Goodhart, and in some forms it is not only common, but unavoidable due to the nature of metrics. There are two distinct but interrelated problems that must be overcome in building better metrics; first, specifying metrics more closely related to the true goals, and second, preventing the recipients from gaming the difference between the reward system and the true goal. This paper describes several approaches to designing metrics, beginning with design considerations and processes, then discussing specific strategies including secrecy, randomization, diversification, and post-hoc specification. Finally, it will discuss important desiderata and the trade-offs involved in each approach.

(Currently working on a rewrite, but feedback on the ideas and anything missing is especially appreciated.)

Comment by davidmanheim on Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management · 2019-11-06T13:23:25.577Z · score: 2 (1 votes) · LW · GW

Yeah, I could see giving sleeping pills as a phenomenon a different name, but (while I am unsure,) in the industry I think the term is well known, so I'm sticking with it.

Still, I'm open to better suggestions, but sugar pill / placebo doesn't seem better, since it's much less clear as an analogy.

Comment by davidmanheim on Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management · 2019-11-06T13:21:07.717Z · score: 2 (1 votes) · LW · GW

Absolutely true - they are slightly less related to principle agent problems, but I can add stories illustrating these, because I have some good ones.

Re: Risk distribution, there's a great game that someone suggested to understand what 100 year floods look like, since people complain they happen every year. Get a room full of, say, 40 people, and give them each 2 dice. Tell them snake-eyes is a flood, then have everyone roll, and to call out if there was a flood. Then roll and announce again. And again. Yes, many "years" will have no floods, but some will have 2 or even 3 - because the risk isn't highly correlated across areas.

Re: Threshold of misery, I knew an underwriter who said they would write New York City nuclear terrorism insurance whenever they had the chance, at basically any price. His reasoning? He lives and works in NYC, so if a nuclear bomb goes off, he's dead, and the fact that he lost money isn't what matters. Otherwise he collects premiums.


Comment by davidmanheim on Prospecting for Conceptual Holes · 2019-10-30T10:38:16.221Z · score: 13 (9 votes) · LW · GW

It's worth noting that these types basically match the Johari Window for types of risks; known knowns, known unknowns, unknown knowns, and unknown unknowns. This is because, at least according to one definition of the term, risks are things we don't expect. Given that definition, a risk is the product of a conceptual hole of some sort - and so the two have a somewhat trivial mapping.

Comment by davidmanheim on All I know is Goodhart · 2019-10-24T10:59:12.269Z · score: 4 (2 votes) · LW · GW

This adversarial issue can be true, and is critical, but I don't think it's what Stuart was pointing to. in his post or his reply.

Comment by davidmanheim on All I know is Goodhart · 2019-10-24T10:56:54.922Z · score: 10 (3 votes) · LW · GW

I think you want to differentiate between different mechanisms for Goodhart's law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here: https://arxiv.org/abs/1803.04585

Given that, I see several different things going on.

First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called "The Tails Come Apart". This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)

Second, there is the potential for divergences between, in your example,"a measurement of my internal sense of beauty rather than my internal sense of beauty itself" is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there's also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.

Alternatively, if the initial sample of "your internal sense of beauty" is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents - inferring a continuous, potentially unbounded function from a finite sample. That's an important and fundamental issue, but it's only partially about, in this case, extremal Goodhart. It's also a more general issue about inferring preferences, i.e. learning is hard and this is learning.

Comment by davidmanheim on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T06:53:59.169Z · score: 16 (5 votes) · LW · GW

I commented on the thread (after seeing this) in order to add a link to my paper that addresses Bengio's last argument;

@Yoshua Bengio I attempted to formalize this argument somewhat in a recent paper. I don't think the argument there is particularly airtight, but I think it provides a significantly stronger argument for why we should believe that interaction between optimizing systems is fundamentally hard.
https://www.mdpi.com/2504-2289/3/2/21/htm

Paper abstract: "An important challenge for safety in machine learning and artificial intelligence systems is a set of related failures involving specification gaming, reward hacking, fragility to distributional shifts, and Goodhart’s or Campbell’s law. This paper presents additional failure modes for interactions within multi-agent systems that are closely related. These multi-agent failure modes are more complex, more problematic, and less well understood than the single-agent case, and are also already occurring, largely unnoticed. After motivating the discussion with examples from poker-playing artificial intelligence (AI), the paper explains why these failure modes are in some senses unavoidable. Following this, the paper categorizes failure modes, provides definitions, and cites examples for each of the modes: accidental steering, coordination failures, adversarial misalignment, input spoofing and filtering, and goal co-option or direct hacking. The paper then discusses how extant literature on multi-agent AI fails to address these failure modes, and identifies work which may be useful for the mitigation of these failure modes."

Comment by davidmanheim on Follow-Up to Petrov Day, 2019 · 2019-10-02T10:42:00.773Z · score: 4 (2 votes) · LW · GW

Re: examples of point #1, I don't think that shaming in this forum is productive - it's polarizing and stigmatizing rather than helpful. But I do know of several individuals and a couple organizations which are guilty of this, each repeatedly.

I do think that people should be more willing to personally / privately respond if someone does something, and I have done so in several specific cases where someone decided on a unilateralist approach that I thought was damaging.

Comment by davidmanheim on Follow-Up to Petrov Day, 2019 · 2019-09-29T10:08:28.581Z · score: 23 (8 votes) · LW · GW
Even ignoring the above problem, I'm confused why it's valuable to build up a "real tradition" among LW users, given that the wider unilateralist curse problem that our world faces can't possibly be solved by LW users having such a tradition.

A few points.

First, I don't think it's clear that in the Rationalist / EA community, there is enough reinforcement of this, and I routinely see issues with people "going rogue" and unilaterally engaging in activities that others have warned them would be dangerous, net negative, etc.

Second, it's valuable even as an exemplar; we should be able to say that there is such a community, and that they are capable of exercising at least this minimal level of restraint.

Third, I think it's clear that in the next decade the number of people in the Rationalist-sphere that are in actual positions of (relatively significant) power will continue to grow, and we have already seen some such people emerge in government and in the world of NGOs. For AI, in particular, there are many people who have significant influence in making decisions that could significantly affect Humanity's future. Their active (i.e. passive) participation in this seems likely to at least give them a better understanding of what is needed when they are faced with these choices.

Comment by davidmanheim on Divergence on Evidence Due to Differing Priors - A Political Case Study · 2019-09-17T20:28:56.581Z · score: 3 (2 votes) · LW · GW

They don't contradict directly, but they reflect nearly incompatible updates to their world-models based on the same data.

Comment by davidmanheim on Hackable Rewards as a Safety Valve? · 2019-09-12T06:54:17.571Z · score: 2 (1 votes) · LW · GW
But I don't understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking

I was mostly noting that I hadn't thought of this, hadn't seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn't previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.

(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for "safer self-driving cars" and "reducing and mitigating side-effects" is plausibly a net negative if done poorly, instead of being benign.

Comment by davidmanheim on Hackable Rewards as a Safety Valve? · 2019-09-11T09:08:49.045Z · score: 2 (0 votes) · LW · GW

My claim here is that superintelligence is a result of training, not a starting condition. Yes, a SAI would do bad things unless robustly aligned, but building the SAI requires it not to wirehead at an earlier stage in the process. My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack - not necessarily via direct access, but via whatever channel is cheapest. And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.

Comment by davidmanheim on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-09-05T11:19:25.416Z · score: 3 (2 votes) · LW · GW

This is really helpful - thanks!

Comment by davidmanheim on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-08-29T09:05:20.527Z · score: 6 (2 votes) · LW · GW

Yes, an unsafe AI cannot be boxed on unsafe hardware, not can any AI running on physical hardware be made immune to attacks - but those are very different questions. For this question, first we assume that a provably safe AI can be written, then I wanted to ask what language would be needed.

Comment by davidmanheim on Where are people thinking and talking about global coordination for AI safety? · 2019-08-26T10:13:12.607Z · score: 2 (1 votes) · LW · GW

No, that's implicit in the model - and either *some* crisis requiring higher capacity than we have will overwhelm us and we'll all die (and it doesn't matter which,) or the variance is relatively small so no such event occurs, and/or our capacity to manage risks grows quickly enough that we avoid the upper tail.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-20T11:18:23.249Z · score: 2 (1 votes) · LW · GW

Fully agree - I was using the example to make a far less fundamental point.

Comment by davidmanheim on Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) · 2019-08-19T09:56:54.942Z · score: 2 (1 votes) · LW · GW
2) Calling the issues between the agents because of model differences "terminology issues" could also work well - this may be a little like people talking past each other.

I really like this point. I think it's parallel to the human issue where different models of the world can lead to misinterpretation of the "same" goal. So "terminology issues" would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets "temperature" as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we're not calling terminology issues.

I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It's possible that two models disagree on representation, but agree on all object level claims - think of using different coordinate systems. Because terminology issues can cause mistakes, I'd suggest that agents with non-shared world models can only reliably communicate via object-level claims.

The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.

Comment by davidmanheim on Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) · 2019-08-19T09:44:17.623Z · score: 2 (2 votes) · LW · GW
It seems like there's a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn't a perfect score), because it solved them the way I solve them, that doesn't sound like misalignment.

The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That's why I'm calling it a principal alignment failure - even though it's the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T09:22:34.816Z · score: 4 (2 votes) · LW · GW

Separate from my other comment, I want to question your assumption that we must worry about an AI-takeoff that is exponentially better than humans at everything, so that a very slight misalignment would be disastrous. That seems possible, per Eliezer's Rocket Example, but is far from certain.

It seems likely that instead there are fundamental limits on intelligence (for a given architecture, at least) and while it is unlikely that the overall limits are coincidentally the same as / near human intelligence, it seems plausible that the first superhuman AI system still plateaus somewhere far short of infinite optimization power. If so, we only need to mitigate well, instead of perfectly align the AI to our goals.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T09:15:33.897Z · score: 10 (5 votes) · LW · GW

I'll start by noting that I am in the strange (for me) position of arguing that someone is too concerned about over-optimization failures, rather than trying to convince someone who is dismissive. But that said, I do think that the concern here, while real, is mitigable in a variety of ways.

First, there is the possibility of reducing optimization pressure. One key contribution here is Jessica Taylor's Quantilizers paper, which you note, that shows a way to build systems that optimize but are not nearly as subject to Goodhart's curse. I think you are too dismissive. Similarly, you are dismissive of optimizing the target directly. I think that the epistemological issues you point to are possible to mitigate to the extent that they won't cause misalignment between reality and an AI's representation of that reality. Once that is done, the remaining issue is aligning "true" goals with the measured goals, which is still hard, but certainly not fundamentally impossible in the same way.

Second, you note that you don't think we will solve alignment. I agree, because I think that "alignment" presupposes a single coherent ideal. If human preferences are diverse, as it seems they are, we may find that alignment is impossible. This, however, allows a very different approach. This would optimize only when it finds Pareto-improvements across a set of sub-alignment metrics or goals, to constrain the possibility of runaway optimization. Even if alignment is possible, it seems likely that we can specify a set of diverse goals / metrics that are all aligned with some human goals, so that the system will be limited in its ability to be misaligned.

Lastly, there is optimization for a safe and very limited goal. If the goal is limited and specific, and we find a way to minimize side-effects, this seems like it could be fairly safe. For example, Oracle AIs are an attempt to severely limit the goal. More broadly, however, we might be able to build constraints that work, so that it that can reliably perform limited tasks (“put a strawberry on a plate without producing any catastrophic side-effects.”)

Comment by davidmanheim on Intertheoretic utility comparison: examples · 2019-08-12T01:34:36.089Z · score: 2 (1 votes) · LW · GW

This is very interesting - I hadn't thought about utility aggregation for a single agent before, but it seems clearly important now that it has been pointed out.

I'm thinking about this in the context of both the human brain as an amalgamation of sub-agents, and organizations as an amalgamation of individuals. Note that we can treat organizations as rationally maximizing some utility function in the same way we can treat individuals as doing so - but I think that for many or most voting or decision structures, we should be able to rule out the claim that they are following any weighted combination of normalized utilities of the agents involved in the system using any intertheoretic comparison. This seems like a useful result if we can prove it. (Alternatively, it may be that certain decision rules map to specific intertheoretic comparison rules, which would be even more interesting.)

Comment by davidmanheim on How can we respond to info-cascades? [Info-cascade series] · 2019-08-08T15:04:09.761Z · score: 8 (2 votes) · LW · GW

1) It's neither noise nor rapid increase - it's delayed feedback. Control theorists in engineering have this as a really clear, basic result, that delayed feedback is really really bad in various ways. There are entire books on how to do it well - https://books.google.ch/books?id=Cy_wCAAAQBAJ&pg=PR9&lpg=PR9 - but doing it without using these more complex techniques is bad.

2) You either hire a control theorist, or (more practically) you avoid the current feedback mechanism, and instead get people on the phone to talk about and understand what everyone needs, as opposed to relying on their delayed feedback in the form of numeric orders.

Comment by davidmanheim on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-08-08T15:00:22.759Z · score: 2 (1 votes) · LW · GW

I think I agree about Eliezer's definition, that it's theoretically correct, but I definitely agree that I need to understand this better.

Comment by davidmanheim on Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) · 2019-08-07T07:35:29.483Z · score: 8 (3 votes) · LW · GW

Thanks for the feedback. I agree that in a control system, any divergence between intent and outcome is an alignment issue, and I agree that this makes overoptimization different in control versus selection. Despite the conceptual confusion, I definitely think the connections are worth noting - not only "wireheading," but the issues with mesa-optimizers. And I definitely think that causal failures are important particularly in this context.

But I strongly endorse how weak and fuzzy this is - which is a large part of why I wanted to try to de-confuse myself. That's the goal of this mini-sequence, and I hope that doing so publicly in this way at least highlights where the confusion is, even if I can't successfully de-confuse myself, much less others. And if there are places where others are materially less confused than me and/or you, I'd love for them to write responses or their own explainers on this.

Comment by davidmanheim on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-07-30T14:43:08.167Z · score: 2 (1 votes) · LW · GW

A couple points.

First, the reason why I wasn't happy with entropy as a metric is because it doesn't allow (straightforward) comparison of different types of optimization, as I discussed. Entropy of a probability distribution output isn't comparable to the entropy over states that Eliezer defines, for example.

Second, I'm not sure false positive and false negative are the right conceptual tools here. I can easily show examples of each - gradient descent can fail horribly in many ways, and luck of specific starting parameters on specific distributions can lead to unreasonably rapid convergence, but in both cases, it's a relationship between the algorithm and the space being optimized.

Comment by davidmanheim on Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) · 2019-07-30T05:02:49.930Z · score: 4 (2 votes) · LW · GW

As always, constructive criticism on whether I'm still confused, or whether the points I'm making are clear, is welcome!

Comment by davidmanheim on Indifference: multiple changes, multiple agents · 2019-07-15T15:29:04.163Z · score: 2 (1 votes) · LW · GW

Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we've assumed away a number of key problems - among the most critical of which seem to be:

We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward - Abram Demski has noted in another post that this is very problematic.

We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent's indifference. (As an aside, I'm unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we're computing.) If we can compute another agent's reward function when designing our agent, however, we can plausibly hijack that agent.

We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That's a critical issue I'm still trying to wrap my head around, because it's unclear to me how a system can reason in those cases.

Comment by davidmanheim on Indifference: multiple changes, multiple agents · 2019-07-14T11:50:28.267Z · score: 2 (1 votes) · LW · GW

The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.

Comment by davidmanheim on Indifference: multiple changes, multiple agents · 2019-07-10T08:09:37.142Z · score: 2 (1 votes) · LW · GW

This seems like an important issue, but given the example, I'm also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be "obvious" and so a patch would be obviously needed. This seems unclear - we need to understand the strategic motives to diagnose what is happening, and given that we don't have good ideas for explainability, I'm unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don't currently have good ideas.)

Comment by davidmanheim on Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) · 2019-07-07T10:38:58.193Z · score: 2 (1 votes) · LW · GW

Note: as I mentioned in the post, I'd love feedback of all types - for better clarity, criticism of my understanding, suggestions for how to continue, and places the discussion still seems confused.

Comment by davidmanheim on Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) · 2019-07-07T10:37:53.965Z · score: 2 (1 votes) · LW · GW

I like that intuition overall, but there's a sense that adaptive search does give far more resolution than grid search, but the analysis seems wrong; if I use gradient descent, by eventual accuracy is the step size near the end, which gives far more precise answers than "only" checking a grid of M^N points spaced equally.

Comment by davidmanheim on Conceptual Problems with UDT and Policy Selection · 2019-07-04T08:09:35.623Z · score: 2 (1 votes) · LW · GW

Pretty sure we're agreeing here. I was originally just supporting cousin_it's claim, not claiming that Nash Equilibria are a useful-enough solution concept. I was simply noting that - while they are weaker than a useful-enough concept would be - they can show the issue with non-uniqueness clearly.

Comment by davidmanheim on Conceptual Problems with UDT and Policy Selection · 2019-07-02T09:54:16.843Z · score: 4 (2 votes) · LW · GW

Agreed that it's insufficient, but I think it shows that there's no way to specify strategies that work regardless of other players' strategies, and I agree that this generalizes to better solution concepts, which I agree "make the equilibrium selection problem worse".

I'd also point out an oft-noted critical failure of Nash Equilibria, which is that they assume infinite computation, and (therefore) no logical uncertainty. A game can pay out the seventeenth digit of the BB(200) to player 1 and the eighteenth digit to player 2, and we must assume these are known, and can be used to find the NE. I haven't thought through the following through completely, but it seems obvious that this issue can be used to show why NE is not generally a useful/valid solution concept for embedded agents, because they would need models of themselves and other agents their own size to predict goals / strategies.

Comment by davidmanheim on Conceptual Problems with UDT and Policy Selection · 2019-06-30T10:24:54.295Z · score: 4 (2 votes) · LW · GW

I think it might be worth noting that there's a trivial no-free-lunch theorem we can state about multiplayer games that can formalize your intuition.

(In at least a large class of cases) where there are multiple nash-equilibria, if different players aim for different equilibria, the best strategy depends on the strategy of the player you face. I think that's all we need to say to show there is no best player.

Comment by davidmanheim on A case for strategy research: what it is and why we need more of it · 2019-06-23T10:21:28.046Z · score: 2 (1 votes) · LW · GW

Glad to hear that you aren't recommending strategy research in general - because that's what it looked like.

And yes, I think it's incredibly hard to make sure we're not putting effort into efforts with negative expected value, and I think that attention hazards are critical, and are the biggest place where I think strategy research has the potential to increase risks rather than ameliorate them. (Which is exactly why I'm confused that anyone would suggest that more such research should be done publicly and/or shared. And it's why I don't think that a more detailed object level discussion makes sense here, in public.)