Divergence on Evidence Due to Differing Priors - A Political Case Study 2019-09-16T11:01:11.341Z · score: 27 (11 votes)
Hackable Rewards as a Safety Valve? 2019-09-10T10:33:40.238Z · score: 18 (5 votes)
What Programming Language Characteristics Would Allow Provably Safe AI? 2019-08-28T10:46:32.643Z · score: 5 (5 votes)
Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) 2019-08-12T08:07:01.769Z · score: 17 (9 votes)
Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) 2019-07-28T09:32:25.878Z · score: 19 (6 votes)
What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) 2019-07-28T09:30:29.792Z · score: 29 (6 votes)
Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) 2019-07-02T15:36:51.071Z · score: 27 (5 votes)
Schelling Fences versus Marginal Thinking 2019-05-22T10:22:32.213Z · score: 23 (14 votes)
Values Weren't Complex, Once. 2018-11-25T09:17:02.207Z · score: 34 (15 votes)
Oversight of Unsafe Systems via Dynamic Safety Envelopes 2018-11-23T08:37:30.401Z · score: 11 (5 votes)
Collaboration-by-Design versus Emergent Collaboration 2018-11-18T07:22:16.340Z · score: 12 (3 votes)
Multi-Agent Overoptimization, and Embedded Agent World Models 2018-11-08T20:33:00.499Z · score: 9 (4 votes)
Policy Beats Morality 2018-10-17T06:39:40.398Z · score: 15 (15 votes)
(Some?) Possible Multi-Agent Goodhart Interactions 2018-09-22T17:48:22.356Z · score: 21 (5 votes)
Lotuses and Loot Boxes 2018-05-17T00:21:12.583Z · score: 27 (6 votes)
Non-Adversarial Goodhart and AI Risks 2018-03-27T01:39:30.539Z · score: 64 (14 votes)
Evidence as Rhetoric — Normative or Positive? 2017-12-06T17:38:05.033Z · score: 1 (1 votes)
A Short Explanation of Blame and Causation 2017-09-18T17:43:34.571Z · score: 1 (1 votes)
Prescientific Organizational Theory (Ribbonfarm) 2017-02-22T23:00:41.273Z · score: 3 (4 votes)
A Quick Confidence Heuristic; Implicitly Leveraging "The Wisdom of Crowds" 2017-02-10T00:54:41.394Z · score: 1 (2 votes)
Most empirical questions are unresolveable; The good, the bad, and the appropriately under-powered 2017-01-23T20:35:29.054Z · score: 3 (4 votes)
A Cruciverbalist’s Introduction to Bayesian reasoning 2017-01-12T20:43:48.928Z · score: 1 (2 votes)
Map:Territory::Uncertainty::Randomness – but that doesn’t matter, value of information does. 2016-01-22T19:12:17.946Z · score: 6 (11 votes)
Meetup : Finding Effective Altruism with Biased Inputs on Options - LA Rationality Weekly Meetup 2016-01-14T05:31:20.472Z · score: 1 (2 votes)
Perceptual Entropy and Frozen Estimates 2015-06-03T19:27:31.074Z · score: 10 (11 votes)
Meetup : Complex problems, limited information, and rationality; How should we make decisions in real life? 2013-10-09T21:44:19.773Z · score: 3 (4 votes)
Meetup : Group Decision Making (the good, the bad, and the confusion of welfare economics) 2013-04-30T16:18:04.955Z · score: 4 (5 votes)


Comment by davidmanheim on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T06:53:59.169Z · score: 16 (5 votes) · LW · GW

I commented on the thread (after seeing this) in order to add a link to my paper that addresses Bengio's last argument;

@Yoshua Bengio I attempted to formalize this argument somewhat in a recent paper. I don't think the argument there is particularly airtight, but I think it provides a significantly stronger argument for why we should believe that interaction between optimizing systems is fundamentally hard.

Paper abstract: "An important challenge for safety in machine learning and artificial intelligence systems is a set of related failures involving specification gaming, reward hacking, fragility to distributional shifts, and Goodhart’s or Campbell’s law. This paper presents additional failure modes for interactions within multi-agent systems that are closely related. These multi-agent failure modes are more complex, more problematic, and less well understood than the single-agent case, and are also already occurring, largely unnoticed. After motivating the discussion with examples from poker-playing artificial intelligence (AI), the paper explains why these failure modes are in some senses unavoidable. Following this, the paper categorizes failure modes, provides definitions, and cites examples for each of the modes: accidental steering, coordination failures, adversarial misalignment, input spoofing and filtering, and goal co-option or direct hacking. The paper then discusses how extant literature on multi-agent AI fails to address these failure modes, and identifies work which may be useful for the mitigation of these failure modes."

Comment by davidmanheim on Follow-Up to Petrov Day, 2019 · 2019-10-02T10:42:00.773Z · score: 4 (2 votes) · LW · GW

Re: examples of point #1, I don't think that shaming in this forum is productive - it's polarizing and stigmatizing rather than helpful. But I do know of several individuals and a couple organizations which are guilty of this, each repeatedly.

I do think that people should be more willing to personally / privately respond if someone does something, and I have done so in several specific cases where someone decided on a unilateralist approach that I thought was damaging.

Comment by davidmanheim on Follow-Up to Petrov Day, 2019 · 2019-09-29T10:08:28.581Z · score: 23 (8 votes) · LW · GW
Even ignoring the above problem, I'm confused why it's valuable to build up a "real tradition" among LW users, given that the wider unilateralist curse problem that our world faces can't possibly be solved by LW users having such a tradition.

A few points.

First, I don't think it's clear that in the Rationalist / EA community, there is enough reinforcement of this, and I routinely see issues with people "going rogue" and unilaterally engaging in activities that others have warned them would be dangerous, net negative, etc.

Second, it's valuable even as an exemplar; we should be able to say that there is such a community, and that they are capable of exercising at least this minimal level of restraint.

Third, I think it's clear that in the next decade the number of people in the Rationalist-sphere that are in actual positions of (relatively significant) power will continue to grow, and we have already seen some such people emerge in government and in the world of NGOs. For AI, in particular, there are many people who have significant influence in making decisions that could significantly affect Humanity's future. Their active (i.e. passive) participation in this seems likely to at least give them a better understanding of what is needed when they are faced with these choices.

Comment by davidmanheim on Divergence on Evidence Due to Differing Priors - A Political Case Study · 2019-09-17T20:28:56.581Z · score: 3 (2 votes) · LW · GW

They don't contradict directly, but they reflect nearly incompatible updates to their world-models based on the same data.

Comment by davidmanheim on Hackable Rewards as a Safety Valve? · 2019-09-12T06:54:17.571Z · score: 2 (1 votes) · LW · GW
But I don't understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking

I was mostly noting that I hadn't thought of this, hadn't seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn't previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.

(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for "safer self-driving cars" and "reducing and mitigating side-effects" is plausibly a net negative if done poorly, instead of being benign.

Comment by davidmanheim on Hackable Rewards as a Safety Valve? · 2019-09-11T09:08:49.045Z · score: 2 (0 votes) · LW · GW

My claim here is that superintelligence is a result of training, not a starting condition. Yes, a SAI would do bad things unless robustly aligned, but building the SAI requires it not to wirehead at an earlier stage in the process. My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack - not necessarily via direct access, but via whatever channel is cheapest. And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.

Comment by davidmanheim on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-09-05T11:19:25.416Z · score: 3 (2 votes) · LW · GW

This is really helpful - thanks!

Comment by davidmanheim on What Programming Language Characteristics Would Allow Provably Safe AI? · 2019-08-29T09:05:20.527Z · score: 6 (2 votes) · LW · GW

Yes, an unsafe AI cannot be boxed on unsafe hardware, not can any AI running on physical hardware be made immune to attacks - but those are very different questions. For this question, first we assume that a provably safe AI can be written, then I wanted to ask what language would be needed.

Comment by davidmanheim on Where are people thinking and talking about global coordination for AI safety? · 2019-08-26T10:13:12.607Z · score: 2 (1 votes) · LW · GW

No, that's implicit in the model - and either *some* crisis requiring higher capacity than we have will overwhelm us and we'll all die (and it doesn't matter which,) or the variance is relatively small so no such event occurs, and/or our capacity to manage risks grows quickly enough that we avoid the upper tail.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-20T11:18:23.249Z · score: 2 (1 votes) · LW · GW

Fully agree - I was using the example to make a far less fundamental point.

Comment by davidmanheim on Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) · 2019-08-19T09:56:54.942Z · score: 2 (1 votes) · LW · GW
2) Calling the issues between the agents because of model differences "terminology issues" could also work well - this may be a little like people talking past each other.

I really like this point. I think it's parallel to the human issue where different models of the world can lead to misinterpretation of the "same" goal. So "terminology issues" would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets "temperature" as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we're not calling terminology issues.

I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It's possible that two models disagree on representation, but agree on all object level claims - think of using different coordinate systems. Because terminology issues can cause mistakes, I'd suggest that agents with non-shared world models can only reliably communicate via object-level claims.

The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.

Comment by davidmanheim on Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) · 2019-08-19T09:44:17.623Z · score: 2 (2 votes) · LW · GW
It seems like there's a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn't a perfect score), because it solved them the way I solve them, that doesn't sound like misalignment.

The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That's why I'm calling it a principal alignment failure - even though it's the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T09:22:34.816Z · score: 4 (2 votes) · LW · GW

Separate from my other comment, I want to question your assumption that we must worry about an AI-takeoff that is exponentially better than humans at everything, so that a very slight misalignment would be disastrous. That seems possible, per Eliezer's Rocket Example, but is far from certain.

It seems likely that instead there are fundamental limits on intelligence (for a given architecture, at least) and while it is unlikely that the overall limits are coincidentally the same as / near human intelligence, it seems plausible that the first superhuman AI system still plateaus somewhere far short of infinite optimization power. If so, we only need to mitigate well, instead of perfectly align the AI to our goals.

Comment by davidmanheim on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T09:15:33.897Z · score: 10 (5 votes) · LW · GW

I'll start by noting that I am in the strange (for me) position of arguing that someone is too concerned about over-optimization failures, rather than trying to convince someone who is dismissive. But that said, I do think that the concern here, while real, is mitigable in a variety of ways.

First, there is the possibility of reducing optimization pressure. One key contribution here is Jessica Taylor's Quantilizers paper, which you note, that shows a way to build systems that optimize but are not nearly as subject to Goodhart's curse. I think you are too dismissive. Similarly, you are dismissive of optimizing the target directly. I think that the epistemological issues you point to are possible to mitigate to the extent that they won't cause misalignment between reality and an AI's representation of that reality. Once that is done, the remaining issue is aligning "true" goals with the measured goals, which is still hard, but certainly not fundamentally impossible in the same way.

Second, you note that you don't think we will solve alignment. I agree, because I think that "alignment" presupposes a single coherent ideal. If human preferences are diverse, as it seems they are, we may find that alignment is impossible. This, however, allows a very different approach. This would optimize only when it finds Pareto-improvements across a set of sub-alignment metrics or goals, to constrain the possibility of runaway optimization. Even if alignment is possible, it seems likely that we can specify a set of diverse goals / metrics that are all aligned with some human goals, so that the system will be limited in its ability to be misaligned.

Lastly, there is optimization for a safe and very limited goal. If the goal is limited and specific, and we find a way to minimize side-effects, this seems like it could be fairly safe. For example, Oracle AIs are an attempt to severely limit the goal. More broadly, however, we might be able to build constraints that work, so that it that can reliably perform limited tasks (“put a strawberry on a plate without producing any catastrophic side-effects.”)

Comment by davidmanheim on Intertheoretic utility comparison: examples · 2019-08-12T01:34:36.089Z · score: 2 (1 votes) · LW · GW

This is very interesting - I hadn't thought about utility aggregation for a single agent before, but it seems clearly important now that it has been pointed out.

I'm thinking about this in the context of both the human brain as an amalgamation of sub-agents, and organizations as an amalgamation of individuals. Note that we can treat organizations as rationally maximizing some utility function in the same way we can treat individuals as doing so - but I think that for many or most voting or decision structures, we should be able to rule out the claim that they are following any weighted combination of normalized utilities of the agents involved in the system using any intertheoretic comparison. This seems like a useful result if we can prove it. (Alternatively, it may be that certain decision rules map to specific intertheoretic comparison rules, which would be even more interesting.)

Comment by davidmanheim on How can we respond to info-cascades? [Info-cascade series] · 2019-08-08T15:04:09.761Z · score: 2 (1 votes) · LW · GW

1) It's neither noise nor rapid increase - it's delayed feedback. Control theorists in engineering have this as a really clear, basic result, that delayed feedback is really really bad in various ways. There are entire books on how to do it well - - but doing it without using these more complex techniques is bad.

2) You either hire a control theorist, or (more practically) you avoid the current feedback mechanism, and instead get people on the phone to talk about and understand what everyone needs, as opposed to relying on their delayed feedback in the form of numeric orders.

Comment by davidmanheim on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-08-08T15:00:22.759Z · score: 2 (1 votes) · LW · GW

I think I agree about Eliezer's definition, that it's theoretically correct, but I definitely agree that I need to understand this better.

Comment by davidmanheim on Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) · 2019-08-07T07:35:29.483Z · score: 8 (3 votes) · LW · GW

Thanks for the feedback. I agree that in a control system, any divergence between intent and outcome is an alignment issue, and I agree that this makes overoptimization different in control versus selection. Despite the conceptual confusion, I definitely think the connections are worth noting - not only "wireheading," but the issues with mesa-optimizers. And I definitely think that causal failures are important particularly in this context.

But I strongly endorse how weak and fuzzy this is - which is a large part of why I wanted to try to de-confuse myself. That's the goal of this mini-sequence, and I hope that doing so publicly in this way at least highlights where the confusion is, even if I can't successfully de-confuse myself, much less others. And if there are places where others are materially less confused than me and/or you, I'd love for them to write responses or their own explainers on this.

Comment by davidmanheim on What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) · 2019-07-30T14:43:08.167Z · score: 2 (1 votes) · LW · GW

A couple points.

First, the reason why I wasn't happy with entropy as a metric is because it doesn't allow (straightforward) comparison of different types of optimization, as I discussed. Entropy of a probability distribution output isn't comparable to the entropy over states that Eliezer defines, for example.

Second, I'm not sure false positive and false negative are the right conceptual tools here. I can easily show examples of each - gradient descent can fail horribly in many ways, and luck of specific starting parameters on specific distributions can lead to unreasonably rapid convergence, but in both cases, it's a relationship between the algorithm and the space being optimized.

Comment by davidmanheim on Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) · 2019-07-30T05:02:49.930Z · score: 4 (2 votes) · LW · GW

As always, constructive criticism on whether I'm still confused, or whether the points I'm making are clear, is welcome!

Comment by davidmanheim on Indifference: multiple changes, multiple agents · 2019-07-15T15:29:04.163Z · score: 2 (1 votes) · LW · GW

Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we've assumed away a number of key problems - among the most critical of which seem to be:

We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward - Abram Demski has noted in another post that this is very problematic.

We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent's indifference. (As an aside, I'm unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we're computing.) If we can compute another agent's reward function when designing our agent, however, we can plausibly hijack that agent.

We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That's a critical issue I'm still trying to wrap my head around, because it's unclear to me how a system can reason in those cases.

Comment by davidmanheim on Indifference: multiple changes, multiple agents · 2019-07-14T11:50:28.267Z · score: 2 (1 votes) · LW · GW

The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.

Comment by davidmanheim on Indifference: multiple changes, multiple agents · 2019-07-10T08:09:37.142Z · score: 2 (1 votes) · LW · GW

This seems like an important issue, but given the example, I'm also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be "obvious" and so a patch would be obviously needed. This seems unclear - we need to understand the strategic motives to diagnose what is happening, and given that we don't have good ideas for explainability, I'm unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don't currently have good ideas.)

Comment by davidmanheim on Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) · 2019-07-07T10:38:58.193Z · score: 2 (1 votes) · LW · GW

Note: as I mentioned in the post, I'd love feedback of all types - for better clarity, criticism of my understanding, suggestions for how to continue, and places the discussion still seems confused.

Comment by davidmanheim on Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) · 2019-07-07T10:37:53.965Z · score: 2 (1 votes) · LW · GW

I like that intuition overall, but there's a sense that adaptive search does give far more resolution than grid search, but the analysis seems wrong; if I use gradient descent, by eventual accuracy is the step size near the end, which gives far more precise answers than "only" checking a grid of M^N points spaced equally.

Comment by davidmanheim on Conceptual Problems with UDT and Policy Selection · 2019-07-04T08:09:35.623Z · score: 2 (1 votes) · LW · GW

Pretty sure we're agreeing here. I was originally just supporting cousin_it's claim, not claiming that Nash Equilibria are a useful-enough solution concept. I was simply noting that - while they are weaker than a useful-enough concept would be - they can show the issue with non-uniqueness clearly.

Comment by davidmanheim on Conceptual Problems with UDT and Policy Selection · 2019-07-02T09:54:16.843Z · score: 4 (2 votes) · LW · GW

Agreed that it's insufficient, but I think it shows that there's no way to specify strategies that work regardless of other players' strategies, and I agree that this generalizes to better solution concepts, which I agree "make the equilibrium selection problem worse".

I'd also point out an oft-noted critical failure of Nash Equilibria, which is that they assume infinite computation, and (therefore) no logical uncertainty. A game can pay out the seventeenth digit of the BB(200) to player 1 and the eighteenth digit to player 2, and we must assume these are known, and can be used to find the NE. I haven't thought through the following through completely, but it seems obvious that this issue can be used to show why NE is not generally a useful/valid solution concept for embedded agents, because they would need models of themselves and other agents their own size to predict goals / strategies.

Comment by davidmanheim on Conceptual Problems with UDT and Policy Selection · 2019-06-30T10:24:54.295Z · score: 4 (2 votes) · LW · GW

I think it might be worth noting that there's a trivial no-free-lunch theorem we can state about multiplayer games that can formalize your intuition.

(In at least a large class of cases) where there are multiple nash-equilibria, if different players aim for different equilibria, the best strategy depends on the strategy of the player you face. I think that's all we need to say to show there is no best player.

Comment by davidmanheim on A case for strategy research: what it is and why we need more of it · 2019-06-23T10:21:28.046Z · score: 2 (1 votes) · LW · GW

Glad to hear that you aren't recommending strategy research in general - because that's what it looked like.

And yes, I think it's incredibly hard to make sure we're not putting effort into efforts with negative expected value, and I think that attention hazards are critical, and are the biggest place where I think strategy research has the potential to increase risks rather than ameliorate them. (Which is exactly why I'm confused that anyone would suggest that more such research should be done publicly and/or shared. And it's why I don't think that a more detailed object level discussion makes sense here, in public.)

Comment by davidmanheim on A case for strategy research: what it is and why we need more of it · 2019-06-23T10:07:53.053Z · score: 2 (1 votes) · LW · GW

No, it just means you just need an actual system model which is at least somewhat predictive in order to make decisions, and therefore have a better grasp on the expected value of your investments than "let's try something, who knows, let's just take risks."

Comment by davidmanheim on A case for strategy research: what it is and why we need more of it · 2019-06-21T13:38:27.036Z · score: 3 (3 votes) · LW · GW
the more money you have, the higher the variance on weird projects you should be funding.

Only if you're sure the mean is positive - and there's no reason to think that. In fact, it's arguable that in a complex system, a priori, we should consider significant changes destabilizing and significantly net negative unless we have reason to think otherwise.

Comment by davidmanheim on A case for strategy research: what it is and why we need more of it · 2019-06-21T13:11:51.231Z · score: 6 (1 votes) · LW · GW

I'm very confused why you think that such research should be done publicly, and why you seem to think it's not being done privately.

Also, regarding the following:

Strategy research would not be valuable if it was completely intractable. We believe some actors and attempts at strategy research can succeed, but it is hard to predict success beforehand.

Given the first sentence, I'm confused as to why you think that "strategy research" (writ large) is going to be valuable, given our fundamental lack of predictive ability in most of the domains where existential risk is a concern.

Comment by davidmanheim on Should rationality be a movement? · 2019-06-21T13:05:37.473Z · score: 24 (9 votes) · LW · GW

It seems strange to try to draw sharp boundaries around communities for the purposes of this argument, and given the obvious overlap and fuzzy boundaries, I don't really understand what the claim that the "rationality community didn't have an organisation like CEA" even means. This is doubly true given that as far as I have seen, all of the central EA organizations are full of people who read/used to read Lesswrong.

Comment by davidmanheim on Modeling AGI Safety Frameworks with Causal Influence Diagrams · 2019-06-21T12:59:45.485Z · score: 2 (2 votes) · LW · GW

On point 2, which is the only one I can really comment on, yes, this seems like a useful paper, and I buy the argument that such an approach is critical for some purposes, including some of what we discussed on Goodhart's Law - - where one class of misalignment can be explicitly addressed by your approach. Also see the recent paper here: that explicitly models causal dependencies (like in figure 2,) to show a safety result.

Comment by davidmanheim on How does Gradient Descent Interact with Goodhart? · 2019-06-21T12:40:07.270Z · score: 3 (2 votes) · LW · GW

Yes, and yes, I'm hoping to be there.

Comment by davidmanheim on How does Gradient Descent Interact with Goodhart? · 2019-06-18T09:40:04.503Z · score: 7 (2 votes) · LW · GW

Note: I briefly tried a similar approach, albeit with polynomial functions with random coefficients rather than ANNs, and in R not python, but couldn't figure out how to say anything useful with it.

If this is of any interest, it is available here: (I also built simulations for bog-standard Goodhart)

I am unclear how much of my feeling that this approach is fairly useless reflects my lack of continued pursuit of building such models and figuring out what can be said, or my diversion to other work that was more fruitful, rather than a fundamental difficult of saying anything clear based on these types of simulations. I'd like to claim it's the latter, but I'll clearly note that it is heavily motivated reasoning.

Comment by davidmanheim on How does Gradient Descent Interact with Goodhart? · 2019-06-18T09:26:27.381Z · score: 1 (1 votes) · LW · GW

I really like the connection between optimal learning and Goodhart failures, and I'd love to think about / discuss this more. I've mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: "Oversight of Unsafe Systems via Dynamic Safety Envelopes", which I think parallels, but is less developed than one part of Paul Christiano's approach, but I see why that's infeasible in many settings, which is a critical issue that the offline case addresses.

I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.

As an example of the former for human values, I'd suggest that "maximize food intake" is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.

As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success - any divergence leads to a system that will kill people and lie about it.

Comment by davidmanheim on Paternal Formats · 2019-06-12T06:46:28.287Z · score: 2 (2 votes) · LW · GW

Based on the discussions below, it seems clear to me that there are (at least) two continuous dimensions of legibility and coercion, which are often related but conceptually distinct. I think they are positively correlated in most good writing, so they are easily conflated, but clarifying them seems useful.

The first is Legible <--> Illegible, in Venkatesh Rao's terms, as others suggested. This is typically the same as serial-access vs random-access, but has more to do with structure; trees are highly legible, but may not require a particular order. Rough notes from a lecture are illegible (even if they are typed, rather than hand-written,) but usually need to be read in order.

Coercive <--> Non-coercive, mostly in the negative sense people disliked. Most of the time, the level of coercion is fairly low even in what we think of as coercive writing. For example, any writing that pushes a conclusion is attempting to change your mind, hence it is coercive. Structures that review or present evidence are non-coercive.

I think it takes effort to make something legible but non-coercive, and it is either very high effort OR badly structured when they are illegible and non-coercive. And since I've brought up Venkatesh Rao and mentioned two dimensions, I believe I'm morally required to construct a 2x2. I can't upload a drawing in a comment, but I will "take two spectra (or watersheds) relevant to a complex issue, simplify each down to a black/white dichotomy, and label the four quadrants you produce." Given his advice, I'll use a "glossary of example “types” to illustrate diversity and differentiation within the soup of ambiguity."

Paternalistic non-fiction writing is legible but coercive; it assumes it knows best, but allows navigation. The sequences are a good example, well structured textbooks are often a better example. Note that being correct doesn't change the level of coercion! There are plenty of coercive anti-evolution/religious biology "textbooks," but the ones that are teaching actual science are no less coercive.

Unstructured Wikis are illegible and non-coercive; the structure isn't intended to make a point or convince you, but they are also unstructured and makes no effort to present things logically or clearly on a higher level. (Individual articles can be more or less structured or coercive, but the wiki format is not.)

Blueprints, and Diagrams, are legible but non-coercive, since by their structure they only present information, rather than leading to a conclusion. Novels and other fiction are (usually) legible, but are often non-coercive. Sometimes there is an element of coercion, as in fables, Lord of the Flies, HP:MoR, and everything CS Lewis ever wrote - but the main goal is (or should be) to be immersive or entertaining rather than coercive or instructive.

Conversations, and almost any multi-person Forum (including most lesswrong writing) are coercive and illegible. Tl;drs are usually somewhat illegible as well. The structure of conversation is hard to understand, there are posts and comments that are relevant that aren't clearly structured. At the same time, everyone is trying to push their reasoning.

Comment by davidmanheim on Major Update on Cost Disease · 2019-06-06T09:25:59.016Z · score: 12 (5 votes) · LW · GW

It also fails to account for the fact that health care is, in a sense, an ultimate superior good - there is no level of income at which people don't want more health, and their demand scales with more income. This combines with the fact that we don't have good ways to exchange money for being healthier. (The same applies for intelligence / education.) I discussed this in an essay on Scott's original post:

Comment by davidmanheim on Does Bayes Beat Goodhart? · 2019-06-06T09:16:47.236Z · score: 1 (1 votes) · LW · GW

That's all basically right, but if we're sticking to causal Goodhart, the "without further assumptions" may be where we differ. I think that if the uncertainty is over causal structures, the "correct" structure will be more likely to increase all metrics than most others.

(I'm uncertain how to do this, but) it would be interesting to explore this over causal graphs, where a system has control over a random subset of nodes, and a metric correlated to the unobservable goal is chosen. In most cases, I'd think that leads to causal goodhart quickly, but if the set of nodes potentially used for the metric includes some that are directly causing the goal, and others than can be intercepted creating causal goodhart, uncertainty over the metric would lead to less Causal-goodharting, since targeting the actual cause should improve the correlated metrics, while the reverse is not true.

Comment by davidmanheim on Uncertainty versus fuzziness versus extrapolation desiderata · 2019-06-04T12:51:49.583Z · score: 12 (3 votes) · LW · GW

It's not exactly the same, but I would argue that the issues with "Dog" versus "Cat" for the picture are best captured with that formalism - the boundaries between categories are not strict.

To be more technical, there are a couple locations where fuzziness can exist. First, the mapping in reality is potentially fuzzy since someone could, in theory, bio-engineer a kuppy or cat-dog. These would be partly members of the cat set, and partly members of the dog set, perhaps in proportion to the genetic resemblance to each of the parent categories.

Second, the process that leads to the picture, involving a camera and a physical item in space, is a mapping from reality to an image. That is, reality may have a sharp boundary between dogs and cats, but the space of possible pictures of a given resolution is far smaller than the space of physical configurations that can be photographed, so the mapping from reality->pictures is many-to-one, creating a different irresolvable fuzziness - perhaps 70% of the plausible configurations that lead to this set of pixels are cats, and 30% are dogs, so the picture has a fuzzy set membership.

Lastly, there is mental fuzziness, which usually captures the other two implicitly, but has the additional fuzziness created because the categories were made for man, not man for the categories. That is, the categories themselves may not map to reality coherently. This is different from the first issue, where "sharp" genetic boundaries like that between dogs and cats do map to reality correctly, but items can be made to sit on the line. This third issues is that the category may not map coherently to any actual distinction, or may be fundamentally ambiguous, as Scott's post details for "Man vs. Woman" or "Planet vs. Planetoid" - items can partly match one or more than one category, and be fuzzy members of the set.

Each of these, it seems, can be captured fairly well as fuzzy sets, which is why I'm proposing that your usage has a high degree of membership in the fuzzy set of things that can be represented by fuzzy sets.

Comment by davidmanheim on Does Bayes Beat Goodhart? · 2019-06-03T06:44:01.816Z · score: 16 (4 votes) · LW · GW

Also, I keep feeling bad that we're perpetuating giving Goodhart credit, rather than Campbell, since Campbell was clearly first - - and Goodhart explicitly said he was joking in a recent interview.

Comment by davidmanheim on Does Bayes Beat Goodhart? · 2019-06-03T06:38:38.341Z · score: 9 (3 votes) · LW · GW

See my much shorter and less developed note to a similar effect: - and I agree that regressional and extremal goodhart cannot be fixed purely with his solution.

I will, however, defend some of Stuart's suggestions as they relate to causal Goodhart in a non-adversarial setting. - I'm also avoiding the can of worms of game theory. In that case, both randomization AND mixtures of multiple metrics can address Goodhart-like failures, albeit in different ways. I had been thinking about this in the context of policy - - rather than AI alignment, but some of the arguments still apply. (One critical argument that doesn't fully apply is that "good enough" mitigation raises the cognitive costs of cheating to a point where aligning with the true goal is cheaper. I also noted in the paper that satisficing is useful for limiting the misalignment from metrics, and quantilization seems like one promising approach for satisficing for AGI.)

The argument for causal goodhart is that randomization and mixed utilities are both effective in mitigating causal structure errors that lead to causal Goodhart in the one-party case. That's because the failure occurs when uncertainty or mistakes about causal structure leads to choice of metrics that are corrrelated with the goal, rather than causal of the goal. However, if even some significant fraction or probability of the metric is causally connected to the metrics in ways that cannot be gamed, it can greatly mitigate this class of failure.

To more clearly apply this logic to human utility, if we accidentally think that endorphins in the brain are 100% of human goals, AGI might want to tile the universe with rats on happy drugs, or the moral equivalent. If we assign this only 50% weight, of have a 50% probability that it will be the scored outcome, and we define something that requires a different way of creating what we actually think of as happiness / life satisfaction, it does not just shift the optimum from 50% of the universe tiled with rat brains. This is because the alternative class of hedonium will involve a non-trivial amount of endorphins as well, as long as other solutions have anywhere close to as much endorphins, they will be preferred. (In this case, admittedly, we got the endorphin goal so wrong that 50% of the universe tiled in rats on drugs is likely - bad enough utility functions can't be fixed with either randomization or weighting. But if a causal mistake can be fixed with either a probabilistic or a weighting solution, it seems likely it can be fixed with the other.)

Comment by davidmanheim on Conditions for Mesa-Optimization · 2019-06-03T06:07:02.080Z · score: 14 (4 votes) · LW · GW

I really like this formulation, and it greatly clarifies something I was trying to note in my recent paper on multiparty dynamics and failure modes - The discussion about the likelihood of mesa-optimization due to human modeling is close to the more general points I tried to make in the discussion section of that paper. As argued here about humans, other systems are optimizers (even if they are themselves only base-optimizers,) and therefore any successful machine-learning system in a multiparty environment is implicitly forced to model the other parties. I called this the "opponent model," and argued that they are dangerous because they are always approximate, arguing directly from that point to claim there is great potential for misalignment - but the implication from this work is that they are also dangerous because it encourages machine learning in multi-agent systems to be mesa-optimizers, and the mesa-optimization is a critical enabler of misalignment even when the base optimizer is well aligned.

I would add to the discussion here that multiparty systems can display the same dynamics, and therefore have risks similar to that of systems which require human models. I also think, less closely connected to the current discussion, but directly related to my paper, that mesa-optimizers misalignments pose new and harder to understand risks when they interact with one another.

I also strongly agree with the point that current examples are not really representative of the full risk. Unfortunately, peer-reviewers strongly suggested that I have moreconcrete examples of failures. But as I said in the paper, "the failures seen so far are minimally disruptive. At the same time, many of the outlined failures are more problematic for agents with a higher degree of sophistication, so they should be expected not to lead to catastrophic failures given the types of fairly rudimentary agents currently being deployed. For this reason, specification gaming currently appears to be a mitigable problem, or as Stuart Russell claimed, be thought of as “errors in specifying the objective, period.”"

As a final aside, I think that the concept of mesa-optimizers is very helpful in laying out the argument against that last claim - misalignment is more than just misspecification. I think that this paper will be very helpful in showing why,

Comment by davidmanheim on Uncertainty versus fuzziness versus extrapolation desiderata · 2019-06-01T18:41:51.747Z · score: 3 (2 votes) · LW · GW

Actually, I assumed fuzzy was intended here to be a precise term, contrasted with probability and uncertainty, as it is used in describing fuzzy sets versus uncertainty about set membership.

Comment by davidmanheim on Uncertainty versus fuzziness versus extrapolation desiderata · 2019-05-31T08:13:19.569Z · score: 9 (3 votes) · LW · GW

I missed the proposal when it was first released, but I wanted to note that the original proposal addresses only one (critical) class of Goodhart-error, and proposes a strategy based on addressing one problematic result of that, nearest-unblocked neighbor. The strategy does more widely useful for misspecification than just nearest-unblocked neighbor, but it still is only addressing some Goodhart-effects.

The misspecification discussed is more closely related to, but still distinct from, extremal and regressional Goodhart. (Causal and adversarial Goodhart are somewhat far removed, and don't seem as relevant to me here. Causal Goodhart is due to mistakes, albeit fundamentally hard to avoid mistakes, while adversarial Goodhart happens via exploiting other modes of failure.)

I notice I am confused about how different strategies being proposed to mitigate these related failures can coexist if each is implemented separately, and/or how they would be balanced if implemented together, as I briefly outline below. Reconciling or balancing these different strategies seems like an important question, but I want to wait to see the full research agenda before commenting or questioning further.

Explaining the conflict I see between the strategies:

Extremal Goodhart is somewhat addressed by another post you made, which proposes to avoid ambiguous distant situations - It seems that the strategy proposed here is to attempt to resolve fuzziness, rather than avoid areas where it becomes critical. These seem to be at least somewhat at odds, though this is partly reconcilable by fully pursuing neither resolving ambiguity, nor fully avoiding distant ambiguity.

and regressional Goodhart, as Scott G. originally pointed out, is unavoidable except by staying in-sample, interpolating rather than extrapolating. Fully pursuing that strategy is precluded by injecting uncertainty into the model of the Human-provided modification to the utility function. Again, this is partly reconcilable, for example, by trying to bound how far we let the system stray from the initially provided blocked strategy, and how much fuzziness it is allowed to infer without an external check.

Comment by davidmanheim on Schelling Fences versus Marginal Thinking · 2019-05-23T11:06:38.609Z · score: 1 (1 votes) · LW · GW

Yes, that does seem to be a risk. I would think that applying schelling fences to reinforce current values reduces the amount of expected drift in the future, and I'm unclear whether you are claiming that using Schelling fences will do the opposite, or claiming that they are imperfect.

I'd also like to better understand what specifically you think is making the error of making it difficult to re-align with current values, rather than reducing the degree of drift, and how it could be handled differently.

Comment by davidmanheim on No Really, Why Aren't Rationalists Winning? · 2019-05-23T06:33:50.135Z · score: 4 (2 votes) · LW · GW

That's a very good point, I was definitely unclear.

I think that the critical difference is that in epistemically health communities, when such a failure is pointed out, some effort is spent in identifying and fixing the problem, instead of pointedly ignoring it despite efforts to solve the problem, or spending time actively defending the inadequate status quo from even pareto-improving changes.

Comment by davidmanheim on No Really, Why Aren't Rationalists Winning? · 2019-05-23T06:30:39.333Z · score: 12 (3 votes) · LW · GW

I don't think they get epistemic rationality anywhere near correct either. As a clear and simpole example, there are academics currently vigorously defending their right not to pre-register empirical studies.

Comment by davidmanheim on By default, avoid ambiguous distant situations · 2019-05-23T06:27:33.503Z · score: 3 (2 votes) · LW · GW

Agreed. I'm just trying to think through why we should / should not privilege the status quo. I notice I'm confused about this, since the reversal heuristic implies we shouldn't. If we take this approach to an extreme, aren't we locking in the status-quo as a base for allowing only pareto improvements, rather than overall utilitarian gains?

(I'll note that Eric Drexler's Pareto-topia argument explicitly allows for this condition - I'm just wondering whether it is ideal, or a necessary compromise.)