Posts

Have the lockdowns been worth it? 2020-10-12T23:35:14.835Z
How uniform is the neocortex? 2020-05-04T02:16:50.650Z
How special are human brains among animal brains? 2020-04-01T01:35:36.995Z
My current framework for thinking about AGI timelines 2020-03-30T01:23:57.195Z
zhukeepa's Shortform 2020-03-12T19:09:28.648Z
Can HCH epistemically dominate Ramanujan? 2019-02-23T22:00:33.363Z
Paul's research agenda FAQ 2018-07-01T06:25:14.013Z
Another take on agent foundations: formalizing zero-shot reasoning 2018-07-01T06:12:57.414Z
My take on agent foundations: formalizing metaphilosophical competence 2018-04-01T06:33:10.372Z
Corrigible but misaligned: a superintelligent messiah 2018-04-01T06:20:50.577Z
Reframing misaligned AGI's: well-intentioned non-neurotypical assistants 2018-04-01T01:22:36.993Z
Metaphilosophical competence can't be disentangled from alignment 2018-04-01T00:38:11.533Z

Comments

Comment by zhukeepa on How uniform is the neocortex? · 2020-05-04T17:44:48.834Z · LW · GW

Yep! I addressed this point in footnote [3].

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T02:32:05.907Z · LW · GW

I just want to share another reason I find this n=1 anecdote so interesting -- I have a highly speculative inside view that the abstract concept of self provides a cognitive affordance for intertemporal coordination, resulting in a phase transition in agentiness only known to be accessible to humans.

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T02:29:53.639Z · LW · GW

Hmm, I'm not sure I understand what point you think I was trying to make. The only case I was trying to make here was that much of our subjective experience which may appear uniquely human might stem from our langauge abilites, which seems consistent with Helen Keller undergoing a phase transition in her subjective experience upon learning a single abstract concept. I'm not getting what age has to do with this.

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T02:10:18.166Z · LW · GW
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot.

Not necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T02:01:15.698Z · LW · GW

I remembered reading about this a while back and updating on it, but I'd forgotten about it. I definitely think this is relevant, so I'm glad you mentioned it -- thanks!

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T02:00:40.167Z · LW · GW
I think this explanation makes sense, but it raises the further question of why we don't see other animal species with partial language competency. There may be an anthropic explanation here - i.e. that once one species gets a small amount of language ability, they always quickly master language and become the dominant species. But this seems unlikely: e.g. most birds have such severe brain size limitations that, while they could probably have 1% of human language, I doubt they could become dominant in anywhere near the same way we did.

Can you elaborate more on what partial language competency would look like to you? (FWIW, my current best guess is on "once one species gets a small amount of language ability, they always quickly master language and become the dominant species", but I have a lot of uncertainty. I suppose this also depends a lot on what exactly what's meant by "language ability".)

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T01:58:46.186Z · LW · GW
This seems like a false dichotomy. We shouldn't think of scaling up as "free" from a complexity perspective - usually when scaling up, you need to make quite a few changes just to keep individual components working. This happens in software all the time: in general it's nontrivial to roll out the same service to 1000x users.

I agree. But I also think there's an important sense in which this additional complexity is mundane -- if the only sorts of differences between a mouse brain and a human brain were the sorts of differences involved in scaling up a software service to 1000x users, I think it would be fair (although somewhat glib) to call a human brain a scaled-up mouse brain. I don't think this comparison would be fair if the sorts of differences were more like the sorts of differences involved in creating 1000 new software services.

Comment by zhukeepa on How special are human brains among animal brains? · 2020-04-05T01:46:55.633Z · LW · GW

That's one of the "unique intellectual superpowers" that I think language confers us:

On a species level, our mastery of language enables intricate insights to accumulate over generations with high fidelity. Our ability to stand on the shoulders of giants is unique among animals, which is why our culture is unrivaled in its richness in sophistication.

(I do think it helps to explicitly name our ability to learn culture as something that sets us apart, and wish I'd made that more front-and-center.)

Comment by zhukeepa on Paul's research agenda FAQ · 2018-09-16T17:09:37.555Z · LW · GW

I'm still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it's in the test distribution, and executing a treacherous turn if so. The way I understand your summaries:

  • Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it's in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it's in the test distribution.
  • Interpretability makes it harder for the malignant 1% to be hidden, but doesn't prevent malignant cognition it can't detect. (My reading of "Gradient descent will be pushing against whatever kinds of treacherous turns that can be detected by our machinery. But it will only push for things that improve behavior on the training distribution, so a treacherous turn can only appear where it is coupled with the machinery that is producing good predictions, giving us more traction for interpretability." is completely consistent with this.)

I didn't understand what your wrote about verification well enough to have anything to say.

It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it's in the test distribution. Is your model that gradient descent will favor the former agent over the latter agent, making the 99%/1% agent unlikely to arise (and perhaps quantifiably so)?

The inner process may nevertheless use TDT if TDT doesn't diverge from CDT on the training distribution, or it might learn to use TDT but "look nice" so that it doesn't get selected against.

This was what I was intending to convey in assumption 3.

Comment by zhukeepa on Paul's research agenda FAQ · 2018-09-14T05:08:16.383Z · LW · GW

I'm currently intuiting that there's a broad basin of "seeming corrigible until you can perform a treacherous turn", but not a broad basin of true corrigibility. If the overseer can robustly detect incorrigible cognition in the distilled agent, I think things may be fine. But if e.g. the overseer is only inspecting the distilled agent's justifications for its behavior, and something like its verbal loop, I don't see how things can work out. 

Here are my assumptions underlying this intuition: 

1. Past a certain capabilities level, distilled agents can meaningfully be modeled as expected utility maximizers for short-term rewards from its overseer. (This is because expected utility maximization is instrumentally convergent, the same way that believing 51 to be composite is instrumentally convergent, as per Eliezer's comment.)

2. If the agents reason using TDT, the effective utility functions of these act-based agents will optimize for long-term instrumental preferences of their users, because in expectation, adopting this policy raises whatever short-term instrumental reward they'll get.

3. Expected utility maximizers for discounted future rewards of reward signals are hopelessly incorrigible. 

4. Agents produced through the distillation process, misaligned in this way, could enter a regime of devoting 99% of their cognition to doing exactly what a corrigible agent would do, and 1% of their cognition to finding strategies for determining how to / whether to execute a treacherous turn. 

5. Because of how many ways there are to hide this 1% of cognition, a complete white-box understanding of the agent's cognition is necessary to spot this sort of malign reasoning. 

6. We can't achieve this level of understanding via anything like current ML transparency techniques. 

Which of these assumptions do you disagree with most? Or do you agree with these assumptions, but feel optimistic that we can get good worst-case guarantees anyway?

2. How easy is it to learn to be corrigible? I'd think of this as: if we impose the extra constraint that our model behave corrigibly on all inputs, in addition to solving the object-level task well, how much bigger do we need to make the model?

My current intuition: for the same reasons it seems really hard to train models that believe 51 is prime while also performing well on object-level tasks, learning to be corrigible seems very difficult.

Comment by zhukeepa on Another take on agent foundations: formalizing zero-shot reasoning · 2018-07-06T20:58:13.436Z · LW · GW
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don't expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).

I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes "sufficiently complex". I'm not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.

is simply not able to foresee the impacts of its changes and so makes them 'recklessly' (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).

+100. This is why I feel queasy about "OK, I judge this self-modification to be fine" when the self-modifications are sufficiently complex, if this judgment isn't based off something like zero-shot reasoning (in which case we'd have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).

Comment by zhukeepa on Paul's research agenda FAQ · 2018-07-06T01:07:54.852Z · LW · GW

If we view the US government as a single entity, it's not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it's not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself. 

If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator? 

It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:

  • Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
  • Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn't expect to be able to adequately impart corrigibility via labeled training data.
  • Without a clear conceptual notion of agency, we won't have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.

I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.

Comment by zhukeepa on Another take on agent foundations: formalizing zero-shot reasoning · 2018-07-04T18:21:11.515Z · LW · GW

I should clarify a few more background beliefs:

  • I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
  • I agree that right now, nobody is trying to (or should be trying to) build an AGI that's competently optimizing for our values for 1,000,000,000 years. (I'd want an aligned, foomed AGI to be doing that.)
  • I agree that if we're not doing anything as ambitious as that, it's probably fine to rely on human input.
  • I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don't have to worry about zero-shot reasoning today.
  • Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I'd give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)

Let me now clarify what I mean by "foomed AGI":

  • A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. ("Optimally optimized optimizer" is another way of putting it.)
  • You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it's more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
  • I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
  • In this "nuclear explosion" of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.

In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like "Just don't build a foomed AGI. Just like it's way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it's way too hard to build a safe foomed AGI, so let's just not do it". And my position is something like "It's probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let's do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right."

I'm happy to delve into your individual points, but before I do so, I'd like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.

Comment by zhukeepa on Paul's research agenda FAQ · 2018-07-03T22:23:17.505Z · LW · GW

Corrigibility. Without corrigibility I would be just as scared of Goodhart.

Comment by zhukeepa on Another take on agent foundations: formalizing zero-shot reasoning · 2018-07-03T19:59:46.706Z · LW · GW
This seems like it's using a bazooka to kill a fly. I'm not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?

I agree that zero-shot reasoning doesn't save us from daemons by itself, and I think there's important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.

Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?

The daemons I'm focusing on here mostly arise from embedded agency, which Solomonoff induction doesn't capture at all. (It's worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and "internal politics"/"embedded agency" daemons.) I'm interested in hashing this out further, but probably at some future point, since this doesn't seem central to our disagreement.

But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is "the AGI was incompetent at some point, made a mistake, and bad things happened". I don't know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to "want" to do that.
However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.
....
If you're trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can't build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)

I'm surprised by how much we seem to agree about everything you've written here. :P Let me start by clarifying my position a bit:

  • When I imagine the AGI making a "plan that will work in one go", I'm not imagining it going like "OK, here's a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!" I'm imagining the plan to look more like "set a bunch of things in motion, reevaluate and update it based on where things are, and repeat". So the overall shape of this AGI's cognition will look something like "execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.", happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn't introduce any critical errors.
  • I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
  • Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years' worth of human cognition without any catastrophic failures.
  • I'm less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It's plausible that the latter usually leads to the former (particularly if there's a fast takeoff on Earth that completes in a few hours), but that's mostly not what I'm concerned about.

In terms of actual disagreement, I suspect I'm much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it's doing something like 1,000,000 years' worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What's your take on this?

Comment by zhukeepa on Paul's research agenda FAQ · 2018-07-03T18:28:33.414Z · LW · GW
This proposal judges explanations by plausibility and articulateness. Truthfulness is only incidentally relevant and will be Goodharted away.

Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we're distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.

Comment by zhukeepa on Another take on agent foundations: formalizing zero-shot reasoning · 2018-07-03T07:04:23.693Z · LW · GW
I see. Given this, I think "zero-shot learning" makes sense but "zero-shot reasoning" still doesn't, since in the former "zero" refers to "zero demonstrations" and you're learning something without doing a learning process targeted at that specific thing, whereas in the latter "zero" isn't referring to anything and you're trying to get the reasoning correct in one attempt so "one-shot" is a more sensible description.

I was imagining something like "zero failed attempts", where each failed attempt approximately corresponds to a demonstration.

Are you saying that in the slow-takeoff world, we will be able to coordinate to stop AI progress after reaching AGI and then solve the full alignment problem at leisure? If so, what's your conditional probability P(successful coordination to stop AI progress | slow takeoff)?

More like, conditioning on getting international coordination after our first AGI, P(safe intelligence explosion | slow takeoff) is a lot higher, like 80%. I don't think slow takeoff does very much to help international coordination.

Comment by zhukeepa on Paul's research agenda FAQ · 2018-07-03T06:51:43.142Z · LW · GW

1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?

2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?

3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?

Comment by zhukeepa on Another take on agent foundations: formalizing zero-shot reasoning · 2018-07-01T19:12:24.113Z · LW · GW
This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.

I'm mostly concerned with daemons, not utility functions changing in random directions. If I knew that corrigibility were robust and that a corrigible AI would never encounter daemons, I'd feel pretty good about it recursively self-improving without formal zero-shot reasoning.

You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don't expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task -- the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).

I'm imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it's performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.

In particular, the one context where we're most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs -- and yet, daemons arise.

I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.

I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.

In my mind, there's no clear difference between preventing daemons and securing complex systems. For example, I think there's a fundamental similarity between the following questions:

  • How can we build an organization that we trust to optimize for its founders' original goals for 10,000 years?
  • How can ensure a society of humans flourishes for 1,000,000,000 years without falling apart?
  • How can we build an AGI which, when run for 1,000,000,000 years, still optimizes for its original goals with > 99% probability? (If it critically malfunctions, e.g. if it "goes insane", it will not be optimizing for its original goals.)
  • How can we build an AGI which, after undergoing an intelligence explosion, still optimizes for its original goals with > 99% probability?

I think of AGIs as implementing miniature societies teeming with subagents that interact in extraordinarily sophisticated ways (for example they might play politics or Goodhart like crazy). On this view, ensuring the robustness of an AGI entails ensuring the robustness of a society at least as complex as human society, which seems to me like it requires zero-shot reasoning.

It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms). Maybe it's just a failure of my imagination, but I can't think of any way to accomplish even this task without delegating it to a skilled zero-shot reasoner.

Comment by zhukeepa on Another take on agent foundations: formalizing zero-shot reasoning · 2018-07-01T17:50:12.709Z · LW · GW
Why "zero-shot"? You're talking about getting something right in one try, so wouldn't "one-shot" make more sense?

I've flip-flopped between "one-shot" and "zero-shot". I'm calling it "zero-shot" in analogy with zero-shot learning, which refers to the ability to perform a task after zero demonstrations. "One-shot reasoning" probably makes more sense to folks outside of ML.

I think this paragraph gives an overly optimistic impression of how much progress has been made. We are still very confused about what probabilities really are, we haven't made any progress on the problem of Apparent Unformalizability of “Actual” Induction, and decision theory seems to have mostly stalled since about 8 years ago (the MIRI paper you cite does not seem to represent a substantial amount of progress over UDT 1.1).

I used "substantial progress" to mean "real and useful progress", rather than "substantial fraction of the necessary progress". Most of my examples happened in the eary to mid-1900s, suggesting that if we continue at that rate we might need at least another century.

This isn't obvious to me. Can you explain why you think this?

I'd feel much better about delegating the problem to a post-AGI society, because I'd expect such a society to be far more stable if takeoff is slow, and far more capable of taking its merry time to solve the full problem in earnest. (I think it will be more stable because I think it would be much harder for a single actor to attain a decisive strategic advantage over the rest of the world.)

Comment by zhukeepa on Challenges to Christiano’s capability amplification proposal · 2018-05-26T14:41:49.152Z · LW · GW
To clarify: your position is that 100,000 scientists thinking for a week each, one after another, could not replicate the performance of one scientist thinking for 1 year?

Actually I would be surprised if that's the case, and I think it's plausible that large teams of scientists thinking for one week each could safely replicate arbitrary human intellectual progress.

But if you replaced 100,000 scientists thinking for a week each with 1,000,000,000,000 scientists thinking for 10 minutes each, I'd feel more skeptical. In particular I think 10,000,000 10-minute scientists can't replicate the performance of one 1-week scientist, unless the 10-minute scientists become human transistors. In my mind there isn't a qualitative difference between this scenario and the low-bandwidth oversight scenario. It's specifically dealing with human transistors that I worry about.

I also haven't thought too carefully about the 10-minute-thought threshold in particular and wouldn't be too surprised if I revised my view here. But if we replaced "10,000,000 10-minute scientists" with "arbitrarily many 2-minute scientists" I would even more think we couldn't assemble the scientists safely.

I'm assuming in all of this that the scientists have the same starting knowledge.

There's an old SlateStarCodex post that's a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists' epistemic processis fundamentally similar to that of the alchemists. And the alchemists' thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH's.

I also imagine them to be far more constraining if "thought-lifespans" shrank from ~30 years to ~30 minutes. But "thought-lifespans" on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.

I don't understand at all how that could be true for brain uploading at the scale of a week vs. year.
Solving this problem considering multiple possible approaches. Those can't be decomposed with 100% efficiency, but it sure seems like they can be split up across people.
Evaluating an approach requires considering a bunch of different possible constraints, considering a bunch of separate steps, building models of relevant phenomena, etc.
Building models requires considering several hypotheses and modeling strategies. Evaluating how well a hypothesis fits the data involves considering lots of different observations. And so on.

I agree with all this.

EDIT: In summary, my view is that:

  • if all the necessary intellectual progress can be distilled into individual scientists' heads, I feel good about HCH making a lot of intellectual progress
  • if the agents are thinking long enough (1 week seems long enough to me, 30 minutes doesn't), this distillation can happen.
  • if this distillation doesn't happen, we'd have to end up doing a lot of cognition on "virtual machines", and cognition on virtual machines is unsafe.
Comment by zhukeepa on Challenges to Christiano’s capability amplification proposal · 2018-05-22T14:37:05.094Z · LW · GW

You're right -- I edited my comment accordingly. But my confusion still stands. Say the problem is "figure out how to upload a human and run him at 10,000x". On my current view:

(1) However you decompose this problem, you'd need something equivalent to at least 1 year's worth of a competent scientist doing general reasoning to solve this problem.

(2) In particular, this general reasoning would require the ability to accumulate new knowledge and synthesize it to make novel inferences.

(3) This sort of reasoning would end up happening on a "virtual machine AGI" built out of "human transistors".

(4) Unless we know how to ensure cognition is safe (e.g. daemon-free) we wouldn't know how to make safe "virtual machine AGI's".

(5) So either we aren't able to perform this reasoning (because it's unsafe and recognized as such), or we perform it anyway unsafely, which may lead to catastrophic outcomes.

Which of these points would you say you agree with? (Alternatively, if my picture of the situation seems totally off, could you help show me where?)

Comment by zhukeepa on Challenges to Christiano’s capability amplification proposal · 2018-05-20T22:45:45.613Z · LW · GW
D-imitations agglomerate to sufficient cognitive power to perform a pivotal act in a way that causes the alignment of the components to be effective upon aligning the whole; and imperfect DD-imitation preserves this property.

This is the crux I currently feel most skeptical of. I don't understand how we could safely decompose the task of emulating 1 year's worth of von Neumann-caliber general reasoning on some scientific problem. (I'm assuming something like this is necessary for a pivotal act; maybe it's possible to build nanotech or whole-brain emulations without such reasoning being automated, in which case my picture for the world becomes rosier.) (EDIT: Rather than "decomposing the task of emulating a year's worth of von Neumann-caliber general reasoning", I meant to say "decomposing any problem whose solution seems to require 1 year's worth of von Neumann-caliber general reasoning".)

In particular, I'm still picturing Paul's agenda as implementing some form of HCH, and I don't understand how anything that looks like an HCH can accumulate new knowledge, synthesize it, and make new discoveries on top of it, without the HCH-humans effectively becoming "human transistors" that implement an AGI. (An analogy: the HCH-humans would be like ants; the AGI would be like a very complicated ant colony.) And unless we know how to build a safe AGI (for example we'd need to ensure it has no daemons), I don't see how the HCH-humans would know how to configure themselves into a safe AGI, so they just wouldn't (if they're benign).

Comment by zhukeepa on Challenges to Christiano’s capability amplification proposal · 2018-05-20T21:53:31.047Z · LW · GW

Oops, I think I was conflating "corrigible agent" with "benign act-based agent". You're right that they're separate ideas. I edited my original comment accordingly.

Comment by zhukeepa on Challenges to Christiano’s capability amplification proposal · 2018-05-20T19:26:40.499Z · LW · GW
X-and-only-X is what I call the issue where the property that's easy to verify and train is X, but the property you want is "this was optimized for X and only X and doesn't contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system".

If X is "be a competent, catastrophe-free, corrigible act-based assistant", it's plausible to me that an AGI trained to do X is sufficient to lead humanity to a good outcome, even if X doesn't capture human values. For example, the operator might have the AGI develop the technology for whole brain emulations, enabling human uploads that can solve the safety problem in earnest, after which the original AGI is shut down.

Being an act-based (and thus approval-directed) agent is doing a ton of heavy lifting in this picture. Humans obviously wouldn't approve of daemons, so your AI would just try really hard to not do that. Humans obviously wouldn't approve of a Rubik's cube solution that modulates RAM to send GSM cellphone signals, so your AI would just try really hard to not do that.

I think most of the difficulty here is shoved into training an agent to actually have property X, instead of just some approximation of X. It's plausible to me that this is actually straightforward, but it also feels plausible that X is a really hard property to impart (though still much easier to impart than "have human values").

A crux for me whether property X is sufficient is whether the operator could avoid getting accidentally manipulated. (A corrigible assistant would never intentionally manipulate, but if it satisfies property X while more directly optimizing Y, it might accidentally manipulate the humans into doing some Y distinct from human values.) I feel very uncertain about this, but it currently seems plausible to me that some operators could successfully just use the assistant to solve the safety problem in earnest, and then shut down the original AGI.

Comment by zhukeepa on Challenges to Christiano’s capability amplification proposal · 2018-05-20T19:00:07.161Z · LW · GW

D-imitations and DD-imitations robustly preserve the goodness of the people being imitated, despite the imperfection of the imitation;

My model of Paul thinks it's sufficient to train the AI's to be corrigible act-based assistants that are competent enough to help us significantly, while also able to avoid catastrophes. If possible, this would allow significant wiggle room for imperfect imitation.

Paul and I disagreed about the ease of training such assistants, and we hashed out a specific thought experiment: if we humans were trying our hardest to be competent, catastrophe-free, corrigible act-based assistants to some aliens, is there some reasonable training procedure they could give us that would enable us to significantly and non-catastrophically assist the aliens perform a pivotal act? Paul thought yes (IIRC), while I felt iffy about it. After all, we might need to understand tons and tons of alien minutiae to avoid any catastrophes, and given how different our cultures (and brains) are from theirs, it seems unlikely we'd be able to capture all the relevant minutiae.

I've since warmed up to the feasibility of this. It seems like there aren't too many ways to cause existential catastrophes, it's pretty easy to determine what things constitute existential catastrophes, and it's pretty easy to spot them in advance (at least as well as the aliens would). Yes we might still make some catastrophic mistakes, but they're likely to be benign, and it's not clear that the risk of catastrophe we'd incur is much worse than the risk the aliens would incur if a large team of them tried to execute a pivotal act. Perhaps there's still room for things like accidental mass manipulation, but this feels much less worrisome than existential catastrophe (and also seems plausibly preventable with a sufficiently competent operator).

I suspect another major crux on this point is whether there is a broad basin of corrigibility (link). If so, it shouldn't be too hard for D-imitations to be corrigible, nor for IDA to preserve corrigibility for DD-imitations. If not, it seems likely that corrigibility would be lost through distillation. I think this is also a crux for Vaniver in his post about his confusions with Paul's agenda.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-12T08:16:13.242Z · LW · GW

I think the world will end up in a catastrophic epistemic pit. For example, if any religious leader got massively amplified, I think it's pretty likely (>50%) the whole world will just stay religious forever.

Us making progress on metaphilosophy isn't an improvement over the empowered person making progress on metaphilosophy, conditioning on the empowered person making enough progress on metaphilosophy. But in general I wouldn't trust someone to make enough progress on metaphilosophy unless they had a strong enough metaphilosophical base to begin with.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-12T08:10:57.977Z · LW · GW

Yeah, what you described indeed matches my notion of "values-on-reflection" pretty well. So for example, I think a religious person's values-on-reflection should include valuing logical consistency and coherent logical arguments (because they do implicitly care about those in their everyday lives, even if they explicitly deny it). This means their values-on-reflection should include having true beliefs, and thus be atheistic. But I also wouldn't generally trust religious people to update away from religion if they reflected a bunch.

Comment by zhukeepa on Reframing misaligned AGI's: well-intentioned non-neurotypical assistants · 2018-04-12T08:04:51.896Z · LW · GW

I wish I were clearer in my title that I'm not trying to reframe all misaligned AGI's, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be "well-intentioned") if it had a bad goal.

That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.

I think if we've correctly specified the values in an AGI, then I agree that when the AGI is smart enough it'll correctly optimize for our values. But it's not necessarily robust to scaling down, and I think it's likely to hit a weird place where it's trying and failing to optimize for our values. This post is about my intuitions for what that might look like.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-10T06:19:15.171Z · LW · GW

Oh, I actually think that giving the 100th best person a bunch of power is probably better.than the status quo, assuming there are ~100 people who pass the bar (I also feel pessimistic about the status quo). The only reason why I think the status quo might be better is that more metaphilosophy would develop, and then whoever gets amplified would have more metaphilosophical competence to begin with, which seems safer.

Comment by zhukeepa on Reframing misaligned AGI's: well-intentioned non-neurotypical assistants · 2018-04-10T02:55:59.733Z · LW · GW

Thanks a lot Ben! =D

I am somewhat hesitant to share simple intuition pumps about important topics, in case those intuition pumps are misleading.

On that note, Paul has recently written a blog post clarifying that his notion of "misaligned AI" does not coincide with what I wrote about here.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-09T22:01:45.253Z · LW · GW

I'd say that it wouldn't appear catastrophic to the amplified human, but might be catastrophic for that human anyway (e.g. if their values-on-reflection actually look a lot like humanity's values-on-reflection, but they fail to achieve their values-on-reflection).

Comment by zhukeepa on Can corrigibility be learned safely? · 2018-04-09T20:58:58.566Z · LW · GW
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don't think they suggest that an AI needs to learn human representations in order to be safe.

I think my present view is something like a conjunction of:

1. An AI needs to learn human representations in order to generalize like a human does.

2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.

3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it's easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.

4. To the extent that "general and open-ended tasks" can be broken down into narrow tasks that don't require human generalization, they don't require human generalization to learn safely.

My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it's true but the bar for "sufficiently general and open-ended" is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?

I'm confused about your thoughts on (1).

(I'm currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)

Comment by zhukeepa on Can corrigibility be learned safely? · 2018-04-08T19:57:00.596Z · LW · GW

I replied about (2) and black swans in a comment way down.

I'm curious to hear more about your thoughts about (4).

To flesh out my intuitions around (4) and (5): I think there are many tasks where a high-dimensional and difficult to articulate piece of knowledge is critical for completing the task. For example:

  • if you're Larry or Sergey trying to hire a new CEO, you need your new CEO to be a culture fit. Which in this case means something like "being technical, brilliant, and also a hippie at heart". It's really, really hard to communicate this to a slick MBA. Especially the "be a hippie at heart" part. Maybe if you sent them to Burning Man and had them take a few drugs, they'd grok it?
  • if you're Bill Gates hiring a new CEO, you should make sure your new CEO is also a developer at heart, not a salesman. Otherwise, you might hire Steve Ballmer, who drove Microsoft's revenues through the roof for a few years, but also had little understanding of developers (for example he produced an event where he celebrated developers in a way developers don't tend to like being celebrated). This led to an overall trend of the company losing its technical edge, and thus its competitive edge... this was all while Ballmer had worked with Gates at Microsoft for two decades. If Ballmer was a developer, he may have been able avoid this, but he very much wasn't.
  • if you're a self-driving car engineer delegating image classification to a modern-day neural net, you'd really want its understanding of what the classifications mean to match yours, lest they be susceptible to clever adversarial attacks. Humans understand the images to represent projections of crisp three-dimensional objects that exist in a physical world; image classifiers don't, which is why they can get fooled so easily by overlays of random patterns. Maybe it's possible to replicate this understanding without being an embodied agent, but it seems you'd need something beyond training a big neural net on a large collection of images, and making incremental fixes.
  • if you're a startup trying to build a product, it's very hard to do so correctly if you don't have a detailed implicit model of your users' workflows and pain points. It helps a lot to talk to them, but even then, you may only be getting 10% of the picture if you don't know what it's like to be them. Most startups die by not having this picture, flying blind, and failing to acquire any users.
  • if you're trying to help your extremely awkward and non-neurotypical friend find a romantic partner, you might find it difficult to convey what exactly is so bad about carrying around slips of paper with clever replies, and pulling them out and reading from them when your date says something you don't have a reply to. (It's not that hard to convey why doing this particuar thing is bad. It's hard to convey what exactly about it is bad, that would have him properly generalize and avoid all classes of mistakes like this going forward, rather than just going like "Oh, pulling out slips of paper is jarring and might make her feel bad, so I'll stop doing this particular thing".) (No, I did not make up this example.)

In these sorts of situations, I wouldn't trust an AI to capture my knowledge/understanding. It's often tacit and perceptual, it's often acquired through being a human making direct contact with reality, and it might require a human cognitive architecture to even comprehend in the first place. (Hence my claims that proper generalization requires having the same ontologies as the overseer, which they obtained from their particular methods of solving a problem.)

In general, I feel really sketched about amplifying oversight, if the mechanism involves filtering your judgment through a bunch of well-intentioned non-neurotypical assistants, since I'd expect the tacit understandings that go into your judgment to get significantly distorted. (Hence my curiosity about whether you think we can avoid the judgment getting significantly distorted, and/or whether you think we can do fine even with significantly distorted judgment.)

It's also pretty plausible that I'm talking completely past you here; please let me know if this is the case.

Comment by zhukeepa on Can corrigibility be learned safely? · 2018-04-08T18:08:05.405Z · LW · GW

I really like that list of points! Not that I'm Rob, but I'd mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I'd expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I'd be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about "black swans" popping up. And when I said:

2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it's critical that it knows to check that action with you).

I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn't mean "the AI is incorrigible if it's not high-impact calibrated", I meant "the AI, even if corrigible, would be unsafe it's not high-impact calibrated".)

If these kinds of errors are included in "alignment," then I'd want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as "figure out more about what is right" is one way to try to build an AI that is trying to do the right thing.)

I think I understand your position much better now. The way I've been describing "ability to figure out what is right" is "metaphilosophical competence", and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it's not good enough at figuring out what is right, even if it's corrigible.

Comment by zhukeepa on Corrigible but misaligned: a superintelligent messiah · 2018-04-05T06:22:29.464Z · LW · GW

Oops, I do think that's what I meant. To explain my wording: when I imagined a "system optimizing for X", I didn't imagine that system trying its hardest to do X, I imagined "a system for which the variable Z it can best be described as optimizing is X".

To say it all concretely another way, I mean that there are a bunch of different systems that, when "trying to optimize for X as hard as possible" all look to us like they optimize for X successfully, but do so via methods that lead to vastly different (and generally undesirable) endstates like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe as optimizing for instead of X, even though is trying to optimize for X and optimizes it pretty successfully. But I really don't want a superintelligent system optimizing for some Y that is not my values.

As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they'd enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It's possible that one crisp articulation of "sufficient metaphilosophical competence" is that following a sequence of suggestions, each of which you'd enthusiastically endorse, is actually good for you.)

On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.

Comment by zhukeepa on Can corrigibility be learned safely? · 2018-04-05T05:04:01.781Z · LW · GW

I thought more about my own uncertainty about corrigibility, and I've fleshed out some intuitions on it. I'm intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.

Suppose we have an agent A optimizing for some values V. I'll call an AI system S high-impact calibrated with respect to A if, when A would consider an action "high-impact" with respect to V, S will correctly classify it as high-impact with probability at least 1-ɛ, for some small ɛ.

My intuitions about corrigibility are as follows:

1. If you're not calibrated about high-impact, catastrophic errors can occur. (These are basically black swans, and black swans can be extremely bad.)

2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it's critical that it knows to check that action with you).

3. To learn how to be high-impact calibrated w.r.t. A, you will have to generalize properly from training examples of low/high-impact (i.e. be robust to distributional shift).

4. To robustly generalize, you're going to need the ontologies / internal representations that A is using. (In slightly weirder terms, you're going to have to share A's tastes/aesthetic.)

5. You will not be able to learn those ontologies unless you know how to optimize for V the way A is optimizing for V. (This is the core thing missing from the well-intentioned extremley non-neurotypical assistant I illlustrated.)

6. If S's "brain" starts out very differently from A's "brain", S will not be able to model A's representations unless S is significantly smarter than A.

In light of this, for any agent A, some value V they're optimizing for, and some system S that's assisting A, we can ask two important questions:

(I) How well can S learn A's representations?

(II) If the representation is imperfect, how catastrophic might the resulting mistakes be?

In the case of a programmer (A) building a web app trying to make users happy (V), it's plausible that some run-of-the-mill AI system (S) would learn a lot of the important representations right and a lot of the important representations wrong, but it also seems like none of the mistakes are particularly catastrophic (worst case, the programmer just reverts the codebase.)

In the case of a human (A) trying to make his company succeed (V), looking for a new CEO (S) to replace himself, it's usually the case that the new CEO doesn't have the same internal representations as the founder. If they're too different, the result is commonly catastrophic (e.g. if the new CEO is an MBA with "more business experience", but with no vision and irreconcilable taste). Some examples:

  • For those who've watched HBO's Silicon Valley, Action Jack Barker epitomizes this.
  • When Sequoia Capital asked Larry and Sergey to find a new CEO for Google, they hemmed and hawed until they found one who had a CS Ph.D and went to Burning Man, just like they did. (Fact-check me on this one?)
  • When Apple ousted Steve Jobs, the company tanked, and only after he was hired back as CEO did the company turn around and become the most valuable company in the world.

(It's worth noting that if the MBA got hired as a "faux-CEO", where the founder could veto any of the MBA's proposals, the founders might make some use of him. But the way in which he'd be useful is that he'd effectively be hired for some non-CEO position. In this picture, the founders are still doing most of the cognitive work in running the company, while the MBA ends up relegated to being a "narrow tool intelligence utilized for boring business-y things". It's also worth noting that companies care significantly about culture fit when looking for people to fill even mundane MBA-like positions...)

In the case of a human (A) generically trying to optimize for his values (V), with an AGI trained to be corrigible (S) assisting, it seems quite unlikely that S would be able to learn A's relevant internal representations (unless it's far smarter and thus untrustworthy), which would lead to incorrect generalizations. My intuition is that if S is not much smarter than A, but helping in extremely general ways and given significant autonomy, the resulting outcome will be very bad. I definitely think this if S is a sovereign, but also think this if e.g. it's doing a thousand years' worth of human cognitive work in determining if a newly distilled agent is corrigible, which I think happens in ALBA. (Please correct me if I botched some details.)

Paul: Is your picture that the corrigible AI learns the relevant internal representations in lockstep with getting smarter, such that it manages to hit a "sweet spot" where it groks human values but isn't vastly superintelligent? Or do you think it doesn't learn the relevant internal representations, but its action space is limited enough that none of its plausible mistakes would be catastrophic? Or do you think one of my initial intuitions (1-6) is importantly wrong? Or do you think something else?

Two final thoughts:

  • The way I've been thinking about corrigibility, there is a simple core to corrigibility, but it only applies when the subagent can accurately predict any judgment you'd make of the world, and isn't much more powerful than you. This is the case if e.g. the subagent starts as a clone of you, and is not the case if you're training it from scratch (because it'll either be too dumb to understand you, or too smart to be trustworthy). I'm currently chewing on some ideas for operationalizing this take on corrigibility using decision theory.
  • None of this analysis takes into account that human notions of "high-impact" are often wrong. Typical human reasoning processes are pretty susceptible to black swans, as history shows. (Daemons sprouting would be a subcase of this, where naive human judgments might judge massive algorithmic searches to be low-impact.)
Comment by zhukeepa on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T05:11:27.785Z · LW · GW

Thanks Ryan!

What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
This I agree with.

Hurrah!

At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.

I no longer think it wants us to turn into yes-men, and edited my post accordingly. I still think it will be incentivized to corrupt us, and I don't see how being an act-based agent would be sufficient, though it's likely I'm missing something. I agree that if it's sufficiently broadly uncertain over values then we're likely to be fine, but in my head that unpacks into "if we knew the AI were metaphilosophically competent enough, we'd be fine", which doesn't help things much.

Comment by zhukeepa on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T03:57:44.031Z · LW · GW
If the AI is optimizing its behavior to have some effect on the human, then that's practically the central case the concept of corrigibility is intended to exclude. I don't think it matters what the AI thinks about what the AI is doing, it just matters what optimization power it is applying.

See my comment re: optimizing for high approval now vs. high approval later.

If you buy (as I do) that optimizing for high approval now leaves a huge number of important variables unconstrained, I don't see how it make sense to talk about an AI optimizing for high approval now without also optimizing to have some effect on the human, because the unconstrained variables are about effects on the human. If there were a human infant in the wilderness and you told me to optimize for keeping it alive without optimizing for any other effects on the infant, and you told me I'd be screwing up if I did optimize for other effects, I would be extremely confused about how to proceed.

If you don't buy that optimizing for high approval now leaves a huge number of important variables unconstrained, then I agree that the AI optimizing its behavior to have some effects on the human should be ruled out by the definition of corrigibility.

Comment by zhukeepa on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T03:38:11.587Z · LW · GW
I don't think this is right. The agent is optimized to choose actions which, when shown to a human, receive high approval. It's not optimized to pick actions which, when executed, cause the agent to receive high approval in the future.

I think optimizing for high approval now leaves a huge number of variables unconstrained. For example, I could absolutely imagine a corrigible AI with ISIS suicide bomber values that consistently receives high approval from its operator and eventually turns its operator into an ISIS suicide bomber. (Maybe not all operators, but definitely some operators.)

Given the constraint of optimizing for high approval now, in what other directions would our corrigible AI try to optimize? Some natural guesses would be optimizing for future approval, or optimizing for its model of its operator's extrapolated values (which I would distrust unless I had good reason to trust its extrapolation process). If it were doing either, I'd be very scared about getting corrupted. But you're right that it may not optimize for us turning into yes-men in particular.

I suspect this disagreement is related to our disagreement about the robustness of human reflection. Actually, the robustness of human reflection is a crux for me—if I thought human reflection were robust, then I think an AI that continously optimizes for high approval now leaves few important variables unconstrained, and would lead us to very good outcomes by default. Is this a crux for you too?

Comment by zhukeepa on Corrigible but misaligned: a superintelligent messiah · 2018-04-03T02:59:21.933Z · LW · GW
Do you think the AI-assisted humanity is in a worse situation than humanity is today? If we are metaphilosophically competent enough that we can make progress, why won't we remain metaphilosphically competent enough once we have powerful AI assistants?

Depends on who "we" is. If the first team that builds an AGI achieves a singleton, then I think the outcome is good if and only if the people on that team are metaphilosophically competent enough, and don't have that competence corrupted by AI's.

In your hypothetical in particular, why do the people in the future---who have had radically more subjective time to consider this problem than we have, have apparently augmented their intelligence, and have exchanged massive amounts of knowledge with each other---make decisions so much worse than those that you or I would make today?

If the team in the hypothetical is less metaphilosophically competent than we are, or have their metaphilosophical competence corrupted by the AI, then their decisions would turn out worse.

I'm reminded of the lengthy discussion you had with Wei Dai back in the day. I share his picture of which scenarios will get us something close to optimal, his beliefs that philosophical ignorance might persist indefinitely, his skepticism about the robustness of human reflection, and his skepticism that human values will robustly converge upon reflection.

Is that an accurate characterization?

I would say so. Another fairly significant component is my model that humanity makes updates by having enough powerful people paying enough attention to reasonable people, enough other powerful people paying attention to those powerful people, and with everyone else roughly copying the beliefs of the powerful people. So, good memes --> reasonable people --> some powerful people --> other powerful people --> everyone else.

AI would make some group of people far more powerful than the rest, which screws up the chain if that group don't pay much attention to reasonable people. In that case, they (and the world) might just never become reasonable. I think this would happen if ISIS took control, for example.

Other than disagreeing, my main complaint is that this doesn't seem to have much to do with AI. Couldn't you tell exactly the same story about human civilization proceeding along its normal development trajectory, never building an AI, but gradually uncovering new technologies and becoming smarter?

I would indeed expect this by default, particularly if one group with one ideology attains decisive control over the world. But if we somehow manage to avoid that (which seems unlikely to me, given the nature of technological progress), I feel much more optimistic about metaphilosophy continuing to progress and propagate throughout humanity relatively quickly.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-02T21:16:16.988Z · LW · GW

I think values and decision-making processes can't be disentangled, because I think people's values often stem from their decision-making processes. For example, someone might be selfish because they perceive the whole world to be selfish and uncaring, and in return acts selfish and uncaring by default. This default behavior might cause the world to act selfishly and uncaringly toward him, further reinforcing his perception. If he fully understood this was happening (rather than the world just being fundamentally selfish), he might experiment with acting more generously with the rest of the world, and observe the rest of the world to act more generously in return, and in turn stop being selfish entirely.

In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough. For example, it's not clear that amplifying someone will cause them to observe that their own policy of selfishness is locking them into a fixed point that they could "Löb out of" into a more preferable fixed point. I expect this to be particularly unlikely if e.g. they believe their object-level values to be fixed and immutable, which migh result in a fairly pernicious epistemic pit.

My intuition is that most decision-making processes have room for subtle but significant improvements, that most people won't realize these improvements upon amplification, and that failing to make these improvements will result in catastrophic amounts of waste. As another example, it seems quite plausible to me that:

  • the vast majority of human-value-satisfaction (e.g. human flourishing, or general reduction of suffering) comes from acausally trading with distant superintelligences.
  • most people will never care about (or even realize) acausal trade, even upon amplification.
Comment by zhukeepa on My take on agent foundations: formalizing metaphilosophical competence · 2018-04-02T20:21:41.189Z · LW · GW

I agree I've been using "metaphilosophical competence" to refer to some combination of both rationality and philosophical competence. I have an implicit intuition that rationality, philosophical competence, and metaphilosophical competence all sort of blur into each other, such that being sufficient in any one of them makes you sufficient in all of them. I agree this is not obvious and probably confusing.

To elaborate: sufficient metaphilosophical competence should imply broad philosophical competence, and since metaphilosophy is a kind of philosophy, sufficient philosophical competence should imply sufficient metaphilosophical competence. Sufficient philosophical competence would allow you to figure out what it means to act rationally, and cause you to act rationally.

That rationality implies philosophical competence seems the least obvious. I suppose I think of philosophical competence as some combination of not being confused by words, and normal scientific competence—that is, given a bunch of data, figuring out which data is noise and which hypotheses fit the non-noisy data. Philosophy is just a special case where the data is our intuitions about what concepts should mean, the hypotheses are criteria/definitions that capture these intuitions, and the datapoints happen to be extremely sparse and noisy. Some examples:

  • Section 1.1 in the logical induction paper lists a bunch of desiderata ("datapoints") for what logical uncertainty is. The logical induction criterion is a criterion ("hypothesis") that fits a majority of those datapoints.
  • The Von Neumann–Morgenstern utility theorem starts with a bunch of desiderata ("datapoints") for rational behavior, and expected utility maximization is a criterion ("hypothesis") that fits these datapoints.
  • I think both utilitarianism and deontology are moral theories ("hypotheses") that fit a good chunk of our moral intuitions ("datapoints"). I also think both leave much to be desired.

Philosophical progress seems objective and real like scientific progress—progress is made when a parsimonious new theory fits the data much better. One important way in which philosophical progress differs from scientific progress is that there's much less consensus on what the data is or whether a theory fits it better, but I think this is mostly a function of most people being extremely philosophically confused, rather than e.g. philosophy being inherently subjective. (The "not being confused by words" component I identified mostly corresponds to the skill of identifying which datapoints we should consider in the first place, which of the datapoints are noise, and what it means for a theory to fit the data.)

Relatedly, I think it is not a coincidence that the Sequences, which are primarily about rationality, also managed to deftly resolve a number of common philosophical confusions (e.g. MWI vs Copenhagen, free will, p-zombies).

I also suspect that a sufficiently rational AGI would simply not get confused by philosophy the way humans do, and that it would feel to it from the inside like a variant of science. For example, it's hard for me to imagine it tying itself up in knots trying to reason about theology. (I sometimes think about confusing philosophical arguments as adversarial examples for human reasoning...)

Anyway, I agree this was all unclear and non-obvious (and plausibly wrong), and I'm happy to hear any suggestions for better descriptors. I literally went with "rationality" before "metaphilosophical competence", but people complained that was overloaded and confusing...

Comment by zhukeepa on Corrigible but misaligned: a superintelligent messiah · 2018-04-02T19:19:08.868Z · LW · GW

Overall, my impression is that you thought I was saying, "A corrigible AI might turn against its operators and kill us all, and the only way to prevent that is by ensuring the AI is metaphilosophically competent." I was really trying to say "A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we'd definitely want our operators to be metaphilosophically competent, and we'd definitely want our AI to not corrupt them. The latter may be simple to ensure if the AI isn't broadly superhumanly powerful, but may be difficult to ensure if the AI is broadly superhumanly powerful and we don't have formal guarantees". My sense is that we actually agree on the latter and agree that the former is wrong. Does this sound right? (I do think it's concerning that my original post led you to reasonably interpret me as saying the former. Hopefully my edits make this clearer. I suspect part of what happened is that I think humans surviving but producing astronomical moral waste is about as bad as human extinction, so I didn't bother delineating them, even though this is probably an unusual position.)

See below for individual responses.

You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: "[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...". i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.

I edited my post accordingly. This doesn't change my perspective at all.

To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.

I gave that description to illustrate one way it is like a corrigible agent, which does have the best of intentions (to help its operators), not to imply that a well-intentioned agent is corrigible. I edited it to "In his heart of hearts, the messiah would be trying to help them, and everyone would know that." Does that make it clearer?

Clearly you're right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.

I agree that corrigibility helps a bunch with mundane existential risks, and think that e.g. a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste. I edited from "Surely, we wouldn't build a superintelligence that would guide us down such an insidious path?" to "I don't think a corrigible superintelligence would guide us down such an insidious path. I even think it would substantially improve the human condition, and would manage to avoid killing us all. But I think it might still lead us to astronomical moral waste." Does this clarify things?

And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would's actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.

I agree with all this.

I think you're making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you're relying on the metaphilosophical competence of the AI, to one in which you're relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human's power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I'm saying is at least in some tension with the traditional story of indirect normativity.

I also agree with all this.

Rather than trying to give the AI very general instructions for its interpretation, I'm saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.

I also agree with all this. I never imagined giving the corrigible AI extremely general instructions for its interpretation.

Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans' reasoning abilities while doing so. But why must the AI system's metaphilosophical competence be the only defeator? Why couldn't this be achieved by quantilizing, or otherwise throttling the agent's capabilities? By restricting the agent's activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human's approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system's capabilities are overall not far beyond those of its human operators.

I think this captures my biggest update from your comment, and modified the ending of this post to reflect this. Throttling the AI's power seems more viable than I'd previously thought, and seems like a pretty good way to significantly lower the risk of manipulation. That said, I think even extraordinary human persuadors might compromise human reasoning abilities, and I have fast takeoff intuitions that make it very hard for me to imagine an AGI that simultaneously

  • understands humans well enough to be corrigible
  • is superhumanly intelligent at engineering or global strategy
  • isn't superhumanly capable of persuasion
  • wouldn't corrupt humans (even if it tried to not corrupt them)

I haven't thought too much about this though, and this might just be a failure of my imagination.

Overall, I'd say superintelligent messiahs are sometimes corrigible, and they're more likely to be aligned if so.

Agreed. Does the new title seem better? I was mostly trying to explicate a distinction between corrigibility and alignment, which was maybe obvious to you beforehand, and illustrate the astronomical waste that can result even if we avoid self-annihilation.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-02T08:15:04.408Z · LW · GW

My model is that giving them the five-minute lecture on the dangers of value lock-in won't help much. (We've tried giving five-minute lectures on the dangers of building superintelligences...) And I think most people executing "find out what would I think if I reflected more, and what the actual causes are of everyone else's opinions" would get stuck in an epistemic pit and not realize it.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-02T08:08:14.203Z · LW · GW

I think most humans achieving what they currently consider their goals would end up being catastrophic for humanity, even if they succeed. (For example I think an eternal authoritarian regime is pretty catastrophic.)

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-01T20:47:26.486Z · LW · GW

To test how much my proposed crux is in fact a crux, I'd like for folks to share their intuitions about how many people are metaphilosophically competent enough to safely 1,000,000,000,000,000x, along with their intuitions about the difficulty of AI alignment.

My current intuition is that there are under 100 people whom, if 1,000,000,000,000,000x'd, would end up avoiding irreversible catastrophes with > 50% probability. (I haven't thought too much about this question, and wouldn't be surprised if I update to thinking there are fewer than 10 such people, or even 0 such people.) I also think AI alignment is pretty hard, and necessitates solving difficult metaphilosophical problems.

Once humanity makes enough metaphilosophical progress (which might require first solving agent foundations), I might feel comfortable 1,000,000,000,000,000x'ing the most metaphilosophically competent person alive, though it's possible I'll decide I wouldn't want to 1,000,000,000,000,000x anyone running on current biological hardware. I'd also feel good 1,000,000,000,000,000x'ing someone if we're in the endgame and the default outcome is clearly self-annihilation.

All of these intuitions are weakly held.

Comment by zhukeepa on Metaphilosophical competence can't be disentangled from alignment · 2018-04-01T20:23:49.355Z · LW · GW

Well, wouldn't it be great if we had sound metaphilosophical principles that help us distinguish epistemic pits from correct conclusions! :P

I actually think humanity is in a bunch of epistemic pits that we mostly aren't even aware of. For example, if you share my view that Buddhist enlightenment carries significant (albeit hard-to-articulate) epistemic content, then basically all of humanity over basically all of time has been in the epistemic pit of non-enlightenment.

If we figure out the metaphilosophy of how to robustly avoid epistemic pits, and build that into an aligned AGI, then in some sense none of our current epistemic pits are that bad, since that AGI would help us climb out in relatively short order. But if we don't figure it out, we'll plausibly stay in our epistemic pits for unacceptably long periods of time.

Comment by zhukeepa on Reframing misaligned AGI's: well-intentioned non-neurotypical assistants · 2018-04-01T01:22:46.933Z · LW · GW

How about now? :P