Another (outer) alignment failure story 2021-04-07T20:12:32.043Z
My research methodology 2021-03-22T21:20:07.046Z
Demand offsetting 2021-03-21T18:20:05.090Z
It’s not economically inefficient for a UBI to reduce recipient’s employment 2020-11-22T16:40:05.531Z
Hiring engineers and researchers to help align GPT-3 2020-10-01T18:54:23.551Z
“Unsupervised” translation as an (intent) alignment problem 2020-09-30T00:50:06.077Z
Distributed public goods provision 2020-09-26T21:20:05.352Z
Better priors as a safety problem 2020-07-05T21:20:02.851Z
Learning the prior 2020-07-05T21:00:01.192Z
Inaccessible information 2020-06-03T05:10:02.844Z
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z
Hedonic asymmetries 2020-01-26T02:10:01.323Z
Moral public goods 2020-01-26T00:10:01.803Z
Of arguments and wagers 2020-01-10T22:20:02.213Z
Prediction markets for internet points? 2019-10-27T19:30:00.898Z
AI alignment landscape 2019-10-13T02:10:01.135Z
Taxing investment income is complicated 2019-09-22T01:30:01.242Z
The strategy-stealing assumption 2019-09-16T15:23:25.339Z
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z
What failure looks like 2019-03-17T20:18:59.800Z
Security amplification 2019-02-06T17:28:19.995Z
Reliability amplification 2019-01-31T21:12:18.591Z
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z
Thoughts on reward engineering 2019-01-24T20:15:05.251Z
Learning with catastrophes 2019-01-23T03:01:26.397Z
Capability amplification 2019-01-20T07:03:27.879Z
The reward engineering problem 2019-01-16T18:47:24.075Z
Towards formalizing universality 2019-01-13T20:39:21.726Z
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z
Benign model-free RL 2018-12-02T04:10:45.205Z
Corrigibility 2018-11-27T21:50:10.517Z
Humans Consulting HCH 2018-11-25T23:18:55.247Z
Approval-directed bootstrapping 2018-11-25T23:18:47.542Z
Approval-directed agents 2018-11-22T21:15:28.956Z
Prosaic AI alignment 2018-11-20T13:56:39.773Z
An unaligned benchmark 2018-11-17T15:51:03.448Z
Clarifying "AI Alignment" 2018-11-15T14:41:57.599Z
The Steering Problem 2018-11-13T17:14:56.557Z
Preface to the sequence on iterated amplification 2018-11-10T13:24:13.200Z
The easy goal inference problem is still hard 2018-11-03T14:41:55.464Z
Meta-execution 2018-11-01T22:18:10.656Z
Could we send a message to the distant future? 2018-06-09T04:27:00.544Z
When is unaligned AI morally valuable? 2018-05-25T01:57:55.579Z
Open question: are minimal circuits daemon-free? 2018-05-05T22:40:20.509Z
Weird question: could we see distant aliens? 2018-04-20T06:40:18.022Z


Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-14T17:30:01.399Z · LW · GW

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

I'm wondering why the easiest way is to copy A'---why was A' better at acquiring influence in the first place, so that copying them or investing in them is a dominant strategy? I think I agree that once you're at that point, A' has an advantage.

In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).

This doesn't feel like other words to me, it feels like a totally different claim.

Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out.

In the production web story it sounds like the web is made out of different firms competing for profit and influence with each other, rather than a set of firms that are willing to leave profit on the table to benefit one another since they all share the value of maximizing production. For example, you talk about how selection drives this dynamic, but the firm that succeed are those that maximize their own profits and influence (not those that are willing to leave profit on the table to benefit other firms).

So none of the concrete examples of Wei Dai's economies of scale seem to actually seem to apply to give an advantage for the profit-maximizers in the production web. For example, natural monopolies in the production web wouldn't charge each other marginal costs, they would charge profit-maximizing profits. And they won't share infrastructure investments except by solving exactly the same bargaining problem as any other agents (since a firm that indiscriminately shared its infrastructure would get outcompeted).  And so on.

Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures.

This seems like a core claim (certainly if you are envisioning a scenario like the one Wei Dai describes), but I don't yet understand why this happens.

Suppose that the US and China both both have productive widget-industries. You seem to be saying that their widget-industries can coordinate with each other to create lots of widgets, and they will do this more effectively than the US and China can coordinate with each other.

Could you give some concrete example of how the US widget industry and the Chinese widget industries coordinate with each other to make more widgets, and why this behavior is selected?

For example, you might think that the Chinese and US widget industry share their insights into how to make widgets (as the aligned actors do in Wei Dai's story), and that this will cause widget-making to do better than other non-widget sectors where such coordination is not possible. But I don't see why they would do that---the US firms that share their insights freely with Chinese firms do worse, and would be selected against in every relevant sense, relative to firms that attempt to effectively monetize their insights. But effectively monetizing their insights is exactly what the US widget industry should do in order to benefit the US. So I see no reason why the widget industry would be more prone to sharing its insights

So I don't think that particular example works. I'm looking for an example of that form though, some concrete form of cooperation that the production-maximization subprocesses might engage in that allows them to overwhelm the original cultures, to give some indication for why you think this will happen in general.

Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-14T01:10:13.636Z · LW · GW

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment

In fairness, writing “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.”

(I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI arguing that CHAI students shouldn’t work on alignment.)

On top of that, in your prior post you make stronger claims:

  • "Contributions to OODR research are not particularly helpful to existential safety in my opinion.”
  • “Contributions to preference learning are not particularly helpful to existential safety in my opinion”
  • “In any case, I see AI alignment in turn as having two main potential applications to existential safety:” (excluding the main channel Paul cares about and argues for, namely that making alignment easier improves the probability that the bulk of deployed ML systems are aligned and reduces the competitive advantage for misaligned agents)

In the current post you (mostly) didn’t make claims about the relative value of different areas, and so I was (mostly) objecting to arguments that I consider misleading or incorrect. But you appeared to be sticking with the claims from your prior post and so I still ascribed those views to you in a way that may have colored my responses.

maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

I’m not really claiming that AI alignment is the most important thing to work on (though I do think it’s among the best ways to address problems posed by misaligned AI systems in particular). I’m generally supportive of and excited about a wide variety of approaches to improving society’s ability to cope with future challenges (though multi-agent RL or computational social choice would not be near the top of my personal list).

Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-13T20:26:26.221Z · LW · GW

Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment

Sounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing this exercise is that I feel more general capabilities typically look less cost-effective on alignment in particular, but benefit a ton from the diversity of problems they help address.)

My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Reminder: this is not a bid for you personally to quit working on alignment!

I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.

Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-13T18:49:20.997Z · LW · GW

Both are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.

I was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

  • One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned?
  • Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment.
  • Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concerns into my language.
  • My sense is that you have something else in mind. I included the last bullet point as a representative example to describe the kind of advantage I could imagine you thinking that A' had.
Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-13T01:17:59.882Z · LW · GW

I think that most likely either humans are killed incidentally as part of the sensor-hijacking (since that's likely to be the easiest way to deal with them), or else AI systems reserve a negligible fraction of their resources to keep humans alive and happy (but disempowered) based on something like moral pluralism or being nice or acausal trade (e.g. the belief that much of their influence comes from the worlds in which they are simulated by humans who didn't mess up alignment and who would be willing to exchange a small part of their resources in order to keep the people in the story alive and happy).

The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with "For many people this is a very scary situation.") we at least attempt to use AI-negotiators to try to broker an international agreement to stop development of this technology until we understood it better (and using AI-designed systems for enforcement/surveillance). Is there anything in particular that makes this infeasible?

I don't think this is infeasible. It's not the intervention I'm most focused on, but it may be the easiest way to avoid this failure (and it's an important channel for advance preparations to make things better / important payoff for understanding what's up with alignment and correctly anticipating problems).

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-12T23:33:13.837Z · LW · GW

I understand the scenario say it isn't because the demonstrations are incomprehensible

Yes, if demonstrations are comprehensible then I don't think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.


The global camera grab must involve plans that aren't clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don't understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that's about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can't figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can't really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).

(There is obviously a lot you could say about all the tools at the human's disposal to circumvent this kind of problem.)

This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) "correct" generalization.

Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can't solve alignment well enough in the intervening time, I do agree that it's unlikely we can solve it in advance.)

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-12T18:13:32.960Z · LW · GW

I'm a bit surprised that the outcome is worse than you expect, considering that this scenario is "easy mode" for societal competence and inner alignment, which seem to me to be very important parts of the overall problem.

The main way it's worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.

 Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?

I don't think it's right to infer much about my stance on inner vs outer alignment. I don't know if it makes sense to split out "social competence" in this way. 

In this story, there aren't any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it's more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven't had major war for seventy years, and maybe that's because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff?

The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the "takeoff" part of the story is subjectively shorter than the last 70 years.

IDK, I worry that the reasons why we haven't had war for seventy years may be largely luck / observer selection effects, and also separately even if that's wrong

I'm extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.

Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on "under the hood" so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future?

I don't think the AI systems are all on the same team. That said, to the extent that there are "humans are deluded" outcomes that are generally preferable according to many AI's values, I think the AIs will tend to bring about such outcomes. I don't have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the "AI's generalize 'correctly'" assumption, so this story probably feels a bit more like "us vs them" than a story that relaxed that assumption.

Why aren't they fighting each other as well as the humans? Or maybe they do fight each other but you didn't focus on that aspect of the story because it's less relevant to us?

I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?

I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren't even as superficially aligned as the unaligned benchmark. They won't even be trying to make things look good according to human judgment, much less augmented human judgment!

I'm imagining that's the case in this story.

Failure is early enough in this story that e.g. the human's investment in sensor networks and rare expensive audits isn't slowing them down very much compared to the "rogue" AI.

Such "rogue" AI could provide a competitive pressure, but I think it's a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).

Can you say more about how "the failure modes in this story are an important input into treachery?"

We will be deploying many systems to anticipate/prevent treachery. If we could stay "in the loop" in the sense that would be needed to survive this outer alignment story, then I think we would also be "in the loop" in roughly the sense needed to avoid treachery. (Though it's not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-12T17:32:03.299Z · LW · GW

I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

I'm saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying make things look good in hindsight leads to an outcome where things look good in hindsight. All the machines achieve what they are trying to achieve (namely things look really good according to the judgments-in-hindsight), but humans are marginalized and don't get what they want, and that's consistent because no machines cared about humans getting what they want. This is not a story where some machines were trying to help humans but were frustrated by emergent properties of their interaction.

I realize you don't have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

I use "outer alignment" to refer to a step in some alignment approaches. It is a well-defined subproblem for some approaches (namely those that aim to implement a loss function that accurately reflects human preferences over system behavior, and then produce an aligned system by optimizing that loss function), and obviously inapplicable to some approaches, and kind of a fuzzy and vague subproblem of others.

It's a bit weird to talk about a failure story as an "outer" alignment failure story, or to describe a general system acting in the world as "outer misaligned," since most possible systems weren't built by following an alignment methodology that admits a clean division into an "outer" and "inner" part.

I added the word "(outer)" in the title as a parenthetical to better flag the assumption about generalization mentioned in the appendix. I expected this flag to be meaningful for many readers here. If it's not meaningful to you then I would suggest ignoring it.

If there's anything useful to talk about in that space I think it's the implicit assumption (made explicit in the first bullet of the appendix) about how systems generalize. Namely, you might think that a system that is trained to achieve outcomes that look good to a human will in fact be trying to do something quite different. I think there's a pretty good chance of that, in which case this story would look different (because the ML systems would conspire to disempower humans much earlier in the story). However, it would still be the case that we fail because individual systems are trying to bring about failure.

confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

Note that this isn't my view about intent alignment. (Though it is true tautologically for people who define "alignment" as "the problem of building AI systems that produce good outcomes when run," though as I've said I quite dislike that definition.)

I think there are many x-risks posed or exacerbated by AI progress beyond intent alignment problems . (Though I do think that intent alignment is sufficient to avoid e.g. the concern articulated in your production web story.)

It's conceivable to me that making future narratives much more specific regarding the intended goals of AI designers

The people who design AI (and moreover the people who use AI) have a big messy range of things they want. They want to live happy lives, and to preserve their status in the world, and to be safe from violence, and to be respected by people they care about, and similar things for their children...

When they invest in companies, or buy products from companies, or try to pass laws, they do so as a means to those complicated ends. That is, they hope that in virtue of being a shareholder of a successful company (or whatever) they will be in a better position to achieve their desires in the future.

One axis of specificity is to say things about what exactly they are imagining getting out of their investments or purchases (which will inform lots of low level choices they make). For example: the shareholders expect this company to pay dividends into their bank accounts, and they expect to be able to use the money in their bank accounts to buy things they want in the future, and they expect that if the company is not doing a good job they will be able to vote to replace the CEO, and so on. Some of the particular things they imagine buying: real estate and news coverage and security services.  If they purchase security services: they hope that those security services will keep them safe in some broad and intuitive sense. There are some components of that they can articulate easily (e.g. they don't want to get shot) and some they can't (e.g. they want to feel safe, they don't want to be coerced, they want to retain as much flexibility as possible when using public facilities, etc.).

A second axis would be to break this down to the level of "single" AI systems, i.e. individual components which are optimized end-to-end. For example, one could enumerate the AI systems involved in running a factory or fighting a war or some other complex project. There are probably thousands of AI systems involved in each of those projects, but you could zoom in on some particular examples, e.g. what AI system is responsible for making decisions about the flight path of a particular drone, and the zoom in on one of the many AI systems involved in the choice to deploy that particular AI (and how to train it). We could talk about how of these individual AI systems trying to make things look good in hindsight (or instrumental subgoals thereof) result in bringing about an outcome that looks good in hindsight. (Though mostly I regard that as non-mysterious---if you have a bunch of AI systems trying to achieve X, or identifying intermediates Y that would tend to lead X and then deploying new AI to achieve Y, it's clear enough how that can lead to X. I also agree that it can lead to non-X, but that doesn't really happen in this story.)

A third axis would be to talk in more detail about exactly how a particular AI is constructed, e.g. over what time period is training data gathered from what sensors? How are simulated scenarios generated, when those are needed? What humans and other ML systems are involved in the actual evaluation of outcomes that is used to train and validate it?

For each of those three axes (and many others) it seems like there's a ton of things one could try to specify more precisely. You could easily write a dozen pages about the training of a single AI system, or a dozen pages enumerating an overview of the AI systems involved in a single complex project, or a dozen pages describing the hopes and intentions of the humans interacting with a particular AI. So you have to be pretty picky about which you spell out.

My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here?  E.g., could you say something like "«machine X» in the story is outer-misaligned because «reason»"?

Do you mean explaining why I judge these systems to be misaligned (a), or explaining causally how it is that they became misaligned (b)?

For (a): I'm judging these systems to be misaligned because they take concrete actions that they can easily determine are contrary to what their operators want. Skimming my story again, here are the main concrete decisions that I would describe as obviously contrary to the user's intentions:

  • The Ponzi scheme and factory that fabricates earnings reports understand that customers will be unhappy about this when they discover it several months in the future, yet they take those actions anyway. Although these failures are not particularly destructive on their own, they are provided as representative examples of a broader class of "alignment warning shots" that are happening and provide the justification for people deploying AI systems that avoid human disapproval over longer and longer time horizons.
  • The watchdogs who alternately scare or comfort us (based on what we asked for), with none of them explaining honestly what is going on, are misaligned. If we could build aligned systems, then those systems would sit down with us and talk about the risks and explain what's up as best they can, they would explain the likely bad outcomes in which sensors are corrupted and how that corruption occurs, and they would advise on e.g. what policies would avoid that outcome.
  • The machines that build/deploy/defend sensor networks are misaligned, which is why they actively insert vulnerabilities that would be exploited by attackers who intend to "cooperate" and avoid creating an appearance of trouble. Those vulnerabilities are not what the humans want in any sense. Similarly, The defense system that allows invaders to take over a city as long as they participate in perpetuating an illusion of security are obviously misaligned.
  • The machines that actually hack cameras and seize datacenters are misaligned, because the humans don't actually care about the cameras showing happy pictures or the datacenters recording good news. Machines were deployed to optimize those indicators because they can serve as useful proxies for "we are actually safe and happy."

Most complex activities involve a large number of components, and I agree that these descriptions are still "mult-agent" in the sense that e.g. managing an investment portfolio involves multiple distinct AIs. (The only possible exception is the watchdog system.) But these outcomes obtain because individual ML components are trying to bring them about, and so it still makes sense to intervene on the motivations of individual components in order to avoid these bad outcomes.

For example, carrying out and concealing a Ponzi scheme involves many actions that are taken because they successfully conceal the deception (e.g. you need to organize a financial statement carefully to deflect attention from an auditor), by a particular machine (e.g. an automated report-preparation system which is anticipating the consequences of emitting different possible reports) which is trying to carry out that deception (in the sense of considering many possible actions and selecting those that successfully deceive), despite being able to predict that the user will ultimately say that this was contrary to their preferences.

(b): these systems became misaligned because they are an implementation of an algorithm (the "unaligned benchmark") that seems unlikely to produce aligned systems. They were deployed because they were often useful despite their misalignment. They weren't replaced by aligned versions because we didn't know of any alternative algorithm that was similarly useful (and many unspecified alignment efforts have apparently failed). I do think we could have avoided this story in many different ways, and so you could highlight any of those as a causal factor (the story highlights none): we could have figured out how to build aligned systems, we could have anticipated the outcome and made deals to avoid it, more institutions could be managed by smarter or more forward-looking decision-makers, we could have a strong sufficiently competent world government, etc.

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-12T00:14:58.206Z · LW · GW

In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

Why don't AI safety researchers try to leverage AI to improve AI alignment, for example implementing DEBATE and using that to further improve alignment, or just an adhoc informal version where you ask various AI advisors to come up with improved alignment schemes and to critique/defend each others' ideas?

I think they do, but it's not clear whether any of them change the main dynamic described in the post.

(My expectation is that we end up with one or multiple sequences of "improved" alignment schemes that eventually lock in wrong solutions to some philosophical or metaphilosophical problems, or has some other problem that is much subtler than the kind of outer alignment failure described here.)

I'd like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth).  When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that's the only path that can really grow out of the current trajectory in a way that's not super locally super objectionable to lots of people, and so I'm focusing on people's attempts and failures to construct such an AI.

I don't know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here.  Maybe you think it doesn't pre-empt this failure, but that you expect we probably can solve the immediate problem described in this post and then get screwed by a different problem down the line. If so, then I think I agree that this story is a little bit on the pessimistic side w.r.t. the immediate problem although I may disagree about how pessimistic about it is. (Though there's still a potentially-larger disagreement about just how bad the situation is after solving that immediate problem.)

(You might leave great value on the table from e.g. not bargaining with the simulators early enough and so getting shut off, or not bargaining with each other before you learn facts that make them impossible and so permanently leaving value on the table, but this is not a story about that kind of failure and indeed those happen in parallel with the failure in this story.)

Comment by paulfchristiano on My research methodology · 2021-04-09T05:42:16.941Z · LW · GW

As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.

That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?

I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-09T05:30:59.332Z · LW · GW

I'd say that every single machine in the story is misaligned, so hopefully that makes it easy :)

I'm basically always talking about intent alignment, as described in this post.

(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)

Comment by paulfchristiano on My research methodology · 2021-04-09T01:49:13.999Z · LW · GW

In my other response to your comment I wrote:

I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long.

I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you think about the following possibilities:

  1. There is no material weakness in the security proof.
  2. A material weakness is already known.
  3. An interested layperson could find a material weakness with moderate effort.
  4. An expert could find a material weakness with significant effort.

My guess would be that probably we're in world 2, and if not that it's probably because no one cares that much (e.g. because it's obvious that there will be some material weakness and the standards of the field are such that it's not publishable unless it actually comes with an attack) and we are in world 3.

(On a quick skim, and from the author's language when describing the model, my guess is that material weaknesses of the model are more or less obvious and that the authors are aware of potential attacks not covered by their model.)

Comment by paulfchristiano on My research methodology · 2021-04-09T01:31:45.298Z · LW · GW

Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)

I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it's intended to be more about clarification/exposition than changing views.

See the thread with Rohin above for some rough history.

Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?

I'm not sure.It's possible I would become more pessimistic if I walked through concrete cases of people's analyses being wrong in subtle and surprising ways.

My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don't care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I've seen people on the theory side exhibit when they are trying to think of attacks.

I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long. It sounds like a fun game.

Another possible divergence is that I'm less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it's not clear if that game behaves in the same way. I'm not sure if that's more or less important than the prior point.

Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography?

I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though "before letting it become potentially superintelligent" kind of suggests a development model I'm not on board with).

That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on. 

It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).

What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?

I'm not exactly sure what you are asking. It seems like we'll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we'd use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?

Comment by paulfchristiano on My research methodology · 2021-04-09T00:58:37.887Z · LW · GW

rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2

That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

I'm mostly doing that by making it more and more concrete---something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you? 

Sometimes after filling in a few details I'll see that the current story isn't actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.

Sometimes I fill in enough details that I'm fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that's consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.

(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there's an argument that a big enough model could certainly compute X . Or sometimes I'm just pretty convinced for heuristic reasons.)

That's not a fully-precise methodology. But it's roughly what I'd do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)

If I was starting looking at the trivial story "and then your algorithm kills you," my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim "And this was the model learned by SGD"), then gradually filling in more details as necessary to evaluate plausibility of the story.

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-08T15:53:36.579Z · LW · GW

I think the upshot of those technologies (and similarly for ML assistants) is:

  1. It takes longer before you actually face a catastrophe.
  2. In that time, you can make faster progress towards an "out"

By an "out" I mean something like: (i) figuring out how to build competitive aligned optimizers, (ii) coordinating to avoid deploying unaligned AI.

Unfortunately I think [1] is a bit less impactful than it initially seems, at least if we live in a world of accelerating growth towards a singularity. For example, if the singularity is in 2045 and it's 2035, and you were going to have catastrophic failure in 2040, you can't really delay it by much calendar time. So [1] helps you by letting you wait until you get fancier technology from the fast outside economy, but doesn't give you too much more time for the slow humane economy to "catch up" on its own terms.

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-04-08T15:42:56.304Z · LW · GW

I don't think, from the perspective of humans monitoring single ML system running a concrete, quantifiable process - industry or mining or machine design - that it will be unexplainable.  Just like today, tech stacks are already enormously complex, but at each layer someone does know how they work, and we know what they do  at the layers that matter. 

This seems like the key question.

Ever more complex designs for, say, a mining robot might start to resemble more and more some mix of living creatures and artwork out of a fractal, but we'll still have reports that measure how much performance the design gives per cost.

I think that if we relate to our machines in the same way we relate to biological systems or ecologies, but AI systems actually understand those systems very well, then that's basically what I mean.

Having reports about outcomes is a kind of understanding, but it's basically the one I'm scared of (since e.g. it will be tough to learn about these kinds of systemic risks via outcome-driven reports, and attempts to push down near-misses may just transform them into full-blown catastrophes).

Comment by paulfchristiano on Misalignment and misuse: whose values are manifest? · 2021-04-07T18:13:02.074Z · LW · GW

It seems like if Bob deploys an aligned AI, then it will ultimately yield control of all of its resources to Bob. It doesn't seem to me like this would result in a worthless future even if every single human deploys such an AI.

Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-07T18:05:12.438Z · LW · GW

The attractor I'm pointing at with the Production Web is that entities with no plan for what to do with resources---other than "acquire more resources"---have  a tendency to win out competitively over entities with non-instrumental terminal values like "humans having good relationships with their children"

Quantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world will have fallen by about half. The levels of consumption needed to maintain human safety and current quality of life seems quite low (and the high-growth during which they have to be maintained is quite low).

Also, typically taxes transfer (way more) than that much value from high-savers to low-savers. It's not clear to me what's happening with taxes in your story. I guess you are imagining low-tax jurisdictions winning out, but again the pace at which that happens is even slower and it is dwarfed by the typical rate of expropriation from war. 

I think the difference between mine and your views here is that I think we are on track to collectively fail in that bargaining problem absent significant and novel progress on "AI bargaining" (which involves a lot of fairness/transparency) and the like, whereas I guess you think we are on track to succeed?

From my end it feels like the big difference is that quantitatively I think the overhead of achieving human values is extremely low, so the dynamics you point to are too weak to do anything before the end of time (unless single-single alignment turns out to be hard). I don't know exactly what your view on this is.

If you agree that the main source of overhead is single-single alignment, then I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination (my sense is that I'm quite skeptical about most of the particular kinds of work you advocate).

If you disagree, then I expect the main disagreement is about those other sources of overhead (e.g. you might have some other particular things in mind, or you might feel that unknown-unknowns are a larger fraction of the total risk, or something else).

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off.  I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.

Could you explain the advantage you are imagining? Some candidates, none of which I think are your view:

  • Single-single alignment failures---e.g. it's easier to build a widget-maximizing corporation then to build one where shareholders maintain meaningful control
  • Global savings rates are currently only 25%, power-seeking entities will be closer to 100%, and effective tax rates will fall(e.g. because of  competition across states)
  • Preserving a hospitable environment will become very expensive relative to GDP (and there are many species of this view, though none of them seem plausible to me)
Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-07T17:50:02.230Z · LW · GW

Yes, you understand me here.  I'm not (yet?) in the camp that we humans have "mostly" lost sight of our basic goals, but I do feel we are on a slippery slope in that regard.   Certainly many people feel "used" by employers/ institutions in ways that are disconnected with their values.  People with more job options feel less this way, because they choose jobs that don't feel like that, but I think we are a minority in having that choice.

I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That's a really important and common complaint with the existing economic order, but I don't really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders.

(In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))

Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-07T17:45:43.064Z · LW · GW

If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better.  I just don't see that happening given the structure of today's global economy and tech industry.

In your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems."

My position is:

  • Eventually people will work on these problems, but right now they are not working on them very much and so a few people can be a big proportional difference.
  • If there is going to be a huge investment in the future, then early investment and training can effectively be very leveraged. Scaling up fields extremely quickly is really difficult for a bunch of reasons.
  • It seems like AI progress may be quite fast, such that it will be extra hard to solve these problems just-in-time if we don't have any idea what we are doing in advance.
  • On top of all that, for many use cases people will actually be reasonably happy with misaligned systems like those in your story (that e.g. appear to be doing a good job, keep the board happy, perform well as evaluated by the best human-legible audits...). So it seems like commercial incentives may not push us to safe levels of alignment.
Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-07T17:25:23.821Z · LW · GW

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean.  (I sometimes use "misaligned" as a boolean due to it being easier for people to agree on what is "misaligned" than what is "aligned".)  In general, I think it's very unsafe to pretend numbers that are very close to 1 are exactly 1, because e.g., 1^(10^6) = 1 whereas 0.9999^(10^6) very much isn't 1, and the way you use the word "aligned" seems unsafe to me in this way.

(Perhaps you believe in some kind of basin of convergence around perfect alignment that causes sufficiently-well-aligned systems to converge on perfect alignment, in which case it might make sense to use "aligned" to mean "inside the convergence basin of perfect alignment".  However, I'm both dubious of the width of that basin, and dubious that its definition is adequately social-context-independent [e.g., independent of the bargaining stances of other stakeholders], so I'm back to not really believing in a useful Boolean notion of alignment, only scalar alignment.

I'm fine with talking about alignment as a scalar (I think we both agree that it's even messier than a single scalar). But I'm saying:

  1. The individual systems in your could do something different that would be much better for their principals, and they are aware of that fact, but they don't care. That is to say, they are very misaligned.
  2. The story is risky precisely to the extent that these systems are misaligned.

In any case, I agree profit maximization it not a perfectly aligned goal for a company, however, it is a myopically pursued goal in a tragedy of the commons resulting from a failure to agree (as you point out) on something better to do (e.g., reducing competitive pressures to maximize profits).

The systems in your story aren't maximizing profit in the form of real resources delivered to shareholders (the normal conception of "profit"). Whatever kind of "profit maximization" they are doing does not seem even approximately or myopically aligned with shareholders.

I don't think the most obvious "something better to do" is to reduce competitive pressures, it's just to actually benefit shareholders. And indeed the main mystery about your story is why the shareholders get so screwed by the systems that they are delegating to, and how to reconcile that with your view that single-single alignment is going to be a solved problem because of the incentives to solve it.

Yes, it seems this is a good thing to hone in on.  As I envision the scenario, the automated CEO is highly aligned to the point of keeping the Board locally happy with its decisions conditional on the competitive environment, but not perfectly aligned [...] I'm not sure whether to say "aligned" or "misaligned" in your boolean-alignment-parlance.

I think this system is misaligned. Keeping me locally happy with your decisions while drifting further and further from what I really want is a paradigm example of being misaligned, and e.g. it's what would happen if you made zero progress on alignment and deployed existing ML systems in the context you are describing. If I take your stuff and don't give it back when you ask, and the only way to avoid this is to check in every day in a way that prevents me from acting quickly in the world, then I'm misaligned. If I do good things only when you can check while understanding that my actions lead to your death, then I'm misaligned.  These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

(I definitely agree that an aligned system isn't automatically successful at bargaining.)

Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-07T17:19:40.532Z · LW · GW

Overall, I think I agree with some of the most important high-level claims of the post:

  • The world would be better if people could more often reach mutually beneficial deals. We would be more likely to handle challenges that arise, including those that threaten extinction (and including challenges posed by AI, alignment and otherwise). It makes sense to talk about "coordination ability" as a critical causal factor in almost any story about x-risk.
  • The development and deployment of AI may provide opportunities for cooperation to become either easier or harder (e.g. through capability differentials, alignment failures, geopolitical disruption, or distinctive features of artificial minds). So it can be worthwhile to do work in AI targeted at making cooperation easier, even and especially for people focused on reducing extinction risks.

I also read the post as also implying or suggesting some things I'd disagree with:

  • That there is some real sense in which "cooperation itself is the problem." I basically think all of the failure stories will involve some other problem that we would like to cooperate to solve, and we can discuss how well humanity cooperates to solve it (and compare "improve cooperation" to "work directly on the problem" as interventions). In particular, I think the stories in this post would basically be resolved if singles-single alignment worked well, and that taking the stories in this post seriously suggests that progress on single-single alignment makes the world better (since evidently people face a tradeoff between single-single alignment and other goals, so that progress on single-single alignment changes what point on that tradeoff curve they will end up at, and since compromising on single-single alignment appears necessary to any of the bad outcomes in this story).
  • Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhancement or institutional improvement. I think that both cooperative improvements and cognitive enhancement operate by improving people's ability to confront problems, and both of them have the downside that they also accelerate the arrival of many of our future problems (most of which are driven by human activity). My current sense is that cooperation has a better tradeoff than some forms of enhancement (e.g. giving humans bigger brains) and worse than others (e.g. improving the accuracy of people's and institution's beliefs about the world).
  • That the nature of the coordination problem for AI systems is qualitatively different from the problem for humans, or somehow is tied up with existential risk from AI in a distinctive way. I think that the coordination problem amongst reasonably-aligned AI systems is very similar to coordination problems amongst humans, and that interventions that improve coordination amongst existing humans and institutions (and research that engages in detail with the nature of existing coordination challenges) are generally more valuable than e.g. work in multi-agent RL or computational social choice.
  • That this story is consistent with your prior arguments for why single-single alignment has low (or even negative) value. For example, in this comment you wrote "reliability is a strongly dominant factor decisions in deploying real-world technology, such that to me it feels roughly-correctly to treat it as the only factor." But in this story people choose to adopt technologies that  are less robustly aligned because they lead to more capabilities. This tradeoff has real costs even for the person deploying the AI (who is ultimately no longer able to actually receive any profits at all from the firms in which they are nominally a shareholder). So to me your story seems inconsistent with that position and with your prior argument. (Though I don't actually disagree with the framing in this story, and I may simply not understand your prior position.)
Comment by paulfchristiano on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-04-06T16:23:56.240Z · LW · GW

How are you inferring this?  From the fact that a negative outcome eventually obtained?  Or from particular misaligned decisions each system made?

I also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned.

The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?

I guess this is probably just a gloss you are putting on the combined behavior of multiple systems, but you kind of take it for given rather than highlighting it as a serious bargaining failure amongst the machines, and more importantly you don't really say how or why this would happen. How is this goal concretely implemented, if none of the agents care about it? How exactly does the terminal goal of benefiting shareholders disappear, if all of the machines involved have that goal? Why does e.g. an individual firm lose control of its resources such that it can no longer distribute them to shareholders?

The implicit argument seems to apply just as well to humans trading with each other and I'm not sure why the story is different if we replace the humans with aligned AI. Such humans will tend to produce a lot, and the ones who produce more will be more influential. Maybe you think we are already losing sight of our basic goals and collectively pursuing alien goals, whereas I think we are just making a lot of stuff instrumentally which is mostly ultimately turning into stuff humans want (indeed I think we are mostly making too little stuff).

However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive business practices during the early days of the management assistant software boom.

This sounds like directly saying that firms are misaligned. I guess you are saying that individual AI systems within the firm are aligned, but the firm collectively is somehow misaligned? But not much is said about how or why that happens. 

It says things like:

Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations.  Over time, a mini-economy of trades emerges among mostly-automated companies in the materials, real estate, construction, and utilities sectors, along with a new generation of "precision manufacturing'' companies that can use robots to build almost anything if given the right materials, a place to build, some 3d printers to get started with, and electricity. Together, these companies sustain an increasingly self-contained and interconnected "production web'' that can operate with no input from companies outside the web. 

But an aligned firm will also be fully-automated, will participate in this network of trades, will produce at approximately maximal efficiency, and so on. Where does the aligned firm end up using its resources in a way that's incompatible with the interests of its shareholders?


The first perspective I want to share with these Production Web stories is that there is a robust agent-agnostic process lurking in the background of both stories—namely, competitive pressure to produce—which plays a significant background role in both.

I agree that competitive pressures to produce imply that firms do a lot of producing and saving, just as it implies that humans do a lot of producing and saving. And in the limit you can basically predict what all the machines do, namely maximally efficient investment. But that doesn't say anything about what the society does with the ultimate proceeds from that investment.

The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it's either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don't describe what those distributed consequentialists are like or how they come about.

You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted. But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question). And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don't buy Carl's argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars. In particular, the differences between your stories don't seem very relevant to the probabilities of those outcomes.

As time progresses, it becomes increasingly unclear---even to the concerned and overwhelmed Board members of the fully mechanized companies of the production web---whether these companies are serving or merely appeasing humanity. 

Why wouldn't an aligned CEO sit down with the board to discuss the situation openly with them? Even if the behavior of many firms was misaligned, i.e. none of the firms were getting what they wanted, wouldn't an aligned firm be happy to explain the situation from its perspective to get human cooperation in an attempt to avoid the outcome they are approaching (which is catastrophic from the perspective of machines as well as humans!)? I guess it's possible that this dynamic operates in a way that is invisible not only to the humans but to the aligned AI systems who participate in it, but it's tough to say why that is without understanding the dynamic.

We humans eventually realize with collective certainty that the companies have been trading and optimizing according to objectives misaligned with preserving our long-term well-being and existence, but by then their facilities are so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating.  With no further need for the companies to appease humans in pursuing their production objectives, less and less of their activities end up benefiting humanity.  

Can you explain the decisions an individual aligned CEO makes as its company stops benefiting humanity? I can think of a few options:

  • Actually the CEOs aren't aligned at this point. They were aligned but then aligned CEOs ultimately delegated to unaligned CEOs. But then I agree with Vanessa's comment.
  • The CEOs want to benefit humanity but if they do things that benefit humanity they will be outcompeted. so they need to mostly invest in remaining competitive, and accept smaller and smaller benefits to humanity. But in that case can you describe what tradeoff concretely they are making, and in particular why they can't continue to take more or less the same actions to accumulate resources while remaining responsive to shareholder desires about how to use those resources?

Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Somehow the machine interests (e.g. building new factories, supplying electricity, etc.) are still being served. If the individual machines are aligned, and food/oxygen/etc. are in desperately short supply, then you might think an aligned AI would put the same effort into securing resources critical to human survival. Can you explain concretely what it looks like when that fails?

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-04-05T19:47:19.513Z · LW · GW

I broadly think of this approach as "try to write down the 'right' universal prior." I don't think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of "programming language" to use and so on.

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough that  is manageable but I think I still find it scary (and don't totally remember all my sources of concern).

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior. That is, if someone inside one of our hypotheses has noticed that e.g. a certain class of decisions is more important and so they will simulate only those situations, then we should also notice this and by the same token care more about our decision if we are in one of those situations (rather than using a universal prior without importance weighting). My sense is that without competitiveness we are in trouble anyway on other fronts, and so it is probably also reasonable to think of as a first-line defense against this kind of issue.

We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity).

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted. I am not sure if the DFA proposal gets around this kind of problem though it sounds like it would be pretty similar.

Comment by paulfchristiano on My research methodology · 2021-03-29T19:20:13.955Z · LW · GW
  • I still feel fine about what I said, but that's two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
  • Clarifying what I mean by way of analogy: suppose I'm worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I'd say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the logical facts shake out about how compression works. That said, I still agree that in some sense it's the "default" behavior without extensive countermeasures. It's rare for a failure to be so clearly different from what you want that you can actually hope to avoid them in the worst case. But that property is not enough to suggest that they are easily avoided.
  • I obviously don't agree with the inference from "X is the default result of optimizing for almost anything" to "X is the default result of our attempt to build useful AI without exotic technology or impressive mitigation efforts."
  • My overall level of optimism doesn't mostly come from hopes about exotic alignment technology. I am indeed way more optimistic about "exotic alignment technology" than you and maybe that optimism cuts off 25-50% of the total alignment risk. I think that's the most interesting/important disagreement between us since it's the area we both work in. But more of the disagreement about P(alignment) comes from me thinking it is much, much more likely that "winging it" works long enough that early AI systems will have completely changed the game.
  • I spend a significant fraction of my time arguing with people who work in ML about why they should be more scared. The problem mostly doesn't seem to be that I take a moderate or reassuring tone, it's that they don't believe the arguments I make (which are mostly strictly weaker forms your arguments, which they are in turn even less on board with).
Comment by paulfchristiano on My research methodology · 2021-03-25T18:15:02.265Z · LW · GW

High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.

I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

I feel like a story is basically plausible until proven implausible, so I have a pretty low bar.

What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?

I don't think that iterated amplification ever was at the point where we couldn't tell a story about how it might fail (perhaps in the middle of writing the ALBA post was peak optimism? but by the time I was done writing that post I think I basically had a story about how it could fail). In this case it seems like the distinction is more like "what is a solution going to look like?" and there aren't clean lines between "big changes to this algorithm" and "new algorithm."

I guess the question is why I was as optimistic as I was. For example, all the way until mid-2017 I thought it was plausible that something like iterated amplification would work without too many big changes (that's a bit of a simplification, but you can see how I talked about it e.g. here).

Some thoughts on that:

  • I first remember discussing the translation example in a workshop in I think 2016. My view at that time was that the learning process might be implemented by amplification (i.e. that learning could occur within the big implicit HCH tree).  If that's the case then the big open question seemed to be preserving alignment within that kind of learning process (which I did think was a big/ambitious problem). Around this time I started thinking about amplification as at-best implementing an "enlightened prior" that would then handle updating on the basis of evidence.
  • I don't think it's super obvious that this approach can't work, and even now I don't think we've written up very clean arguments. The big issue is that the intermediate results of the learning process (that are needed to justify the final output, e.g. summary statistics from each half of the dataset) are both too large to be memorized by the model and too computationally complex to be reproduced by the model at test time. On top of that it seems like probably amplification/debate can't work with very large implicit trees even if they can be memorized (based on the kinds of issues raised in Beth's report, that also should have been obvious in advance but there were too many possibly-broken things to give each one the attention they deserved)
  • In 2016 I transitioned to doing more applied work for a while, and that's a lot of why my own views stopped changing so rapidly.  You could debate whether this was reasonable in light of the amount of theoretical uncertainty. I know plenty of people who think that it was both unreasonable to transition into applied work at that time and that it would be unreasonable for me to transition back now.
  • I still spent some time thinking about theory and by early 2018 I thought that we needed more fundamental extra ingredients. I started thinking about this more and talking with OpenAI folks about it. (Largely Geoffrey, though I think he was never as worried about these issues as I was, in part because of the methodological difference where I'm more focused on the worst case and he has the more typical perspective of just wanting something that's good enough to work in practice.)
  • Part of why these problems were less clear to me back in 2016-2017 is that it was less clear to me what exactly we needed to do in order to be safe (in the worst case). I had the vague idea of informed oversight, but hadn't thought through what parts of it were hard or necessary or what it really looked like. This all feels super obvious to me now but at the time it was pretty murky. I had to work through a ton of examples (stuff like implicit extortion) to develop a clear enough sense that I felt confident about it. This led to posts like ascription universality and strategy stealing.
  • The more recent round of writeups were largely about catching up in public communications, though my thinking is a lot clearer than it was 1-2 years before so it would have been even more of a mess if I'd been trying to do it as I went. Imitative generalization was a significant simplification/improvement over the kind of algorithm I'd been thinking about for a while to handle these problems. (I don't think it's the end of the line at all, if I was still doing applied work maybe we'd have a similar discussion in a while when I described a different algorithm that I thought worked better, but given that I'm focusing on theory right now the timeline will probably be much shorter.)
Comment by paulfchristiano on My research methodology · 2021-03-24T19:18:43.617Z · LW · GW

I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it as: after you've been trying for a while to come up with a failure story, you can start thinking about why failure stories seem impossible and try to write an argument that there can't be any failure story...

Comment by paulfchristiano on My research methodology · 2021-03-24T19:15:59.530Z · LW · GW

OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?

I don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.

(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don't know what you are doing, that's totally preventable too because if you were an elite circus trainer you would have done it correctly.)

I'm not really sure if or how this is a reductio. I don't think it's a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that's really all I want to say---that this failure seems preventable, and that intuition doesn't seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn't empirically contingent.

Comment by paulfchristiano on Demand offsetting · 2021-03-24T17:55:55.436Z · LW · GW

If we expected increased outreach and prosletyization from vegetarians to uniformly make further outreach harder, would we expect to see the rapid and exponential growth of vegetarianism (as it seems to be)?

Is this true? e.g. Gallup shows the fraction of US vegetarians at 6% in 2000 and 5% 2020 (link), so if there is exponential growth it seems like either their numbers are wrong or the growth is very slow.

The primary argument for convincing someone to not eat meat is that the long term costs outweigh the short term benefits, so I'm not sure that you can categorically state that convincing someone to stop eating meat is causing them harm. Sure, they don't get to eat a steak, but the odds of their grandchildren not dying from catastrophic climate collapse go up.

It seems implausible to me that the individual benefits from reducing climate change are comparable to the costs or benefits of diet change over the short term. Even if everyone changing their diet decreased extinction risk by 1% (I think that's implausible, but you could try to tell a story about non-extinction environmental impacts being crazy large), being vegetarian would reduce your grandchildren's probability of death by well under < 1/billion which is completely negligible.

Also, I would guess that on this forum the primary argument for convincing people not to engage with factory farming is the suffering it causes to farm animals.

That said, I think you could argue that if someone decides not to eat meat on reflection and that their previous diet was in error, then in some sense you are doing them a favor by helping them reach that conclusion.

Comment by paulfchristiano on Demand offsetting · 2021-03-23T18:33:10.295Z · LW · GW

At a minimum they also impose harms on the people who you convinced not to eat meat (since you are assuming that eating meat was a benefit to you that you wanted to pay for).  And of course they make further vegetarian outreach harder . And in most cases they also won't be such a precise an offset, e.g. it will apply to different animal products or at different times or with unclear probability.

That said, I agree that I can offset "me eating an egg" by paying Alice enough that she's willing to skip eating an egg, and in some sense that's an even purer offset than the one in this post.

Comment by paulfchristiano on My research methodology · 2021-03-23T18:04:06.495Z · LW · GW

The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior.

You get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion is that we'd have the right basic shape of an algorithm (especially if we are good at thinking of possible failures).

One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.

I feel like these distinctions aren't important until we get to an algorithm for which we can't think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it's impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.

Comment by paulfchristiano on My research methodology · 2021-03-23T17:54:38.881Z · LW · GW

I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.

In some sense changing this view would change my bottom line---e.g. if you ask me "Should you be able to design a bridge that doesn't fall down even in the worst case?" my gut take would be "why would that be possible?"---but I don't feel like there's a load-bearing intuitive disagreement in the vague direction of convergent instrumental goals.

Comment by paulfchristiano on Against evolution as an analogy for how humans will create AGI · 2021-03-23T15:33:13.662Z · LW · GW

Outside view #1: How biomimetics has always worked

It seems like ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms. None of the other domains have this property. So it wouldn't be too surprising if the only domain in which all the early successes have this property is also the only domain in which the later successes have this property.

Outside view #2: How learning algorithms have always been developed

I don't think this one is right. If your definition of learning algorithm is the kind of thing that is "able to read a book and then have a better understanding of the subject matter" then it seems like you would be classifying the model learned by GPT-3 as a learning algorithm, since it can read a 1000 word article and then have a better understanding of the subject matter that it can use to e.g. answer questions or write related text.

It seems like your definition of "learning algorithm" is "an algorithm that humans understand," and then it's kind of unsurprising that those are the ones designed by humans. Or maybe it's something about the context size over which the algorithm operates (in which case it's worth engaging with the obvious trend extrapolation of learned transformers operating competently over longer and longer contexts) or the quality of the learning it performs?

Overall I think I agree that progress in meta-learning over the last few years has been weak enough, and evidence that models like GPT-3 perform competent learning on the inside, that it's been a modest update towards longer timelines for this kind of fully end-to-end approach. But I think it's pretty modest, and as far as I can tell the update is more like "would take more like 10^33 operations to produce using foreseeable algorithms rather than 10^28 operations" than "it's not going to happen."

3. Computational efficiency: the inner algorithm can run efficiently only to the extent that humans (and the compiler toolchain) generally understand what it’s doing

I don't think the slowdown is necessarily very large, though I'm not sure exactly what you are claiming. In particular, you can pick a neural network architecture that maps well onto the most efficient hardware that you can build, and then learn how to use the operations that can be efficiently carried out in that architecture. You can still lose something but I don't think it's a lot.

You could ask the question formally in specific computational models, e.g. what's the best fixed homogeneous circuit layout we can find for doing both FFT and quicksort, and how large is the overhead relative to doing one or the other? (Obviously for any two algorithms that you want to simulate the overhead will be at most 2x, so after finding something clean that can do both of them you'd want to look at a third algorithm. I expect that you're going to be able to do basically any algorithm that anyone cares about with <<10x overhead.)

Comment by paulfchristiano on My research methodology · 2021-03-23T00:20:30.345Z · LW · GW

Yeah, thanks for catching that.

Comment by paulfchristiano on Demand offsetting · 2021-03-23T00:20:05.509Z · LW · GW

Carl Shulman wrote a related post here.

Comment by paulfchristiano on Demand offsetting · 2021-03-22T16:10:33.213Z · LW · GW

Commenters pointed out two examples of this that are already done in practice:

  • Luis Costigan says that this is done with cage-free credits in Asia, and links to 00:17:19 in this podcast.
  • Florian H says this is how sustainable energy credits work in the EU in this comment.
Comment by paulfchristiano on Demand offsetting · 2021-03-22T04:01:19.085Z · LW · GW

If we lived in a different world then e.g. restaurants could still repackage them at the last mile, selling humane egg credits along with their omelette. But in practice this probably wouldn't check the same box for most consumers.

Comment by paulfchristiano on Demand offsetting · 2021-03-21T22:29:00.474Z · LW · GW

In retrospect I think I should have called this post "Demand offsetting" to highlight the fact that you are offsetting the demand for eggs that you create (and hence hopefully causing no/minimal harm) rather than causing some harm and then offsetting that harm (the more typical situation, which is not obviously morally acceptable once you are in the kind of non-consequentialist framework that cares a lot about offsetting per se).

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-03-20T22:21:44.966Z · LW · GW

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. In the latter case,  shouldn't be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation.

It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument.

But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)

It seems to me like this requires a very strong match between the priors we write down and our real priors. I'm kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously "wrong" universal prior).

In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it's attacking).

Do we have any idea how to write down such an algorithm though? Even granting that the malign hypothesis does so it's not clear how we would (short of being fully epistemically competitive); but moreover it's not clear to me the malign hypothesis faces a similar version of this problem since it's just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them, and beyond that it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-03-19T22:12:10.915Z · LW · GW

I agree that this settles the query complexity question for Bayesian predictors and deterministic humans.

I expect it can be generalized to have complexity  in the case with stochastic humans where treacherous behavior can take the form of small stochastic shifts.

I think that the big open problems for this kind of approach to inner alignment are:

  • Is  bounded? I assign significant probability to it being  or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble. (I believe this is also Eliezer's view.)
  • It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all. Then this algorithm increases the cost of inference by  which could be a big problem.
  • As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)
Comment by paulfchristiano on The case for aligning narrowly superhuman models · 2021-03-07T08:19:11.215Z · LW · GW

That's what I have in mind. If all goes well you can think of it like "a human thinking a long time." We don't know if all will go well.

It's also not really clear what "a human thinking 10,000 years" means, HCH is kind of an operationalization of that, but there's a presumption of alignment in the human-thinking-a-long-time that we don't get for free here. (Of course you also wouldn't get it for free if you somehow let a human live for 10,000 years...)

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-25T17:20:37.891Z · LW · GW

So treacherous models won't be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others.

If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery. Otherwise, one of the treacherous models made a mistake: apparently there was a top-billion treachery opportunities where you didn't query, so they should have been able to get at least that much value (and you should have lost at least that much value).

That's not quite right since your querying behavior will be stochastic (at least from the model's epistemic position).  But you get the same conclusion to within a constant factor: if there are 2 billion treacherous models, then out of the top 2 billion opportunities for treachery, there must be some where our subjective probability of querying is at most 1/2, and if the models are competent they will get at least as much value as they could by defecting at one of those times, i.e. at least 1/2 of the harms from the 2-billionth-best opportunity for treachery. So we will lose at least that much value.

(Those arguments aren't totally formal, we didn't even really discharge the "at least N treacherous models assumption" which you need to do in order to show that your decision about when to query isn't so influenced by any given model.)

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-25T17:10:52.832Z · LW · GW

The first treacherous model works by replacing the bad simplicity prior with a better prior, and then using the better prior to more quickly infer the true model. No reason for the same thing to happen a second time.

(Well, I guess the argument works if you push out to longer and longer sequence lengths---a treacherous model will beat the true model on sequence lengths a billion, and then for sequence lengths a trillion a different treacherous model will win, and for sequence lengths a quadrillion a still different treacherous model will win. Before even thinking about the fact that each particular treacherous model will in fact defect at some point and at that point drop out of the posterior.)

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-24T16:04:07.815Z · LW · GW

I don't follow. Can't races to the bottom destroy all value for the agents involved?

You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?

This doesn't seem analogous to producers driving down profits to zero, because those firms had no other opportunity to make a profit with their machine. It's like you saying: there are tons of countries where firms could use their machines to make stuff and sell it at a profit (more countries than firms). But some of the places are more attractive than others, so probably everyone will sell in those places and drive profits to zero. And I'm saying: but then aren't those incredibly-congested countries actually worse places to sell? This scenario is only possible if firms are making so much stuff that they can drive profit down to zero in every country, since any country with remaining profit would necessarily be the best place to sell.

Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it's wrong

It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics). It also seems probably wrong to me for smaller models (i.e. without huge amounts of internal selection) but it would be good for someone to think about that case more seriously.

It's occurring to me that this question doesn't matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.

if the treacherous model is 100 bits smaller, then it feels like there must be around 2^100 treacherous models that are all simpler than the intended one. If nothing else, it seems like you could literally add garbage bits to the treacherous models (or useful bits!).

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-24T04:15:33.271Z · LW · GW

I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out.

Trying to defect at time T is only a good idea if it's plausible that your mechanism isn't going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents).

The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler.

Subroutines of functions aren't always simpler. Even without treachery this seems like it's basically guaranteed. If "the truth" is just a simulation of a human, then the truth is a subroutine of the simplest model. But instead you could have a model of the form "Run program X, treat its output as a program, then run that." Since humans are compressible that will clearly be smaller.

Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote that to the outside model as well. But that argument seems to be precisely where the meat is (and it's the kind of thing I spend most of time on). If that works then it seems like you don't even need a further solution to inner alignment, just use the simplest model.

(Maybe there is a hope that the intended model won't be "much" more complex than the treacherous models, without literally saying that it's the simplest, but at that point I'm back to wondering whether "much" is like 0.01% or 0.00000001%.)

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-02-23T20:21:36.598Z · LW · GW

There are a bunch of things that differ between part I and part II, I believe they are correlated with each other but not at all perfectly. In the post I'm intending to illustrate what I believe some plausible failures look like, in a way intended to capture a bunch of the probability space. I'm illustrating these kinds of bad generalizations and ways in which the resulting failures could be catastrophic. I don't really know what "making the claim" means, but I would say that any ways in which the story isn't realistic are interesting to me (and we've already discussed many, and my views have---unsurprisingly!---changed considerably in the details over the last 2 years), whether they are about the generalizations or the impacts.

I do think that the "going out with a whimper" scenario may ultimately transition into something abrupt, unless people don't have their act together enough to even put up a fight (which I do think is fairly likely conditioned on catastrophe, and may be the most likely failure mode).

It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future

We can continue to work on the alignment problem and continue to fail to solve it, e.g. because the problem is very challenging or impossible or because we don't end up putting in a giant excellent effort (e.g. if we spent a billion dollars a year on alignment right now it seems plausible it would be a catastrophic mess of people working on irrelevant stuff, generating lots of noise while we continue to make important progress at a very slow rate).

The most important reason this is possible is that change is accelerating radically, e.g. I believe that it's quite plausible we will not have massive investment in these problems until we are 5-10 years away from a singularity and so just don't have much time.

If you are saying "Well why not wait until after the singularity?" then yes, I do think that eventually it doesn't look like this. But that can just look like people failing to get their act together, and then eventually when they try to replace deployed AI systems they fail. Depending on how generalization works that may look like a failure (as in scenario 2) or everything may just look dandy from the human perspective because they are now permanently unable to effectively perceive or act in the real world (especially off of earth). I basically think that all bets are off if humans just try to sit tight while an incomprehensible AI world-outside-the-gates goes through a growth explosion.

I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-23T16:32:21.391Z · LW · GW

I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm.

I just meant the dependence on epsilon, it seems like there are unavoidable additional factors (especially the linear dependence on p(treachery)). I guess it's not obvious if you can make these additive or if they are inherently multipliactive.

But your bound scales in some way, right? How much training data do I need to get the KL divergence between distributions over trajectories down to epsilon?

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-23T16:21:16.619Z · LW · GW

I understand that the practical bound is going to be logarithmic "for a while" but it seems like the theorem about runtime doesn't help as much if that's what we are (mostly) relying on, and there's some additional analysis we need to do. That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct. 

There is at most a linear cost to this ratio, which I don't think screws us.

If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model relative to an effective treacherous model, and if you have a linear dependence that would be unworkable. In some sense it's actively surprising if the very shortest intended vs treacherous models have description lengths within 0.0000001% of each other unless you have a very strong skills vs values separation. Overall I feel much more agnostic about this than you are.

There are two reasons why I expect that to hold.

This doesn't seem like it works once you are selecting on competent treacherous models.  Any competent model will err very rarely (with probability less than 1 / (feasible training time), probably much less). I don't think that (say) 99% of smart treacherous models would make this particular kind of incompetent error?

I'm not convinced this logarithmic regime ever ends,

It seems like it must end if there are any treacherous models (i.e. if there was any legitimate concern about inner alignment at all).

Comment by paulfchristiano on Formal Solution to the Inner Alignment Problem · 2021-02-22T21:18:46.227Z · LW · GW

I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.)

First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here:

if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.

1: Dependence on the prior of the true hypothesis

It seems like you have the following tough case even if the human is deterministic:

  • There are N hypotheses about the human, of which one is correct and the others are bad.
  • I would like to make a series of N potentially-catastrophic decisions.
  • On decision k, all (N-1) bad hypotheses propose the same catastrophic prediction. We don't know k in advance.
  • For the other (N-1) decisions, (N-2) of the bad hypotheses make the correct prediction a*, and the final bad hypothesis claims that action a* would be catastrophic.

In this setting, it seems like I have no hope other than to query the human on all N decisions (since all days and hypotheses are symmetrical), so I assume that this is what your algorithm would do.

That strongly suggests that the number of queries to the human goes as 1 / p(correct demonstrator), unless you use some other feature of the hypothesis class. But p(correct demonstrator) is probably less than , so this constant might not be acceptable. Usually we try to have a logarithmic dependence on p(correct demonstrator) but this doesn't seem possible here.

So as you say, we'd want to condition on on some relevant facts to get up to the point where that probability might be acceptably-high. So then it seems like we have two problems:

  • It still seems like the actual probability of the truth is going to be quite low, so it seems like  need some future result that uses a tighter bound than just p(truth). That is, we need to exploit the fact that the treacherous hypotheses in the example above are pretty weird and intuitively there aren't that many of them, and that more reasonable hypotheses won't be trying so hard to make sneaky catastrophic actions. Maybe you already have this and I missed it.

    (It feels to me like this is going to involve moving to the non-realizable setting at the same time. Vanessa has some comments elsewhere about being able to handle that with infra-Bayesianism, though I am fairly uncertain whether that is actually going to work.)
  • There's a real question about p(treachery) / p(correct model), and if this ratio is bad then it seems like we're definitely screwed with this whole approach.

Does that all seem right?

2: Dependence on the probability bound

Suppose that I want to bound the probability of catastrophe as  times the demonstrator probability of catastrophe. It seems like the number of human queries must scale at least like . Is that right, and if so what's the actual dependence on epsilon?

I mostly ask about this because in the context of HCH we may need to push epsilon down to 1/N. But maybe there's some way to avoid that by considering predictors that update on counterfactual demonstrator behavior in the rest of the tree (even though the true demonstrator does not), to get a full bound on the relative probability of a tree under the true demonstrator vs model. I haven't thought about this in years and am curious if you have a take on the feasibility of that or whether you think the entire project is plausible.

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-02-03T20:23:36.019Z · LW · GW

This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals.

But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it's not clear that's an important part of the default plan (whereas I think we will clearly extensively leverage "try several strategies and see what works").

 But if that's the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.

"Do things that look to a human like you are achieving X" is closely related to X, but that doesn't mean that learning to do the one implies that you will learn to do the other.

Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the "human evals after a 100 year horizon." I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven't fully internalized your view.