## Posts

It’s not economically inefficient for a UBI to reduce recipient’s employment 2020-11-22T16:40:05.531Z
Hiring engineers and researchers to help align GPT-3 2020-10-01T18:54:23.551Z
“Unsupervised” translation as an (intent) alignment problem 2020-09-30T00:50:06.077Z
Distributed public goods provision 2020-09-26T21:20:05.352Z
Better priors as a safety problem 2020-07-05T21:20:02.851Z
Learning the prior 2020-07-05T21:00:01.192Z
Inaccessible information 2020-06-03T05:10:02.844Z
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z
Hedonic asymmetries 2020-01-26T02:10:01.323Z
Moral public goods 2020-01-26T00:10:01.803Z
Of arguments and wagers 2020-01-10T22:20:02.213Z
Prediction markets for internet points? 2019-10-27T19:30:00.898Z
AI alignment landscape 2019-10-13T02:10:01.135Z
Taxing investment income is complicated 2019-09-22T01:30:01.242Z
The strategy-stealing assumption 2019-09-16T15:23:25.339Z
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z
What failure looks like 2019-03-17T20:18:59.800Z
Security amplification 2019-02-06T17:28:19.995Z
Reliability amplification 2019-01-31T21:12:18.591Z
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z
Thoughts on reward engineering 2019-01-24T20:15:05.251Z
Learning with catastrophes 2019-01-23T03:01:26.397Z
Capability amplification 2019-01-20T07:03:27.879Z
The reward engineering problem 2019-01-16T18:47:24.075Z
Towards formalizing universality 2019-01-13T20:39:21.726Z
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z
Benign model-free RL 2018-12-02T04:10:45.205Z
Corrigibility 2018-11-27T21:50:10.517Z
Humans Consulting HCH 2018-11-25T23:18:55.247Z
Approval-directed bootstrapping 2018-11-25T23:18:47.542Z
Approval-directed agents 2018-11-22T21:15:28.956Z
Prosaic AI alignment 2018-11-20T13:56:39.773Z
An unaligned benchmark 2018-11-17T15:51:03.448Z
Clarifying "AI Alignment" 2018-11-15T14:41:57.599Z
The Steering Problem 2018-11-13T17:14:56.557Z
Preface to the sequence on iterated amplification 2018-11-10T13:24:13.200Z
The easy goal inference problem is still hard 2018-11-03T14:41:55.464Z
Meta-execution 2018-11-01T22:18:10.656Z
Could we send a message to the distant future? 2018-06-09T04:27:00.544Z
When is unaligned AI morally valuable? 2018-05-25T01:57:55.579Z
Open question: are minimal circuits daemon-free? 2018-05-05T22:40:20.509Z
Weird question: could we see distant aliens? 2018-04-20T06:40:18.022Z
Implicit extortion 2018-04-13T16:33:21.503Z
Prize for probable problems 2018-03-08T16:58:11.536Z
Argument, intuition, and recursion 2018-03-05T01:37:36.120Z

Comment by paulfchristiano on Learning the prior · 2021-01-25T16:21:47.123Z · LW · GW

Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I'm kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we'd want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it's needed for HCH to be stable/aligned against internal optimization pressure).

Comment by paulfchristiano on Learning the prior · 2021-01-24T16:39:44.160Z · LW · GW

was optimized to imitate H on D

It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!).

I suppose this works, but then couldn't we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?

The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.

Comment by paulfchristiano on Learning the prior · 2021-01-23T23:15:27.201Z · LW · GW

I think your description is correct.

The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters.

My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it implies on D*. But what is that going to look like? If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network...

In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I've generally moved away from that kind of perspective, partly based on the kinds of considerations in this post.

But since the amplification/debate models are ML models, and we’re running these models to aid human decisions on x*, aren’t we back to relying on ML OOD generalization, and so back where we started?

I now think we're going to have to actually have z* reflect something more like the structure of the unaligned neural network, rather than another model (Mz) that outputs all of the unaligned neural network's knowledge.

That said, I'm not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z. Then the prior calculation can access all of the words i in order to evaluate the plausibility of the string represented by Mz. We then use that same set of words at training time and test time. If some index i is used at test time but not at training time, then the model responsible for evaluating Prior(z) is incentivized to access that index in order to show that z is unnecessarily complex. So every index i should be accessed on the training distribution. (Though they need not be accessed explicitly, just somewhere in the implicit exponentially large tree).

Like I said, I'm a bit less optimistic about doing this kind of massive compression. For now, I'm just thinking about the setting where our human has plenty of time to look at z in detail even if it's the same size as the weights of our neural network. if we can make that work, then I'll think about to do it in the case of computationally bounded humans (which I expect to be straightforward).

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-01-21T01:39:18.253Z · LW · GW

I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.

It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons).

If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I'm not sure if there's actually a big quantitative disagreement though rather than a communication problem.

I also think it's quite likely that the story in my post is unrealistic in a bunch of ways and I'm currently thinking more about what I think would actually happen.

Some more detailed responses that feel more in-the-weeds:

you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons

I might not understand this point. For example, suppose I'm training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has "robust motivations" then they would most likely be to predict accurately, but I'm not sure about why the model necessarily has robust motivations.

I feel similarly about goals like "plan to get high reward (defined as signals on channel X, you can learn how the channel works)." But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.

But it seems to me by default, during early training periods the AI won't have much information about either the overseer's knowledge (or the overseer's existence), and may not even have the concept of rewards, making alignment with instructions much more natural.

It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.

my concern is that this underlying concept of "natural generalisation" is doing a lot of work, despite not having been explored in your original post

Definitely, I think it's critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).

That said, a major part of my view is that it's pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it's not a big deal which since they both seem bad and seem averted in the same way.

I think the really key question is how likely it is that we get some kind of "intended" generalization like friendliness. I'm frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I'm also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.

(or anywhere else, to my knowledge)

Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).

Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.

I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don't know pointers offhand. I think most of my sense is

I haven't written that much about why I think generalizations like "just be helpful" aren't that likely.  I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.

There are some google doc comment threads with MIRI where I've written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it's a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of "short hops" where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-01-20T03:01:32.281Z · LW · GW

I agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-01-20T02:58:16.094Z · LW · GW

I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).

Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on the training set. It's also not clear to me that "follow the spirit of the instructions" is better-specified than "do things the instruction-giver would rate highly if we asked them"---informally I would say the latter is better-specified, and it seems like the argument here is resting crucially on some other sense of well-specification.

On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years.

I've trained in simulation on tasks where I face a wide variety of environment, each with a reward signal, and I am taught to learn the dynamics of the environment and the reward and then take actions that lead to a lot of reward. In simulation my tasks can have reasonably long time horizons (as measured by how long I think), though that depends on open questions about scaling behavior. I don't agree with the claim that it's unrealistic to imagine such models generalizing to reality by wanting something-like-reward.

In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task [...]

Trying to maximize wealth over 100 minutes is indeed very different from maximizing wealth over 1 year, and is also almost completely useless for basically the same reason (except in domains like day trading where mark to market acts as a strong value function).

My take is that people will be pushed to optimizing over longer horizons because these qualitatively different tasks over short horizons aren't useful. The useful tasks in fact do involve preparing for the future and acquiring flexible influence, and so time horizons long enough to be useful will also be long enough to be relevantly similar to yet longer horizons.

Developers will be incentivized to find any way to get good behavior over long horizons, and it seems like we have many candidates that I regard as plausible and which all seem reasonably likely to lead to the kind of behavior I discuss. To me it feels like you are quite opinionated about how that generalization will work.

It seems like your take is "consequences over long enough horizons to be useful will be way too expensive to use for training," which seems close to 50/50 to me.

I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap.

I agree that this is a useful distinction and there will be some gap. I think that quantitatively I expect the gap to be much smaller than you do (e.g. getting 10k historical examples of 1-year plans seems quite realistic), and I expect people to work to design training procedures that get good performance on type two measures (roughly by definition), and I guess I'm significantly more agnostic about the likelihood of generalization from the longest type one measures to type two measures.

In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.

I'm imagining systems generalizing much more narrowly to the evaluation process used during training. This is still underspecified in some sense (are you trying to optimize the data that goes into SGD, or the data that goes into the dataset, or the data that goes into the sensors?) and in the limit that basically leads to influence-maximization and continuously fades into scenario 2. It's also true that e.g. I may be able to confirm at test-time that there is no training process holding me accountable, and for some of these generalizations that would lead to a kind of existential crisis (where I've never encountered anything like this during training and it's no longer clear what I'm even aiming at). It doesn't feel like these are the kinds of underspecification you are referring to.

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-01-19T17:37:32.622Z · LW · GW

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.

I think this is the key point and it's glossed over in my original post, so it seems worth digging in a bit more.

I think there are many plausible models that generalize successfully to longer horizons, e.g. from 100 days to 10,000 days:

• Acquire money and other forms of flexible influence, and then tomorrow switch to using a 99-day (or 9999-day) horizon policy.
• Have a short-term predictor, and apply it over more and more steps to predict longer horizons (if your predictor generalizes then there are tons of approaches to acting that would generalize).
• Deductively reason about what actions are good over 100 days (vs 10,000 days), since deduction appears to generalize well from a big messy set of facts to new very different facts.
• If I've learned to abstract seconds into minutes, minutes into hours, hours into days, days into weeks, and then plan over weeks, its pretty plausible that the same procedure can abstract weeks into months and months into years. (It's kind of like I'm now I'm working on a log scale and asking the model to generalize from 1, 2, ..., 10 to 11, 12, 13.)
• Most possible ways of reasoning are hard to write down in a really simple list, but I expect that many hard-to-describe models also generalize. If some generalize and some do not, then training my model over longer and longer horizons (3 seconds, 30 seconds, 5 minutes...) will gradually knock out the non-generalizing modes of reasoning and leave me with the modes that do generalize to longer horizons.

This is roughly why I'm afraid that models we train will ultimately be able to plan over long horizons than those that appear in training.

But many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons (and in particular the first 4 above seem like they'd all be undesirable if generalizing from easily-measured goals, and would lead to the kinds of failures I describe in part I of WFLL).

I think one reason that my posts about this are confusing is that I often insist that we don't rely on generalization because I don't expect it to work reliably in the way we hope. But that's about what assumptions we want to make when designing our algorithms---I still think that the "generalizes in the natural way" model is important for getting a sense of what AI systems are going to do, even if I think there is a good chance that it's not a good enough approximation to make the systems do exactly what we want. (And of course I think if you are relying on generalization in this way you have very little ability to avoid the out-with-a-bang failure mode, so I have further reasons to be unhappy about relying on generalization.)

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-01-19T16:59:10.527Z · LW · GW

In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?

Yes.

If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.

I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.

To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?

I consider those pretty wide open empirical questions. The view that we can get good generalization of this kind is fairly common within ML.

I do agree once you generalize motivations from easily measurable tasks with short feedback loops to tasks with long feedback loops then you may also be able to get "good" generalizations, and this is a way that you can solve the alignment problem. It seems to me that there are lots of plausible ways to generalize to longer horizons without also generalizing to "better" answers (according to humans' idealized reasoning).

(Another salient way in which you get long horizons is by doing something like TD learning, i.e. train a model that predicts its own judgment in 1 minute. I don't know if it's important to get into the details of all the ways people can try to get things to generalize over longer time horizons, it seems like there are many candidates. I agree that there are analogously candidates for getting models to optimize the things we want even if we can't measure them easily, and as I've said I think it's most likely those techniques will be successful, but this is a post about what happens if we fail, and I think it's completely unclear that "we can generalize to longer horizons" implies "we can generalize from the measurable to the unmeasurable.".)

(Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)

When we deploy humans in the real world they do seem to have many desires resembling various plausible generalizations of evolutionary fitness (e.g. to intrinsically want kids even in unfamiliar situations, to care about very long-term legacies, etc.). I totally agree that humans also want a bunch of kind of random spandrels. This is related to the basic uncertainty discussed in the previous paragraphs. I think the situation with ML may well differ because, if we wanted to, we can use training procedures that are much more likely to generalize than evolution.

I don't think it's relevant to my argument that humans can learn sophisticated concepts in a new domain, the question is about the motivations of humans.

Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?

Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?

Yes, I'm saying that part 1 is where you are able to get what you measure and part 2 is where you aren't.

Also, as I say, I expect the real world to be some complicated mish-mash of these kinds of failures (and for real motivations to be potentially influenced both by natural generalizations of what happens at training time and also by randomness / architecture / etc., as seems to be the case with humans).

The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).

Wanting to earn more money or grow users or survive over the long term is also an easily measured goal, and in practice firms crucially exploit the fact that these goals are contiguous with their shorter easily measured proxies. Non-profits that act in the world often have bottom-line metrics that they use to guide their action and seem better at optimizing goals that can be captured by such metrics (or metrics like donor acquisition).

The mechanism by which you are better at pursuing easily-measureable goals is primarily via internal coherence / stability.

I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning.

I've said that previously human world-steering is the only game in town but soon it won't be, so the future is more likely to be steered in ways that a human wouldn't steer it, and that in turn is more likely to be a direction humans don't like. This doesn't speak to whether the harms on balance outweigh the benefits, which would require an analysis of the benefits but is also pretty irrelevant to my claim (especially so given that all of the world-steerers enjoy these benefits at all and we are primarily concerned with relative influence over the very long term). I'm trying to talk about how the future could get steered in a direction that we don't like if AI development goes in a bad direction, I'm not trying to argue something like "Shorter AI timelines are worse" (which I also think is probably true but about which I'm more ambivalent).

If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.

I don't see a plausible way that humans can use physical power for some long-term goals and not other long-term goals, whereas I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).

I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this?

If Facebook's news feed would generate actions chosen to have the long-term consequence of increasing political polarization then I'd say it was steering the future towards political polarization. (And I assume you'd say it was an agent.)

As is, I don't think Facebook's newsfeed steers the future towards political polarization in a meaningful sense (it's roughly the same as a toaster steering the world towards more toast).

Maybe that's quantitatively just the same kind of thing but weak, since after all everything is about generalization anyway. In that case the concern seems like it's about world-steering that scales up as we scale up our technology/models improve (such that they will eventually become competitive with human world-steering), whereas the news feed doesn't scale up since it's just relying on some random association about how short-term events X happen to lead to polarization (and nor will a toaster if you make it better and better at toasting). I don't really have views on this kind of definitional question, and my post isn't really relying on any of these distinctions.

Something like A/B testing is much closer to future-steering, since scaling it up in the obvious way (and scaling to effects across more users and longer horizons rather than independently randomizing) would in fact steer the future towards whatever selection criteria you were using. But I agree with your point that such systems can only steer the very long-term future once there is some kind of generalization.

Comment by paulfchristiano on Some thoughts on risks from narrow, non-agentic AI · 2021-01-19T02:07:44.676Z · LW · GW

If it’s via a deliberate plan to suppress them while also overcoming human objections to doing so, then that seems less like a narrow system “optimising for an easily measured objective” and more like an agentic and misaligned AGI

I didn't mean to make any distinction of this kind. I don't think I said anything about narrowness or agency. The systems I describe do seem to be optimizing for easily measurable objectives, but that seems mostly orthogonal to these other axes.

I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.

Secondly, let’s talk about existing pressures towards easily-measured goals. I read this as primarily referring to competitive political and economic activity - because competition is a key force pushing people towards tradeoffs which are undesirable in the long term.

I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.

Regarding the extent and nature of competition I do think I disagree with you fairly strongly but it doesn't seem like a central point.

the US, for example, doesn’t seem very concerned right now about falling behind China.

I think this is in fact quite high on the list of concerns for US policy-makers and especially the US defense establishment.

Further, I don’t see where the meta-level optimisation for easily measured objectives comes from.

Firms and governments and people pursue a whole mix of objectives, some of which are easily measured. The ones pursuing easily-measured objectives are more successful, and so control an increasing fraction of resources.

So if narrow AI becomes very powerful, we should expect it to improve humanity’s ability to steer our trajectory in many ways.

I don't disagree with this at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. (If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.) That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult and say a little bit about how that difficulty might materialize.

I don't really know if or how this is distinct from what you call the second species argument. It feels like you are objecting to a distinction I'm not intending to make.

Comment by paulfchristiano on Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment · 2021-01-16T02:26:43.676Z · LW · GW

If it turns out not to be possible then the AI should never fade away.

If the humans in the container succeed in becoming wiser, then hopefully it is wise for us to leave this decision up to them than to preemptively make it now (and so I think the situation is even better than it sounds superficially).

But it seems to me that there are a small number of humans on this planet who have moved some way in the direction of being fit to run the world, and in time, more humans could move in this direction, and could move further.

It seems like the real thing up for debate will be about power struggles amongst humans---if we had just one human, then it seems to me like the grandparent's position would be straightforwardly incoherent. This includes, in particular, competing views about what kind of structure we should use to govern ourselves in the future.

Comment by paulfchristiano on Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment · 2021-01-16T02:20:27.190Z · LW · GW

I buy into the delegation framing, but I think that the best targets for delegation look more like "slightly older and wiser versions of ourselves with slightly more space" (who can themselves make decisions about whether to delegate to something more alien). In the sand-pit example, if the child opted into that arrangement then I would say they have effectively delegated to a version of themselves who is slightly constrained and shaped by the supervision of the adult. (But in the present situation, the most important thing is that the parent protects them from the outside the world while they have time to grow.)

Comment by paulfchristiano on Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment · 2021-01-16T02:17:19.198Z · LW · GW

I basically agree that humans ought to use AI to get space, safety and time to figure out what we want and grow into the people we want to be before making important decisions. This is (roughly) why I'm not concerned with some of the distinctions Gabriel raises, or that naturally come to mind when many people think of alignment.

That said, I feel your analogy misses a key point: while the child is playing in their sandbox, other stuff is happening in the world---people are building factories and armies, fighting wars and grabbing resources in space, and so on---and the child will inherit nothing at all unless their parent fights for it.

So without (fairly extreme) coordination, we need to figure out how to have the parent acquire resources and then ultimately "give" those resources to the child. It feels like that problem shouldn't be much harder than the parent acquiring resources for themselves (I explore this intuition some in this post on the "strategy stealing" assumption), so that this just comes down to whether we can create a parent who is competent while being motivated to even try to help the child. That's what I have in mind while working on the alignment problem.

On the other hand, given strong enough coordination that the parent doesn't have to fight for their child, I think that the whole shape of the alignment problem changes in more profound ways.

I think that much existing research on alignment, and my research in particular, is embedded in the "agency hand-off paradigm" only to the extent that is necessitated by that situation.

I do agree that my post on indirect normativity is embedded in a stronger version of the agency hand-off paradigm. I think the main reason for taking an approach like that is that a human embedded in the physical world is a soft target for would-be attackers and creates a. If we are happy handing off control to a hypothetical version of ourselves in the imagination of our AI, then we can achieve additional security by doing so, and this may be more appealing than other mechanisms to achieve a similar level of security (like uploading or retreating to a secure physical sanctuary). In some sense all of this is just about saying what it means to ultimately "give" the resources to the child, and it does so by trying to construct an ideal environment for them to become wiser after which they will be mature enough to provide more direct instructions. (But in practice I think that these proposals may involve a jarring transition that could be avoided by using a physical sanctuary instead or just ensuring that our local environments remain hospitable.)

Overall it feels to me like you are coming from a similar place to where I was when I wrote this post on corrigibility, and I'm curious if there are places where you would part ways with that perspective (given the consideration I raised in this comment).

(I do think "aligned with who?" is a real question since the parent needs to decide which child will ultimately get the resources, or else if there are multiple children playing together then it matters a lot how the parent's decisions shape the environment that will ultimately aggregate their preferences.)

Comment by paulfchristiano on What technologies could cause world GDP doubling times to be <8 years? · 2020-12-12T23:21:19.112Z · LW · GW

France is the other country for which Our World in Data has figures going back to 1400 (I think from Maddison), here's the same graph for France:

There is more crazy stuff going on, but broadly the picture looks the same and there is quite a lot of acceleration between 1800 and the 1950s. The growth numbers are 0.7% for 1800-1850, 1.2% for 1850-1900, 1.2% for 1900-1950, 2.8% for 1950-2000.

And for the even messier case of China:

Growth averages 0 from 1800 to 1950, and then 3.8% from 1950-2000 and 6.9% from 2000-2016.

Comment by paulfchristiano on What technologies could cause world GDP doubling times to be <8 years? · 2020-12-12T22:56:27.487Z · LW · GW

I think past acceleration is mostly about a large number of improvements that build on one another rather than a small number of big wins (as Katja points out), and future acceleration will probably be more of the same. It seems like almost all of the tasks that humans currently do could plausibly be automated without "AGI" (though it depends on how exactly you define AGI), and if you improve human productivity a bunch in enough industries then you are likely to have faster growth.

I expect "21st century acceleration is about computers taking over cognitive work from humans" will be the analog of "The industrial revolution is about engines taking over mechanical work from humans / beasts of burden."

From that perspective, asking "What technology short of AGI would take over cognitive work from humans, and how?" is analogous to asking "What technology short of a universal actuator would take over mechanical work from humans, and how?" The answer is just: a bunch of stuff that's specific to the details of each type of work.

Thoughts on some particular technologies, kind of at random:

• I think that most of that automation is likely to involve new software, and so the size of the software industry is likely to grow a bunch. Increasing productivity in the software industry (likely via ML) would then be an important driver of productivity growth despite software currently being a small share of GDP.
• I think that cheap solar power, automation of manufacturing and construction (including manufacturing industrial tools and construction of factories), and automation of service jobs are also very important stories.
• I think that west probably could be growing considerably faster even without qualitative technological change, so part of the story may be western countries either getting out of their current slump or being overtaken.

The other part of your post is about how much qualitative change would correspond to a doubling of growth rates. I think you are moderately underestimating the extent of historical acceleration and so overestimating how much qualitative change would be needed:

• I think the US over the last 200 years is a particularly bad comparison because at the beginning of the period it was benefiting a lot from colonization. Below I talk about the UK which I think is probably more representative. I chose the UK as the the most natural frontier economy after the industrial revolution, but I expect the exercise would be similar for other countries without complications.
• Looking at growth over the last 200 years hides the fact that there was a period of more rapid acceleration followed by a stagnation. If we instead compared 1800 to 1950 we'd see a larger change in growth rates accompanied by a smaller qualitative change. So that's probably more useful if you are looking for an existence proof (and I think low levels of current growth likely make acceleration easier).

In 1800 the US was growing rapidly in significant part because colonists were still taking new land and then increasing utilization of that land. So over the last 200 years you have a decrease in some kinds of growth and an increase in others. I don't know much about this and it may be completely wrong, but given that the US was growing so much faster than the rest of the world and that there's such a simple explanation that seems to check out, that's what I'd assume is going on. If that's right then it can still be OK to use the US as an example but you can't use raw growth numbers to infer something about technological change.

If you want to see what's happening in frontier economies since the industrial revolution then it seems more natural to use something like per capita GDP in the UK. If I look up the GDP per capita in the UK time series at Our World in Data and turn that into a graph of (GDP per capita growth rate) vs (time), I get:

So it seems to me like things really did change a lot as technology improved, growing from 0.4% in 1800-1850, to 1% in 1850-1900, to .8% in 1900-1950, to 2.4% in 1950-2000. What we're talking about is a further change similar in scope to the change from 1800 to 1850 or from 1900 to 1950.

(I don't know if there are other reasons the UK isn't representative. I think the most obvious candidate would be that 1900-1950 was a really rough period for the UK, and then 1950-2000 potentially involves some catch-up growth.)

Comment by paulfchristiano on What technologies could cause world GDP doubling times to be <8 years? · 2020-12-12T22:28:18.546Z · LW · GW

the acceleration due to the agricultural revolution was due to agriculture + a few other things maybe

This linguistic trick only seems to work because you have a single word ("agriculture") that describes most human work at that time. If you want the analogous level of description for "what is going to improve next?" just look up what people are doing in the modern economy. If you need more words to describe the modern economy, then I guess it's going to be "more technologies" that are responsible this time (though in the future, when stuff we do today is a smaller part of the economy, they may describe it using a smaller number of words).

we can say the acceleration due to the industrial revolution was due to engines + international trade + mass-production methods + scientific institutions + capitalist institutions + a few other things I'm forgetting

If you are including lists that long then I guess the thing that's going to change is "improved manufacturing + logistics + construction + retail + finance" or whatever, just sample all the stuff that humans do and then it gets improved.

(I think "a few other things" is actually quite a lot of other things, unless you construe "mass-production" super broadly.)

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-25T05:37:08.599Z · LW · GW

Many people's view of a UBI depends on whether recipients in fact stop working. For example, people are interested in running studies on that question, often with a clear indication that they would support a UBI if and only if recipients don't significantly decrease hours worked.

What are we to make of this concern?

A natural way to understand it is to separate the effects of UBI into {recipients may decide to reduce hours worked} from {all other effects}. Then the concern could be understood as a suggestion that this change in hours worked is bad even if the the other effects of a UBI would be good. Put differently, people who express this concern may believe that a UBI would be good if we magically causally intervened to ensure that people continued working the same amount, while the effects of UBI alone are more uncertain.

The reason to respond to this view, rather than directly analyzing all the effects of a UBI together, is that it seemed to me to indicate a moral error that could be separated from the other complex empirical questions at stake.

(Given that this seems like a kind of unenlightening thread about a topic that's not super important to me, I'll probably drop it.)

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-24T16:35:05.803Z · LW · GW

Capital costs in food production are significant. Land will still cost money, materials will still cost money, and machines will still cost money (though the cost of machines above and beyond the raw material cost could rapidly fall).

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-24T16:30:58.724Z · LW · GW

You could argue that people don't take that into account when deciding not to work (so that I can make the world better by forcing people to work for their own benefit).

The first step would be believing that people who stop working because they don't have to end up being less healthy, I have no idea if that's true. It's a bit hard to study, since interventions like "inherit a bunch of money" and "receive a UBI" mostly affect health via the channel of "now you have a bunch of money," and that obscures any negative effect from not having to work.

(And on the other hand, comparing the employed to the unemployed is extremely confounded and I'm skeptical it gives any evidence on this question. It would be pretty surprising if people who had a harder time finding work weren't less healthy and happy.)

The best would be to compare people receiving an unconditional transfer to people receiving a transfer with a work requirement, but I'm not aware of studies on that.

You could also have some anecdotal evidence about that. People I know who are voluntarily unemployed seem to eat and exercise better, but they are probably not representative of the people affected by a welfare work requirement.

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-24T16:25:55.933Z · LW · GW

If we're talking economic efficiency, then your own utility should be included.

My starting assumption is that I decided not to work because I believe I am better off. We are wondering if my decision to stop working was inefficient, i.e. if it makes the world worse off despite me voluntarily choosing to do it. So the salient questions are (i) how does this affect everyone else? Does it cause harms to the rest of the world? (ii) am I predictably making a mistake (e.g. by not adequately accounting for the ways in which working benefits my future self)?

In a UBI scenario, you should be able to stop working while still consuming

Yes, I'm talking about the additional consumption if you earn+spend more money.

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-23T02:20:01.489Z · LW · GW

Basically the whole case comes down to the externalities of working+consuming though (both the case in favor and the case against). It seems the point stands that the externalities of working and consuming are both relevant, there's not really an asymmetry there, and I don't see how this is related to "getting distracted by the money flows."

Like, I might produce value because some gets more surplus from hiring me than they would have gotten from hiring someone else (in the competitive limit that gap converges to 0 and they are indifferent, but presumably it won't be 0 in the real world). And similarly I might produce value because someone gets more surplus from selling to me than they would have gotten from selling to me. But those things seem symmetrical.

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-22T23:59:18.262Z · LW · GW

I may still be misunderstanding.

When I work I create value for the world, which is ultimately measured in benefits to other humans. And when I go spend my money I impose a cost on the world which is ultimately measured in the effort those people put in to give me what I bought, or the other people who could have had the thing that did not, or whatever.

It seems like the question is about the balance between the value I create by working, and the value others lose when I consume, isn't it? It's relevant both how much other people value what I do for them, and how much other people value the effort they put in for me.

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-22T23:53:41.811Z · LW · GW

If I earn less money than I spend less money. The question is whether the combination of {me working} + {me consuming} is better or worse for the rest of the world than {me relaxing}, since what's at issue is precisely whether individuals who decide not to work are a sign of social inefficiency.

For the purpose of that comparison, the consumption seems just as relevant as the production. You seem to be disagreeing, but I'm not sure why. Yes, it's true that if I give someone a UBI they will also spend the UBI, and that's the same as any redistribution, but that's not relevant to analyzing whether their decision to not work is socially inefficient.

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-22T17:56:35.123Z · LW · GW

It seems like their problem is that they can't pay for a UBI without crazy distortions (and likely can't raise enough money for a large UBI regardless).

I'm not sure what exactly the reductio is for the medieval society. Giving low-income workers money will generally raise the price of goods produced by low-income workers but that doesn't generally indicate any efficiency loss.

I do definitely agree that paying someone a $100 UBI causes a loss of$100 to the taxpayers who paid for it. But that happens regardless of whether the recipients stop working.

Comment by paulfchristiano on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-22T17:25:02.221Z · LW · GW

That society might gain additional benefit from how you spend your money is merely coincidental.

I'm not sure what "coincidental" means here. The question is how much more or less than \$100 of value you create by working, and that seems to depend about as much on how you spend your money as it does on how you earn your money.

Comment by paulfchristiano on Some AI research areas and their relevance to existential safety · 2020-11-21T02:09:55.394Z · LW · GW

A number of blogs seem to treat [AI existential safety, AI alignment, and AI safety] as near-synonyms (e.g., LessWrong, the Alignment Forum), and I think that is a mistake, at least when it comes to guiding technical work for existential safety.

I strongly agree with the benefits of having separate terms and generally like your definitions.

In this post, AI existential safety means “preventing AI technology from posing risks to humanity that are comparable or greater than human extinction in terms of their moral significance.”

I like "existential AI safety" as a term to distinguish from "AI safety" and agree that it seems to be clearer and have more staying power. (That said, it's a bummer that "AI existential safety forum" is a bit of a mouthful.)

If I read that term without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Comment by paulfchristiano on Some AI research areas and their relevance to existential safety · 2020-11-21T01:38:44.343Z · LW · GW

Outcome C is most naturally achieved using "direct democracy" TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that's hard because:

I'm not sure what's most natural, but I do consider this a fairly unlikely way of achieving outcome C.

I think the best argument for this kind of outcome is from Wei Dai, but I don't think it gets you close to the "direct democracy" outcome. (Even if you had state control and AI systems aligned with the state, it seems unlikely and probably undesirable for the state to be replaced with an aggregation procedure implemented by the AI itself.)

Comment by paulfchristiano on Some AI research areas and their relevance to existential safety · 2020-11-21T01:25:07.582Z · LW · GW

It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

The OP writes "contributions to AI alignment are also generally unhelpful to existential safety." I don't think I'm taking a strong stand in favor of putting all our hopes on plan A, I'm trying to understand the perspective on which plan B is much more important even before considering neglectedness.

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different.

I agree that would be premature. That said, I still found it notable that OP saw such a large gap between the importance of CSC and other areas on and off the list (including MARL). Given that I would have these things in a different order (before having thought deeply), it seemed to illustrate a striking difference in perspective. I'm not really trying to take a strong stand, just using it to illustrate and explore that difference in perspective.

Comment by paulfchristiano on Some AI research areas and their relevance to existential safety · 2020-11-20T06:59:24.757Z · LW · GW

Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn't keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.

It's unclear to me how this particular outcome relates to social alignment (or at least to the kinds of research areas in this post). Some possibilities:

• Does failure to solve social alignment mean that firms and governments cannot use AI to represent their shareholders and constituents? Why might that be? (E.g. what's a plausible approach to atomic alignment that couldn't be used by a firm or government?)
• Does AI progress occur unevenly such that some group gets much more power/profit, and then uses that power? If so, how would technical progress on alignment help address that outcome? (Why would the group with power be inclined to use whatever techniques we're imagining?) Also, why does this happen?
• Does AI progress somehow complicate the problem of governance or corporate governance such that those organizations can no longer represent their constituents/shareholders? What is the mechanism (or any mechanism) by which this happens? Does social alignment help by making new forms of organization possible, and if so should I just be thinking of it as a way of improving those institutions, or is it somehow distinctive?
• Do we already believe that the situation is gravely unequal (e.g. because governments can't effectively represent their constituents and most people don't have a meaningful amount of capital) and AI progress will exacerbate that situation? How does social alignment prevent that?

(This might make more sense as a question for the OP, it just seemed easier to engage with this comment since it describes a particular more concrete possibility. My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se.)

Comment by paulfchristiano on Some AI research areas and their relevance to existential safety · 2020-11-20T04:49:59.265Z · LW · GW

If single/single alignment is solved it feels like there are some salient "default" ways in which we'll end up approaching multi/multi alignment:

• Existing single/single alignment techniques can also be applied to empower an organization rather than an individual. So we can use existing social technology to form firms and governments and so on, and those organizations will use AI.
• AI systems can themselves participate in traditional social institutions. So AI systems that represent individual human interests can interact with each other e.g. in markets or democracies.

I totally agree that there are many important problems in the world even if we can align AI. That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

For example, let's take the considerations you discuss under CSC:

Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world.  This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.

One approach to this problem is to work to make it more likely that AI systems can adequately represent human interests in understanding and intervening on the structure of society. But this seems to be a single/single alignment problem (to whatever extent that existing humans currently try to maintain and influence our social structure, such that impairing their ability to do so is problematic at all) which you aren't excited about.

Fourth, in such a world, algorithms will be needed to hold the aggregate global behavior of algorithms accountable to human wellbeing, because things will be happening too quickly for humans to monitor.  In short, an “algorithmic government” will be needed to govern “algorithmic society”.  Some might argue this is not strictly unnecessary: in the absence of a mathematically codified algorithmic social contract, humans could in principle coordinate to cease or slow down the use of these powerful new alignment technologies, in order to give ourselves more time to adjust to and govern their use.  However, for all our successes in innovating laws and governments, I do not believe current human legal norms are quite developed enough to stably manage a global economy empowered with individually-alignable transformative AI capabilities.

Again, it's not clear what you expect to happen when existing institutions are empowered by AI and mostly coordinate the activities of AI.

The last line reads to me like "If we were smarter, when our legal system may no longer be up to the challenge," with which I agree. But it seems like the main remedy is "if we were smarter, we would hopefully work on improving our legal system in tandem with the increasing demands we impose on it."

It feels like the salient actions to take to me are (i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research, (ii) work on improving the relative capability of AI at those tasks that seem more useful for guiding society in a positive direction.

I consider (ii) to be one of the most important kinds of research other than alignment for improving the impact of AI, and I consider (i) to be all-around one of the most important things to do for making the world better. Neither of them feels much like CSC (e.g. I don't think computer scientists are the best people to do them) and it's surprising to me that we end up at such different places (if only in framing and tone) from what seem like similar starting points.

Comment by paulfchristiano on Some AI research areas and their relevance to existential safety · 2020-11-20T04:36:30.580Z · LW · GW

Progress in OODR will mostly be used to help roll out more AI technologies into active deployment more quickly

It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X). I may be misunderstanding.

A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that's the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.

I'm not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks. My sense is that most of the locally-contrarian views in this post are driven by locally-contrarian quantitative estimates of various risks. If that's the case, then it seems like the main thing that would shift my view would be some argument about the relative magnitude of risks. I'm not sure if other readers feel similarly.

Research in this area usually does not involve deep or lengthy reflections about the structure of society and human values and interactions, which I think makes this field sort of collectively blind to the consequences of the technologies it will help build.

This is a plausible view, but I'm not sure what negative consequences you have in mind (or how it affects the value of progress in the field rather than the educational value of hanging out with people in the field).

Incidentally, the main reason I think OODR research is educationally valuable is that it can eventually help with applying agent foundations research to societal-scale safety.  Specifically: how can we know if one of the operations (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we applied it in a controlled setting, but the setting is changing over time?  This is a special case of an OODR question.

That task---how do we test that this system will consistently have property P, given that we can only test property P at training time?---is basically the goal of OODR research. Your prioritization of OODR suggests that maybe you think that's the "easy part" of the problem (perhaps because testing property P is so much harder), or that OODR doesn't make meaningful progress on that problem (perhaps because the nature of the problem is so different for different properties P?). Whatever it is, it seems like that's at the core of the disagreement and you don't say much about it. I think many people have the opposite intuition, i.e. that much of the expected harm from AI systems comes from behaviors that would have been recognized as problematic at training time.

In any case, I see AI alignment in turn as having two main potential applications to existential safety:

1. AI alignment is useful as a metaphor for thinking about how to align the global effects of AI technology with human existence, a major concern for AI governance at a global scale, and
2. AI alignment solutions could be used directly to govern powerful AI technologies designed specifically to make the world safer.

Here is one standard argument for working on alignment. It currently seems plausible that AI systems will be trying to do stuff that no one wants and that this could be very bad if AI systems are much more competent than humans. Prima facie, if the designers of AI systems are able to better control what AI systems are trying to do, then those AI systems are more likely to be trying to do what the developers want. So if we are able to give developers that ability, we can reduce the risk of AI competently doing stuff no one wants.

This isn't really a metaphor, it's a direct path for impact. It's unclear if you think that this argument is mistaken because developers will be able to control what their AI systems are trying to do, because they won't be motivated to deploy AI until they have that control, because it's not much better for AI systems to be trying to do what their developers want, because there are other more important reasons that AI systems could be trying to do stuff that no one wants, because there are other risks unrelated to AI trying to do stuff no one wants, or something else altogether.

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

Like you, I'm opposed to plans where people try to take over the world in order to make it safer. But this looks like a bit of a leap. For example, AI alignment may help us build powerful AI systems that help us negotiate or draft agreements, which doesn't seem like taking over the world to make it safer.

Comment by paulfchristiano on My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda · 2020-10-12T03:59:37.745Z · LW · GW

What would a corrigible but not-intent-aligned AI system look like?

Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can't be both corrigible and intent aligned.

Comment by paulfchristiano on Puzzle Games · 2020-10-09T04:44:53.423Z · LW · GW

It's meant to be read before playing, added a comment clarifying.

Comment by paulfchristiano on Puzzle Games · 2020-10-09T01:00:51.355Z · LW · GW

Follow-up now that I've finished.

(This is spoiler'ed as per this post's spoiler policy, but it's designed to provide a rules clarification relevant to the parent and to be read before finishing the game.)

Here's a simplified model of resetting: the game tracks the most recent landmass you've stepped on. When you reset, all trees from that landmass are returned to their initial state. You are moved to the location where you first stepped foot on that landmass.

That model isn't exactly right (since it would make it way too easy for the player to get stuck), but every puzzle is solvable under that model. I had a single solution that would have worked under that model but didn't work under the actual behavior of resetting, which was a tiny bit frustrating but not a big deal.

If you reset half of a raft then it becomes a lone log (this makes it possible to split a long log in two or to rotate a log by integrating it into a raft then resetting). If you reset something that's holding another log up, the other log will fall down. I think there are some tricky corner cases but you never need to deal with any more complicated than those two basics and you can just pretend that you automatically lose if you create a tricky situation (e.g. if you reset when a tree's starting position is occupied).

Comment by paulfchristiano on Hiring engineers and researchers to help align GPT-3 · 2020-10-05T19:28:18.441Z · LW · GW

described by Eliezer as “directly, straight-up relevant to real alignment problems.”

Worth saying that Eliezer still thinks our team is pretty doomed and this is definitely not a general endorsement of our agenda. I feel excited about our approach and think it may yet work, but I believe Eliezer's position is that we're just shuffling around the most important difficulties into the part of the plan that's vague and speculative.

I think it's fair to say that Reflection is on the Pareto frontier of {plays ball with MIRI-style concerns, does mainstream ML research}. I'm excited for a future where either we convince MIRI that aligning prosaic AI is plausible, or MIRI convinces us that it isn't.

Comment by paulfchristiano on Hiring engineers and researchers to help align GPT-3 · 2020-10-05T19:19:02.180Z · LW · GW

I think that "imitate a human who is trying to be helpful" is better than "imitate a human who is writing an article on the internet," even though it's hard to define "helpful." I agree that's not completely obvious for a bunch of reasons.

(GPT-3 is better if your goal is in fact to predict text that people write on the internet, but that's a minority of API applications.)

Comment by paulfchristiano on Hiring engineers and researchers to help align GPT-3 · 2020-10-05T19:07:20.831Z · LW · GW

will these jobs be long-term remote? if not, on what timeframe will they be remote?

We expect to be requiring people to work from the office again sometime next year.

how suitable is the research engineering job for people with no background in ml, but who are otherwise strong engineers and mathematicians?

ML background is very helpful. Strong engineers who are interested in learning about ML are also welcome to apply though no promises about how well we'll handle those applications in the current round.

Comment by paulfchristiano on Hiring engineers and researchers to help align GPT-3 · 2020-10-05T19:05:19.063Z · LW · GW

The team is currently 7 people and we are hiring 1-2 additional people over the coming months.

I am optimistic that our team and other similar efforts will be hiring more people in the future and continuously scaling up, and that over the long term there could be a lot of people working on these issues.

(The post is definitely written with that in mind and the hope that enthusiasm will translate into more than just hires in the current round. Growth will also depend on how strong the pool of candidates is.)

Comment by paulfchristiano on Puzzle Games · 2020-10-02T02:06:24.677Z · LW · GW

I totally understand why resetting had to be kind of complicated / ad hoc, and I think that this was a reasonable compromise. I don't think uncertainty about resetting matters much in the scheme of things, it's a great game.

Comment by paulfchristiano on Puzzle Games · 2020-10-01T06:39:08.389Z · LW · GW

Partial answer to my question (significantly more spoilers):

You can get to the credits without resetting. Extra puzzles appear to require resetting but maybe not in a very subtle way. I don't know if resetting has a simple description. It definitely depends on invisible facts about the map.

Comment by paulfchristiano on “Unsupervised” translation as an (intent) alignment problem · 2020-09-30T21:39:42.614Z · LW · GW

Good point, changed.

Originally it was "as an alignment problem" but this has the problem that it also refers to "aligning" unaligned datasets. The new way is bulkier but probably better overall.

Comment by paulfchristiano on “Unsupervised” translation as an (intent) alignment problem · 2020-09-30T18:03:07.596Z · LW · GW

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

This is basically what the helper model does, except:

• For competitiveness you should learn and evaluate the dictionary at the same time you are training the model, running a debate experiment many times where debaters have to output a full dictionary would likely be prohibitively expensive.
• Most knowledge about language isn't easily captured in a dictionary (for example, a human using a Spanish-English dictionary is a mediocre translator), so we'd prefer have a model that answers questions about meaning than have a model that outputs a static dictionary.
• I don't know what standard you want to use for "helpful for understanding the passage" but I think "helps predict the next word correctly" is probably the best approach (since the goal is to be competitive and that's how GPT learned).

After making those changes we're back at the learning the prior proposal.

I think that proposal may work passably here because we can potentially get by with a really crude prior---basically we think "the helper should mostly just explain the meaning of terms" and then we don't need to be particularly opinionated about which meanings are more plausible. I agree that the discussion in the section "A vague hope" is a little bit too pessimistic for the given context of unaligned translation.

Comment by paulfchristiano on Puzzle Games · 2020-09-28T16:11:10.406Z · LW · GW

Question about Monster's Expedition:

The reset mechanic seems necessary to make the game playable in practice, but it seems very unsatisfying. It is unclear how you'd make it work in a principled way; the actual implementation seems extremely confusing, seems to depend on invisible information about the environment, and has some weird behaviors that I think are probably bugs. Unfortunately, it currently seems possible that probing the weirdest behaviors of resetting (e.g. breaking conservation-of-trees) could be the only way to access some places. It's also possible that mundane applications of resetting are essential but you aren't intended to explore weird edge cases, which would be the least satisfying outcome of all.

So two questions:

1. Is it possible to beat the game without resetting? Can I safely ignore it as a mechanic? This is my current default assumption and it's working fine so far.

2. Is the reset mechanic actually lawful/reasonable and I just need to think harder?

(Given the quality of the game I'm hoping that at least one of those is "yes." If "no answer" seems like the best way to enjoy the game I'm open to that as well.)

Comment by paulfchristiano on Puzzle Games · 2020-09-27T23:19:31.335Z · LW · GW

This list is almost the same as mine. I would include Hanano Puzzle 2 at tier 2 and Cosmic Express at tier 3. I haven't played Twisty Little Passages or Kine though I'll try them on this recommendation.

We're putting together a self-contained campaign for engine-game.com which is aiming to be Tier-2-according-to-Paul. We'll see if other folks agree when it's done. It has a very different flavor from the other games on the list.

Comment by paulfchristiano on Distributed public goods provision · 2020-09-27T16:24:08.616Z · LW · GW

I think you inevitably need to answer "What is the marginal impact of funding?" if you are deciding how much to fund something.

(I will probably write about approaches to randomizing to be maximally efficient with research time at some point in the future. My current plan is something like: write out public goods I know of that I benefit from, then sample one of them to research and fund by 1/p(chosen) more than I normally would.)

This isn't really meant to be a quick rule of thumb, it's meant to be a way to answer the question at all.

Comment by paulfchristiano on Search versus design · 2020-08-17T14:46:52.045Z · LW · GW

I liked this post.

I'm not sure that design will end up being as simple as this picture makes it look, no matter how well we understand it---it seems like factorization is one kind of activity in design, but it feels like overall "design" is being used as a kind of catch-all that is probably very complicated.

An important distinction for me is: does the artifact work because of the story (as in "design"), or does the artifact work because of the evaluation (as in search)?

This isn't so clean, since:

• Most artifacts work for a combination of the two reasons---I design a thing then test it and need a few iterations---there is some quantitative story where both factors almost always play a role for practical artifacts.
• There seem to many other reasons things work (e.g. "it's similar to other things that worked" seems to play a super important role in both design and search).
• A story seems like it's the same kind of thing as an artifact, and we could also talk about where *it* comes from. A story that plays a role in a design itself comes from some combination of search and design.
• During design it seems likely that humans rely very extensively on searching against mental models, which may not be introspectively available to us as a search but seems like it has similar properties.

Despite those and more complexities, it feels to me like if there is a clean abstraction it's somewhere in that general space, about the different reasons why a thing can work.

Post-hoc stories are clearly *not* the "reason why things work" (at least at this level of explanation). But also if you do jointly search for a model+helpful story about it, the story still isn't the reason why the model works, and from a safety perspective it might be similarly bad.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-22T01:41:35.308Z · LW · GW
Yeah, I've heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected

For reference, this is the topic of section 7 of AI Safety via Debate.

In the limit they seem equivalent: (i) it's easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).

The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that's a more detailed discussion.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-22T01:34:35.946Z · LW · GW
I'm a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can't be critiqued. One step before the end, they can make statements which can't be convincingly critiqued in one step. And so on.
[...]
The most salient reason for me ATM is the concern that debaters needn't structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).

If I assert "X because Y & Z" and the depth limit is 0, you aren't intended to say "Yup, checks out," unless Y and Z and the implication are self-evident to you. Low-depth debates are supposed to ground out with the judge's priors / low-confidence in things that aren't easy to establish directly (because if I'm only updating on "Y looks plausible in a very low-depth debate" then I'm going to say "I don't know but I suspect X" is a better answer than "definitely X"). That seems like a consequence of the norms in my original answer.

In this context, a circular argument just isn't very appealing. At the bottom you are going to be very uncertain, and all that uncertainty is going to propagate all the way up.

Instead, it seems like you'd want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.

If you do it this way the debate really doesn't seem to work, as you point out.

I currently think all my concerns can be addressed if we abandon the link to factored cognition and defend a less ambitious thesis about debate.

For my part I mostly care about the ambitious thesis.

If the two players choose simultaneously, then it's hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
If player 2 chooses an answer after player 1 (getting access to player 1's answer in order to select a different one), then assuming competent play, player 1's answer will almost always be the better one. This prior taints the judge's decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.

I disagree with both of these as objections to the basic strategy, but don't think they are very important.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-19T21:09:26.655Z · LW · GW

Sorry for not understanding how much context was missing here.

The right starting point for your question is this writeup which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.

In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.

Our current rules---to hopefully be published sometime this quarter---handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it's important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don't expect this paragraph to be clear and don't think it's worth getting into until we publish an update, but wanted to flag it.

Do the debaters know how long the debate is going to be?

Yes.

To what extent are you trying to claim some relationship between the judge strategy you're describing and the honest one? EG, that it's eventually close to honest judging? (I'm asking whether this seems like an important question for the discussion vs one which should be set aside.)

If debate works, then at equilibrium the judge will always be favoring the better answer. If furthermore the judge believes that debate works, then this will also be their honest belief. So if judges believe in debate then it looks to me like the judging strategy must eventually approximate honest judging. But this is downstream of debate working, it doesn't play an important role in the argumetn that debate works or anything like that.

Comment by paulfchristiano on Challenges to Christiano’s capability amplification proposal · 2020-07-18T05:58:41.419Z · LW · GW

Providing context for readers: here is a post someone wrote a few years ago about issues (ii)+(iii) which I assume is the kind of thing Czynski has in mind. The most relevant thing I've written on issues (ii)+(iii) are Universality and consequentialism within HCH, and prior to that Security amplification and Reliability amplification.

Comment by paulfchristiano on Challenges to Christiano’s capability amplification proposal · 2020-07-18T03:45:56.591Z · LW · GW

I think not.

For the kinds of questions discussed in this post, which I think are easier than "Design Hessian-Free Optimization" but face basically the same problems, I think we are making reasonable progress. I'm overall happy with the progress but readily admit that it is much slower than I had hoped. I've certainly made updates (mostly about people, institutions, and getting things done, but naturally you should update differently).

Note that I don't think "Design Hessian-Free Optimization" is amongst the harder cases, and these physics problems are a further step easier than that. I think that sufficient progress on these physics tasks would satisfy the spirit of my remark 2y ago.

I appreciate the reminder at the 2y mark. You are welcome to check back in 1y later and if things don't look much better (at least on this kind of "easy" case), treat it as a further independent update.