Posts
Comments
This caused me to find your substack! Sorry I missed it earlier, looking forward to catching up.
FWIW, I found the Strawberry Appendix especially helpful for understanding how this approach to ELK could solve (some form of) outer alignment.
Other readers, consider looking at the appendix even if you don't feel like you fully understand the main body of the post!
Nice post! I see where you're coming from here.
(ETA: I think what I'm saying here is basically "3.5.3 and 3.5.4 seem to me like they deserve more consideration, at least as backup plans -- I think they're less crazy than you make them sound." So I don't think you missed these strategies, just that maybe we disagree about how crazy they look.)
I haven't thought this through all the way yet, and don't necessarily endorse these strategies without more thought, but:
It seems like there could be a category of strategies for players with "good" AGIs to prepare to salvage some long-term value when/if a war with "bad" AGIs does actually break out, because the Overton window will stop being relevant at that point. This prep might be doable without breaking what we normally think of as Overton windows*, and could salvage a percentage of the future light-cone, but would come at the cost of not preventing a huge war/catastrophe, and could cost a big percentage of the future light-cone (depending how "winnable" a war is from what starting points).
For example, a team could create a bunker that is well-positioned to be defended; or get as much control of civilization's resources as Overton allows and prepare plans to mobilize and expand into a war footing if "bad" AGI emerges; or prepare to launch von Neumann probes. Within the bubble of resources the "good" AGI controls legitimately before the war starts, the AGI might be able to build up a proprietary or stealthy technological lead over the rest of the world, effectively stockpiling its own supply of energy to make up for the fact that it's not consuming the free energy that it doesn't legitimately own.
Mnemonically, this strategy is something like "In case of emergency, break Overton window" :) I don't think your post really addresses these kinds of strategies, but very possible that I missed it (in which case my apologies).
*(We could argue that there's an Overton window that says "if there's a global catastrophe coming, it's unthinkable to just prepare to salvage some value, you must act to stop it!", which is why "prepare a bunker" is seen as nasty and antisocial. But that seems to be getting close to a situation where multiple Overton maxims conflict and no norm-following behavior is possible :) )
Thanks for the post, I found it helpful! the "competent catastrophes" direction sounds particularly interesting.
This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.
It didn't bug me ¯\_(ツ)_/¯
Thanks for the post! FWIW, I found this quote particularly useful:
Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!
The fact that it showed up right before an eye-catching image probably helped :)
This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.
Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit
This seems like a useful lens -- thanks for taking the time to post it!
I do agree. I think the main reason to stick with "robustness" or "reliability" is that that's how the problems of "my model doesn't generalize well / is subject to adversarial examples / didn't really hit the training target outside the training data" are referred to in ML, and it gives a bad impression when people rename problems. I'm definitely most in favor of giving a new name like "hitting the target" if we think the problem we care about is different in a substantial way (which could definitely happen going forward!)
OK -- if it looks like the delay will be super long, we can certainly ask him how he'd be OK w/ us circulating / attributing those ideas. In the meantime, there are pretty standard norms about unpublished work that's been shared for comments, and I think it makes sense to stick to them.
I agree re: terminology, but probably further discussion of unpublished docs should just wait until they're published.
Thanks for writing this, Will! I think it's a good + clear explanation, and "high/low-bandwidth oversight" seems like a useful pair of labels.
I've recently found it useful to think about two kind-of-separate aspects of alignment (I think I first saw these clearly separated by Dario in an unpublished Google Doc):
1. "target": can we define what we mean by "good behavior" in a way that seems in-principle learnable, ignoring the difficulty of learning reliably / generalizing well / being secure? E.g. in RL, this would be the Bellman equation or recursive definition of the Q-function. The basic issue here is that it's super unclear what it means to "do what the human wants, but scale up capabilities far beyond the human's".
2. "hitting the target": given a target, can we learn it in a way that generalizes "well"? This problem is very close to the reliability / security problem a lot of ML folks are thinking about, though our emphasis and methods might be somewhat different. Ideally our learning method would be very reliable, but the critical thing is that we should be very unlikely to learn a policy that is powerfully optimizing for some other target (malign failure / daemon). E.g. inclusive genetic fitness is a fine target, but the learning method got humans instead -- oops.
I've largely been optimistic about IDA because it looks like a really good step forward for our understanding of problem 1 (in particular because it takes a very different angle from CIRL-like methods that try to learn some internal values-ish function by observing human actions). 2 wasn't really on my radar before (maybe because problem 1 was so open / daunting / obviously critical); now it seems like a huge deal to me, largely thanks to Paul, Wei Dai, some unpublished Dario stuff, and more recently some MIRI conversations.
Current state:
- I do think problem 2 is super-worrying for IDA, and probably for all ML-ish approaches to alignment? If there are arguments that different approaches are better on problem 2, I'd love to see them. Problem 2 seems like the most likely reason right now that we'll later be saying "uh, we can't make aligned AI, time to get really persuasive to the rest of the world that AI is very difficult to use safely".
- I'm optimistic about people sometimes choosing only problem 1 or problem 2 to focus on with a particular piece of work -- it seems like "solve both problems in one shot" is too high a bar for any one piece of work. It's most obvious that you can choose to work on problem 2 and set aside problem 1 temporarily -- a ton of ML people are doing this productively -- but I also think it's possible and probably useful to sometimes say "let's map out the space of possible solutions to problem 1, and maybe propose a family of new ones, w/o diving super deep on problem 2 for now."
I really like this post, and am very glad to see it! Nice work.
I'll pay whatever cost I need to for violating non-usefulness-of-comments norms in order to say this -- an upvote didn't seem like enough.
Thanks for writing this -- I think it's a helpful kind of reflection for people to do!
Ah, gotcha. I'll think about those points -- I don't have a good response. (Actually adding "think about"+(link to this discussion) to my todo list.)
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
These objections are all reasonable, and 3 is especially interesting to me -- it seems like the biggest objection to the structure of the argument I gave. Thanks.
I'm afraid that the point I was trying to make didn't come across, or that I'm not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul's are not amenable to any kind of argument for confidence, and we will only ever be able to say "well, I ran out of ideas for how to break it", so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.
Do you think it's unlikely that we'll be able to make positive arguments for the safety of schemes like Paul's? If so, I'd be really interested in why -- apologies if you've already tried to explain this and I just haven't figured that out.
"naturally occurring" means "could be inputs to this AI system from the rest of the world"; naturally occurring inputs don't need to be recognized, they're here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren't, I'd guess that possible input single pages of text aren't value-corrupting in an hour. (I would certainly want a much better answer than "I guess it's fine" if we were really running something like this.)
To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn't going to kill us. If you think it's really unlikely that any argument of this inductive form could be run, I'd be interested in that (or if Paul or someone else thought I'm on the wrong track / making some kind of fundamental mistake.)
My comment, for the record:
I'm glad to see people critiquing Paul's work -- it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of "benign", I don't think humans are benign, so I'm not going to argue with that. Instead, I'll say what I think about building aligned AIs out of simulated human judgement.
I agree with you that listing and solving problems with such systems until we can't think of more problems is unsatisfying, and that we should have positive arguments for confidence that we won't hit unforeseen problems; maybe at some point we need to give up on getting those arguments and do the best we can without them, but it doesn't feel like we're at that point yet. I'm guessing the main difference here is that I'm hopeful about producing those arguments and you think it's not likely to work.
Here's an of example of how an argument might go. It's sloppy, but I think it shows the flavor that makes me hopeful. Meta-execution preserving a "non-corrupting" invariant:
i. define a naturally occurring set of queries nQ.
ii. have some reason to think that nq in nQ are very unlikely to cause significant value drift in Som in 1 hour (nq are "non-corrupting").
iii. let Q be the closure of nQ under "Som spends an hour splitting q into sub-queries".
iv. have some reason to think that Som's processing never purposefully converts non-corrupting queries into corrupting ones.
v. have some defense against random noise producing corrupting nq or q.
vi. conclude that all q in Q are non-corrupting, and so the system won't involve any value-drifted Soms.
This kind of system would run sort of like your (2) or Paul's meta-execution (https://ai-alignment.com/meta-execution-27ba9b34d377).
There are some domains where this argument seems clearly true and Som isn't just being used as a microprocessor, e.g. Go problems or conjectures to be proven. In these cases it seems like (ii), (iii), and (iv) are true by virtue of the domain -- no Go problems are corrupting -- and Som's processing doesn't contribute to the truth of (iii).
For some other sets Q, it seems like (ii) will be true because of the nature of the domain (e.g. almost no naturally occurring single pages of text are value-corrupting in an hour), (iv) will be true because it would take significant work on Som's part to convert a non-scary q into a scary q' and that Som wouldn't want to do this unless they were already corrupted, and (v) can be made true by using a lot of different "noise seeds" and some kind of voting system to wash out noise-produced corruption.
Obviously this argument is frustratingly informal, and maybe I could become convinced that it can't be strengthened, but I think I'd mostly be convinced by trying and failing, and it seems reasonably likely to me that we could succeed.
Paul seems to have another kind of argument for another kind of system in mind here (https://ai-alignment.com/aligned-search-366f983742e9), with a sketch of an argument at "I have a rough angle of attack in mind". Obviously this isn't an argument yet, but it seems worth looking into.
FWIW, Paul is thinking and writing about about the kinds of problems you point out, e.g. in this post (https://ai-alignment.com/security-amplification-f4931419f903), this post (https://ai-alignment.com/reliability-amplification-a96efa115687), or this post (https://ai-alignment.com/implementing-our-considered-judgment-6c715a239b3e, search "virus" on that page). Not sure if his thoughts are helpful to you.
If you're planning to follow up this post, I'd be most interested in whether you think it's not likely to be possible to design a process that can we can be confident will avoid Sim drift. I'd also be interested to know if there are other approaches to alignment that seem more promising to you.
I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon?
If I read Paul's post correctly, ALBA is supposed to do this in theory -- I don't understand the theory/practice distinction you're making.
I'm not sure you've gotten quite ALBA right here, and I think that causes a problem for your objection. Relevant writeups: most recent and original ALBA.
As I understand it, ALBA proposes the following process:
- H trains A to choose actions that would get the best immediate feedback from H. A is benign (assuming that H could give not-catastrophic immediate feedback for all actions and that the learning process is robust). H defines the feedback, and so A doesn't make decisions that are more effective at anything than H is; A is just faster.
- A (and possibly H) is (are) used to define a slow process A+ that makes "better" decisions than A or H would. (Better is in quotes because we don't have a definition of better; the best anyone knows how to do right now is look at the amplification process and say "yep, that should do better.") Maybe H uses A as an assistant, maybe a copy of A breaks down a decision problem into parts and hands them off to other copies of A, maybe A makes decisions that guide a much larger cognitive process.
- The whole loop starts over with A+ used as H.
The claim is that step 2 produces a system that is able to give "better" feedback than the human could -- feedback that considers more consequences more accurately in more complex decision situations, that has spent more effort introspecting, etc. This should make it able to handle circumstances further and further outside human-ordinary, eventually scaling up to extraordinary circumstances. So, while you say that the best case to hope for is , it seems like ALBA is claiming to do more.
A second objection is that while you call each a "reward function", each system is only trained to take actions that maximize the very next reward they get (not sum of future rewards). This means that each system is only effective at anything insofar as the feedback function it's maximizing at each step considers the long-term consequences of each action. So, if , we don't have reason to think that the system will be competent at anything outside of the "normal circumstances + a few exceptions" you describe -- all of its planning power comes from , so we should expect it to be basically incompetent where is incompetent.
FWIW, this also reminded me of some discussion in Paul's post on capability amplification, where Paul asks whether we can even define good behavior in some parts of capability-space, e.g.:
The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part H? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment.
I'm not sure if that's relevant to your point, but it seemed like you might be interested.
Discussed briefly in Concrete Problems, FYI: https://arxiv.org/pdf/1606.06565.pdf
This is a neat idea! I'd be interested to hear why you don't think it's satisfying from a safety point of view, if you have thoughts on that.
Thanks for writing this, Jessica -- I expect to find it helpful when I read it more carefully!
Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of "figuring out how to think about big / self-like hypotheses". Is that how you think of it, or are there aspects of the problem that you think are missed by this framing?
Thanks Jessica. This was helpful, and I think I see more what the problem is.
Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It's clearly not possible for the system to hold and update infinitely many hypotheses the way Solomonoff does, and a system would need some kind of logical uncertainty or other magic to evaluate complex or self-referential hypotheses, but it seems like these hypotheses should be "in its class". Does this make sense, or do you think there is a mistake there?
Re point 2: I'm not confident that's an accurate summary; I'm precisely proposing that the agent learn a model of the world containing a model of the agent (approximate or precise). I agree that evaluating this kind of model will require logical uncertainty or similar magic, since it will be expensive and possibly self-referential.
Re point 3: I see what you mean, though for self-modeling the agent being predicted should only be as smart as the agent doing the prediction. It seems like approximation and logical uncertainty are the main ingredients needed here. Are there particular parts of the unbounded problem that are not solved by reflective oracles?
Thanks, Paul -- I missed this response earlier, and I think you've pointed out some of the major disagreements here.
I agree that there's something somewhat consequentialist going on during all kinds of complex computation. I'm skeptical that we need better decision theory to do this reliably -- are there reasons or intuition-pumps you know of that have a bearing on this?
Thanks Jessica, I think we're on similar pages -- I'm also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.
Thanks Jessica -- sorry I misunderstood about hijacking. A couple of questions:
-
Is there a difference between "safe" and "accurate" predictors? I'm now thinking that you're worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.
-
My feeling is that today's current understanding of planning -- if I run this computation, I will get the result, and if I run it again, I'll get the same one -- are sufficient for harder prediction tasks. Are there particular aspects of planning that we don't yet understand well that you expect to be important for planning computation during prediction?
I agree with paragraphs 1, 2, and 3. To recap, the question we're discussing is "do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?"
A couple of notes on paragraph 4:
- I'm not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don't include built-in planning faculties.
- You are bringing up understandability of an NTM-based human-decision-predictor. I think that's a fine thing to talk about, but it's different from the question we were talking about.
- You're also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.
In paragraph 5, you seem to be proposing that to make any competent predictor, we'll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument ("predicting planners requires planning faculties so that you can emulate the planner" vs "predicting anything requires some amount of prioritization and decision-making"). In these cases, I'm more skeptical that a deep theoretical understanding of decision-making is important, but I'm open to talking about it -- it just seems different from the original question.
Overall, I feel like this response is out-of-scope for the current question -- does that make sense, or do I seem off-base?
Thanks, Jessica. This argument still doesn't seem right to me -- let me try to explain why.
It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won't need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it'll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense?
(I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!)
"Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place."
I've had this conversation with Nate before, and I don't understand why I should think it's true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?
Very thoughtful post! I was so impressed that I clicked the username to see who it was, only to see the link to your LessWrong profile :)
Just wanted to mention that watching this panel was one of the things that convinced me to give AI safety research a try :) Thanks for re-posting, it's a good memory.
To at least try to address your question: one effect could be that there are coordination problems, where many people would be trying to "change the world" in roughly the same direction if they knew that other people would cooperate and work with them. This would result in less of the attention drain you suggest. This seems more like what I've experienced.
I'm more worried about people being stupid than mean, but that could be an effect of the bubble of non-mean people I'm part of.
Cool, thanks; sounds like I have about the same picture. One missing ingredient for me that was resolved by your answer, and by going back and looking at the papers again, was the distinction between consistency and soundness (on the natural numbers), which is not a distinction I think about often.
In case it's useful, I'll note that the procrastination paradox is hard for me to take seriously on an intuitive level, because some part of me thinks that requiring correct answers in infinite decision problems is unreasonable; so many reasoning systems fail on these problems, and infinite situations seem so unlikely, that they are hard for me to get worked up about. This isn't so much a comment on how important the problem actually is, but more about how much argumentation may be required to convince people like me that they're actually worth working on.
I don't (confidently) understand why the procrastination paradox indicates a problem to be solved. Could you clarify that for me, or point me to a clarification?
First off, it doesn't seem like this kind of infinite buck-passing could happen in real life; is there a real-life (finite?) setting where this type of procrastination leads to bad actions? Second, it seems to me that similar paradoxes often come up in other situations where agents have infinite time horizons and can wait as long as they want -- does the problem come from the infinity, or from something else?
The best explanation that I can give is "It's immediately obvious to a human, even in an infinite situation, that the only way to get the button pressed is to press it immediately. Therefore, we haven't captured human reasoning (about infinite situations), and we should capture that human reasoning in order to be confident about AI reasoning." This is AFAICT the explanation Nate gives in the Vingean Reflection paper. Is that how you would express the problem?
Is this sort of a way to get an agent with a DT that admits acausal trade (as we think the correct decision theory would) to act more like a CDT agent? I wonder how different the behaviors of the agent you specify are from those of a CDT agent -- in what kinds of situations would they come apart? When does "I only value what happens given that I exist" (roughly) differ from "I only value what I directly cause" (roughly)?
I don't have the expertise to evaluate it, but Brian Greene suggests this experiment.
I would encourage you to apply, these ideas seem reasonable!
As far as choosing, I would advise you to choose the idea for which you can make the case most strongly that it is Topical and Impactful, as defined here.
It seems that if it is desired, the overseer could also set their behaviour and intentions so that the approval-directed agent acts as we would want an oracle or tool to act. This is a nice feature.
I think Nick Bostrom and Stuart Armstrong would also be interested in this, and might have good feedback for you.
High-level feedback: this is a really interesting proposal, and looks like a promising direction to me! Most of my inline comments on Medium are more critical, but that doesn't reflect my overall assessment.
That's what I thought at first, too, but then I looked at the paper, and their figure looks right to me. Could you check my reasoning here?
On p.11 of Vincent's and Nick's survey, there's a graph "Proportion of experts with 10%/50%/90% confidence of HLMI by that date". At around the the 1 in 10 mark of proportion of experts -- the horizontal line from 0.1 -- the graph shows that 1 in 10 experts thought there was a 50% chance of HLAI by 2020 or so (the square-boxes-line), and 1 in 10 thought there was a 90% chance of HLAI by 2030 or so (the triangles-line). So, maybe 1 in 10 researchers think there's a 70% chance of HLAI by 2025 or so, which is roughly in line with the journalist's remark.
Did I do that right? Do you think the graph is maybe incorrect? I haven't checked the number against other parts of the paper.
There's a good chance that the reviewer got the right number by accident, I think, but it doesn't seem far enough away to call out.
I would be curious to see more thoughts on this from people who have thought more than I have about stable/reliable self-improvement/tiling. Broadly speaking, I am also somewhat skeptical that it's the best problem to be working on now. However, here are some considerations in favor:
It seems plausible to me that an AI will be doing most of the design work before it is a "human-level reasoner" in your sense. The scenario I have in mind is a self-improvement cycle by a machine specialized in CS and math, which is either better than humans at these things, or is changing too rapidly for humans to effectively help it. This would create what Bostrom has called (in private correspondence) a "competence gap", where the AI can and does self-improve, but may not solve the tiling problem or balance risk the way we would have liked it to. In this case, being able to solve this problem for it directly is helpful.
30% efficiency improvement seems quite large, even for major software changes, in machine learning. I'm not sure how much this affects your overall point.
On the value of work now vs. later, I would probably try to determine this mostly by thinking about how much this work will help us grow interest in the area among people who will wield useful skills an influence later. So far, work on the Löbian obstacle has been pretty good on this metric (if you count it as partially responsible for attracting Benja and Nate, attention from mathematicians, its importance to past workshops, Nik Weaver, etc.).
I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits?
Hm, I don't know what the definition is either. In my head, it means "can get an arbitrary amount of money from", e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.
Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it's not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as "Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege".