Comment by rohinmshah on Alignment Newsletter #41 · 2019-01-17T20:48:30.962Z · score: 2 (1 votes) · LW · GW

Idk, it seems hard to do, I personally have had trouble doing it, the future is vast and complex and hard to fit in your head, when trying to make an argument that eliminates all possible bad behaviors while including all the good ones it seems like you're going to forget some cases, which proofs let you avoid because they hold you to a very high standard but there's no equivalent with conceptual thinking.

(These aren't very clear/are confused, if that wasn't obvious already.)

Another way of putting it is that conceptual thinking doesn't seem to have great feedback loops, which experiments clearly have and theory kind of has (you can at least get the binary true/false feedback once you prove any particular theorem).

Comment by rohinmshah on Comments on CAIS · 2019-01-17T19:25:25.851Z · score: 2 (1 votes) · LW · GW
Suppose an AI service realises that it is able to seize many more resources with which to fulfil its bounded utility function. Would it do so? If no, then it's not rational with respect to that utility function. If yes, then it seems rather unsafe, and I'm not sure how it fits Eric's criterion of using "bounded resources".

Yes, it would. The hope is that there do not exist ways to seize and productively use tons of resources within the bound. (To be clear, I'm imagining a bound on time, i.e. finite horizon, as opposed to a bound on the maximum value of the utility function.)

I agree with Eric's claim that R&D automation will speed up AI progress. The point of disagreement is more like: when we have AI technology that's able to do basically all human cognitive tasks (which for want of a better term I'll call AGI, as an umbrella term to include both CAIS and agent AGI), what will it look like? It's true that no past technologies have looked like unified agent AGIs - but no past technologies have also looked like distributed systems capable of accomplishing all human tasks either. So it seems like the evolution prior is still the most relevant one.

I don't really know what to say to this beyond "I disagree", it seems like a case of reference class tennis. I'm not sure how much we disagree -- I do agree that we should put weight on the evolution prior.

I think the whole paradigm of RL is an example of a bias towards thinking about agents with goals, and that as those agents become more powerful, it becomes easier to anthropomorphise them (OpenAI Five being one example where it's hard not to think of it as a group of agents with goals).

But there were so many other paradigms that did not look like that.

I would withdraw my objection if, for example, most AI researchers took the prospect of AGI from supervised learning as seriously as AGI from RL.

There are lots of good reasons not to expect AGI from supervised learning, most notably that with supervised learning you are limited to human performance.

I claim that this sense of "in the loop" is irrelevant, because it's equivalent to the AI doing its own thing while the human holds a finger over the stop button. I.e. the AI will be equivalent to current CEOs, the humans will be equivalent to current boards of directors.

I've lost sight of what original claim we were disagreeing about here. But I'll note that I do think that we have significant control over current CEOs, relative to what we imagine with "superintelligent AGI optimizing a long-term goal".

I think of CEOs as basically the most maximiser-like humans.

I agree with this (and the rest of that paragraph) but I'm not sure what point you're trying to make there. If you're saying that a CAIS-CEO would be risky, I agree. This seems markedly different from worries that a CAIS-anything would behave like a long-term goal-directed literally-actually-maximizer.

I then mentioned that to build systems which implement arbitrary tasks, you may need to be operating over arbitrarily long time horizons. But probably this also comes down to how decomposable such things are.

Agreed that decomposability is the crux.

People are arguing for a focus on CAIS without (to my mind) compelling arguments for why we won't have AGI agents eventually, so I don't think this is a strawman.

Eventually is the key word here. Conditional on AGI agents existing before CAIS, I certainly agree that we should focus on AGI agent safety, which is the claim I thought you were making. Conditional on CAIS existing before AGI agents, I think it's a reasonable position to say "let's focus on CAIS, and then coordinate to either prevent AGI agents from existing or to control them from the outside if they will exist". In particular, approaches like boxing or supervision by a strong overseer become much more likely to work in a world where CAIS already exists.

Also, there is one person working on CAIS and tens to hundreds working on AGI agents (depending on how you count), so arguing for more of a focus on CAIS doesn't mean that you think that CAIS is the most important scenario.

This depends on having pretty powerful CAIS and very good global coordination, both of which I think of as unlikely (especially given that in a world where CAIS occurs and isn't very dangerous, people will probably think that AI safety advocates were wrong about there being existential risk). I'm curious how likely you think this is though?

I don't find it extremely unlikely that we'll get something along these lines. I don't know, maybe something like 5%? (Completely made up number, it's especially meaningless because I don't have a concrete enough sense of what counts as CAIS and what counts as good global coordination to make a prediction about it.) But I also think that the actions we need to take look very different in different worlds, so most of this is uncertainty over which world we're in, as opposed to confidence that we're screwed except in this 5% probability world.

If agent AGIs are 10x as dangerous, and the probability that we eventually build them is more than 10%, then agent AGIs are the bigger threat.

While this is literally true, I have a bunch of problems with the intended implications:

  • Saying "10x as dangerous" is misleading. If CAIS leads to >10% x-risk, it is impossible for agent AGI to be 10x as dangerous (ignoring differences in outcomes like s-risks). So by saying "10x as dangerous" you're making an implicit claim of safety for CAIS. If you phrase it in terms of probabilities, "10x as dangerous" seems much less plausible.
  • The research you do and actions you take in the world where agent AGI comes first are different from those in the world where CAIS comes first. I expect most research to significantly affect one of those two worlds but not both. So the relevant question is the probability of a particular one of those worlds.
  • I expect that our understanding of low-probability / edge-case worlds to be very bad, in which case most research aimed at improving these worlds is much more likely to be misguided and useless. This cuts against arguments of the form "We should focus on X even though it is unlikely or hard to understand because if it happens then it would be really bad/dangerous." Yes, you can apply this to AI safety in general, and yes, I do think that a majority of AI safety research will turn out to be useless, primarily because of this argument.
  • This is an argument only about importance. As I mentioned above, CAIS is much more neglected, and plausibly is more tractable.
Because they have long-term convergent instrumental goals, and CAIS doesn't. CAIS only "cares" about self-improvement to the extent that humans are instructing it to do so, but humans are cautious and slow.

Agreed, though I don't think this is a huge effect. We aren't cautious and slow about our current AI development because we're confident it isn't dangerous; the same can happen in CAIS with basic AI building blocks. But good point, I agree this pushes me to thinking that AGI agents will self-improve faster.

Also because even if building AGI out of task-specific strongly-constrained modules is faster at first, it seems unlikely that it's anywhere near the optimal architecture for self-improvement.

Idk, that seems plausible to me. I don't see strong arguments in either direction.

It's something like "the first half of CAIS comes true, but the services never get good enough to actually be comprehensive/general. Meanwhile fundamental research on agent AGI occurs roughly in parallel, and eventually overtakes CAIS." As a vague picture, imagine a world in which we've applied powerful supervised learning to all industries, and applied RL to all tasks which are either as constrained and well-defined as games, or as cognitively easy as most physical labour, but still don't have AI which can independently do the most complex cognitive tasks (Turing tests, fundamental research, etc).

I agree that seems like a good model. It doesn't seem clearly superior to CAIS though.

Comment by rohinmshah on Human-AI Interaction · 2019-01-17T18:41:16.193Z · score: 4 (2 votes) · LW · GW
However, in this post you suggest that ambitious vs narrow value learning is about the amount of feedback the algorithm requires.

That wasn't exactly my point. My main point was that if we want an AI system that acts autonomously over a long period of time (think centuries), but it isn't doing ambitious value learning (only narrow value learning), then we necessarily require a feedback mechanism that keeps the AI system "on track" (since my instrumental values will change over that period of time). Will add a summary sentence to the post.

I think it depends on the details of the implementation

Agreed, I was imagining the "default" implementation (eg. as in this paper).

For redundancy, if the narrow value learning system is trying to learn how much humans approve of various actions, we can tell the system that the negative score from our disapproval of tampering with the value learning system outweighs any positive score it could achieve through tampering.

Something along these lines seems promising, I hadn't thought of this possibility before.

If the reward function weights rewards according to the certainty of the narrow value learning system that they are the correct reward, that creates incentives to keep the narrow value learning system operating, so the narrow value learning system can acquire greater certainty and provide a greater reward.

Yeah, uncertainty can definitely help get around this problem. (See also the next post, which should hopefully go up soon.)

Comment by rohinmshah on Human-AI Interaction · 2019-01-17T18:31:52.129Z · score: 3 (2 votes) · LW · GW

I agree that all of these seem like good aspects of human-AI interaction to have, especially for narrow AI systems. For superhuman AI systems, there's a question of how much of this should the AI infer for itself vs. make sure to ask the human.

Comment by rohinmshah on Comments on CAIS · 2019-01-17T18:28:06.367Z · score: 2 (1 votes) · LW · GW
You said "In fact, I don’t want to assume that the agent even has a preference ordering" but I'm not sure why.

You could model a calculator as having a preference ordering, but that seems like a pretty useless model. Similarly, if you look at current policies that we get from RL, it seems like a relatively bad model to say that they have a preference ordering, especially a long-term one. It seems more accurate to say that they are executing a particular learned behavior that can't be easily updated in the face of changing circumstances.

On the other hand, the (training process + resulting policy) together is more reasonably modeled as having a preference ordering.

While it's true that so far the only model we have for getting generally intelligent behavior is to have a preference ordering (perhaps expressed as a reward function) that is then optimized, it doesn't seem clear to me that any AI system we build must have this property. For example, GOFAI approaches do not seem like they are well-modeled as having a preference ordering, similarly with theorem proving.

(GOFAI and theorem proving are also examples of technologies that could plausibly have led to what-I-call-AGI-which-is-not-what-Eric-calls-an-AGI-agent, but whose internal cognition does not resemble that of an expected utility maximizer.)

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-17T18:19:25.133Z · score: 2 (1 votes) · LW · GW
Why can't someone just take a plan maker, connect it to a plan executer, and connect that to the Internet to access other services as needed?

I think Eric would not call that an AGI agent.

Setting aside what Eric thinks and talking about what I think: There is one conception of "AGI risk" where the problem is that you have an integrated system that has optimization pressure applied to the system as a whole (similar to end-to-end training) such that the entire system is "pointed at" a particular goal and uses all of its intelligence towards that. The goal is a long-term goal over universe-histories. The agent can be modeled as literally actually maximizing the goal. These are all properties of the AGI itself.

With the system you described, there is no end-to-end training, and it doesn't seem right to say that the overall system is aimed at a long-term goal, since it depends on what you ask the plan maker to do. I agree this does not clearly solve any major problem, but it does seem markedly different to me.

I think that Eric's conception of "AGI agent" is like the first thing I described. I agree that this is not what everyone means by "AGI", and it is particularly not the thing you mean by "AGI".

You might argue that there seems to be no effective safety difference between an Eric-AGI-agent and the plan maker + plan executor. The main differences seem to be about what safety mechanisms you can add -- such as looking at the generated plan, or using human models of approval to check that you have the right goal. (Whereas an Eric-AGI-agent is so opaque that you can't look at things like "generated plans", and you can't check that you have the right goal because the Eric-AGI-agent will not let you change its goal.)

With an Eric-AGI-agent, if you try to create a human model of approval, that would need to be an Eric-AGI-agent itself in order to effectively supervise the first Eric-AGI-agent, but in that case the model of approval will be literally actually maximizing some goal like "be as accurate as possible", which will lead to perverse behavior like manipulating humans so that what they approve is easier to predict. In CAIS, this doesn't happen, because the approval model is not searching over possibilities that involve manipulating humans.

Comment by rohinmshah on Ambitious vs. narrow value learning · 2019-01-17T17:57:45.836Z · score: 2 (1 votes) · LW · GW
is there a mathematical theory of instrumental value learning, that we can expect practical algorithms to better approximate over time, which would let us predict what future algorithms might look like or be able to do?

Not to my knowledge, though partly I'm hoping that this sequence will encourage more work on that front. Eg. I'd be interested in analyzing a variant of CIRL where the human's reward exogenously changes over time. This is clearly an incorrect model of what actually happens, and in particular breaks down once the AI system can predict how the human's reward will change over time, but I expect there to be interesting insights to be gained from a conceptual analysis.

"You" meaning the user?

Yes.

Does the user need to know when they need to provide the AI with more training data? Or can we expect the AI to know when it should ask the user for more training data?

Hopefully not, I meant only that the user would need to provide more data, it seems quite possible to have the AI system figure out when that is necessary.

If the latter, what can we expect the AI to do in the meantime (e.g., if the user is asleep and it can't ask)?

I don't imagine this as "suddenly the reward changed dramatically and following the old reward is catastrophic", more like "the human's priorities have shifted slightly, you need to account for this at some point or you'll get compounding errors, but it's not crucial that you do it immediately". To answer your question more directly, in the meantime the AI can continue doing what it was doing in the past (and in cases where it is unsure, it preserves option value, though one would hope this doesn't need to be explicitly coded in and arises from "try to help the human").

Comment by rohinmshah on And My Axiom! Insights from 'Computability and Logic' · 2019-01-17T08:53:54.050Z · score: 3 (2 votes) · LW · GW

Another way of putting it: you can't possibly know that there isn't some device out in the universe that lets you do more powerful things than your model (eg. a device that can tell you whether an arbitrary Turing machine halts), so it can never be proven that your model captures real-world computability.

Alignment Newsletter #41

2019-01-17T08:10:01.958Z · score: 14 (2 votes)
Comment by rohinmshah on Non-Consequentialist Cooperation? · 2019-01-17T07:14:16.048Z · score: 5 (2 votes) · LW · GW

This seems like an interesting idea for how to build an AI system in practice, along the same lines as corrigibility. We notice that value learning is not very robust: if you aren't very good at value learning, then you can get very bad behavior, and human values are sufficiently complex that you do need to be very capable in order to be sufficiently good at value learning. With (a particular kind of) corrigibility, we instead set the goal to be to make an AI system that is trying to help us, which seems more achievable even when the AI system is not very capable. Similarly, if we formalize or learn informed consent reasonably well (which seems easier to do since it is not as complex as "human values"), then our AI systems will likely have good behavior (though they will probably not have the best possible behavior, since they are limited by having to respect informed consent).

However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.

Comment by rohinmshah on Directions and desiderata for AI alignment · 2019-01-16T00:46:30.563Z · score: 6 (3 votes) · LW · GW

The three directions of reliability/robustness, reward learning, and amplification seem great, though robustness seems particularly hard to achieve. While there is current work on adversarial training, interpretability and verification, even if all of the problems that researchers currently work on were magically solved, I don't have a story for how that leads to robustness of (say) an agent trained by iterated amplification.

I am more conflicted about the desiderata. They seem very difficult to satisfy, and they don't seem strictly necessary to achieve good outcomes. The underlying view here is that we should aim for something that we know is sufficient to achieve good outcomes, and only weaken our requirements if we find a fundamental obstacle. My main issue with this view is that even if it is true that the requirements are impossible to satisfy, it seems very hard to know this, and so we may spend a lot of time trying to satisfy these requirements and most of that work ends up being useless. I can imagine that we try to figure out ways to achieve robustness for several years in order to get a secure AI system, and it turns out that this is impossible to do in a way where we know it is robust, but in practice any AI system that we train will be sufficiently robust that it never fails catastrophically. In this world, we keep trying to achieve robustness, never find a fundamental obstruction, but also never succeed at creating a secure AI system.

Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn't hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that you're aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.

Human-AI Interaction

2019-01-15T01:57:15.558Z · score: 17 (5 votes)
Comment by rohinmshah on Comments on CAIS · 2019-01-13T18:44:44.569Z · score: 9 (2 votes) · LW · GW
This seems like a trait which AGIs might have, but not a part of how they should be defined.

There's a thing that Eric is arguing against in his report, which he calls an "AGI agent". I think it is reasonable to say that this thing can be fuzzily defined as something that approximates an expected utility maximizer.

(By your definition of AGI, which seems to be something like "thing that can do all tasks that humans can do", CAIS would be AGI, and Eric is typically contrasting CAIS and AGI.)

That said, I disagree with Wei that this is relatively crisp: taken literally, the definition is vacuous because all behavior maximizes some expected utility. Maybe we mean that it is long-term goal-directed, but at least I don't know how to cash that out. I think I agree that it is more crisp than the notion of a "service", but it doesn't feel that much more crisp.

Comment by rohinmshah on Comments on CAIS · 2019-01-13T17:51:21.621Z · score: 11 (3 votes) · LW · GW
And since AI services aren’t “rational agents” in the first place

AI services can totally be (approximately) VNM rational -- for a bounded utility function. The point is the boundedness, not the lack of VNM rationality. It is true that AI services would not be rational agents optimizing a simple utility function over the history of the universe (which is what I read when I see the phrase "AGI agent" from Eric).

As a basic prior, our only example of general intelligence so far is ourselves - a species composed of agentlike individuals who pursue open-ended goals.

Note that CAIS is suggesting that we should use a different prior: the prior based on "how have previous advances in technology come about". I find this to be stronger evidence than how evolution got to general intelligence.

Humans think in terms of individuals with goals, and so even if there's an equally good approach to AGI which doesn't conceive of it as a single goal-directed agent, researchers will be biased against it. 

I'm curious how strong an objection you think this is. I find it weak; in practice most of the researchers I know think much more concretely about the systems they implement than "agent with a goal", and these are researchers who work on deep RL. And in the history of AI, there were many things to be done besides "agent with a goal"; expert systems/GOFAI seems like the canonical counterexample.

There'll be significant pressure to reduce the extent to which humans are in the loop of AI services, for efficiency reasons.

Agreed for tactical decisions that require quick responses (eg. military uses, surgeries); this seems less true for strategic decisions. Humans are risk-averse and the safety community is cautioning against giving control to AI systems. I'd weakly expect that humans continue to be in the loop for nearly all important decisions (eg. remaining as CEOs of companies, but with advisor AI systems that do most of the work), until eg. curing cancer, solving climate change, ending global poverty, etc. (I'm not saying they'll stop being in the loop after that, I'm saying they'll remain in the loop at least until then.) To be clear, I'm imagining something like how I use Google Maps: basically always follow its instructions, but check that it isn't eg. routing me onto a road that's closed.

A clear counterargument is that some companies will have AI CEOs, and they will outcompete the others, and so we'll quickly transition to the world where all companies have AI CEOs. I think this is not that important -- having a human in the loop need not slow down everything by a huge margin, since most of the cognitive work is done by the AI advisor, and the human just needs to check that it makes sense (perhaps assisted by other AI services).

To the extent that you are using this to argue that "the AI advisor will be much more like an agent optimising for an open-ended goal than Eric claims", I agree that the AI advisor will look like it is "being a very good CEO". I'm not sure I agree that it will look like an agent optimizing for an open-ended goal, though I'm confused about this.

Even if we have lots of individually bounded-yet-efficacious modules, the task of combining them to perform well in new tasks seems like a difficult one which will require a broad understanding of the world.

Broad understanding isn't incompatible with services; Eric gives the example of language translation.

An overseer service which is trained to combine those modules to perform arbitrary tasks may be dangerous because if it is goal-oriented, it can use those modules to fulfil its goals

The main point of CAIS is that services aren't long-term goal-oriented; I agree that if services end up being long-term goal-oriented they become dangerous. In that case, there are still approaches that help us monitor when something bad happens (eg. looking at which services are being called upon for which task, limiting the information flow into any particular service), but the adversarial optimization danger is certainly present. (I think but am not sure that Eric would broadly agree with this take.)

My guess is that Eric would argue that this overseer would itself be composed of bounded services, in which case the real disagreement is how competitive that decomposition would be

Yup, that's the argument I would make.

Conditional on both sorts of superintelligences existing, I think (and I would guess that Eric agrees) that CAIS superintelligences are significantly less likely to cause existential catastrophe. And in general, it’s easier to reduce the absolute likelihood of an event the more likely it is (even a 10% reduction of a 50% risk is more impactful than a 90% reduction of a 5% risk). So unless we think that technical research to reduce the probability of CAIS catastrophes is significantly more tractable than other technical AI safety research, it shouldn’t be our main focus.

If you go via the CAIS route you definitely want to prevent unbounded AGI maximizers from being created until you are sure of their safety or that you can control them. (I know you addressed that in the previous point, but I'm pretty sure that no one is arguing to focus on CAIS conditional on AGI agents existing and being more powerful than CAIS, so it feels like you're attacking a strawman.)

Eventually we’ll have the technology to build unified agents doing unbounded maximisation. Once built, such agents will eventually overtake CAIS superintelligences because they’ll have more efficient internal structure and will be optimising harder for self-improvement.

Given a sufficiently long delay, we could use CAIS to build global systems that can control any new AGIs, in the same way that government currently controls most people.

I also am not sure why you think that AGI agents will optimize harder for self-improvement.

So while CAIS may be a good model of early steps towards AGI, I think it is a worse model of the period I’m most worried about.

Compared to what? If the alternative is "a vastly superintelligent AGI agent that is acting within what is effectively the society of 2019", then I think CAIS is a better model. I'm guessing that you have something else in mind though.

Comment by rohinmshah on Ambitious vs. narrow value learning · 2019-01-12T14:43:51.716Z · score: 4 (2 votes) · LW · GW
How would this kind of narrow value learning work in a mathematical or algorithmic sense?

I'm not sure I understand the question. Inverse reinforcement learning, preference learning (eg. deep RL from human preferences) and inverse reward design are some existing examples of narrow value learning.

since instrumental goals and values can be invalidated by environmental changes (e.g., I'd stop valuing US dollars if I couldn't buy things with them anymore), how does the value learner know when that has happened?

By default, it doesn't. You have to put active work to make sure the value learner continues to do what you want. Afaik there isn't any literature on this.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-12T00:27:57.877Z · score: 4 (2 votes) · LW · GW

I quickly skimmed the table of contents to generate this list, so it might have both false positives and false negatives.

Section 1: We typically make progress using R&D processes; this can get us to superintelligence. Implicitly also makes the claim that this is qualitatively different from AGI, though doesn't really argue for that.

Section 8: Optimization pressure points away from generality, not towards it, which suggests that strong optimization pressure doesn't give you AGI.

Section 12.6: AGI and CAIS solve problems in different ways. (Combined with the claim, argued elsewhere: CAIS will happen first.)

Section 13: AGI agents are more complex. (Implicit claim: and so harder to build.)

Section 17: Most complex tasks involve several different subtasks that don't interact much; so you get efficiency and generality gains by splitting the subtasks up into separate services.

Section 38: Division of labor + specialization are useful for good performance.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-11T17:00:12.975Z · score: 4 (2 votes) · LW · GW

I agree that it's an important crux, and that the arguments are not sufficiently strong that everyone should believe Eric's position. I do think that he has provided arguments that support his position, though they are in a different language/ontology than is usually used here.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-10T22:50:46.047Z · score: 2 (1 votes) · LW · GW

Yeah, that seems right, I don't think anyone is arguing against that claim.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-10T20:14:20.183Z · score: 4 (2 votes) · LW · GW
AIXI would be very good at making complex plans and doing well first time.

Agreed, I claim we have no clue at how to make anything remotely like AIXI in the real world.

Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully.

Agreed, in a CAIS world, the system of interacting services would probably notice the plan but not execute it because of some service that is meant to prevent it from doing crazy things that humans would not want.

What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can't protect us from making an unfriendly AGI.

This definitely seems like the crux for many people. I'm quite unsure about this point; it seems plausible to me that CAIS could in fact do most things such that there aren't very large incentives, especially if the Factored Cognition hypothesis is true.

I'm also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms.

I don't see why it would have to be little tweaks to existing algorithms, it seems plausible to have the R&D services consider entirely new algorithms as well.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-10T17:35:59.166Z · score: 10 (3 votes) · LW · GW
Can you explain how you'd implement these services?

Not really. I think of CAIS as suggesting that we take an outside view that says "looking at how AI has been progressing, and how humans generally do things, we'll probably be able to do more and more complex tasks as time goes on". But the emphasis that CAIS places is that the things we'll be able to do will be domain-specific tasks, rather than getting a general-purpose reasoner. I don't have a detailed enough inside view to say how complex tasks might be implemented in practice.

I agree with the rest of what you said, which feels to me like considering a few possible inside-view scenarios and showing that they don't work.

One way to think about this is through the lens of iterated amplification. With iterated amplification, we also get the property that our AI systems will be able to do more and more complex tasks as time goes on. The key piece that enables this is the ability to decompose problems, so that iterated amplification always bottoms out into a tree of questions and subquestions down to leaves which the base agent can answer. You could think of (my conception of) CAIS as a claim that a similar process will happen in a decentralized way for all of ML by default, and at any point the things we can do will look like an explicit iterated amplification deliberation tree of depth one or two, where the leaves are individual services and the top level question will be some task that is accomplished through a combination of individual services.

I could try to think in that direction after I get a better sense of what kinds of services might be both feasible and trustworthy in the CAIS world. It seems easy to become too optimistic/complacent under the CAIS model if I just try to imagine what safety-enhancing services might be helpful without worrying about whether those services would be feasible or how well they'd work at the time when they're needed.

Agreed, I'm making a bid for generating ideas without worrying about feasibility and trustworthiness, but not spending too much time on this and not taking the results too seriously.

Comment by rohinmshah on Alignment Newsletter #40 · 2019-01-10T17:25:07.045Z · score: 2 (1 votes) · LW · GW

As far as I know, only MIRI has really engaged with this problem, and they have only talked about it as a problem, not suggested any solutions.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-10T17:22:37.130Z · score: 7 (4 votes) · LW · GW
Why should we expect that this conceptual breakthrough would come later than other conceptual breakthroughs needed to achieve CAIS?

I think I share Eric's intuition that this problem is hard in a more fundamental way than other things, but I don't really know why I have this intuition. Some potential generators:

  • ML systems seem to be really good at learning tasks, but really bad at learning explicit reasoning. I think of CAIS as being on the side of "we never figure out explicit reasoning at the level that humans do it", and making up for this deficit by having good simulators that allow us to learn from experience, or by collecting much more data across multiple instances of AI systems, or by trying out many different AI designs and choosing the one which performs best.
  • It seems like humans tend to build systems by making individual parts that we can understand and predict well, and putting those together in a way where we can make some guarantees/predictions about what will happen. CAIS plays to this strength, whereas "figure out how to do very-long-term-planning" doesn't.
I don't see why it wouldn't, unless these services are specifically designed to be corrigible (in which case the "corrigible" part seems much more important than the "service" part).

Yeah, you're right, I definitely said the wrong thing there. I guess the difference is that the convergent instrumental subgoals are now "one level up" -- they aren't subgoals of the AI service itself, they're subgoals of the plan that was created by the AI service. It feels like this is qualitatively different and easier to address, but I can't really say why. More generators:

  • In this setting, convergent instrumental subgoals happen only if the plan-making service is told to maximize outcomes. However, since it's one level up, it should be easier to ask for something that says something more like "do X, interpreted pragmatically and not literally".
  • Things that happen one level up in the CAIS world are easier to point at and more interpretable, so it should be easier to find and fix issues of this sort.

(You could of course say "just because it's easier that doesn't mean people will do it", but I could imagine that if its easy enough this becomes best practice and people do it by default, and you don't actually gain very much by taking these parts out.)

I was assuming that long term strategic planners (as described in section 27) are available as an AIS, and would be one of the components of the hypothetical AGI.

Yeah, here also what I should have said is that the long term optimization is happening one level up, whereas with the typical AGI agent scenario it feels like the long term optimization needs to happen at the base level, and that's the thing we don't know how to do.

Comment by rohinmshah on What is narrow value learning? · 2019-01-10T17:00:05.030Z · score: 2 (1 votes) · LW · GW

Hmm, I agree that Paul's definition is different from mine, but it feels to me like they are both pointing at the same thing.

I think this means that under your definition, behavioral cloning and approval-directed agents are subsets of narrow value learning

That's right.

whereas under Paul's definition they are disjoint from narrow value learning.

I'm not sure. I would have included them, because sufficiently good behavioral cloning/approval-directed agents would need to learn instrumental goals and values in order to work effectively in a domain.

was this overloading of the term intentional?

It was intentional, in that I thought that these were different ways of pointing at the same thing.

What is narrow value learning?

2019-01-10T07:05:29.652Z · score: 18 (5 votes)
Comment by rohinmshah on Imitation learning considered unsafe? · 2019-01-09T23:22:57.213Z · score: 3 (2 votes) · LW · GW

Yeah, I agree with all of those clarifications.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T23:21:11.540Z · score: 5 (3 votes) · LW · GW
Part of the reason that AI alignment is hard is that The Box is FULL of Holes! Breaking Out is EASY!

Note that under the CAIS worldview, in order to be competent in some domain you need to have some experience in that domain (i.e. competence requires learning). Or at least, that's the worldview under which I find CAIS most compelling. In that case, the AI would have had to try breaking out of the box a few times in order to get good at it, and why would it do that? Even if it ever hit upon this plan, whenever it tried it for the first time it would get a gradient pushing that behavior away, since it didn't help with achieving the goal. Only after significant learning would it be able to execute these weird plans in a way that they actually succeed and help achieve the goal, and that significant learning will not happen.

The only thing that distinguishes one from the other is what humans prefer.

CAIS would definitely use human preference information, see eg. section 22.

This might be a good approach, but I don't feel it answers the question "I have a humanoid robot a hypercomputer and a couple of toddlers, how can I build something to look after the kids for a few weeks (without destroying the world) ?"

It's not really an approach to AI safety, it's mostly meant to be a different prediction about how we achieve superintelligence. (There are definitely some prescriptive aspects of CAIS, and some arguments that it is safer than AGI agents, but mostly it is meant to be descriptive, I believe.)

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T23:09:16.450Z · score: 15 (4 votes) · LW · GW
Do you get it?

I doubt I will ever be able to confidently answer yes to that question.

That does not seem to be his position though, because if AGI is not much more capable than CAIS, then there would be no need to talk specifically about how to defend the world against AGI, as he does at length in section 32.

My model is that he does think AGI won't be much more capable than CAIS (see sections 12 and 13 in particular, and 10, 11 and 16 also touch on the topic), but lots of people (including me) kept making the argument that end-to-end training tends to improve performance and so AGI would outperform CAIS, and so he decided to write a response to that.

In general, my impression from talking to him and reading earlier drafts is that the earlier chapters are representative of his core models, while the later chapters are more like responses to particular arguments, or specific implications of those models.

I can give one positive argument for AGI being harder to make than SI-level CAIS. All of our current techniques for building AI systems create things that are bounded in the time horizon they are optimizing over. It's actually quite unclear how we would use current techniques to get something that does very-long-term-planning. (This could be the "conceptual breakthroughs" point.) Seems a lot easier to get a bunch of bounded services and hook them up together in such a way that they can do the sorts of things that AGI agents could do.

The one scenario that is both concrete and somewhat plausible to me is that we run powerful deep RL on a very complex environment, and this finds an agent that does very-long-term-planning, because that's what it takes to do well on the environment. I don't know what Eric thinks about this scenario, but it doesn't seem to influence his thinking very much (and in fact in the OP I argued that CAIS isn't engaging enough with this scenario).

Why couldn't someone just take some appropriate AI services, connect them together in a straightforward way, and end up with an AGI?

If you take a bunch of a bounded services and connect them together in some straightforward way, you wouldn't get something that is optimizing over the long term. Where did the long term optimization come from?

For example, you could take any long term task and break it down into the "plan maker" which thinks for an hour and gives a plan for the task, and the "plan executor" which takes an in-progress plan and executes the next step. Both of these are bounded and so could be services, and their combination is generally intelligent, but the combination wouldn't have convergent instrumental subgoals.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T10:23:38.062Z · score: 4 (2 votes) · LW · GW

Sorry, when I said "there are lots of other tasks that are not as clear", I meant that there are a lot of other tasks relevant to policing and security that are not as clear, such as police to deal with threats that evade surveillance. I think the optimism here comes from our ability to decompose tasks, such that we can take a task that seems to require goal-directed agency (like "be the police") and turn it into a bunch of subtasks that no longer look agential.

Comment by rohinmshah on AI safety without goal-directed behavior · 2019-01-09T10:19:18.850Z · score: 2 (1 votes) · LW · GW
Well I'm not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.

Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don't think it's an advantage, then I don't think we disagree here.

Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don't see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).

That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it's not clear whether non-goal-directed AI could do something similar.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T10:08:56.731Z · score: 5 (3 votes) · LW · GW
I suspect you can take each of your comprehensive AI services and swap out the specific algorithm you were using for a one true learning algorithm without making the result any more of an agent.

Mostly agreed, but if we find the one true learning algorithm, then CAIS is no longer on the development path towards AGI agents, and I would predict that someone builds an AGI agent in that world because it could have lots of economic benefits that have not already been captured by CAIS services.

Indeed, this feels to me like a fundamental defining characteristic of superintelligence refers to... it refers to a specific bit of computer code that is able to learn better and faster, using fewer computational resources, than whatever algorithms the human brain uses.

I actually see CAIS as an argument against this. I think we could get superintelligent services by having lots of specialization (unlike humans, who are mostly general and a little bit specialized for their jobs), by aggregating learning across many actors (whereas humans can't learn from other humans' experience), by making models much larger and with much more compute (whereas humans are limited by brain size). Humans could still outperform AI services on things like power usage, sample efficiency, compute requirements, etc. while still having lots of AI services that can perform nearly any task at a superhuman level.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T09:59:08.672Z · score: 2 (1 votes) · LW · GW

I don't think he'd make a strong claim about that, but I wouldn't be surprised if he assigned that possibility significant credence. I assign that possibility relatively low credence. I assign much more credence to the position that we'll never need to solve the problem of designing a human-friendly superintelligent goal-directed agent.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T09:55:27.631Z · score: 6 (3 votes) · LW · GW
Why wouldn't someone create an AGI by cobbling together a bunch of AI services, or hire a bunch of AI services to help them design an AGI, as soon as they could?

Because any task that an AGI could do, CAIS could do as well. (Though I don't agree with this -- unified agents seem to work better.)

But if quickly building an AGI can potentially allow someone to take over the world before "unopposed preparation" can take place, isn't that a compelling motivation by itself for many people?

I suspect he would claim that quickly building an AGI would not allow you to take over the world, because the AGI would not be that much more capable than the CAIS service cluster.

It may be the case that people try to take over the world just with CAIS, and maybe that could succeed. I think he's arguing only against AGI accident risk here, not against malicious uses of AI. (I think you already knew that, but it wasn't fully clear on reading your comment.)

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T02:25:38.740Z · score: 4 (2 votes) · LW · GW

It's linked in the first sentence of the post. Though I guess I link to the pdf instead of the web page.

I tried to make this a link post, but I got an error message saying that it has already been linked before.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T02:23:48.116Z · score: 4 (2 votes) · LW · GW

It sounds like he's talking about services. From the post:

A service is an AI system that delivers bounded results for some task using bounded resources in bounded time.
Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T02:18:03.032Z · score: 2 (1 votes) · LW · GW

Monitoring surveillance in order to see if anyone is breaking rules seems to be quite a bounded task, and in fact is one that we are already in the process of automating (using our current AI systems, which are basically all bounded).

Of course, there are lots of other tasks that are not as clear. But to the extent that you believe the Factored Cognition hypothesis, you should believe that we can make bounded services that nevertheless do a very good job.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-09T02:11:14.240Z · score: 6 (3 votes) · LW · GW

If by agent we mean "system that takes actions in the real world", then services can be agents. As I understand it, Eric is only arguing against monolithic AGI agents that are optimizing a long-term utility function and that can learn/perform any task.

Current factory robots definitely look like a service, and even the soon-to-come robots-trained-with-deep-RL will be services. They execute particular learned behaviors.

If I remember correctly, Gwern's argument is basically that Agent AI will outcompete Tool AI because Agent AI can optimize things that Tool AI cannot, such as its own cognition. In the CAIS world, there are separate services that improve cognition, and so the CAIS services do get the benefit of ever-improving cognition, without being classical AGI agents. But overall I agree with this point (and disagree with Eric) because I expect there to be lots of gains to be had by removing the boundaries between services, at least where possible.

Comment by rohinmshah on AI safety without goal-directed behavior · 2019-01-09T02:00:00.565Z · score: 2 (1 votes) · LW · GW
For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.

Strong agree, and I do think it's the biggest downside of trying to build non-goal-directed agents.

The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify.

For the case of idealized humans, couldn't real humans defer to idealized humans if they thought that was better?

Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn't seem very likely to me.

For an explicit set of values, those values come from humans, so wouldn't they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.

Alignment Newsletter #40

2019-01-08T20:10:03.445Z · score: 21 (4 votes)
Comment by rohinmshah on AI safety without goal-directed behavior · 2019-01-08T17:53:59.363Z · score: 5 (3 votes) · LW · GW

While I mostly agree with all three of your advantages, I am more optimistic about non-goal-directed approaches to AI safety. I think this is primarily because I'm generally optimistic about AI safety, and the well-documented problems with goal-directed agents makes me pessimistic about that particular approach.

If I had to guess at what drives my optimism that you don't have, it would be that we can aim for an adequate, not-formalized solution, and this will very likely be okay. All else equal, I would prefer a more formal solution, but I don't think we have the time for that. I would guess that while this lack of formality makes me only a little more worried, it is a big source of worry for you and MIRI researchers. This means that argument 1 isn't a big update for me.

Re: argument 2, it's worth noting that a system that has some chance of causing catastrophe is going to be less economically efficient. Now people might build it anyway because they underestimate the chance of catastrophe, or because of race dynamics, but I'm hopeful that (assuming it's true) we can convince all the relevant actors that goal-directed agents have a significant chance of causing catastrophe. In that case, non-goal-directed agents have a lower bar to meet. But overall this is a significant update.

Re: argument 3, I don't really see why goal-directed agents are more likely to avoid human safety problems. It seems intuitively plausible -- if you get the right goal, then you don't have to rely on humans, and so you avoid their safety problems. However, even with goal-directed agents, the goal has to come from somewhere, which means it comes from humans. (If not, we almost certainly get catastrophe.) So wouldn't the goal have all of the human safety problems anyway?

I'm also optimistic about our ability to solve human safety problems in non-goal-directed approaches -- see for example the reply I just wrote on your CAIS comment.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-08T17:38:17.991Z · score: 4 (2 votes) · LW · GW

I actually think the CAIS model gives me optimism for these sorts of problems. As long as we acknowledge that the problems exist and can be an issue, we could develop services that help us mitigate them. Safety in the CAIS world already depends on having services that are in charge of good engineering, testing, red teaming, monitoring, etc., as well as services that evaluate objectives and make sure humans would approve of them. It seems fairly easy to expand this to include services that consider how disruptive new technologies will be, how underdetermined human values are, whether a proposed plan reduces option value, what risk aversion implies about a particular plan of action, what blind spots people have, etc.

I'd be interested in a list of services that you think would be helpful for addressing human safety problems. You might think of this as "our best current guess at metaphilosophy and metaphilosophy research".

(I know you were mainly talking about the document's framing, I don't have much to say about that.)

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-08T17:23:52.598Z · score: 7 (4 votes) · LW · GW

That seems right. I would argue that CAIS is more likely than any particular one of the other scenarios that you listed, because it is primarily taking trends from the past and projecting them into the future, whereas most other scenarios require something qualitatively new -- eg. an AGI agent (before CAIS) would happen if we find the one true learning algorithm, ems require us to completely map out the brain in a way that we don't have any results for currently, even in simple cases like C. elegans. But CAIS is probably not more likely than a disjunction over all of those possible scenarios.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-08T16:38:41.581Z · score: 2 (1 votes) · LW · GW

Yeah, I agree that even without the training it would be goal-directed, that comes from the MCTS.

Note though that if we stop training and also stop using MCTS and you play a game against it, it will beat you and yet I would say that it is not goal-directed.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-01-08T16:33:36.651Z · score: 2 (1 votes) · LW · GW

Depends what you mean by "generally intelligent". Any individual service could certainly have deep and broad knowledge about the world (as with eg. a language translation service), but no service will be able to do all tasks (eg. the service-creating-service is not going to be able to edit genomes, except by creating a new service that learns how to edit genomes).

With that caveat, yes, this assumes that we'll be able to build services that optimize for bounded tasks. But this is meant more as a description of how existing AI systems already work. Current RL agents are best modeled as optimizing for maximizing reward obtained for the current episode. (This isn't exactly right, because the value function is trying to capture the reward that can be obtained in the future, but in practice this doesn't make much of a difference.)

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

2019-01-08T07:12:29.534Z · score: 83 (28 votes)
Comment by rohinmshah on Imitation learning considered unsafe? · 2019-01-07T20:39:19.096Z · score: 4 (2 votes) · LW · GW

This sounds to me like an argument that inner optimizers are particularly likely to arise in imitation learning, because humans are pretty close to optimizers. Does that seem right?

Comment by rohinmshah on AI safety without goal-directed behavior · 2019-01-07T20:26:48.352Z · score: 3 (2 votes) · LW · GW
I usually think that logic-based reasoning systems are the canonical example of of an AI without goal-directed behaviour.

Yeah, that seems right to me. Though it's not clear how you'd use a logic-based reasoning system to act in the world -- if you do that by asking the question "what action would lead to the maximum value of this function", which it then computes using logic-based reasoning, then the resulting behavior would be goal-directed.

I'm fairly sure you can specify the behaviour of _anything_

Yup. I actually made this argument two posts ago.

Comment by rohinmshah on Intuitions about goal-directed behavior · 2019-01-07T14:10:39.042Z · score: 2 (1 votes) · LW · GW

Agreed, changed that sentence so that it no longer claims it is a sufficient condition. Thanks for catching that!

AI safety without goal-directed behavior

2019-01-07T07:48:18.705Z · score: 40 (12 votes)
Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-06T11:55:07.327Z · score: 2 (1 votes) · LW · GW

As I understand it, the first one is an argument for value lock in, and the third one is an argument for interpretability, does that seem right to you?

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-06T02:21:15.774Z · score: 2 (1 votes) · LW · GW
But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you're optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process.

But if there are safety problems in approval, wouldn't there also be safety problems in the human's behavior, which imitation learning would copy?

Similarly, if there are safety problems in the estimation process, wouldn't there also be safety problems in the prediction of what action a human would take?

From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?

I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I'll add a pointer to this discussion to the post.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-06T02:15:06.342Z · score: 6 (3 votes) · LW · GW

If you've seen the human acquire resources, then you'll acquire resources in the same way.

If there's now some new resource that you've never seen before, you may acquire it if you're sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn't actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-06T02:04:39.816Z · score: 3 (2 votes) · LW · GW

^ Yes to all of this.

A little bit of nuance: IRL is considered to be a form of imitation learning because in many cases the inferred reward in IRL is only meant to reproduce the human's performance and isn't expected to generalize outside of the training distribution.

There are versions of IRL which are meant to go beyond imitation. For example, adversarial IRL was trying to infer a reward that would generalize to new environments, in which case it would be doing something more than imitation.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-06T01:30:38.401Z · score: 6 (3 votes) · LW · GW

I don't think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you'll be uncertain about what the human is going to do. If you're uncertain in this way, and you are getting your goals from a human, then you don't do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)

It may be that "goal-directed" is the wrong word for the property I'm talking about, but I'm predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-05T21:02:26.613Z · score: 4 (2 votes) · LW · GW

Yes, as long as you keep doing the MCTS + training. The value/policy networks by themselves are not goal-directed.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-05T21:00:35.350Z · score: 2 (1 votes) · LW · GW

Yeah, I was imagining that we would convince AI researchers that goal-directed systems are dangerous, and that we should build the non-goal-directed versions instead.

Comment by rohinmshah on Will humans build goal-directed agents? · 2019-01-05T20:58:22.918Z · score: 14 (5 votes) · LW · GW
Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I was imagining something more like B for the imitation learning case.

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)

That analysis seems right to me.

With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

Will humans build goal-directed agents?

2019-01-05T01:33:36.548Z · score: 38 (9 votes)

Alignment Newsletter #39

2019-01-01T08:10:01.379Z · score: 33 (10 votes)

Alignment Newsletter #38

2018-12-25T16:10:01.289Z · score: 9 (4 votes)

Alignment Newsletter #37

2018-12-17T19:10:01.774Z · score: 26 (7 votes)

Alignment Newsletter #36

2018-12-12T01:10:01.398Z · score: 22 (6 votes)

Alignment Newsletter #35

2018-12-04T01:10:01.209Z · score: 15 (3 votes)

Coherence arguments do not imply goal-directed behavior

2018-12-03T03:26:03.563Z · score: 53 (17 votes)

Intuitions about goal-directed behavior

2018-12-01T04:25:46.560Z · score: 28 (9 votes)

Alignment Newsletter #34

2018-11-26T23:10:03.388Z · score: 26 (5 votes)

Alignment Newsletter #33

2018-11-19T17:20:03.463Z · score: 25 (7 votes)

Alignment Newsletter #32

2018-11-12T17:20:03.572Z · score: 20 (4 votes)

Future directions for ambitious value learning

2018-11-11T15:53:52.888Z · score: 39 (8 votes)

Alignment Newsletter #31

2018-11-05T23:50:02.432Z · score: 19 (3 votes)

What is ambitious value learning?

2018-11-01T16:20:27.865Z · score: 43 (12 votes)

Preface to the Sequence on Value Learning

2018-10-30T22:04:16.196Z · score: 61 (22 votes)

Alignment Newsletter #30

2018-10-29T16:10:02.051Z · score: 31 (13 votes)

Alignment Newsletter #29

2018-10-22T16:20:01.728Z · score: 16 (5 votes)

Alignment Newsletter #28

2018-10-15T21:20:11.587Z · score: 11 (5 votes)

Alignment Newsletter #27

2018-10-09T01:10:01.827Z · score: 16 (3 votes)

Alignment Newsletter #26

2018-10-02T16:10:02.638Z · score: 14 (3 votes)

Alignment Newsletter #25

2018-09-24T16:10:02.168Z · score: 22 (6 votes)

Alignment Newsletter #24

2018-09-17T16:20:01.955Z · score: 10 (5 votes)

Alignment Newsletter #23

2018-09-10T17:10:01.228Z · score: 17 (5 votes)

Alignment Newsletter #22

2018-09-03T16:10:01.116Z · score: 15 (4 votes)

Do what we mean vs. do what we say

2018-08-30T22:03:27.665Z · score: 30 (15 votes)

Alignment Newsletter #21

2018-08-27T16:20:01.406Z · score: 26 (6 votes)

Alignment Newsletter #20

2018-08-20T16:00:04.558Z · score: 13 (6 votes)

Alignment Newsletter #19

2018-08-14T02:10:01.943Z · score: 19 (5 votes)

Alignment Newsletter #18

2018-08-06T16:00:02.561Z · score: 19 (5 votes)

Alignment Newsletter #17

2018-07-30T16:10:02.008Z · score: 35 (6 votes)

Alignment Newsletter #16: 07/23/18

2018-07-23T16:20:03.039Z · score: 44 (11 votes)

Alignment Newsletter #15: 07/16/18

2018-07-16T16:10:03.390Z · score: 42 (13 votes)

Alignment Newsletter #14

2018-07-09T16:20:04.519Z · score: 15 (8 votes)

Alignment Newsletter #13: 07/02/18

2018-07-02T16:10:02.539Z · score: 74 (26 votes)

The Alignment Newsletter #12: 06/25/18

2018-06-25T16:00:42.856Z · score: 15 (5 votes)

The Alignment Newsletter #11: 06/18/18

2018-06-18T16:00:46.985Z · score: 8 (1 votes)

The Alignment Newsletter #10: 06/11/18

2018-06-11T16:00:28.458Z · score: 16 (3 votes)

The Alignment Newsletter #9: 06/04/18

2018-06-04T16:00:42.161Z · score: 8 (1 votes)

The Alignment Newsletter #7: 05/21/18

2018-05-21T16:00:45.356Z · score: 8 (1 votes)

The Alignment Newsletter #5: 05/07/18

2018-05-07T16:00:11.059Z · score: 8 (1 votes)

The Alignment Newsletter #4: 04/30/18

2018-04-30T16:00:13.425Z · score: 8 (1 votes)

The Alignment Newsletter #3: 04/23/18

2018-04-23T16:00:32.988Z · score: 8 (1 votes)

The Alignment Newsletter #2: 04/16/18

2018-04-16T16:00:18.678Z · score: 8 (1 votes)

Announcing the Alignment Newsletter

2018-04-09T21:16:54.274Z · score: 72 (20 votes)