Comment by rohinmshah on Partial preferences needed; partial preferences sufficient · 2019-03-21T20:18:07.865Z · score: 2 (1 votes) · LW · GW
The counter-examples are of that type because the examples are often of that type - presented formally, so vulnerable to a formal solution.

It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I'm not optimistic about getting them to the level of formality you typically use).

(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it's still much less formal than you usually work with.)

we need to define what a "reasonable" utility function

My argument is that we don't need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.

Comment by rohinmshah on Partial preferences needed; partial preferences sufficient · 2019-03-20T23:53:44.283Z · score: 8 (4 votes) · LW · GW

I guess my point is that your counterexamples/problems all have this very formal no-free-lunch theorem aspect to them, and we solve problems that have no-free-lunch theorems all the time -- in fact a lot of the programming languages community is tackling such problems and getting decent results in most cases.

For this reason you could say "okay, while there is a no-free-lunch theorem here, probably when the AI system carves reality at the joints, it ends up with features / cognition that doesn't consider the utility on something like turning on a yellow light to be a reasonable utility function". You seem to be opposed to any reasoning of this sort, and I don't know why.

Alignment Newsletter #49

2019-03-20T04:20:01.333Z · score: 17 (4 votes)
Comment by rohinmshah on Partial preferences needed; partial preferences sufficient · 2019-03-20T00:31:12.298Z · score: 5 (3 votes) · LW · GW
Now, people working in these areas don't often disagree with this formal argument; they just think it isn't that important. They feel that getting the right formalism is most of the work, and finding the right U is easier, or just a separate bolt-on that can be added later.
My intuition, formed mainly by my many failure in this area, is that defining the U is absolutely critical, and is much harder than the rest of the problem. Others have different intuitions, and I hope they're right.

I'm curious if you're aiming for justified 99.9999999% confidence in having a friendly AI on the first try (i.e. justified belief that there's no more than a 1 in a billion chance of not-a-friendly-AI-on-the-first-try). I would agree that defining U is necessary to hit that sort of confidence, and that it's much harder than the rest of the problem.

ETA: The reason I ask is that this post seems very similar to the problem I have with impact measures (briefly: either you fail to prevent catastrophes, or you never do anything useful), but I wouldn't apply that argument to corrigibility. I think the difference might be that I'm thinking of "natural" things that agents might want, whereas you're considering the entire space of possible utility functions. I'm trying to figure out why we have this difference.

Comment by rohinmshah on How can we respond to info-cascades? [Info-cascade series] · 2019-03-13T16:58:52.092Z · score: 6 (3 votes) · LW · GW

Pretty sure you know this already, and it's not exactly infrastructure, but it seems like if you have a nice formal process for eliciting people's beliefs, then you want to explicitly ask them for their impressions, not credences (or alternatively for both).

Comment by rohinmshah on Alignment Newsletter #48 · 2019-03-12T22:20:46.414Z · score: 2 (1 votes) · LW · GW

What advantages do you think this has compared to vanilla RL on U + AUP_Penalty?

Comment by rohinmshah on Alignment Newsletter #48 · 2019-03-12T16:49:19.172Z · score: 4 (2 votes) · LW · GW
Question about quantilization: where does the base distribution come from? You and Jessica both mention humans, but if we apply ML to humans, and the ML is really good, wouldn't it just give a prediction like "With near certainty, the human will output X in this situation"? (If the ML isn't very good, then any deviation from the above prediction would reflect the properties of the ML algorithm more than properties of the human.)

I don't have a great answer to this. Intuitively, at the high level, there are a lot of different plans I "could have" taken, and the fact that I didn't take them is more a result of what I happened to think about rather than a considered decision that they were bad. So in the limit of really good ML, one thing you could do is to have a distribution over "initial states" and then ask for the induced distribution over human actions. For example, if you're predicting the human's choice from a list of actions, then you could make the prediction depending on different orderings of the choices in the list, and different presentations of the list. If you're predicting what the human will do in some physical environment, you could check to see what would be done if the human felt slightly colder or slightly warmer, or if they had just thought of particular random words or sentences, etc. All of these have issues in the worst case (e.g. if you're making a decision about whether to wear a jacket, slightly changing the temperature of the room will make the decision worse), but seem fine in most cases, suggesting that there could be a way of making this work, especially if you can do it differently for different domains.

Alignment Newsletter #48

2019-03-11T21:10:02.312Z · score: 29 (11 votes)
Comment by rohinmshah on Example population ethics: ordered discounted utility · 2019-03-11T16:22:22.026Z · score: 4 (3 votes) · LW · GW

It does recommend against creating humans with lives barely worth living, and equivalently painlessly killing such people as well. If your population is a single person with utility 1000 and γ=.99, then this would recommend against creating a person with utility 1.

Comment by rohinmshah on How dangerous is it to ride a bicycle without a helmet? · 2019-03-10T00:26:00.121Z · score: 6 (3 votes) · LW · GW

Yeah, I agree with all of that. (I didn't realize the point about the relative sizes of reference classes until I read your reply to habryka more carefully.)

Perhaps another way to make the point about the argument for voting being stronger is that it affects your decisionmaking even if you are not altruistic. Here by stronger I mean that the argument is "more robust" or "less suspicious".

Comment by rohinmshah on How dangerous is it to ride a bicycle without a helmet? · 2019-03-09T22:36:12.090Z · score: 5 (3 votes) · LW · GW
Your decision inflicts the micromorts not just on yourself, but on all the people in the reference class, for the proportionally greater total number of micromorts that given this consideration turn into actual morts very easily.

But your decision also causes the corresponding benefits to accrue to all the people in the reference class? So the decision you make should be the same, it just becomes more consequentially important.

The voting case is different because the benefits are superlinear in the number of people you affect (at least up to a point) -- a million people voting the same way as you probably have more than a million times more chance at swinging the election.

ETA: Never mind, misunderstood habryka's reply, I'm basically saying the same thing. Though I still think that the case for applying the argument to voting is much stronger than the case for applying it in other decisions where benefits are linear.

Alignment Newsletter #47

2019-03-04T04:30:11.524Z · score: 21 (5 votes)
Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-23T22:55:49.997Z · score: 2 (1 votes) · LW · GW
On the scientific/technological side, you can also use scientific/engineering papers (which I'm guessing has to be at least an order of magnitude greater in volume than philosophy writing)

This still seems like it is continuing the status quo (where we put more effort into technology relative to philosophy) rather than differentially benefitting technology.

My main point is that it seems a lot harder for technological progress to go "off the rails" due to having access to ground truths (even if that data is sparse) so we can push it much harder with ML.

Yeah, that seems right, to the extent that we want to use ML to "directly" work on technological / philosophical progress. To the extent that it has to factor through some more indirect method (e.g. through human reasoning as in iterated amplification) I think this becomes an argument to be pessimistic about solving metaphilosophy, but not that it will differentially benefit technological progress (or at least this depends on hard-to-agree-on intuitions).

I think there's a strong argument to be made that you will have to go through some indirect method because there isn't enough data to attack the problem directly.

(Fwiw, I'm also worried about the semi-supervised RL part of iterated amplification for the same reason.)

The way I would put it is that humans developed philosophical abilities for some mysterious reason that we don't understand, so we can't rule out AI developing philosophical abilities for the same reason. It feels pretty risky to rely on this though.

Yeah, I agree that this is a strong argument for your position.

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-22T17:03:17.026Z · score: 4 (2 votes) · LW · GW
I thought from a previous comment that you already agree with the latter

Yeah, that's why I said "I probably agreed with this in the past". I'm not sure whether my underlying models changed or whether I didn't notice the contradiction in my beliefs at the time.

It's basically that the most obvious way of using ML to accelerate philosophical progress seems risky

It feels like this is true for the vast majority of plausible technological progress as well? E.g. most scientific experiments / designed technologies require real-world experimentation, which means you get very little data, making it very hard to naively automate with ML. I could make a just-so story where philosophy has much more data (philosophy writing), that is relatively easy to access (a lot of it is on the Internet), and so will be easier to automate.

My actual reason for not seeing much of a difference is that (conditional on short timelines) I expect that the systems we develop will be very similar to humans in the profile of abilities they have, because it looks like we will develop them in a manner similar to how humans were "developed" (I'm imagining development paths that look like e.g. OpenAI Five, AlphaStar, GPT-2 as described at SlateStarCodex). So the zeroth-order prediction is that there won't be a relative difference between technological and philosophical progress. A very sketchy first-order prediction based on "there is lots of easily accessible philosophy data" suggests that philosophical progress will be differentially advanced.

See the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy for more details.

Yeah, I agree that that particular method of making philosophical progress is not going to work.

I guess this is more of an argument for overall pessimism rather than for favoring one approach over another, but I still wanted to point out that I don't agree with your relative optimism here.

Yeah, that's basically my response.

I don't have good arguments for my optimism (and I did remove it from the newsletter opinion for that reason). Nonetheless, I am optimistic. One argument is that over the past few centuries it seems like philosophical progress has been making the world better faster than technological progress has been causing bad distributional shifts -- but of course even if our ancestors' values had been corrupted we would not see it that way, so this isn't a very good argument.

Alignment Newsletter #46

2019-02-22T00:10:04.376Z · score: 18 (8 votes)
Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-21T17:44:03.168Z · score: 2 (1 votes) · LW · GW
Solving metaphilosophy is itself a philosophical problem, so if we haven't made much progress on metaphilosophy by the time we get human-level AI, AI probably won't be able to help much with solving metaphilosophy (especially relative to accelerating technological progress).

I could interpret this in two ways:

  • Conditioned on metaphilosophy being hard to solve, AI won't be able to help us with it.
  • Conditioned on us not trying to solve metaphilosophy, AI won't be able to help us with it.

The first interpretation is independent of whether or not we work on metaphilosophy, so it can't be an argument for working on metaphilosophy.

The second interpretation seems false to me, and not because I think there are many considerations that overall come out to make it false -- I don't see any arguments in favor of it. Perhaps one argument is that if we don't try to solve metaphilosophy, then AI won't infer that we care about it, and so won't optimize for it. But that seems very weak, since we can just say that we do care, and that's much stronger evidence. We can also point out that we didn't try to solve the problem because it wasn't the most urgent one at the time.

Implementing the hybrid approach may be more of a technological problem but may still involve hard philosophical problems so it seems like a good idea to look more into it now to determine if that is the case and how feasible it looks overall (and hence how "doomed" approach 5 is, if approach 5 depends on implementing the hybrid approach at some point).

This suggests to me that you think that corrigible AI can't help us figure out hard philosophical problems or metaphilosophy? That would also explain the paragraph above. If so, that's definitely a crux for me, and I'd like to see arguments for that.

I guess you could also make this argument if you think AI is going to accelerate technological progress relative to (meta)philosophical progress. I probably agreed with this in the past, but now that I'm thinking more about it I'm not sure I agree any more. I suspect I was interpreting this as "technological progress will be faster than (meta)philosophical progress" instead of the actually-relevant "the gap between technological progress and (meta)philosophical progress will grow faster than it would have without AI". Do you have arguments for this latter operationalization?

Background: I generally think humans are pretty "good" at technological progress and pretty "bad" at (meta)philosophical progress, and I think AI will be similar. If anything, I might expect the gap between the two to decrease, since humans are "just barely" capable of (meta)philosophical progress (animals aren't capable of it, whereas they are somewhat capable of technological progress), and so there might be more room to improve. But this is based on what I expect are extremely fragile and probably wrong intuitions.

Also it seems like a good idea to try to give the hybrid approach as much of a head start as possible, because any value corruption that occurs prior to corrigible AI switching to a hybrid design probably won't get rolled back.

This is also dependent on the crux above.

Maybe I should clarify that I'm not against people working on corrigibility, if they think that is especially promising or they have a comparative advantage for working on that.

I didn't get the impression that you were against people working on corrigibility. Similarly, I'm not strongly against people working on metaphilosophy. What I'd like to do here is clarify what about metaphilosophy is likely to be necessary before we build powerful AI systems.

Does that seem reasonable to you?

Given your beliefs definitely. It's reasonable by my beliefs too, though it's not what I would do (obviously).

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-21T03:05:19.729Z · score: 2 (1 votes) · LW · GW
I think no, because using either metaphilosophy or the hybrid approach involving idealized humans, an AI could potentially undo any corruption that happens to the user after it becomes powerful enough (i.e., by using superhuman persuasion or some other method).

Couldn't the overseer and the corrigible AI together attempt to solve metaphilosophy / use the hybrid approach if that was most promising? (And to the extent that we could solve metaphilosophy / use the hybrid approach now, it should only be easier once we have a corrigible AI.)

Maybe come back to this after we settle the above question?

Yeah, seems right.

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-20T15:43:33.675Z · score: 2 (1 votes) · LW · GW
These were meant to be arguments that approach 5 (corrigibility) is "doomed"

Aren't they also equally powerful arguments that approaches 1-3 are doomed? I could see approach 4 as getting around the problem, though I'd hope that approach 4 could be subsumed under approach 5.

I agree that they are arguments against the statement "I am most optimistic that the last approach will “just work”". Would you agree with "The last approach seems to be the most promising to work on"?

Comment by rohinmshah on Coherent behaviour in the real world is an incoherent concept · 2019-02-20T02:36:35.111Z · score: 3 (3 votes) · LW · GW

I'm pretty sure I have never mentioned Eliezer in the Value Learning sequence. I linked to his writings because they're the best explanation of the perspective I'm arguing against. (Note that this is different from claiming that Eliezer believes that perspective.) This post and comment thread attributed the argument and belief to Eliezer, not me. I responded because it was specifically about what I was arguing against in my post, and I didn't say "I am clarifying the particular argument I am arguing against and am unsure what Eliezer's actual position is" because a) I did think that it was Eliezer's actual position, b) this is a ridiculous amount of boilerplate and c) I try not to spend too much time on comments.

I'm not feeling particularly open to feedback currently, because honestly I think I take far more care about this sort of issue than the typical researcher, but if you want to list a specific thing I could have done differently, I might try to consider how to do that sort of thing in the future.

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-20T02:20:16.029Z · score: 7 (4 votes) · LW · GW
What kind of strategy/policy work do you have in mind?

Assessing the incentives for whether or not people will try to intentionally corrupt values, as well as figuring out how to change those incentives if they exist. I don't know exactly, my point was more that this seems like an incentive problem. How would you attack this from a technical angle -- do you have to handcuff the AI to prevent it from ever corrupting values?

Don't we usually assume that the AI is ultimately corrigible to the user or otherwise has to cater to the user's demands, because of competition between different AI providers? In that scenario, the end user also has to care about getting philosophy correct and being risk-averse for things to work out well, right? Or are you imagining some kind of monopoly or oligopoly situation where the AI providers all agree to be paternalistic and keep certain kinds of choices and technologies away from users? If so, how do you prevent AI tech from leaking out (ETA: or being reinvented) and enabling smaller actors from satisfying users' risky demands? (ETA: Maybe you're thinking of a scenario that's more like 4 in my list?)

Yes, AI systems sold to end users would be corrigible to them, but I'm hoping that most of the power is concentrated with the overseers. End users could certainly hurt themselves, but broader governance would prevent them from significantly harming everyone else. Maybe you're worried about end users having their values corrupted and then because of democracy preventing us from getting most of the value? But even without value corruption I'd be quite afraid of end-user-defined democracy + powerful AI systems, and I assume you'd be too, so value corruption doesn't seem to be the main issue.

Another issue is that if AIs are not corrigible to end users but to overseers or their companies, that puts the overseers or companies in positions of tremendous power, which would be corrupting in its own way.

Agreed that this is a problem.

It seems that in general one could want to be risk-averse but not know how, so just having people be risk averse doesn't seem enough to ensure safety.
[...] it's unclear what it's supposed to do if such queries can themselves corrupt the overseer or user. [...]
BTW, Alex Zhu made a similar point in Acknowledging metaphilosophical competence may be insufficient for safe self-amplification.

In all of these cases, it seems like the problem is independent of AI. For risk aversion, if you wanted to solve it now, presumably you would try to figure out how to be risk-averse. But you could also do this with the assistance of an AI system. Perhaps the AI system does something risky while it is helping you figure out risk aversion? This doesn't feel very likely to me.

For the second one, presumably the queries would also corrupt the human if the human thought of them? If you'd like to solve this problem by creating a theory of value corruption and using that to decide whether queries were going to corrupt values, couldn't you do that with the assistance of the AI, and it waits on the potentially corrupting queries until that theory is complete?

For Alex's point, if there are risks during the period that an AI is trying to become metaphilosophically competent that can't be ignored, why aren't there similar risks right now that can't be ignored?

(These could all be arguments that we're doomed and there's no hope, but they don't seem to be arguments that we should differentially be putting in current effort into them.)

Comment by rohinmshah on Alignment Newsletter #43 · 2019-02-20T01:57:13.508Z · score: 2 (1 votes) · LW · GW

Thanks!

Comment by rohinmshah on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-19T01:19:28.261Z · score: 2 (1 votes) · LW · GW

Got it, thanks for clarifying.

Comment by rohinmshah on Coherent behaviour in the real world is an incoherent concept · 2019-02-19T01:17:59.203Z · score: 2 (1 votes) · LW · GW
In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between "intelligent" (or "cognitively powerful") and "highly optimized".

That's true, I'm not sure what this distinction is meant to capture. I'm updating that the thing I said is less likely to be true, but I'm still somewhat confident that it captures the general gist of what Eliezer meant. I would bet on this at even odds if there were some way to evaluate it.

Eliezer explicitly disclaimed this: [...]
In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not "highly optimized" and hence not an expected utility maximizer.

This is a tiny bit of his writing, and his tone makes it clear that this is unlikely. This is different from what I expected (when something has the force of a theorem you don't usually call its negation just "unlikely" and have a story for how it could be true), but it still seems consistent with the general story I said above.

In any case, I don't want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I'm arguing against, which I thought Eliezer believed, but even if he doesn't it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-19T00:44:23.282Z · score: 2 (1 votes) · LW · GW

I'm super unsure about the intentional case, and agree that I want to see more work on that front, but it feels like a particular problem that can be solved with something like strategy/policy work. Put another way, intentional value corruption seems like a non-central example of problems that arise from philosophical difficulty. I agree that corrigibility + good overseers does not clearly solve it.

For the unintentional case, I think that overseers who care about getting philosophy right are going to think about value drift, because many of us are currently thinking about it. It seems like as long as the overseers make this apparent to the AI system and are sufficiently risk-averse, a corrigible AI system would take care not to corrupt their values. (The AI system might fail at this, but this doesn't seem that likely to me, and it feels very hard to make progress on that particular point without more details on how the AI system works.)

I do think that we want to think about how to ensure that there are overseers who care about getting the questions right, who know about value drift, who will be sufficiently risk-averse, etc.

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-18T06:43:58.652Z · score: 5 (2 votes) · LW · GW

Planned newsletter opinion: This seems like a real problem, but I'm not sure how important it is. I am most optimistic that the last approach will "just work", where we solve alignment and there are enough overseers who care about getting these questions right that we do solve these philosophical problems. However, I'm very uncertain about this since I haven't thought about it enough (it seems like a question about humans rather than about AI). Regardless of importance, it does seem to have almost no one working on it and could benefit from more thought.

Comment by rohinmshah on Coherent behaviour in the real world is an incoherent concept · 2019-02-18T05:48:28.853Z · score: 0 (2 votes) · LW · GW
Rohin seems to think the point is "Simply knowing that an agent is intelligent lets us infer that it is goal-directed" but Eliezer doesn't seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That's actually one of MIRI's research objectives even though they take a different approach from Paul's.)

I think the point (from Eliezer's perspective) is "Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer". The main implication is that there is no way to affect the details of a superintelligent AI except by affecting its utility function, since everything else is fixed by math (specifically the VNM theorem). Note that this is (or rather, appears to be) a very strong condition on what alignment approaches could possibly work -- you can throw out any approach that isn't going to affect the AI's utility function. I think this is the primary reason for Eliezer making this argument. Let's call this the "intelligence implies EU maximization" claim.

Separately, there is another claim that says "EU maximization by default implies goal-directedness" (or the presence of convergent instrumental subgoals, if you prefer that instead of goal-directedness). However, this is not required by math, so it is possible to avoid this implication, by designing your utility function in just the right way.

Corrigibility is possible under this framework by working against the second claim, i.e. designing the utility function in just the right way that you get corrigible behavior out. And in fact this is the approach to corrigibility that MIRI looked into.

I am primarily taking issue with the "intelligence implies EU maximization" argument. The problem is, "intelligence implies EU maximization" is true, it just happens to be vacuous. So I can't say that that's what I'm arguing against. This is why I rounded it off to arguing against "intelligence implies goal-directedness", though this is clearly a bad enough summary that I shouldn't be saying that any more.

Comment by rohinmshah on Pedagogy as Struggle · 2019-02-18T03:59:26.518Z · score: 1 (4 votes) · LW · GW

Strong agree. One of my subgoals during teaching is often to confuse students. See also this video, which basically captures the reason why.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-02-18T03:48:43.004Z · score: 8 (4 votes) · LW · GW

That was the summary :P The full thing was quite a bit longer. I also didn't want to misquote Eric.

Maybe the shorter summary is: there are two axes which we can talk about. First, will systems be transparent, modular and structured (call this CAIS-like), or will they be opaque and well-integrated? Second, assuming that they are opaque and well-integrated, will they have the classic long-term goal-directed AGI-agent risks or not?

Eric and I disagree on the first one: my position is that for any particular task, while CAIS-like systems will be developed first, they will gradually be replaced by well-integrated ones, once we have enough compute, data, and model capacity.

I'm not sure how much Eric and I disagree on the second one: I think it's reasonable to predict that the resulting systems are specialized for particular bounded tasks and so won't be running broad searches for long-term plans. I would still worry about inner optimizers; I don't know what Eric thinks about that worry.

This summary is more focused on my beliefs than Eric's, and is probably not a good summary of the intent behind the original comment, which was "what does Eric think Rohin got wrong in his summary + opinion of CAIS", along with some commentary from me trying to clarify my beliefs.

Updates were mainly about actually carving up the space in the way above. Probably others, but I often find it hard to introspect on how my beliefs are updating.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-02-17T21:17:51.485Z · score: 13 (4 votes) · LW · GW

Eric and I have exchanged a few emails since I posted this summary, I'm posting some of it here (with his permission), edited by me for conciseness and clarity. The paragraphs in the quotes are Eric's, but I have rearranged his paragraphs and omitted some of them for better flow in this comment.

There is a widespread intuition that AGI agents would by nature be more integrated, flexible, or efficient than comparable AI services. I am persuaded that this is wrong, and stems from an illusion of simplicity that results from hiding mechanism in a conceptually opaque box, a point that is argued at some length in Section 13.
Overall, I think that many of us have been in the habit of seeing flexible optimization itself as problem, when optimization is instead (in the typical case) a strong constraint on a system’s behavior (see Section 8). Flexibility of computation in pursuit of optimization for bounded tasks seems simply useful, regardless of planning horizon, scope of considerations, or scope of required knowledge.

I agree that AGI agents hide mechanism in an opaque box. I also agree that the sort of optimization that current ML does, which is very task-focused, is a strong constraint on behavior. There seems to be a different sort of optimization that humans are capable of, where we can enter a new domain and perform well in it very quickly; I don't have a good understanding of that sort of optimization, and I think that's what the classic AGI agent risks are about.

Relatedly, I've used the words "monolithic AGI agent" a bunch in the summary and the post. Now, I want to instead talk about whether AI systems will be opaque and well-integrated, since that's the main crux of our disagreement. It's plausible to me that even if they are opaque and well-integrated, you don't get the classic AGI agent risks, because you don't get the sort of optimization I was talking about above.

In this connection, you cite the power of end-to-end training, but Section 17.4 (“General capabilities comprise many tasks and end-to-end relationships”) argues that, because diverse tasks encompass many end-to-end relationships, the idea that a broad set of tasks can be trained “end to end” is mistaken, a result of the narrowness of current trained systems in which services form chains rather than networks that are more wide than deep. We should instead expect that broad capabilities will best be implemented by sets of systems (or sets of end-to-end chains of systems) that comprise well-focused competencies: Systems that draw on distinct subtask competencies will typically be easier to train and provide more robust and general performance (Section 17.5).  Modularity typically improves flexibility and generality, rather than impeding it.
Note that the ability to employ subtask components in multiple contexts constitutes a form of transfer learning, and [...] this transfer learning can carry with it task-specific aspects of behavioral alignment.

This seems like the main crux of the disagreement. My claim is that for any particular task, given enough compute, data and model size, an opaque, well-integrated, unstructured AI system will outperform a transparent, modular collection of services. This is only on the axis of performance at the task: I agree that the structured system will generalize better out of distribution (which leads to robustness, flexibility, and better transfer learning). I'm basing this primarily off of empirical evidence and intuitions:

  • For many tasks so far (computer vision, NLP, robotics), transitioning from a modular architecture to end-to-end deep learning led to large boosts in performance.
  • My impression is that many interdisciplinary academics are able to transfer ideas and intuitions from one of their fields to the other, allowing them to make big contributions that more experienced researchers could not do. This suggests that patterns of problem-solving from one field can transfer to another in a non-trivial way, that you could achieve best with well-integrated systems.
  • Psychology research can be thought of as an attempt to systematize/modularize our knowledge about humans. Despite a huge amount of work in psychology, our internal, implicit, well-integrated models of humans are way better than our explicit theories.

Humans definitely solve large tasks in a very structured way; I hypothesize that this is because for those tasks the limits of human compute/data/brain size prevent us from getting the benefits of an unstructured heuristic approach.

Speaking of integration:

Regarding integration, I’ve argued that classic AGI-agent models neither simplify nor explain general AI capabilities (Section 13.3), including the integration of competencies. Whatever integration of functionality one expects to find inside an opaque AGI agent must be based on mechanisms that presumably apply equally well to integrating relatively transparent systems of services. These mechanisms can be dynamic, rather than static, and can include communication via opaque vector embeddings, jointly fine-tuning systems that perform often-repeated tasks, and matching of tasks to services, (including service-development services) in semantically meaningful “task spaces” (discussed in Section 39 “Tiling task-space with AI services can provide general AI capabilities”).
[...]
Direct lateral links between competencies such as organic synthesis, celestial mechanics, ancient Greek, particle physics, image interpretation, algorithm design, traffic planning (etc.) are likely to be sparse, particularly when services perform object-level tasks. This sparseness is, I think, inherent in natural task-structures, quite independent of human cognitive limitations.

(The paragraphs above were written in a response to me while I was still using the phrase "AGI agents")

I expect that the more you integrate the systems of services, the more opaque they will become. The resulting system will be less interpretable; it will be harder to reason about what information particular services do not have access to (Section 9.4); and it is harder to tell when malicious behavior is happening. The safety affordances identified in CAIS no longer apply because there is not enough modularity between services.

Re: sparseness inherent in task-structures, I think this is a result of human cognitive limitations but don't know how to argue more for that perspective.

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-17T19:56:33.041Z · score: 2 (1 votes) · LW · GW
What would you say is the primary source of the problem?

The fact that humans don't generalize well out of distribution, especially on moral questions; and the fact that progress can cause distribution shifts that cause us to fail to achieve our "true values".

What do you think the implications of this are?

Um, nothing in particular.

I'm not sure why you ask this though.

It's very hard to understand what people actually mean when they say things, and a good way to check is to formulate an implication of (your model of) their model that they haven't said explicitly, and then see whether you were correct about that implication.

Comment by rohinmshah on How the MtG Color Wheel Explains AI Safety · 2019-02-17T00:21:53.421Z · score: 12 (4 votes) · LW · GW
Other strategies I want to put in this cluster include formal verification, informed oversight and factorization.

Why informed oversight? It doesn't feel like a natural fit to me. Perhaps you think any oversight fits in this category, as opposed to the specific problem pointed to by informed oversight? Or perhaps there was no better place to put it?

Corrigibility is largely about making systems that are superintelligent without being themselves fully agentic.

This seems very different from the notion of corrigibility that is "a system that is trying to help its operator". Do you think that these are two different notions, or are they different ways of pointing at the same thing?

Comment by rohinmshah on The Argument from Philosophical Difficulty · 2019-02-16T19:55:29.801Z · score: 5 (2 votes) · LW · GW

A lot of this doesn't seem specific to AI. Would you agree that AI accelerates the problem and makes it more urgent, but isn't the primary source of the problem you've identified?

How would you feel about our chances for a good future if AI didn't exist (but we still go forward with technological development, presumably reaching space exploration eventually)? Are human safety problems an issue then? Some of the problems, like intentional value manipulation, do seem to become significantly easier.

Comment by rohinmshah on Why is so much discussion happening in private Google Docs? · 2019-02-16T19:47:31.493Z · score: 2 (1 votes) · LW · GW

FYI I've had this experience as well, though it's not particularly strong or common.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-02-15T22:20:18.607Z · score: 4 (2 votes) · LW · GW
I see a few criticisms about how this doesn't really solve the problem, it only delays it because we expect a unified agent to outperform the combined services.

Not sure if you're talking about me, but I suspect that my criticism could be read that way. Just want to clarify that I do think "we expect a unified agent to outperform the combined services" but I don't think this means we shouldn't pursue CAIS. That strategic question seems hard and I don't have a strong opinion on it.

Comment by rohinmshah on Learning preferences by looking at the world · 2019-02-15T22:17:46.727Z · score: 5 (1 votes) · LW · GW
(But maybe these questions aren't very important if the main point here isn't offering RLSP as a concrete technique for people to use but more that "state of the world tells us a lot about what humans care about".)

Yeah, I think that's basically my position.

But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of "what humans care about". On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I'd guess that large-scale IRL would still be better. But both would be bad.

Comment by rohinmshah on Learning preferences by looking at the world · 2019-02-14T18:46:37.249Z · score: 4 (2 votes) · LW · GW
I'm confused that this idea is framed as an alternative to impact measures, because I thought the main point of impact measures is "prevent catastrophe" and this doesn't aim to do that.

I didn't mean to frame it as an alternative to impact measures, but it is achieving some of the things that impact measures achieve. Partly I wrote this post to explicitly say that I don't imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true. I guess I didn't communicate that effectively.

In the AI that RLSP might be a component of, what is doing the "prevent catastrophe" part?

That depends more on the AI part than on RLSP. I think the actual contribution here is the observation that the state of the world tells us a lot about what humans care about, and the RLSP algorithm is meant to demonstrate that it is in principle possible to extract those preferences.

If I were forced to give an answer to this question, it would be that RLSP would form a part of a norm-following AI, and that because the AI was following norms it wouldn't do anything too crazy. However, RLSP doesn't solve any of the theoretical problems with norm-following AI.

But the real answer is that this is an observation that seems important, but I don't have a story for how it leads to us solving AI safety.

Can you also compare the pros and cons of this idea with other related ideas, for example large-scale IRL? (I'm imagining attaching recording devices to lots of people and recording their behavior over say months or years and feeding that to IRL.)

Any scenario I construct with RLSP has clear problems, and similarly large-scale IRL also has clear problems. If you provide particular scenarios I could analyze those.

For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.

It seems like there's gotta be a principled way to combine this idea with inverse reward design. Is that something you've thought about?

Yeah, I agree they feel very composable. The main issue is that the observation model in IRD requires a notion of a "training environment" that's separate from the real world, whereas RLSP assumes that there is one complex environment in which you are acting.

Certainly if you first trained your AI system in some training environments and then deployed them in the real world, you could use IRD during training to get a distribution over reward functions, and then use that distribution as your prior when running RLSP. It's maybe plausible that if you did this you could simply optimize the resulting reward function, rather than doing risk-averse planning (which is how IRD gets the robot to avoid lava), that would be cool. It's hard to test because all of the IRD environments don't satisfy the key assumption of RLSP (that humans have optimized the environment for their preferences).

Comment by rohinmshah on Alignment Newsletter #45 · 2019-02-14T18:28:12.673Z · score: 8 (5 votes) · LW · GW

:)

Comment by rohinmshah on X-risks are a tragedies of the commons · 2019-02-14T18:27:41.644Z · score: 3 (2 votes) · LW · GW

Ah, you're right, we don't really agree, I misunderstood.

I think we basically agree on actual object-level thing and I'm mostly disagreeing on the use of "tragedy of the commons" as a description of it. I don't think this is important though so I'd prefer to drop it.

Tbc, I agree with this:

If there is a cost to reducing Xrisk (which I think is a reasonable assumption), then there will be an incentive [...] to underinvest in reducing Xrisk. There's still *some* incentive to prevent Xrisk, but to some people everyone dying is not much worse than just them dying.
Comment by rohinmshah on Three Kinds of Research Documents: Clarification, Explanatory, Academic · 2019-02-14T18:21:29.666Z · score: 3 (2 votes) · LW · GW
Academic documents, as I interpret them, aim to be acceptable to the academic community or considered academic.

There are good non-signaling reasons for academic documents being the way that they are. Consider the following properties of academia:

  • A field is huge, such that it is very hard to learn all of it
  • The group of people working on the field is enormous, requiring decentralized coordination
  • Fields of inquiry take decades, meaning that there needs to be a way of onboarding new people

Consider how you might try to write explanatory posts for such a group that are shorter than books, and I suspect you'll recover many of the properties of academic articles (perhaps modernized, e.g. links instead of citations).

Alignment Newsletter #45

2019-02-14T02:10:01.155Z · score: 26 (8 votes)
Comment by rohinmshah on Learning preferences by looking at the world · 2019-02-13T17:04:06.030Z · score: 4 (2 votes) · LW · GW
Whether it is necessary to simulate the past to figure out the cost of deviating from the present state, I am not sure.

You seem to be proposing low-impact AI / impact regularization methods. As I mentioned in the post:

we are gaining significantly on the “do what we want” desideratum: the point of inferring preferences is that we do not also penalize positive impacts that we want to happen.

Almost everything we want to do is irreversible / impactful / entropy-increasing, and many things that we don't care about are also irreversible / impactful / entropy-increasing. If you penalize irreversibility / impact / entropy, then you will prevent your AI system from executing strategies that would be perfectly fine and even desirable. My intuition is that typically this would prevent your AI system from doing anything interesting (e.g. replacing CEOs).

Simulating the past is one way that you can infer preferences from the state of the world; it's probably not the best way and I'm not tied to that particularly strategy. The important bit is that the state contains preference information and it is possible in theory to extract it.

Comment by rohinmshah on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-13T07:06:59.406Z · score: 8 (3 votes) · LW · GW
You can use RL for the distillation step.

Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn't sure where the "sequential" part came in).

The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and "RL" seems like the normal way to talk about that algorithm/problem. We you could instead call it "contextual bandits."

I get the need for reinforce, I'm not sure I understand the value function baseline part.

Here's a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:

  • The states are the question + in-progress answer
  • The actions are "append the word w to the answer"
  • All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer's answer to "How good is answer <answer> to question <question>?"

And we can apply RL algorithms to this problem.

Is that equivalent to what you're saying?

You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it's generic RL.

Just to make sure I'm understanding correctly, this is recursive reward modeling, right?

Does "imitation learning" refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it's normally the kind of algorithm I have in mind when talking about "imitation learning" (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)

Yeah, that was bad wording on my part. I was using "imitation learning" to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.

I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)

Comment by rohinmshah on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-13T06:35:10.905Z · score: 2 (1 votes) · LW · GW

I'm seeing a one-hour old empty comment, I assume it got accidentally deleted somehow?

ETA: Nvm, I can see it on LessWrong, but not on the Alignment Forum.

Comment by rohinmshah on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-13T03:28:09.251Z · score: 5 (2 votes) · LW · GW

I agree with Wei Dai that the schemes you're describing do not sound like imitation learning. Both of the schemes you describe sound to me like RL-IA. The scheme that you call imitation-IA seems like a combination random search + gradients method of doing RL. There's an exactly analogous RL algorithm for the normal RL setting -- just take the algorithm you have, and replace all instances of M2("How good is answer X to Y?") with , where is the reward function.

One way that you could do imitation-IA would be to compute a bunch of times to get a dataset and train on that dataset.

I am also not sure exactly what it means to use RL in iterated amplification. There are two different possibilities I could imagine:

  • Using a combination of IRL + RL to achieve the same effect as imitation learning. The hope here would be that IRL + RL provides a better inductive bias for imitation learning, helping with sample efficiency.
  • Instead of asking the amplified model to compute directly, we ask it to provide a measure of approval, e.g. by asking "How good is answer X to Y?", or by asking "Which is a better answer to Y, X1 or X2?" and learning from that signal (see optimizing with comparisons), using some arbitrary RL algorithm.

I'm quite confident that RL+IA is not meant to be the first kind. But even with the second kind, one question does arise -- typically with RL we're trying to optimize the sum of rewards across time, whereas here we actually only want to optimize the one-step reward that you get immediately (which is the point of maximizing approval and having a stronger overseer). So then I don't really see why you want RL, which typically is solving a hard credit assignment problem that doesn't arise in the one-step setting.

Comment by rohinmshah on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-13T03:05:39.828Z · score: 5 (2 votes) · LW · GW
What is ?

It's the probability that the model M that we're training assigns to the best answer X*. (M is outputting a probability distribution over D.)

The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you're increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.

Learning preferences by looking at the world

2019-02-12T22:25:16.905Z · score: 47 (13 votes)
Comment by rohinmshah on X-risks are a tragedies of the commons · 2019-02-12T18:50:47.848Z · score: 2 (1 votes) · LW · GW

Cool, I think we agree.

Comment by rohinmshah on X-risks are a tragedies of the commons · 2019-02-08T21:37:58.116Z · score: 4 (2 votes) · LW · GW

Tragedies of the commons usually involve some personal incentive to defect, which doesn't seem true in the framework you have. Of course, you could get such an incentive if you include race dynamics where safety takes "extra time", and then it would seem like a tragedy of the commons (though race to the bottom seems more appropriate)

Comment by rohinmshah on (notes on) Policy Desiderata for Superintelligent AI: A Vector Field Approach · 2019-02-06T22:47:09.557Z · score: 10 (2 votes) · LW · GW
It'd also be great to (publicly) hear that someone else actually read the paper and checked whether my notes missed something important or are inaccurate.

I read the paper over a year ago (before the update), and reviewing my notes, they look similar to yours (but less detailed).

Alignment Newsletter #44

2019-02-06T08:30:01.424Z · score: 20 (6 votes)
Comment by rohinmshah on Greatest Lower Bound for AGI · 2019-02-06T08:08:49.423Z · score: 10 (6 votes) · LW · GW

Either he's not trying to be calibrated, or he's not good at being calibrated, probably the former. Like, my inside view also screams fairly loudly that AGI in 2020 is never going to happen -- but assigning 99% confidence to my inside view is far too much confidence. I expect LeCun is mostly trying to communicate what his inside view is confident about.

There are lots of good non-alignment ML researchers whose timelines are much much shorter (including many working at DeepMind and OpenAI). Of course, it could be that they are the ones who are wrong and LeCun is right, but I don't see a particularly compelling reason to make that judgment.

Comment by rohinmshah on Greatest Lower Bound for AGI · 2019-02-05T22:22:34.096Z · score: 13 (6 votes) · LW · GW

Given the sheer amount of effort DeepMind and OpenAI are putting into the problem, and the fact that what they are working on need not be clear to us, and the fact that forecasting is hard, I think it's hard to place less than 1% on short timelines. You could justify less than 1% on 2019, maybe even 2020, but you should probably put at least 1% on 2021.

(This is assuming you have no information about DeepMind or OpenAI besides what they publish publicly.)

Comment by rohinmshah on Conclusion to the sequence on value learning · 2019-02-05T19:44:02.775Z · score: 4 (2 votes) · LW · GW
In the abstract, one open problem about "not-goal directed agents" is "when they turn into goal directed?"; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things

I agree that inner optimizers are a way that non-goal directed agents can become goal directed. I don't see why solutions to inner optimizers would help align non goal-directed things. Can you say more about that?

From the "alternative solutions", in my view, what is under-investigated are attempts to limit capabilities - make "bounded agents". One intuition behind it is that humans are functional just because goals and utilities are "broken" in a way compatible with our planning and computational bounds. I'm worried that efforts in this direction got bucketed with "boxing", and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like "just don't connect it to the internet")

I am somewhat worried about such approaches, because it seems hard to make such agents competitive with unaligned agents. But I agree that it seems under-investigated.

I'm particularly happy about your points on the standard claims about expected utility maximization.

Thanks!

Comment by rohinmshah on Conclusion to the sequence on value learning · 2019-02-05T19:40:59.017Z · score: 6 (3 votes) · LW · GW

I also am unsure about how much people think that's the primary problem. I feel fairly confident that Eliezer thinks (or thought at some recent point) that this was the primary problem. I came into the field thinking of this as the primary problem.

It certainly seems that many people assume that a superintelligent AI system has a utility function. I don't know their reasons for this assumption.

Comment by rohinmshah on How does Gradient Descent Interact with Goodhart? · 2019-02-04T06:28:23.328Z · score: 11 (5 votes) · LW · GW
Human approval is a good proxy for human value when sampling (even large numbers of) inputs/plans, but a bad proxy for human value when choosing inputs/plans that were optimized via local search. Local search will find ways to hack the human approval while having little effect on the true value.

I do have this intuition as well, but it's because I expect local search to be way more powerful than random search. I'm not sure how large you were thinking with "large". Either way though, I expect that sampling will result in a just-barely-approved of plan, whereas local search will result in a very-high-approval plan -- which basically means that the local search method was a more powerful optimizer. (As intuition for this, note that since image classifiers have millions of parameters, the success of image classifiers suggests that gradient descent is capable of somewhere between thousands and trillions of bits of optimization. The corresponding random search would be astronomically huge.)

Another way of stating this: I'd be worried about procedure B because it seems like if it is behaving adversarially, then it can craft a plan that gets lots of human approval despite being bad, whereas procedure A can't do that.

However, if procedure B actually takes a lot of samples before finding a design that humans approve, then we'd get a barely-approved of plan, and then I feel about the same amount of worry with either procedure.

How do you feel about a modification of procedure A, where the sampled plans are evaluated by an ML model of human approval, and only if they reach an acceptably high threshold are they sent to a human for final approval?

Also, random note that might be important, any specifications that we give to the rocket design system are very strong indicators of what we will approve.

Comment by rohinmshah on Reliability amplification · 2019-02-04T02:20:12.914Z · score: 4 (2 votes) · LW · GW

(Rambling + confused, I'm trying to understand this post)

It seems like all of this requires the assumption that our agents have a small probability of failure on any given input. If there are some questions on which our agent is very likely to fail, then this scheme actually hurts us, amplifying the failure probability on those questions. Ah, right, that's the problem of security amplification. So really, the point of reliability amplification is to decrease the chance that the agent becomes incorrigible, which is a property of the agent's "motivational system" that doesn't depend on particular inputs. And if any part of a deliberation tree is computed incorrigibly, then the output of the deliberation is itself incorrigible, which is why we get the amplification of failure probability with capability amplification.

When I phrase it this way, it seems like this is another line of defense that's protecting against the same thing as techniques for optimizing worst-case performance. Do you agree that if those techniques work "perfectly" then there's no need for reliability amplification?

This is an interesting failure model though -- how does incorrigibility arise such that it is all-or-nothing, and doesn't depend on input? Why aren't there inputs that almost always cause our agent to become incorrigible? I suppose the answer to that is that we'll start with an agent that uses such small inputs that it is always corrigible, and our capability amplification procedure will ensure that we stay corrigible.

But then in that case why is there a failure probability at all? That assumption is strong enough to say that the agent is never incorrigible.

TL;DR: Where does the potential incorrigibility arise from in the first place? I would expect it to arise in response to a particular input, but that doesn't seem to be your model.

Comment by rohinmshah on Conclusion to the sequence on value learning · 2019-02-04T01:59:28.560Z · score: 5 (3 votes) · LW · GW

Yes, I agree that's a corollary.

Conclusion to the sequence on value learning

2019-02-03T21:05:11.631Z · score: 44 (10 votes)

Alignment Newsletter #43

2019-01-29T21:10:02.373Z · score: 15 (5 votes)

Future directions for narrow value learning

2019-01-26T02:36:51.532Z · score: 12 (5 votes)

The human side of interaction

2019-01-24T10:14:33.906Z · score: 16 (4 votes)

Alignment Newsletter #42

2019-01-22T02:00:02.082Z · score: 21 (7 votes)

Following human norms

2019-01-20T23:59:16.742Z · score: 23 (8 votes)

Reward uncertainty

2019-01-19T02:16:05.194Z · score: 18 (5 votes)

Alignment Newsletter #41

2019-01-17T08:10:01.958Z · score: 23 (4 votes)

Human-AI Interaction

2019-01-15T01:57:15.558Z · score: 18 (6 votes)

What is narrow value learning?

2019-01-10T07:05:29.652Z · score: 21 (7 votes)

Alignment Newsletter #40

2019-01-08T20:10:03.445Z · score: 21 (4 votes)

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

2019-01-08T07:12:29.534Z · score: 90 (34 votes)

AI safety without goal-directed behavior

2019-01-07T07:48:18.705Z · score: 40 (12 votes)

Will humans build goal-directed agents?

2019-01-05T01:33:36.548Z · score: 39 (10 votes)

Alignment Newsletter #39

2019-01-01T08:10:01.379Z · score: 33 (10 votes)

Alignment Newsletter #38

2018-12-25T16:10:01.289Z · score: 9 (4 votes)

Alignment Newsletter #37

2018-12-17T19:10:01.774Z · score: 26 (7 votes)

Alignment Newsletter #36

2018-12-12T01:10:01.398Z · score: 22 (6 votes)

Alignment Newsletter #35

2018-12-04T01:10:01.209Z · score: 15 (3 votes)

Coherence arguments do not imply goal-directed behavior

2018-12-03T03:26:03.563Z · score: 62 (20 votes)

Intuitions about goal-directed behavior

2018-12-01T04:25:46.560Z · score: 29 (10 votes)

Alignment Newsletter #34

2018-11-26T23:10:03.388Z · score: 26 (5 votes)

Alignment Newsletter #33

2018-11-19T17:20:03.463Z · score: 25 (7 votes)

Alignment Newsletter #32

2018-11-12T17:20:03.572Z · score: 20 (4 votes)

Future directions for ambitious value learning

2018-11-11T15:53:52.888Z · score: 42 (10 votes)

Alignment Newsletter #31

2018-11-05T23:50:02.432Z · score: 19 (3 votes)

What is ambitious value learning?

2018-11-01T16:20:27.865Z · score: 44 (13 votes)

Preface to the sequence on value learning

2018-10-30T22:04:16.196Z · score: 64 (25 votes)

Alignment Newsletter #30

2018-10-29T16:10:02.051Z · score: 31 (13 votes)

Alignment Newsletter #29

2018-10-22T16:20:01.728Z · score: 16 (5 votes)

Alignment Newsletter #28

2018-10-15T21:20:11.587Z · score: 11 (5 votes)

Alignment Newsletter #27

2018-10-09T01:10:01.827Z · score: 16 (3 votes)

Alignment Newsletter #26

2018-10-02T16:10:02.638Z · score: 14 (3 votes)

Alignment Newsletter #25

2018-09-24T16:10:02.168Z · score: 22 (6 votes)

Alignment Newsletter #24

2018-09-17T16:20:01.955Z · score: 10 (5 votes)

Alignment Newsletter #23

2018-09-10T17:10:01.228Z · score: 17 (5 votes)

Alignment Newsletter #22

2018-09-03T16:10:01.116Z · score: 15 (4 votes)

Do what we mean vs. do what we say

2018-08-30T22:03:27.665Z · score: 30 (15 votes)

Alignment Newsletter #21

2018-08-27T16:20:01.406Z · score: 26 (6 votes)

Alignment Newsletter #20

2018-08-20T16:00:04.558Z · score: 13 (6 votes)

Alignment Newsletter #19

2018-08-14T02:10:01.943Z · score: 19 (5 votes)

Alignment Newsletter #18

2018-08-06T16:00:02.561Z · score: 19 (5 votes)

Alignment Newsletter #17

2018-07-30T16:10:02.008Z · score: 35 (6 votes)