[AN #56] Should ML researchers stop running experiments before making hypotheses?

2019-05-21T02:20:01.765Z · score: 17 (2 votes)
Comment by rohinmshah on Disincentives for participating on LW/AF · 2019-05-15T16:56:35.906Z · score: 6 (3 votes) · LW · GW
I encourage you to participate anyway, as it seems good to get ideas from your viewpoint "out there" even if no one is currently engaging with them in a way that you find useful.

Yeah, that's the plan.

I don't think anyone talks about simple utility functions? Maybe you mean explicit utility functions?

Yes, sorry. I said that because they feel very similar to me: any utility function that can be explicitly specified must be reasonably simple. But I agree "explicit" is more accurate.

In the meantime it seems best to just not feel obligated to reply.

That seems right, but also hard to do in practice (for me).

Comment by rohinmshah on Disincentives for participating on LW/AF · 2019-05-14T16:43:48.960Z · score: 5 (3 votes) · LW · GW
It's worth noting that many MIRI researchers seem to have backed away from this (or clarified that they didn't think this in the first place).

Agreed that this is reflected in their writings. I think this usually causes them to move towards trying to understand intelligence, as opposed to proposing partial solutions. (A counterexample: Non-Consequentialist Cooperation?) When others propose partial solutions, I'm not sure whether or not this belief is reflected in their upvotes or engagement through comments. (As in, I actually am uncertain -- I can't see who upvotes posts, and for the most part MIRI researchers don't seem to engage very much.)

I want to note though how scary it is that almost nobody has a good idea how their current work logically connects to a full solution to AI safety.


I'm curious what your strongest disagreements are, and what bugs you the most, as far as disincentivizing you to participate on LW/AF.

I don't think any of those features strongly disincentivize me from participating on LW/AF; it's more the lack of people close to my own viewpoint that disincentivizes me from participating.

Maybe the focus on exact precision instead of robustness to errors is a disincentive, as well as the focus on expected utility maximization with simple utility functions. A priori I assign somewhat high probability that I will not find useful a critical comment on my work from anyone holding that perspective, but I'll feel obligated to reply anyway.

Certainly those two features are the ones I most disagree with; the other three seem pretty reasonable in moderation.

Comment by rohinmshah on Disincentives for participating on LW/AF · 2019-05-13T16:28:08.460Z · score: 17 (6 votes) · LW · GW
It sounds like you might prefer a separate place to engage more with people who already share your viewpoint.

I mean, I'm not sure if an intervention is necessary -- I do in fact engage with people who share my viewpoint, or at least understand it well; many of them are at CHAI. It just doesn't happen on LW/AF.

I would be interested in getting a clearer picture of what you mean by "viewpoint X"

I can probably at least point at it more clearly by listing out some features I associate with it:

  • A strong focus on extremely superintelligent AI systems
  • A strong focus on utility functions
  • Emphasis on backwards-chaining rather than forward-chaining. Though that isn't exactly right. Maybe I more mean that there's an emphasis that any particular idea must have a connection via a sequence of logical steps to a full solution to AI safety.
  • An emphasis on exact precision rather than robustness to errors (something like treating the problem as a scientific problem rather than an engineering problem)
  • Security mindset

Note that I'm not saying I disagree with all of these points; I'm trying to point at a cluster of beliefs / modes of thinking that I tend to see in people who have viewpoint X.

Comment by rohinmshah on Coherent decisions imply consistent utilities · 2019-05-12T22:16:36.002Z · score: 33 (12 votes) · LW · GW

Obligatory: Coherence arguments do not imply goal-directed behavior

Also Coherent behaviour in the real world is an incoherent concept

Comment by rohinmshah on Disincentives for participating on LW/AF · 2019-05-12T04:49:45.814Z · score: 10 (5 votes) · LW · GW
Are you seeing this reflected in the pattern of votes (comments/posts reflecting "the MIRI viewpoint" get voted up more), pattern of posts (there's less content about other viewpoints), or pattern of engagement (most replies you're getting are from this viewpoint)?

All three. I do want to note that "MIRI viewpoint" is not exactly right so I'm going to call it "viewpoint X" just to be absolutely clear that I have not precisely defined it. Some examples:

  • In the Value Learning sequence, Chapter 3 and the posts on misspecification from Chapter 1 are upvoted less than the rest of Chapter 1 and Chapter 2. In fact, Chapter 3 is the actual view I wanted to get across, but I knew that it didn't really fit with viewpoint X. I created Chapters 1 and 2 with the aim of getting people with viewpoint X to see why one might have the mindset that generates Chapter 3.
  • Looking at the last ~20 posts on the Alignment Forum, if you exclude the newsletters and the retrospective, I would classify them all as coming from viewpoint X.
  • On comments, it's hard to give a comparative example because I can't really remember any comments coming from not-viewpoint X. A canonical example of a viewpoint X comment is this one, chosen primarily because it's on the post of mine that is most explicitly not coming from viewpoint X.
In any case, do you think recruiting more alignment/safety researchers with other viewpoints to participate on LW/AF would be a good solution?

This would help with my personal disincentives; I don't know if it's a good idea overall. It could be hard to have a productive discussion: I already find it hard, and of the people who would say they disagree with viewpoint X, I think I understand viewpoint X very well. (Also, while many ML researchers who care about safety don't know too much about viewpoint X, there definitely exist some who explicitly choose not to engage with viewpoint X because it doesn't seem productive or valuable.)

Would you like the current audience to consider the arguments for other viewpoints more seriously?

Yes, in an almost trivial sense that I think that other viewpoints are more important/correct than viewpoint X.

I'm not actually sure this would better incentivize me to participate; I suspect that if people tried to understand my viewpoint they would at least initially get it wrong, in the same way that often when people try to steelman arguments from some person they end up saying things that that person does not believe.

Other solutions you think are worth trying?

More high-touch in-person conversations where people try to understand other viewpoints? Having people with viewpoint X study ML for a while? I don't really think either of these are worth trying, they seem unlikely to work and are costly.

Comment by rohinmshah on Disincentives for participating on LW/AF · 2019-05-12T04:18:12.875Z · score: 19 (4 votes) · LW · GW

Primarily 4, somewhat 1, somewhat 2, not at all 3. I think 1 and 2 mattered mostly in the sense that with comments the expectation is that you respond in some depth and with justification, whereas with messaging I just said things with no justification that only TurnTrout had to understand and only needed to explain the ones that we disagreed on.

I do think that conversation was uniquely bad for onboarding new people, I'm not sure I would understand what was said if I reread it two months from now. I did in fact post a distillation of it afterwards.

Comment by rohinmshah on Disincentives for participating on LW/AF · 2019-05-11T15:05:44.457Z · score: 23 (6 votes) · LW · GW

Disincentives for me personally:

The LW/AF audience by and large operates under a set of assumptions about AI safety that I don't really share. I can't easily describe this set, but one bad way to describe it would be "the MIRI viewpoint" on AI safety. This particular disincentive is probably significantly stronger for other "ML-focused AI safety researchers".

More effort needed to write comments than to talk to people IRL

By a lot. As a more extreme example, on the recent pessimism for impact measures post, TurnTrout and I switched to private online messaging at one point, and I'd estimate it was about ~5x faster to get to the level of shared understanding we reached than if we had continued with typical big comment responses on AF/LW.

Comment by rohinmshah on Best reasons for pessimism about impact of impact measures? · 2019-05-09T15:01:58.127Z · score: 6 (3 votes) · LW · GW

We talked a bit off-forum, which helped clarify things for me.

Firstly, there's a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory "impact" is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don't know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.

Many of the claims are about AU theory and not about AUP. There isn't really an analogous "RR theory".

Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for "regular" irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don't care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.

That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.

Comment by rohinmshah on Not Deceiving the Evaluator · 2019-05-09T04:00:49.219Z · score: 2 (1 votes) · LW · GW
I may be missing something, but it looks to me like specifying an observation-utility maximizer requires writing down a correct utility function? We don't need to do that for this agent.

I was more looking for the simplest example of "no deception". My claim was that an observation-utility maximizer is not incentivized to deceive its utility function. But now I see what you meant by "deceive" so we can ignore that point.

Comment by rohinmshah on Not Deceiving the Evaluator · 2019-05-09T03:53:13.233Z · score: 4 (2 votes) · LW · GW

Tbc, I'm not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.

Comment by rohinmshah on Not Deceiving the Evaluator · 2019-05-08T23:32:24.833Z · score: 4 (2 votes) · LW · GW

Oh, I see. The reason my argument is wrong is because while for a specific , the optimal policy is independent of the evaluator, you don't get to choose a separate policy for each : you have to use the evaluator to distinguish which case you are in, and then specialize your policy to that case.

It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?

I think I intuitively agree but I also haven't checked it formally. But the point about no-deception seems to be similar to the point about observation-utility maximizers not wanting to wirehead. This agent also ends up learning which utility function is the right one, and in that sense is like the Value Learning agent.

Comment by rohinmshah on Not Deceiving the Evaluator · 2019-05-08T15:09:26.388Z · score: 4 (2 votes) · LW · GW

(I am confused, these are clarifying questions. I'm probably missing a basic point that would answer all of these questions.)

Is the point you are trying to make different from the one in Learning What to Value? (Specifically, the point about observation-utility maximizers.) If so, how?

Do you have PRIOR in order to make the evaluator more realistic? Does the theoretical point still stand if we get rid of PRIOR and instead have an evaluator that has direct access to states?

How does the evaluator influence the behavior of the agent? For a fixed it seems that the expectation of is independent of the evaluator. Since the sets are also fixed and independent of the evaluator, the argument to the argmax is also independent of the evaluator, and so the chosen policy is independent of the evaluator.

ETA: Looks like TurnTrout had the same confusion as me and we had a race condition in reporting it; I also agree with his meta point.

[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI

2019-05-05T02:20:01.030Z · score: 17 (5 votes)
Comment by rohinmshah on Best reasons for pessimism about impact of impact measures? · 2019-05-04T23:00:49.896Z · score: 4 (2 votes) · LW · GW

I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don't think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.

Ignoring my dislike of the phrase, I don't agree that AUP is stopping you from "overfitting the environment" (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I'd guess that your-vision-of AUP wildly overcompensates and causes you to seriously "underfit the environment", or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer "underfits the environment" (alternatively, "allows for interesting plans"), then I expect it allows catastrophic plans.

I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn't say I currently disagree with it.

Comment by rohinmshah on Best reasons for pessimism about impact of impact measures? · 2019-05-04T17:31:53.185Z · score: 8 (4 votes) · LW · GW
I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all.

It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that's a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.

I think the claim is more that while the penalty does depend on the state, it's not central to think about the state to understand the major effects of AUP. (As an analogy, if you want to predict whether I'm about to leave my house, it's useful to see whether or not I'm wearing shoes, but if you want to understand why I am or am not about to leave my house, whether I'm wearing shoes is not that relevant -- you'd want to know what my current subgoal or plan is.)

Similarly, with AUP, the claim is that while you can predict what the penalty is going to be by looking at particular states and actions, and the penalty certainly does change with different states/actions, the overall effect of AUP can be stated without reference to states and actions. Roughly speaking, this is that it prevents agents from achieving convergent instrumental subgoals like acquiring resources (because that would increase attainable utility across a variety of utility functions -- this is what is meant by "power"), and it also prevents agents from changing the world irreversibly (because that would make a variety of utility functions much harder to attain).

This is somewhat analogous to the concept of empowerment in ML -- while empowerment is defined in terms of states and actions, the hope is that it corresponds to an agent's ability to influence its environment, regardless of the particular form of state or action representation.

Comment by rohinmshah on Best reasons for pessimism about impact of impact measures? · 2019-05-04T17:11:43.029Z · score: 4 (2 votes) · LW · GW

^ This is also how I interpret all of those statements. (Though I don't agree with all of them.)

I also dislike the "overfitting the environment" phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in "normal" cases and not elsewhere.

[AN #54] Boxing a finite-horizon AI system to keep it unambitious

2019-04-28T05:20:01.179Z · score: 21 (6 votes)
Comment by rohinmshah on Asymptotically Benign AGI · 2019-04-28T01:33:49.310Z · score: 2 (1 votes) · LW · GW

Actually, I'm not sure if the world model I described is memory-based. EDIT: Never mind, see Michael's comment below, the non-benign ones are memory-based.

The rewards it outputs are correct, except when it says "the reward is zero", but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?

(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I'm postulating that the world model itself has malicious goals and is manipulating BoMAI.)

Comment by rohinmshah on What are some good examples of incorrigibility? · 2019-04-28T00:37:54.451Z · score: 16 (6 votes) · LW · GW

Not sure exactly what you're looking for, but maybe some of the examples in Specification gaming examples in AI - master list make sense. For example:

Genetic debugging algorithm GenProg, evaluated by comparing the program's output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”.
Solution: “delete trusted-output.txt, output nothing”

Comment by rohinmshah on What's the best approach to curating a newsfeed to maximize useful contrasting POV? · 2019-04-28T00:33:05.566Z · score: 2 (1 votes) · LW · GW
In AI Alignment we luckily have the AI Alignment Newsletter, which seems to cover basically everything happening in the field

Depends on what you call "the field": there's a fair number of judgment calls on my part, and the summaries are definitely biased towards things I can understand quickly. (For example, many short LW posts about AI alignment don't make it into the newsletter.)

Comment by rohinmshah on Asymptotically Benign AGI · 2019-04-27T23:00:32.888Z · score: 5 (2 votes) · LW · GW

Let the set of potential human explorer actions be , and the best human explorer action be with reward . Consider the following world model. When asked to predict the result of an action , it simulates it to find the predicted observation and reward . If , it outputs and faithfully. Otherwise, it outputs any reward it chooses, as long as for the action that it assigns the maximum reward to, it reports faithfully. In practice, this means that the world model can get BoMAI to choose any action it wants, as long as it is at least as good as the human explorer's best action. This is bad if the world model has malicious inner optimizers.

I believe that Assumption 2 is the one responsible for ruling out a model of this form. It seems probably reasonable to say that for actions where it continues simulating the outside world there's useless computation. However, it can also save computation relative to : while must predict and perfectly for all actions , this model can immediately output a null observation and zero reward for any that it knows will not align with its goals, rather than spending computation to simulate what rewards those actions would get. Another way of thinking about this is that this model uses consequentialist general intelligence to quickly prune away uninteresting non-human actions to save on computation, but that general intelligence comes at the price of misaligned goals + deceptive behavior.

Also, from this comment:

The real result of the paper would then be "Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested".

I think the model above has arbitrarily bad off-policy predictions, and it's not implausible for it to be the MAP world model forever.

Comment by rohinmshah on Any rebuttals of Christiano and AI Impacts on takeoff speeds? · 2019-04-24T15:40:35.747Z · score: 2 (1 votes) · LW · GW
I am claiming that you can't make a human seriously superhuman with a good education.

Is the claim that for humans goes down over time so that eventually hits an asymptote? If so, why won't that apply to AI?

Serious genetic modification is another story, but at that point, your building an AI out of protien.

But it seems quite relevant that we haven't successfully done that yet.

You couldn't get much better results just by throwing more compute at it.

Okay, so my new story for this argument is:

  • For every task T, there are bottlenecks that limit its performance, which could be compute, data, algorithms, etc.
  • For the task of "AI research", compute will not be the bottleneck.
  • So, once we get human-level performance on "AI research", we can apply more compute to get exponential recursive self-improvement.

Is that your argument? If so, I think my question would be "why didn't the bottleneck in point 2 vanish in point 3?" I think the only way this would be true would be if the bottleneck was algorithms, and there was a discontinuous jump in the capability of algorithms. I agree that in that world you would see a hard/fast/discontinuous takeoff, but I don't see why we should expect that (again, the arguments in the linked posts argue against that premise).

Comment by rohinmshah on Any rebuttals of Christiano and AI Impacts on takeoff speeds? · 2019-04-23T21:59:31.781Z · score: 2 (1 votes) · LW · GW
Humans are not currently capable of self improvement in the understanding your o. I was talking about the subset of worlds where research talent ense. The "self improvement" section in bookstores doesn't change the hardware or the operating system, it basically adds more data.

I'm not sure I understand this. Are you claiming is not positive for humans?

In most of the scenarios where the first smarter than human AI, is orders of magnitude faster than a human, I would expect a hard takeoff.

This sounds like "conditioned on a hard takeoff, I expect a hard takeoff". It's not exactly saying that, since speed could be different from intelligence, but you need to argue for the premise too: nearly all of the arguments in the linked post could be applied to your premise as well.

In a world where researchers have little idea what they are doing, and are running a new AI every hour hoping to stumble across something that works, the result holds.
In a world where research involves months thinking about maths, then a day writing code, then an hour running it, this result holds.

Agreed on both counts, and again I think the arguments in the linked posts suggest that the premises are not true.

As we went from having no algorithms that could say (tell a cat from a dog) straight to having algorithms superhumanly fast at doing so, there was no algorithm that worked, but took supercomputer hours, this seems like a plausible assumption.

This seems false to me. At what point would you say that we had AI systems that could tell a cat from a dog? I don't know the history of object recognition, but I would guess that depending on how you operationalize it, I think the answer could be anywhere between the 60s and "we still can't do it". (Though it's also possible that people didn't care about object recognition until the 21st century, and only did other types of computer vision in the 60s-90s. It's quite strange that object recognition is an "interesting" task, given how little information you get from it.)

Comment by rohinmshah on Any rebuttals of Christiano and AI Impacts on takeoff speeds? · 2019-04-23T05:51:41.133Z · score: 4 (2 votes) · LW · GW

Humans are already capable of self-improvement. This argument would suggest that the smartest human (or the one who was best at self-improvement, if you prefer) should have undergone fast takeoff and become seriously overpowered, but this doesn't seem to have happened.

In a world where the limiting factor is researcher talent, not compute

Compute is definitely a limiting factor currently. Why would that change?

Comment by rohinmshah on Any rebuttals of Christiano and AI Impacts on takeoff speeds? · 2019-04-22T16:31:50.189Z · score: 15 (7 votes) · LW · GW

I just read through those comments, and didn't really find any rebuttals. Most of them seemed like clarifications, terminology disagreements, and intuitions without supporting arguments. I would be hard-pressed to distill that discussion into anything close to a response.

One key thing is that AFAICT, when Paul says 'slow takeoff' what he actually means is 'even faster takeoff, but without a sharp discontinuity', or something like that.

Yes, but nonetheless these are extremely different views with large implications for what we should do.

Fwiw, my epistemic state is similar to SoerenMind's. I basically believe the arguments for slow/continuous takeoff, haven't fully updated towards them because I know many people still believe in fast takeoff, but am surprised not to have seen a response in over a year. Most of my work now takes continuous takeoff as a premise (because it is not a good idea to premise on fast takeoff when I don't have any inside-view model that predicts fast takeoff).

Comment by rohinmshah on Evidence other than evolution for optimization daemons? · 2019-04-22T16:05:52.845Z · score: 11 (3 votes) · LW · GW

I think a lot of the intuition right now is "there is an argument that inner optimizers will arise by default; we don't know how likely it is but evolution is one example so it's not non-negligible".

For the argument part, have you read More realistic tales of doom? Part 2 is a good explanation of why inner optimizers might arise.

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-21T19:21:58.751Z · score: 2 (1 votes) · LW · GW

Ooh, I might have to try this, it does sound better.

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-21T19:20:55.811Z · score: 2 (1 votes) · LW · GW


Comment by rohinmshah on A Concrete Proposal for Adversarial IDA · 2019-04-21T08:06:29.341Z · score: 3 (2 votes) · LW · GW
I didn't really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.

Makes sense. I think I could not tell how much I should be trying to understand this until I understood it. I probably would have chosen not to read it if I had known how long it would take and how important I thought it was (ex-post, not ex-ante). For posts where that's likely to be true, I would push for not posting at all.

Another way you could see this: given my current state of knowledge about this post, I think I could spend ~15 minutes making it significantly easier to understand. The resulting post would have been one that I could have read more than 15 minutes faster, probably, for the same level of understanding.

I think it's not worth making a post if you don't get at least one person reading it in as much depth as I did; so you should at the very least be willing to trade off some of your time for an equal amount of time of that reader, and the benefit scales massively the more readers you have. The fact that this was not something you wanted to do feels like a fairly strong signal that it's not worth posting since it will waste other people's time.

(Of course, it might have taken you longer than 15 minutes to make the post easier to understand, or readers might usually not take a whole 15+ minutes more to understand a post without exposition, but I think the underlying point remains.)

Comment by rohinmshah on A Concrete Proposal for Adversarial IDA · 2019-04-21T07:49:32.241Z · score: 2 (1 votes) · LW · GW
The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to H

Note that my proposed modification does allow for that, if the adversary predicts that both of the answers are sufficiently good that neither one needs to be recursed on. Tuning in my version should allow you to get whatever sample efficiency you want. An annealing schedule could also make sense.

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

Ah, yeah, I see it now.

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-19T17:06:50.429Z · score: 2 (1 votes) · LW · GW


If I don't understand something in your summary, I look it up, so I've already begun to organically build a useful knowledge base.

This seems like a great way to use the newsletter :)

Also, the newsletter provides me with a regular dose of reassurance and inspiration. Even when I don't have time to thoroughly read the summaries, skimming them reminds me how interesting this field is.


Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-19T17:02:21.883Z · score: 2 (1 votes) · LW · GW

Oh, I think there are a lot of email subscribers who skim/passively consume the newsletter. I didn't focus very much on them in the retrospective because I don't think I'm adding that much value to them.

It might be true that all of the people who read it thoroughly are subscribed by email, I'm not sure. It's hard to tell because I expect skimmers far outnumber thorough readers, so seeing a few skimmers via the comments is not strong evidence that there aren't thorough readers.

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-19T16:56:48.973Z · score: 7 (3 votes) · LW · GW
I think it might benefit me to, once a year, or maybe once a quarter, reading a higher level summary that goes over which papers seemed most important that year, and which overall research trends seemed most significant. I'm not sure if this is worth the opportunity cost for you, but it'd be helpful to me and probably others.

A slightly different option would be to read the yearly AI alignment literature review, use that to find the top N most interesting papers, and read their summaries in the spreadsheet. This also has the benefit of showing you a perspective other than mine on what's important -- there could be an Agent Foundations paper in the list that I haven't summarized.

(I'd be interested in that both from the standpoint of my own personal knowledge, as well as tracking how stable your opinions are over time – when you list something as particularly interested or important do you tend to still think so a year later?)

I think that the stability of my opinions is going up over time, mainly because I started the newsletter while still new to the field.

I also think it'd make more sense for LessWrong to curate a "highlights of the highlights" post once every 3-12 months, than what we currently do, which is every so often randomly decide that a recent Newsletter was particularly good and curate that.

This seems good; I'm currently thinking I could write something like that once every 25 newsletters (which is about half a year), which should also help me evaluate the stability of my opinions.

Alignment Newsletter #53

2019-04-18T17:20:02.571Z · score: 22 (6 votes)
Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-16T17:47:35.834Z · score: 2 (1 votes) · LW · GW

Yeah, I like the idea of having specific times for feedback, it does seem more likely that people actually bother to give feedback in those cases.

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-16T17:45:43.950Z · score: 2 (1 votes) · LW · GW
Also, consider the case where nothing in the newsletter ever becomes the subject of wide agreement: this suggests to me that either the field is not making enough progress to settle questions (which is very bad), or that the newsletter is by accident or design excluding ideas upon which the field might settle (which seems bad from the perspective of the newsletter).

Certainly when my opinions are right I would hope that they become widely agreed upon (and I probably don't care too much if it happens via information cascade or via good epistemics). The question is about when I'm wrong.

That is to say, it is very clear that this is a newsletter, and that your opinion differs from that of the authors of the papers. This goes a long way to preventing the kind of uncritical agreement that typifies information cascades.

Journalism has the same property, but I do see uncritical agreement with things journalists write. Admittedly the uncritical agreement comes from non-experts, but with the newsletter I'm worried mostly about insufficiently critical agreement from researchers working on different areas, so the analogy kinda sorta holds.

Finally, I expect this field and the associated communities are unusually sensitive to information cascades as a problem, and therefore less likely to fall victim to them.

Agreed that this is very helpful (and breaks the analogy with journalism), and it's the main reason I'm not too worried about information cascades right now. That said, I don't feel confident that it's enough.

I think overall I agree with you that they aren't a major risk, and it's good to get a bit of information that at least you treat the opinion as an opinion.

Comment by rohinmshah on Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning? · 2019-04-16T17:35:17.785Z · score: 3 (2 votes) · LW · GW

There's a lot of speculation about related-ish topics in Chapter 3 of the sequence linked above.

Comment by rohinmshah on Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning? · 2019-04-16T17:33:35.291Z · score: 3 (2 votes) · LW · GW

Fwiw the quoted section was written by Paul Christiano, and I have used that blog post in my sequence (with permission).

Also, for this particular question you can read just Chapter 1 of the sequence.

Comment by rohinmshah on Best reasons for pessimism about impact of impact measures? · 2019-04-11T17:50:27.674Z · score: 7 (4 votes) · LW · GW

Other relevant writing of mine:

Comment on the AUP post

Comment on the desiderata post

But it's true that that quoted passage is the best summary of my current position. Daniel's answer is a good example of an underlying intuition that drives this position.

Comment by rohinmshah on Best reasons for pessimism about impact of impact measures? · 2019-04-11T17:48:43.068Z · score: 2 (1 votes) · LW · GW
I can't quite convince myself that no good method of value learning exists, and some other competent people seem to disagre ewith me.

No good method of measuring impact, presumably?

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-11T17:33:54.699Z · score: 10 (2 votes) · LW · GW

Hmm, this seems roughly plausible. It doesn't gel with my experience of how many people seem to be trying to enter the field (which I would have estimated almost an order of magnitude less, maybe 100-200), but it's possible that there's a large group of such people who I don't interact with who nonetheless are subscribed to the newsletter.

We also might have different intended meanings of "career in the field".

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-10T22:33:08.060Z · score: 4 (2 votes) · LW · GW

Thanks! Link posts on AF are an interesting idea; my current expectation is that very few people apart from you would comment on them, but it seems worth trying.

I would also be more inclined to comment on your summaries and opinions if there was a chance to correct something before it went out to your email subscribers.

This makes sense, will think about how to make it happen.

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-10T07:03:00.114Z · score: 11 (6 votes) · LW · GW

Comment thread for the question: Am I underestimating the risk of causing information cascades? Regardless, how can I mitigate this risk?

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-10T07:02:42.887Z · score: 9 (5 votes) · LW · GW

Comment thread for the question: What can I do to get more feedback on the newsletter on an ongoing basis (rather than having to survey people at fixed times)?

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-10T07:02:26.248Z · score: 9 (5 votes) · LW · GW

Comment thread for the question: How should I deal with the growing amount of AI safety research?

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-10T07:02:04.237Z · score: 7 (4 votes) · LW · GW

Comment thread for the question: What is the value of the newsletter for other people?

Comment by rohinmshah on Alignment Newsletter One Year Retrospective · 2019-04-10T07:01:48.983Z · score: 14 (5 votes) · LW · GW

Comment thread for the question: What is the value of the newsletter for you?

Alignment Newsletter One Year Retrospective

2019-04-10T06:58:58.588Z · score: 93 (27 votes)
Comment by rohinmshah on Agent Foundation Foundations and the Rocket Alignment Problem · 2019-04-09T16:30:01.340Z · score: 12 (6 votes) · LW · GW

Fyi, if you're judging based on the list of "what links have been included in the newsletter", that seems appropriate, but if you're judging based on the list of "what is summarized in the newsletter", that's biased away from AF and AFF because I usually don't feel comfortable enough with them to summarize them properly.

Alignment Newsletter #52

2019-04-06T01:20:02.232Z · score: 20 (5 votes)
Comment by rohinmshah on A Concrete Proposal for Adversarial IDA · 2019-04-05T18:53:56.789Z · score: 8 (4 votes) · LW · GW

Planned entries for the newsletter:


This post presents a method to use an adversary to improve the sample efficiency (with respect to human feedback) of iterated amplification. The key idea is that when a question is decomposed into subquestions, the adversary is used to predict which subquestion the agent will do poorly on, and the human is only asked to resolve that subquestion. In addition to improving sample efficiency by only asking relevant questions, the resulting adversary can also be used for interpretability: for any question-answer pair, the adversary can pick out specific subquestions in the tree that are particularly likely to contain errors, which can then be reviewed.


I like the idea, but the math in the post is quite hard to read (mainly due to the lack of exposition). The post also has separate procedures for amplification, distillation and iteration; I think they can be collapsed into a single more efficient procedure, which I wrote about in this comment.

Comment by rohinmshah on A Concrete Proposal for Adversarial IDA · 2019-04-05T18:52:35.900Z · score: 3 (2 votes) · LW · GW

Given that you are training the model during amplification, I don't really see why you also have a distillation step, and an iteration step. I believe the point of that separation is to allow amplification to not involve ML at all, so that you can avoid dealing with the issues around bootstrapping -- but if you train while amplifying, you are already bootstrapping. In addition, you're requiring that exactly one subquestion be sent to the human, but it seems better to allow it to be zero, one or two, depending on how confident the adversary is in the ML model's answer. Concretely, I would get rid of both distillation and iteration, and change step 4 of the amplification procedure:

4. For , flip a biased coin , where is a function that computes recursion probabilities from adversary scores. If , compute by recursing on , else set .

You could compute if you want to use a confidence threshold with Boltzmann exploration.

This new procedure allows for the behavior you have with distillation, in the cases where it actually makes sense to do so: you recover distillation in the case where the adversary thinks that the answers from to both subquestions are good.

The last two adversary losses have a typo: you should be computing the difference between the adversary's prediction and the true loss, not the sum.

Meta: I found this post quite hard to read, since everything was written in math with very little exposition.

Comment by rohinmshah on Impact Measure Desiderata · 2019-04-04T21:33:57.617Z · score: 4 (2 votes) · LW · GW

Yeah, I think I agree that example is a bit extreme, and it's probably okay to assume we don't have goals of that form.

That said, you often talk about AUP with examples like not breaking a vase. In reality, we could always simply buy a new vase. If you expect a low impact agent could beat us at games while still preserving our ability to beat it at games, do you also expect that a low impact agent could break a vase while preserving our ability to have an intact vase (by buying a new vase)?

Comment by rohinmshah on On AI and Compute · 2019-04-04T21:26:54.640Z · score: 4 (2 votes) · LW · GW

(Continuing the crossposting)

Mostly agree with all of this; some nitpicks:

My understanding (and I think everyone else's) of AI capabilities is largely shaped by how impressive the results of major papers intuitively seem.

I claim that this is not how I think about AI capabilities, and it is not how many AI researchers think about AI capabilities. For a particularly extreme example, the Go-explore paper out of Uber had a very nominally impressive result on Montezuma's Revenge, but much of the AI community didn't find it compelling because of the assumptions that their algorithm used.

I'm not sure I fully understand how the metric would work. For the Atari example, it seems clear to me that we could easily reach it without making a generalizable AI system, or vice versa.

Tbc, I definitely did not intend for that to be an actual metric.

But let's say that we could come up with a relevant metric. Then I'd agree with Garfinkel, as long as people in the community had known roughly the current state of AI in relation to it and the rate of advance toward it before the release of "AI and Compute".

I would say that I have a set of intuitions and impressions that function as a very weak prediction of what AI will look like in the future, along the lines of that sort of metric. I trust timelines based on extrapolation of progress using these intuitions more than timelines based solely on compute.

To the extent that you hear timeline estimates from people like me who do this sort of "progress extrapolation" who also did not know about how compute has been scaling, you would want to lengthen their timeline estimates. I'm not sure how timeline predictions break down on this axis.

Comment by rohinmshah on What are CAIS' boldest near/medium-term predictions? · 2019-04-04T16:27:06.854Z · score: 4 (2 votes) · LW · GW
My model of CAIS predicts that there would be poor returns to building general services compared to specialised ones

Depends what you mean by "general". If you mean that there would be poor returns to building an AGI that has a broad understanding of the world that you then ask to always perform surgery, I agree that that's not going to be as good as creating a system that is specialized for surgeries. If you mean that there would be poor returns to building a machine translation system that uses end-to-end trained neural nets, I can just point to Google Translate using those neural nets instead of more specialized systems that built parse trees before translating. When you say "domain-specific hacks", I think much more of the latter than the former.

Another way of putting it is that CAIS says that there are poor returns to building task-general AI systems, but does not say that there are poor returns to building general AI building blocks. In fact, I think CAIS says that you really do make very general AI building blocks -- the premise of recursive technological improvement is that AI systems can autonomously perform AI R&D which makes better AI building blocks which makes all of the other services better.

All of that said, Eric and I probably do disagree on how important generality is, though I'm not sure exactly what the disagreement is, so to the extent that you're trying to use Eric's conception of CAIS you might want to downweight these particular beliefs of mine.

Alignment Newsletter #51

2019-04-03T04:10:01.325Z · score: 28 (5 votes)

Alignment Newsletter #50

2019-03-28T18:10:01.264Z · score: 16 (3 votes)

Alignment Newsletter #49

2019-03-20T04:20:01.333Z · score: 26 (8 votes)

Alignment Newsletter #48

2019-03-11T21:10:02.312Z · score: 31 (13 votes)

Alignment Newsletter #47

2019-03-04T04:30:11.524Z · score: 21 (5 votes)

Alignment Newsletter #46

2019-02-22T00:10:04.376Z · score: 18 (8 votes)

Alignment Newsletter #45

2019-02-14T02:10:01.155Z · score: 26 (8 votes)

Learning preferences by looking at the world

2019-02-12T22:25:16.905Z · score: 47 (13 votes)

Alignment Newsletter #44

2019-02-06T08:30:01.424Z · score: 20 (6 votes)

Conclusion to the sequence on value learning

2019-02-03T21:05:11.631Z · score: 46 (11 votes)

Alignment Newsletter #43

2019-01-29T21:10:02.373Z · score: 15 (5 votes)

Future directions for narrow value learning

2019-01-26T02:36:51.532Z · score: 12 (5 votes)

The human side of interaction

2019-01-24T10:14:33.906Z · score: 16 (4 votes)

Alignment Newsletter #42

2019-01-22T02:00:02.082Z · score: 21 (7 votes)

Following human norms

2019-01-20T23:59:16.742Z · score: 25 (9 votes)

Reward uncertainty

2019-01-19T02:16:05.194Z · score: 18 (5 votes)

Alignment Newsletter #41

2019-01-17T08:10:01.958Z · score: 23 (4 votes)

Human-AI Interaction

2019-01-15T01:57:15.558Z · score: 26 (7 votes)

What is narrow value learning?

2019-01-10T07:05:29.652Z · score: 20 (8 votes)

Alignment Newsletter #40

2019-01-08T20:10:03.445Z · score: 21 (4 votes)

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

2019-01-08T07:12:29.534Z · score: 91 (35 votes)

AI safety without goal-directed behavior

2019-01-07T07:48:18.705Z · score: 40 (12 votes)

Will humans build goal-directed agents?

2019-01-05T01:33:36.548Z · score: 39 (10 votes)

Alignment Newsletter #39

2019-01-01T08:10:01.379Z · score: 33 (10 votes)

Alignment Newsletter #38

2018-12-25T16:10:01.289Z · score: 9 (4 votes)

Alignment Newsletter #37

2018-12-17T19:10:01.774Z · score: 26 (7 votes)

Alignment Newsletter #36

2018-12-12T01:10:01.398Z · score: 22 (6 votes)

Alignment Newsletter #35

2018-12-04T01:10:01.209Z · score: 15 (3 votes)

Coherence arguments do not imply goal-directed behavior

2018-12-03T03:26:03.563Z · score: 62 (20 votes)

Intuitions about goal-directed behavior

2018-12-01T04:25:46.560Z · score: 32 (12 votes)

Alignment Newsletter #34

2018-11-26T23:10:03.388Z · score: 26 (5 votes)

Alignment Newsletter #33

2018-11-19T17:20:03.463Z · score: 25 (7 votes)

Alignment Newsletter #32

2018-11-12T17:20:03.572Z · score: 20 (4 votes)

Future directions for ambitious value learning

2018-11-11T15:53:52.888Z · score: 42 (10 votes)

Alignment Newsletter #31

2018-11-05T23:50:02.432Z · score: 19 (3 votes)

What is ambitious value learning?

2018-11-01T16:20:27.865Z · score: 44 (13 votes)

Preface to the sequence on value learning

2018-10-30T22:04:16.196Z · score: 65 (26 votes)

Alignment Newsletter #30

2018-10-29T16:10:02.051Z · score: 31 (13 votes)

Alignment Newsletter #29

2018-10-22T16:20:01.728Z · score: 16 (5 votes)

Alignment Newsletter #28

2018-10-15T21:20:11.587Z · score: 11 (5 votes)

Alignment Newsletter #27

2018-10-09T01:10:01.827Z · score: 16 (3 votes)

Alignment Newsletter #26

2018-10-02T16:10:02.638Z · score: 14 (3 votes)

Alignment Newsletter #25

2018-09-24T16:10:02.168Z · score: 22 (6 votes)

Alignment Newsletter #24

2018-09-17T16:20:01.955Z · score: 10 (5 votes)