## Posts

[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning 2019-09-16T17:10:02.103Z · score: 11 (5 votes)
[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence 2019-09-10T19:10:01.174Z · score: 24 (8 votes)
[AN #62] Are adversarial examples caused by real but imperceptible features? 2019-08-22T17:10:01.959Z · score: 28 (11 votes)
Call for contributors to the Alignment Newsletter 2019-08-21T18:21:31.113Z · score: 39 (12 votes)
Clarifying some key hypotheses in AI alignment 2019-08-15T21:29:06.564Z · score: 68 (28 votes)
[AN #61] AI policy and governance, from two people in the field 2019-08-05T17:00:02.048Z · score: 11 (5 votes)
[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode 2019-07-22T17:00:01.759Z · score: 25 (10 votes)
[AN #59] How arguments for AI risk have changed over time 2019-07-08T17:20:01.998Z · score: 43 (9 votes)
Learning biases and rewards simultaneously 2019-07-06T01:45:49.651Z · score: 43 (12 votes)
[AN #58] Mesa optimization: what it is, and why we should care 2019-06-24T16:10:01.330Z · score: 50 (13 votes)
[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming 2019-06-05T23:20:01.202Z · score: 28 (9 votes)
[AN #56] Should ML researchers stop running experiments before making hypotheses? 2019-05-21T02:20:01.765Z · score: 22 (6 votes)
[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI 2019-05-05T02:20:01.030Z · score: 17 (5 votes)
[AN #54] Boxing a finite-horizon AI system to keep it unambitious 2019-04-28T05:20:01.179Z · score: 21 (6 votes)
Learning preferences by looking at the world 2019-02-12T22:25:16.905Z · score: 47 (13 votes)
Conclusion to the sequence on value learning 2019-02-03T21:05:11.631Z · score: 48 (11 votes)
Future directions for narrow value learning 2019-01-26T02:36:51.532Z · score: 12 (5 votes)
The human side of interaction 2019-01-24T10:14:33.906Z · score: 18 (5 votes)
Following human norms 2019-01-20T23:59:16.742Z · score: 27 (10 votes)
Reward uncertainty 2019-01-19T02:16:05.194Z · score: 20 (6 votes)
Human-AI Interaction 2019-01-15T01:57:15.558Z · score: 27 (8 votes)
What is narrow value learning? 2019-01-10T07:05:29.652Z · score: 20 (8 votes)
Reframing Superintelligence: Comprehensive AI Services as General Intelligence 2019-01-08T07:12:29.534Z · score: 93 (36 votes)
AI safety without goal-directed behavior 2019-01-07T07:48:18.705Z · score: 50 (15 votes)
Will humans build goal-directed agents? 2019-01-05T01:33:36.548Z · score: 43 (13 votes)
Coherence arguments do not imply goal-directed behavior 2018-12-03T03:26:03.563Z · score: 65 (22 votes)

Comment by rohinmshah on Conditions for Mesa-Optimization · 2019-09-19T20:21:17.309Z · score: 3 (2 votes) · LW · GW
Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)?

Humans don't seem to have one mesa objective that we're optimizing for. Even in this community, we tend to be uncertain about what our actual goal is, and most other people don't even think about it. Humans do lots of things that look like "changing their objective", e.g. maybe someone initially wants to have a family but then realizes they want to devote their life to public service because it's more fulfilling.

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"?

I suspect it would be more efficient, but I'm not sure. (Mostly this is because humans and animals don't seem to have two clean parts, but quite plausibly we'll do something more interpretable than evolution and that will push towards a clean separation.) I also don't know whether it would be better for safety to have it split into two clean parts.

Comment by rohinmshah on Conditions for Mesa-Optimization · 2019-09-19T06:29:30.592Z · score: 3 (2 votes) · LW · GW
This sure seems like "search" to me.

I agree that if you have a model of the system (as you do when you know the rules of the game), you can simulate potential actions and consequences, and that seems like search.

Usually, you don't have a good model of the system, and then you need something else.

Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"?

I was thinking of following a gradient in supervised learning.

I agree that pure reinforcement learning with a sparse reward looks like search. I doubt that pure RL with sparse reward is going to get you very far.

Reinforcement learning with demonstrations or a very dense reward doesn't really look like search, it looks more like someone telling you what to do and you following the instructions faithfully.

Comment by rohinmshah on Conditions for Mesa-Optimization · 2019-09-18T15:55:21.433Z · score: 3 (2 votes) · LW · GW
Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

Humans and systems produced by meta learning both do reasonably well at learning, and don't do "search" (depending on how loose you are with your definition of "search").

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

It's plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven't had much experience on before, but nonetheless because they aren't made up of a two clean parts (mesa objective + capabilities) there isn't a single obvious mesa objective that the AI system is optimizing for off distribution. I'm not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don't seem like "search" to me, though they also aren't just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

(Also, reading back through this comment thread, I'm no longer sure whether or not a neural net could learn to play at least the 1-player random version of the SHA game. Certainly in the limit it can just memorize the input-output table, but I wouldn't be surprised if it could get some accuracy even without that.)

Comment by rohinmshah on Utility uncertainty vs. expected information gain · 2019-09-16T23:17:03.960Z · score: 2 (1 votes) · LW · GW

Identifiability of the optimal policy seems too strong: it's basically fine if my household robot doesn't figure out the optimal schedule for cleaning my house, as long as it's cleaning it somewhat regularly. But I agree that conceptually we would want something like that.

Comment by rohinmshah on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T22:59:42.368Z · score: 2 (1 votes) · LW · GW

Given that there was a round of manual review, I would expect human accuracy to be over 80% and probably over 90%.

Comment by rohinmshah on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T22:58:15.752Z · score: 2 (1 votes) · LW · GW

In that case, why isn't this equivalent to impact as a safety protocol? The period during which we use the regularizer is "training".

(Perhaps the impact as a safety protocol point was meant to be about mesa optimization specifically?)

Comment by rohinmshah on Utility uncertainty vs. expected information gain · 2019-09-14T15:58:26.641Z · score: 6 (3 votes) · LW · GW

Fwiw I don't find the problem of fully updated deference very compelling. My real rejection of utility uncertainty in the superintelligent-god-AI scenario is:

• It seems hard to create a distribution over utility functions that is guaranteed to include the truth (with non-trivial probability, perhaps). It's been a while since I read it, but I think this is the point of Incorrigibility in the CIRL Framework.
• It seems hard to correctly interpret your observations as evidence about utility functions. In other words, the likelihood is arbitrary and not a fact about the world, and so there's no way to ensure you get it right. This is pointed at somewhat by your first link.

If we somehow magically vanished away these problems, maximizing expected utility under that distribution seems fine, even though the resulting AI system would prevent us from shutting it down. It would be aligned but not corrigible.

Comment by rohinmshah on Utility uncertainty vs. expected information gain · 2019-09-13T22:59:28.860Z · score: 2 (1 votes) · LW · GW
Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).

Where's the "pretty fast"? The theorem makes a claim in the limit and says nothing about convergence. (I haven't read the rest of the paper.)

Comment by rohinmshah on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-13T17:39:02.789Z · score: 3 (2 votes) · LW · GW

As phrased in the paper I'm pretty pessimistic, mostly because the paper presents a story with a discontinuity where you throw a huge amount of computation and then at some point AGI emerges abruptly.

I think it's more likely that there won't be discontinuities -- the giant blob of computation keeps spitting out better and better learning algorithms, and we develop better ways of adapting them to tasks in the real world.

At some point one of these algorithms tries and fails to deceive us, we notice the problem and either fix it or stop using the AI-GA approach / limit ourselves to not-too-capable AI systems.

It seems plausible that you could get something like the Interim Quality-of-Life Improver out of such an approach. You'd have to deal with the problem that by default these AI systems are going to have weird alien drives that would likely make them misaligned with us, but you probably do get examples of systems that would deceive us that you can study and fix.

Comment by rohinmshah on AI Safety "Success Stories" · 2019-09-13T17:26:52.237Z · score: 2 (1 votes) · LW · GW
The main difference seems to be that you don't explicitly mention strong global coordination to stop unaligned AI from arising. Is that something you also had in mind?

It's more of a free variable -- I could imagine the world turning out such that we don't need very strong coordination (because the Quality of Life Improver AI could plausibly not sacrifice competitiveness), and I could also imagine the world turning out such that it's really easy to build very powerful unaligned AI and we need strong global coordination to prevent it from happening.

I think the difference may just be in how we present it -- you focus more on the global coordination part, whereas I focus more on the following norms + improving technology + quality of life part.

There's also Will MacAskill and Toby Ord's "the Long Reflection"

Yeah I think that's the same concept.

Comment by rohinmshah on Two senses of “optimizer” · 2019-09-11T21:29:58.625Z · score: 3 (2 votes) · LW · GW

Hmm, idk, it feels more like an optimizer_1 in that situation. Now that you've posed this question, the super-powerful SAT solver that acts in the world feels like both an optimizer_1 and an optimizer_2.

Comment by rohinmshah on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-11T21:25:45.362Z · score: 4 (3 votes) · LW · GW

Yup, I am aware of these arguments and disagree with them, though I haven't written up the reasons anywhere.

Comment by rohinmshah on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-10T23:43:40.929Z · score: 4 (3 votes) · LW · GW

Yup, this is what I meant.

Comment by rohinmshah on Seven habits towards highly effective minds · 2019-09-10T16:58:46.087Z · score: 2 (1 votes) · LW · GW

Not yet, though I'll keep it in mind for the future (I don't end up using this technique very often now because I'm no longer bottlenecked on ideas; I suspect I haven't used it since Write With Transformer was created).

Also tbc I tend to use this for more mundane problems rather than research problems.

Comment by rohinmshah on AI Safety "Success Stories" · 2019-09-08T01:52:32.309Z · score: 7 (3 votes) · LW · GW
I want to credit Rohin Shah as the person that I got this success story from, but can't find the post or comment where he talked about it.

It might be from Following human norms?

With a norm-following AI system, the success story is primarily around accelerating our rate of progress. Humans remain in charge of the overall trajectory of the future, and we use AI systems as tools that enable us to make better decisions and create better technologies, which looks like “superhuman intelligence” from our vantage point today.
If we still want an AI system that colonizes space and optimizes it according to our values without our supervision, we can figure out what our values are over a period of reflection, solve the alignment problem for goal-directed AI systems, and then create such an AI system.

Which was referenced again in Learning preferences by looking at the world:

If I had to point towards a particular concrete path to a good future, it would be the one that I outlined in Following human norms. We build AI systems that have a good understanding of “common sense” or “how to behave normally in human society”; they accelerate technological development and improve decision-making; if we really want to have a goal-directed AI that is not under our control but that optimizes for our values then we solve the full alignment problem in the future. Inferring preferences or norms from the world state could be a crucial part of helping our AI systems understand “common sense”.

It's not the same as your Interim Quality-of-Life Improver, but it's got similar aspects.

It's also related to the concept of a "Great Deliberation" where we stabilize the world and then figure out what we want to do. (I don't have a reference for that though.)

If it wasn't these (but it was me), it was probably something earlier; I think I was thinking along the lines of Interim Quality-of-Life Improver in early-to-mid 2018.

Comment by rohinmshah on Seven habits towards highly effective minds · 2019-09-08T00:55:51.041Z · score: 6 (4 votes) · LW · GW
I’d classify most if not all of the tools listed above as tools for evaluating ideas, though, rather than tools for generating ideas. What helps with the latter?

I've often found it useful to have a random word generator throw a few words at me in order to help me generate new thoughts / ideas when thinking about some problem.

Comment by rohinmshah on One Way to Think About ML Transparency · 2019-09-05T14:35:23.508Z · score: 3 (2 votes) · LW · GW

Oh, sure, but if you train on a complex task like image classification, you're only going to get large decision trees (assuming you get decent accuracy), even with regularization.

(Also, why not just use the decision tree if it's interpretable? Why bother with a neural net at all?)

Comment by rohinmshah on One Way to Think About ML Transparency · 2019-09-03T17:38:25.555Z · score: 2 (1 votes) · LW · GW

But like, I could not operate a large decision tree on a piece of paper if I could study it for a while beforehand, because I wouldn't remember all of the yes/no questions and their structure.

I could certainly build a decision tree given data, but I could also build a neural network given data.

(Well, actually I think large decision trees and neural nets are both uninterpretable, so I mostly do agree with your definition. I object to having this definition of interpretability and believing that decision trees are interpretable.)

Comment by rohinmshah on Where are people thinking and talking about global coordination for AI safety? · 2019-09-02T16:19:34.240Z · score: 2 (1 votes) · LW · GW

This is pretty strongly different from my impressions, but I don't think we could resolve the disagreement without talking about specific examples of people, so I'm inclined to set this aside.

Comment by rohinmshah on Two senses of “optimizer” · 2019-08-31T22:04:53.938Z · score: 11 (4 votes) · LW · GW

Planned summary:

The first sense of "optimizer" is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.

Planned opinion:

I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.

Comment by rohinmshah on Where are people thinking and talking about global coordination for AI safety? · 2019-08-31T17:37:46.778Z · score: 8 (3 votes) · LW · GW
Is my impression from the limited sample correct?

Seems right to me, yes.

how best to correct this communications gap (and prevent similar gaps in the future) between the two groups of people working on AI risk?

Convince the researchers at OpenAI, FHI and Open Phil, and maybe DeepMind and CHAI, that it's not possible to get safe, competitive AI; then ask them to pass it on to governance researchers.

Comment by rohinmshah on [AN #62] Are adversarial examples caused by real but imperceptible features? · 2019-08-31T05:33:27.209Z · score: 4 (2 votes) · LW · GW
Is it rather that the model space might not have the capacity to correctly imitate the human?

There are lots of reasons that a robot might be unable to learn the correct policy despite the action space permitting it:

• Not enough model capacity
• Not enough training data
• Training got stuck in a local optimum
• You've learned from robot play data, but you've never seen anything like the human policy before

etc, etc.

Not all of these are compatible with "and so the robot does the thing that the human does 5% of the time". But it seems like there can and probably will be factors that are different between the human and the robot (even if the human uses teleoperation), and in that setting imitating factored cognition provides the wrong incentives, while optimizing factored evaluation provides the right incentives.

Comment by rohinmshah on Reframing Superintelligence: Comprehensive AI Services as General Intelligence · 2019-08-27T21:42:33.107Z · score: 3 (3 votes) · LW · GW

I broadly agree, especially if you set aside opacity; I very rarely mean to imply a strict dichotomy.

I do think in the scenario you outlined the main issue would be opacity: the learned language representation would become more and more specialized between the various services, becoming less interpretable to humans and more "integrated" across services.

Comment by rohinmshah on Humans can be assigned any values whatsoever… · 2019-08-26T19:40:14.289Z · score: 4 (2 votes) · LW · GW

If we add assumptions like this, they will inevitably be misspecified, which can lead to other problems. For example, how would you operationalize that is good at optimizing ? What if in reality due to effects currently beyond our understanding, our actions are making the future more likely to be dystopian in some way than if we took random actions? Should our AI infer that we prefer that dystopia, since otherwise we wouldn't be better than random?

Comment by rohinmshah on Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on? · 2019-08-24T22:08:15.811Z · score: 2 (1 votes) · LW · GW

In theory, never (either hyperbolic time discounting is a bias, and never "should" be done, or it's a value, but one that longtermists explicitly don't share).

In practice, hyperbolic time discounting might be a useful heuristic, e.g. perhaps since we are bad at thinking of all the ways that our plans can go wrong, we tend to overestimate the amount of stuff we'll have in the future, and hyperbolic time discounting corrects for that.

Comment by rohinmshah on When do utility functions constrain? · 2019-08-24T22:00:12.282Z · score: 5 (3 votes) · LW · GW

For the record, the VNM theorem is about the fact that you are maximizing expected utility. All three of the words are important, not just the utility function part. The biggest constraint that the VNM theorem applies is that, assuming there is a "true" probability distribution over outcomes (or that the agent has a well-calibrated belief over outcomes that captures all information it has about the environment), the agent must choose actions in a way consistent with maximizing the expectation of some real-valued function of the outcome, which does in fact rule out some possibilities.

It's only when you don't have a probability distribution that the VNM theorem becomes contentless. So one check to see whether or not it's "reasonable" to apply the VNM theorem is to see what happens in a deterministic environment (and the agent can perfectly model the environment) -- the VNM theorem shouldn't add any force to the argument in this setting.

Comment by rohinmshah on Embedded Naive Bayes · 2019-08-24T21:26:41.519Z · score: 4 (2 votes) · LW · GW

I usually imagine the problems of embedded agency (at least when I'm reading LW/AF), where the central issue is that the agent is a part of its environment (in contrast to the Cartesian model, where there is a clear, bright line dividing the agent and the environment). Afaict, "embedded Naive Bayes" is something that makes sense in a Cartesian model, which I wasn't expecting.

It's not that big a deal, but if you want to avoid that confusion, you might want to change the word "embedded". I kind of want to say "The Intentional Stance towards Naive Bayes", but that's not right either.

Comment by rohinmshah on Embedded Naive Bayes · 2019-08-24T19:34:58.177Z · score: 4 (2 votes) · LW · GW

What do you mean by embedded here? It seems you are asking the question "does a particular input-output behavior / computation correspond to some Naive Bayes model", which is not what I would intuitively think of as "embedded Naive Bayes".

Comment by rohinmshah on Clarifying "AI Alignment" · 2019-08-23T21:48:18.301Z · score: 2 (1 votes) · LW · GW

Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:

Isn't HCH also such a multiagent system?

Yes, I shouldn't have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is "trying to do", i.e. I wouldn't say it has a single "motivation". This allows you to say "the system is not intent-aligned", even though you can't say "the system is trying to do X".

Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn't make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:

For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.

Also, I want to note strong agreement with this:

Of course, it also seems quite likely that AIs of the kind that will probably be built ("by default") also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Comment by rohinmshah on Clarifying "AI Alignment" · 2019-08-22T21:33:55.207Z · score: 4 (2 votes) · LW · GW

Oh, I see, you're talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn't apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I'd say it fails motivation, mostly because the system doesn't really have a single "motivation").

It doesn't seem like the definition-optimization decomposition helps either? I don't know whether I'd call that a failure of definition or optimization.

Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won't cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)?

I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn't apply this framework to the system (and I also wouldn't apply definition-optimization to the system).

Comment by rohinmshah on Clarifying "AI Alignment" · 2019-08-22T18:24:32.827Z · score: 2 (1 votes) · LW · GW

I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can't talk coherently about their "motivation". Probably all deep-learning-based systems fall into this category.

I also agree that (at least for now, and probably in the future as well) you can't formally specify the "type signature" of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.

My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I'm aware that you disagree with this).

I don't agree that it's not clean at the conceptual level. It's perhaps less clean than the definition-optimization decomposition, but not much less.

For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn't very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both?

This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don't want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.

Comment by rohinmshah on Alignment Newsletter #24 · 2019-08-22T00:00:43.658Z · score: 4 (2 votes) · LW · GW

Update: A reader suggested that in the open-source implementation of PopArt, the PopArt normalization happens after the reward clipping, counter to my assumption. I no longer understand why PopArt is helping, beyond "it's good for things to be normalized".

Comment by rohinmshah on Forum participation as a research strategy · 2019-08-19T21:51:59.438Z · score: 2 (1 votes) · LW · GW
Do you have any links related to this?

No, I haven't read much about Bayesian updating. But I can give an example.

Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you're right I give you $2 and if I'm right you give me$1). Then I flip the coin and the bet resolves.

If your hypothesis space is "the coin has some bias b of coming up Heads or Tails", then you will eagerly accept this game for large enough N -- you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.

However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you'll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.

Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn't help). I don't know of a general way to use noise parameters to avoid issues like this.

Comment by rohinmshah on Coherence arguments do not imply goal-directed behavior · 2019-08-17T23:28:40.471Z · score: 2 (1 votes) · LW · GW
I think it's worth pointing out one technical 'caveat'

Yes, good point. I think I was assuming an infinite horizon (i.e. no terminal states), for which either construction works.

My main point, however, is that I think you could do some steelmanning here and recover most of the arguments you are criticizing (based on complexity arguments).

That's the next post in the sequence, though the arguments are different from the ones you bring up.

But I think there are still good arguments for intelligence strongly suggesting some level of "goal-directed behavior". e.g. it's probably physically impossible to implement policies (over histories) that are effectively random, since they look like look-up tables that are larger than the physical universe.

I mean, you could have the randomly twitching robot. But I agree with the broader point, I think, to the extent that it is the "economic efficiency" argument in the next post.

Eliezer has a nice analogy in a comment on one of Paul's posts (I think), about an agent that behaves like it understands math, except that it thinks 2+2=5.

It seems likely the AI's beliefs would be logically coherent whenever the corresponding human beliefs are logically coherent. This seems quite different from arguing that the AI has a goal.

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-15T19:57:38.540Z · score: 6 (3 votes) · LW · GW

Re: convergent rationality, I don't buy it (specifically the "convergent" part).

Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.

But really my answer is "there are lots of ways you can get confidence in a thing that are not proofs". I think the strongest argument against is "when you have an adversary optimizing against you, nothing short of proofs can give you confidence", which seems to be somewhat true in security. But then I think there are ways that you can get confidence in "the AI system will not adversarially optimize against me" using techniques that are not proofs.

(Note the alternative to proofs is not trial and error. I don't use trial and error to successfully board a flight, but I also don't have a proof that my strategy is going to cause me to successfully board a flight.)

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-09T22:26:05.339Z · score: 2 (1 votes) · LW · GW

I agree with a). c) seems to me to be very optimistic, but that's mostly an intuition, I don't have a strong argument against it (and I wouldn't discourage people who are enthusiastic about it from working on it).

The argument in b) makes sense; I think the part that I disagree with is:

moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there).

The counterargument is "current AI systems don't look like long term planners", but of course it is possible to respond to that with "AGI will be very different from current AI systems", and then I have nothing to say beyond "I think AGI will be like current AI systems".

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-09T22:14:26.776Z · score: 2 (1 votes) · LW · GW
Or are you just trying to see if anyone can defeat the epistemic humility "trump card"?

Partly (I'm surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.

But maybe it's enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-09T22:11:20.198Z · score: 2 (1 votes) · LW · GW

I'd also argue against the empirical claim in that setting; do you agree with the empirical claim there?

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-09T22:08:47.801Z · score: 2 (1 votes) · LW · GW
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.

Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many "easy" things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)

I also predict that there will be types of failure we will not notice, or will misinterpret. [...]

All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I'd be more worried about scenarios like these.

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-09T21:57:22.675Z · score: 3 (2 votes) · LW · GW

I'm uncertain about weaponization of AI (and did say "if we ignore military applications" in the OP).

Comment by rohinmshah on Following human norms · 2019-08-08T22:48:20.748Z · score: 2 (1 votes) · LW · GW
I just don’t know whether I agree with your assertion that eg AUP “defines” what not to do.

I think I mostly meant that it is not learned.

I kind of want to argue that this means the effect of not-learned things can be traced back to researcher's brains, rather than to experience with the real world. But that's not exactly right, because the actual impact penalty can depend on properties of the world, even if it doesn't use learning.

I don't know; it feels too early to say. I think if the norms end up in some hardcoded form such that they never update over time, nearest unblocked strategies feel very likely. If the norms are evolving over time, then it might be fine. The norms would need to evolve at the same "rate" as the rate at which the world changes.

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-08T22:35:48.478Z · score: 2 (1 votes) · LW · GW
To be hopefully clear: I'm applying this normative claim to argue that proof is needed to establish the desired level of confidence.

Under my model, it's overwhelmingly likely that regardless of what we do AGI will be deployed with less than the desired level of confidence in its alignment. If I personally controlled whether or not AGI was deployed, then I'd be extremely interested in the normative claim. If I then agreed with the normative claim, I'd agree with:

proof is needed to establish the desired level of confidence. That doesn't mean direct proof of the claim "the AI will do good", but rather of supporting claims, perhaps involving the learning-theoretic properties of the system (putting bounds on errors of certain kinds) and such.

I don't see how you can be confident enough of that view for it to be how you really want to check.

If I want >99% confidence, I agree that I couldn't be confident enough in that argument.

A system can be optimizing a fairly good proxy, so that at low levels of capability it is highly aligned, but this falls apart as the system becomes highly capable and figures out "hacks" around the "usual interpretation" of the proxy.

Yeah, the hope here would be that the relevant decision-makers are aware of this dynamic (due to previous situations in which e.g. a recommender system optimized the fairly good proxy of clickthrough rate but this lead to "hacks" around the "usual interpretation"), and have some good reason to think that it won't happen with the highly capable system they are planning to deploy.

I also note that it seems like we disagree both about how useful proofs will be and about how useful empirical investigations will be

Agreed. It also might be that we disagree on the tractability of proofs in addition to / instead of the utility of proofs.

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-08T18:00:31.585Z · score: 2 (1 votes) · LW · GW
you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective)

I'm uncertain, given the potential for AGI to be used to reduce other x-risks. (I don't have strong opinions on how large other x-risks are and how much potential there is for AGI to differentially help.) But I'm happy to accept this as a premise.

I think what's happening now is a good guide into what will happen in the future, at least on short timelines. If AGI is >100 years away, then sure, a lot will change and current facts are relatively unimportant. If it's < 20 years away, then current facts seem very relevant. I usually focus on the shorter timelines.

For min(20 years, time till AGI), for each individual trend I identified, I'd weakly predict that trend will continue (except perhaps openness, because that's already changing).

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-08T04:32:19.309Z · score: 2 (1 votes) · LW · GW

I'm more sympathetic to this argument (which is a claim about what might happen in the future, as opposed to what is happening now, which is the analogy I usually encounter, though possibly not on LessWrong). I still think the analogy breaks down, though in different ways:

• There is a strong norm of openness in AI research (though that might be changing). (Though perhaps this was the case with nuclear physics too.)
• There is a strong anti-government / anti-military ethic in the AI research community. I'm not sure what the nuclear analog is, but I'm guessing it was neutral or pro-government/military.
• Governments are staying a mile away from AGI; their interest in AI is in narrow AI's applications. Narrow AI applications are diverse, and many can be done by a huge number of people. In contrast, nukes are a single technology, governments were interested in them, and only a few people could plausibly build them. (This is relevant if you think a ton of narrow AI could be used to take over the world economically.)
• OpenAI / DeepMind are not adversarial towards each other. In contrast, US / Germany were definitely adversarial.
Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-08T03:20:05.976Z · score: 4 (2 votes) · LW · GW
Because (it seems to me) existential risk seems asymmetrically bad in comparison to potential technology upsides (large as upsides may be), I just have different standards of evidence for "significant risk" vs "significant good".

This is a normative argument, not an empirical one. The normative position seems reasonable to me, though I'd want to think more about it (I haven't because it doesn't seem decision-relevant).

I especially don't see an argument that one could expect all failure modes of very very capable systems to present themselves first in less-capable systems.

The quick version is that to the extent that the system is adversarially optimizing against you, it had to at some point learn that that was a worthwhile thing to do, which we could notice. (This is assuming that capable systems are built via learning; if not then who knows what'll happen.)

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-08T03:00:59.950Z · score: 2 (1 votes) · LW · GW

I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work.

I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-07T19:05:08.775Z · score: 6 (3 votes) · LW · GW

This sounds like the normative claim, not the empirical one, given that you said "what we want is..."

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-07T19:02:42.376Z · score: 3 (2 votes) · LW · GW

I hold this view; none of those are reasons for my view. The reason is much more simple -- before x-risk level failures, we'll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We'll notice this, understand it, and fix the issue.

(A crux I expect people to have is whether we'll actually fix the issue or "apply a bandaid" that is only a superficial fix.)

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-07T18:59:03.267Z · score: 2 (1 votes) · LW · GW

Agree that climate change is a better analogy.

Disagree that nukes seem easier to coordinate around -- there are factors that suggest this (e.g. easier to track who is and isn't making nukes), but there are factors against as well (the incentives to "beat the other team" don't seem nearly as strong).

Comment by rohinmshah on AI Alignment Open Thread August 2019 · 2019-08-06T19:52:53.372Z · score: 8 (3 votes) · LW · GW
First, when I say "proof-level guarantees will be easy", I mean "team of experts can predictably and reliably do it in a year or two", not "hacker can do it over the weekend".

This was also what I was imagining. (Well, actually, I was also considering more than two years.)

we are missing some fundamental insights into what it means to be "aligned".

It sounds like our disagreement is the one highlighted in Realism about rationality. When I say we could check whether the AI is deceiving humans, I don't mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of "deception" that gives us a perfect checker. I don't think notions like "deception", "aligned", "want", "optimize", etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult.

Jumping back to the original question, it sounds like the reason that you think that if we don't have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right?

If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality -- maybe physical phenomena aren't amenable to crisp formal definitions, but "alignment" is.