ofer's Shortform 2019-11-26T14:59:40.664Z · score: 4 (1 votes)
A probabilistic off-switch that the agent is indifferent to 2018-09-25T13:13:16.526Z · score: 6 (4 votes)
Looking for AI Safety Experts to Provide High Level Guidance for RAISE 2018-05-06T02:06:51.626Z · score: 41 (13 votes)
A Safer Oracle Setup? 2018-02-09T12:16:12.063Z · score: 12 (4 votes)


Comment by ofer on Open question: are minimal circuits daemon-free? · 2019-12-07T22:48:26.162Z · score: 3 (2 votes) · LW · GW

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let be a minimal circuit that takes as input a string of length that encodes a Turing machine, and outputs a string that is the concatenation of the first configurations in the simulation of that Turing machine (each configuration is encoded as a string).

Now consider a string that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input , the computation of the output of simulates a consequentialist; and is a minimal circuit.

Comment by ofer on AI Alignment Open Thread October 2019 · 2019-12-05T15:05:58.436Z · score: 4 (2 votes) · LW · GW

Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?

I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for minutes vs. 1,000 days.

Or rather: What is the smallest such that 'learning to generate answers that humans may give after thinking for minutes' is not easier than 'learning to generate answers that humans may give after thinking for 1,000 days'.

I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either 'supervised with automatic labeling' or unsupervised (e.g. 'predict the next token') and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).

Comment by ofer on AI Alignment Open Thread October 2019 · 2019-12-04T02:09:34.272Z · score: 1 (1 votes) · LW · GW

Thank you! I think I now have a better model of how people think about factored cognition.

the 'humans thinking for 10 minutes chained together' might have very different properties from 'one human thinking for 1000 days'. But given that those have different properties, it means it might be hard to train the 'one human thinking for 1000 days' system relative to the 'thinking for 10 minutes' system, and the fact that one easily extends to the other is evidence that this isn't how thinking works.

In the above scenario I didn't assume that humans—or the model—use factored cognition when the 'thinking duration' is long. Suppose instead that the model is running a simulation of a system that is similar (in some level of abstraction) to a human brain. For example, suppose some part of the model represents a configuration of a human brain, and during inference some iterative process repeatedly advances that configuration by a single "step". Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.

Generally, one way to make predictions about the final state of complicated physical processes is to simulate them. Solutions that do not involve simulations (or equivalent) may not even exist, or may be less likely to be found by the training algorithm.

Comment by ofer on AI Alignment Open Thread October 2019 · 2019-12-03T18:19:45.856Z · score: 6 (3 votes) · LW · GW

[Question about factored cognition]

Suppose that at some point in the future, for the first time in history, someone trains an ML model that takes any question as input, and outputs an answer that an ~average human might have given after thinking about it for 10 minutes. Suppose that model is trained without any safety-motivated interventions.

Suppose also that the architecture of that model is such that '10 minutes' is just a parameter, , that the operator can choose per inference, and there's no upper bound on it; and the inference runtime increases linearly with . So, for example, the model could be used to get an answer that a human would have come up with after thinking for 1000 days.

In this scenario, would it make sense to use the model for factored cognition? Or should we consider running this model with to be no more dangerous than running it many times with ?

Comment by ofer on ofer's Shortform · 2019-11-26T14:59:40.853Z · score: 12 (4 votes) · LW · GW

Nothing in life is as important as you think it is when you are thinking about it.

--Daniel Kahneman, Thinking, Fast and Slow

To the extent that the above phenomenon tends to occur, here's a fun story that attempts to explain it:

At every moment our brain can choose something to think about (like "that exchange I had with Alice last week"). How does the chosen thought get selected from the thousands of potential thoughts? Let's imagine that the brain assigns an "importance score" to each potential thought, and thoughts with a larger score are more likely to be selected. Since there are thousands of thoughts to choose from, the optimizer's curse makes our brain overestimate the importance of the thought that it ends up selecting.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-22T12:20:42.157Z · score: 3 (2 votes) · LW · GW

If "unintended optimization" referrers only to the inner alignment problem, then there's also the malign prior problem.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-22T11:33:06.207Z · score: 1 (1 votes) · LW · GW

Well, the reason I mentioned the "utility function over different states of matter" thing is because if your utility function isn't specified over states of matter, but is instead specified over your actions (e.g. behave in a way that's as corrigible as possible), you don't necessarily get instrumental convergence.

I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we're talking about systems that 'want to affect (some part of) the world', and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).

My impression is that early thinking about Oracles wasn't really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there's no real reason to believe these early "Oracle" models are an accurate description of current or future (un)supervised learning systems.

It seems possible that something like this has happened. Though as far as I know, we don't currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.

How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase "training distribution" then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?

Therefore, I'm sympathetic to the following perspective, from Armstrong and O'Rourke (2018) (the last sentence was also quoted in the grandparent):

we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-21T15:34:59.152Z · score: 1 (1 votes) · LW · GW

Sorry for the delayed response!

My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I'm not sure how the concept applies in other cases.

I'm confused about the "I'm not sure how the concept applies in other cases" part. It seems to me that 'arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer' are a special case of 'agents which want to achieve a broad variety of utility functions over different states of matter'.

Like, if we aren't using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals?

I'm not sure what's the interpretation of 'unintended optimization', but I think that a sufficiently broad interpretation would cover the failure modes I'm talking about here.

I'm interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that's potentially smarter than we are.

I agree. So the following is a pending question that I haven't addressed here yet: Would '(un)supervised learning at arbitrarily large scale' produce arbitrarily capable systems that "want to affect the world"?

I won't address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I'll end up writing something about it). For now I'll note that my best guess is that most AI safety researchers think that it's at least plausible (>10%) that the answer to that question is "yes".

I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not 'want to affect the world'). Here's some supporting evidence for this:

  • Stuart Armstrong and Xavier O'Rourke wrote in their Safe Uses of AI Oracles paper:

    we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

  • Stuart Russell wrote in his book Human Compatible (2019):

    if the objective of the Oracle AI system is to provide accurate answers to questions in a reasonable amount of time, it will have an incentive to break out of its cage to acquire more computational resources and to control the questioners so that they ask only simple questions.

Comment by ofer on Robin Hanson on the futurist focus on AI · 2019-11-14T21:30:25.369Z · score: 5 (4 votes) · LW · GW

It seems you consider previous AI booms to be a useful reference class for today's progress in AI.

Suppose we will learn that the fraction of global GDP that currently goes into AI research is at least X times higher than in any previous AI boom. What is roughly the smallest X for which you'll change your mind (i.e. no longer consider previous AI booms to be a useful reference class for today's progress in AI)?

[EDIT: added "at least"]

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-13T11:31:25.014Z · score: 1 (1 votes) · LW · GW

Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer.

I'm not sure whether you're asking about my thoughts on:

  1. how can '(un)supervised learning at arbitrarily large scale' produce such systems; or

  2. conditioned on such systems existing, why might they have convergent instrumental goals that look like those failure modes.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-12T13:59:24.428Z · score: 1 (1 votes) · LW · GW

Are you referring to the possibility of unintended optimization

Yes (for a very broad interpretation of 'optimization'). I mentioned some potential failure modes in this comment.

Comment by ofer on Chris Olah’s views on AGI safety · 2019-11-10T08:25:28.607Z · score: 5 (3 votes) · LW · GW

I'm not sure I understand the question, but in case it's useful/relevant here:

A computer that trains an ML model/system—via something that looks like contemporary ML methods at an arbitrarily large scale—might be dangerous even if it's not connected to anything. Humans might get manipulated (e.g. if researchers ever look at the learned parameters), mind crime might occur, acausal trading might occur, the hardware of the computer might be used to implement effectors in some fantastic way. And those might be just a tiny fraction of a large class of relevant risks that the majority of which we can't currently understand.

Such 'offline computers' might be more dangerous than an RL agent that by design controls some actuators, because problems with the latter might be visible to us at a much lower scale of training (and therefore with much less capable/intelligent systems).

Comment by ofer on Defining Myopia · 2019-11-09T13:47:51.438Z · score: 9 (2 votes) · LW · GW

In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.

Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).

Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.

(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)

Comment by ofer on Lite Blocking · 2019-11-06T08:14:59.031Z · score: 1 (1 votes) · LW · GW

Social networks generally have far more things they could show you than you'll be able to look at. To prioritize they use inscrutable algorithms that boil down to "we show you the things we predict you're going to like".

Presumably, social networks tend to optimize for metrics like time spent and user retention. (There might even be a causal relationship between this optimization and threads getting derailed.)

Also, this seems like a stable/likely state, because if any single social network would unilaterally switch to optimize for 'showing users things they like' (or any other metric different from the above), competing social networks would plausibly be "stealing" their users.

Comment by ofer on Book Review: Design Principles of Biological Circuits · 2019-11-05T16:02:55.619Z · score: 24 (11 votes) · LW · GW

Thanks for writing this!

I used to agree with this position. I used to argue that there was no reason to expect human-intelligible structure inside biological organisms, or deep neural networks, or other systems not designed to be understandable. But over the next few years after that biologist’s talk, I changed my mind, and one major reason for the change is Uri Alon’s book An Introduction to Systems Biology: Design Principles of Biological Circuits.

I'm curious what you think about the following argument:

The examples in this book are probably subject to a strong selection effect: examples for mechanisms that researchers currently understand were more likely to be included in the book than those that no one has a clue about. There might be a large variance in the "opaqueness" of biological mechanisms (in different levels of abstraction). So perhaps this book provides little evidence on the (lack of) opaqueness of, say, the mechanisms that allowed John von Neumann to create the field of cellular automata, at a level of abstraction that is analogous to artificial neural networks.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-05T09:33:30.300Z · score: 1 (1 votes) · LW · GW

May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs.

It seems worth pointing out that due to the inner alignment problem, we shouldn't assume that naively training, say, unsupervised learning models with human-level capabilities (e.g. for the purpose of generating novel FAI proposals) will be safe — conditioned on it being possible capabilities-wise.

Comment by ofer on More variations on pseudo-alignment · 2019-11-05T07:29:58.209Z · score: 3 (2 votes) · LW · GW

I'm not sure that I understand your definition of suboptimality deceptive alignment correctly. My current (probably wrong) interpretation of it is: "a model has a suboptimality deceptive alignment problem if it does not currently have a deceptive alignment problem but will plausibly have one in the future". This sounds to me like a wrong interpretation of this concept - perhaps you could point out how it differs from the correct interpretation?

If my interpretation is roughly correct, I suggest naming this concept in a way that would not imply that it is a special case of deceptive alignment. Maybe "prone to deceptive alignment"?

Comment by ofer on More variations on pseudo-alignment · 2019-11-05T07:24:28.342Z · score: 5 (3 votes) · LW · GW

This post made me re-visit the idea in your paper to distinguish between:

  1. Internalization of the base objective ("The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned."); and
  2. Modeling of the base objective ("The base objective is incorporated into the mesa-optimizer’s epistemic model rather than its objective").

I'm currently confused about this distinction. The phrase "point to" seems to me vague. What should count as a model that points to a representation of the base objective (as opposed to internalizing it)?

Suppose we have a model that is represented by a string of 10 billion bits. Suppose it is the case that there is a set of 100 bits such that if we flip all of them, the model would behave very differently (but would still be very "capable", i.e. the modification would not just "break" it).

[EDIT: by "behave very differently" I mean something like "maximize some objective function that is far away from the base objective on objective function space"]

Is it theoretically possible that a model that fits this description is the result of internalization of the base objective rather than modeling of the base objective?

Comment by ofer on Defining Myopia · 2019-11-04T17:45:26.652Z · score: 3 (2 votes) · LW · GW

I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) "informal strategic stuff" (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.

The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).

My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).

I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:

During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.

I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I'll use the following example to explain why I think that:

Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can't perform better than chance), however, if parameter has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of does not (directly) affect predictions.

How should we expect our learning algorithm to update the parameter at the end of iteration 2?

If our learning algorithm is gradient decent, it seems that we should NOT expect to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to ) is expected to be positive.

In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of (over the model population).

Comment by ofer on [deleted post] 2019-11-04T08:46:50.195Z


Comment by ofer on Gradient hacking · 2019-11-02T16:42:33.622Z · score: 3 (2 votes) · LW · GW

Why can't the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?

Gradient descent is a very simple algorithm. It only "gets rid" of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter , by being a model that outputs a very incorrect value if is even slightly different than the desired value.

Comment by ofer on [deleted post] 2019-11-02T16:40:40.310Z


Comment by ofer on Chris Olah’s views on AGI safety · 2019-11-02T09:43:05.575Z · score: 7 (4 votes) · LW · GW
Oracles must somehow be incentivized to give useful answers

A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it's "incentivized to give useful answers" (e.g. an answer to the question: "what is the next word in the text?").

My understanding is that what distinguishes a microscope model is the way it is being used after it's already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.

Comment by ofer on Chris Olah’s views on AGI safety · 2019-11-01T23:46:35.653Z · score: 21 (7 votes) · LW · GW

Thanks for writing this!

I'm curious what's Chris's best guess (or anyone else's) about where to place AlphaGo Zero on that diagram. Presumably its place is somewhere after "Human Performance", but is it close to the "Crisp Abstractions" pick, or perhaps way further - somewhere in the realm of "Increasingly Alien Abstractions"?

Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it.

Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?

(Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it's safe?)

Comment by ofer on Defining Myopia · 2019-10-23T23:49:21.053Z · score: 3 (2 votes) · LW · GW
In particular, I think there’s a distinction between agents with objective functions over states the world vs. their own subjective experience vs. their output.

This line of thinking seems to me very important!

The following point might be obvious to Evan, but it's probably not obvious to everyone: Objective functions over the agent's output should probably not be interpreted as objective functions over the physical representation of the output (e.g. the configuration of atoms in certain RAM memory cells). That would just be a special case of objective functions over world states. Rather, we should probably be thinking about objective functions over the output as it is formally defined by the code of the agent (when interpreting "the code of the agent" as a mathematical object, like a Turing machine, and using a formal mapping from code to output).

Perhaps the following analogy can convey this idea: think about a human facing Newcomb's problem. The person has the following instrumental goal: "be a person that does not take both boxes" (because that makes Omega put $1,000,000 in the first box). Now imagine that that was the person's terminal goal rather than an instrumental goal. That person might be analogous to a program that "wants" to be a program that its (formally defined) output maximizes a given utility function.

Comment by ofer on Defining Myopia · 2019-10-23T22:14:12.354Z · score: 1 (1 votes) · LW · GW

Taking a step back, I want to note two things about my model of the near future (if your model disagrees with those things, that disagreement might explain what's going on in our recent exchanges):

(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.

(2) Suppose there's some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.

Regarding the questions you raised:

How would you propose to apply evolutionary algorithms to online learning?

One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).

How would you propose to apply evolutionary algorithms to non-episodic environments?

I'm not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can't be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there's training data or for which the environment can be simulated).

Comment by ofer on Healthy Competition · 2019-10-21T06:28:29.126Z · score: 2 (2 votes) · LW · GW
If you want to challenge a monopoly with a new org, there's likewise a particular burden to do a good job.

(This seems to depend on whether the job/project in question benefits from concentration.)

Comment by ofer on Healthy Competition · 2019-10-21T06:19:11.327Z · score: 5 (3 votes) · LW · GW

This topic seems very important!

Another potential consideration: In some cases, not having any competition can expose an org to a ~%50 probability of having net-negative impact, simply due to the possibility that a counterfactual org (founded by someone else) would have done a better job.

Comment by ofer on Defining Myopia · 2019-10-20T16:43:24.566Z · score: 2 (2 votes) · LW · GW

[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning was wrong).]

Conjecture: It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.


There's also reason to suspect the conjecture to be false. There's a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there's an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.

Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).

Comment by ofer on Optimization Provenance · 2019-10-16T18:04:17.157Z · score: 1 (1 votes) · LW · GW
The optimization processes used in the agent must all be capable of controlling whether it creates a mesa-optimizer.

I'm confused about this sentence - my understanding is that the term mesa-optimizer refers to the agent/model itself when it is doing some optimization. I think the term "run-time optimization" (which I've seen in this slide, seemingly from a talk by Yann LeCun) refers to this type of optimization.

4. Does this apply to other forms of optimization daemons?

Isn't every optimization daemon a mesa-optimizer?

I was under the impression that the term "optimization daemon" was used to describe a mesa-optimizer that is a "consequentialist" (I don't know whether there's a common definition for the term "consequentialist" in this context; my own tentative fuzzy definition is "something that has preferences about the spacetime of the world/multiverse".)

Comment by ofer on Gradient hacking · 2019-10-16T04:27:26.074Z · score: 6 (4 votes) · LW · GW

I think this post describes an extremely important problem and research directions, and I hope a lot more research and thought goes into this!

ETA: Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.

Comment by ofer on Impact measurement and value-neutrality verification · 2019-10-15T08:59:49.533Z · score: 4 (3 votes) · LW · GW

Very interesting!

Regarding value-neutrality verification: If deceptive alignment occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn't give us much assurance about the behavior of the model.

Comment by ofer on Thoughts on "Human-Compatible" · 2019-10-10T19:27:37.219Z · score: 5 (3 votes) · LW · GW
Any other ideas for "decoupled" AIs, or risks that apply to this approach in general?

If the question is about all the risks that apply, rather than special risks with this specific approach, then I'll note that the usual risks from the inner alignment problem seem to apply.

Comment by ofer on Machine Learning Projects on IDA · 2019-10-08T16:50:55.094Z · score: 1 (1 votes) · LW · GW

Regarding the following passage from the document:

What kind of built-in operations and environments should we use?
In existing work on NPI, the neural net is given outputs that correspond to basic operations on data. This makes it easier to learn algorithms that depend on those basic operations. For IDA, it would be ideal to learn these operations from examples. (If we were learning from human decompositions, we might not know about these “basic operations on data” ahead of time).

Do you have ideas/intuitions about how "basic operations" in the human brain can be learned? Also, how basic are the "basic operations" you're thinking about here? (Are we talking about something like the activity of an individual biological neuron? How active is a particular area in the prefrontal cortex? Symbolic-level stuff?)

Generally, do you consider imitating human cognition at the level of "basic operations" to be part of the IDA agenda? (As opposed to, say, training a model to "directly" predict the output of a human-working-for-10-minutes).

Comment by ofer on List of resolved confusions about IDA · 2019-10-01T15:13:25.543Z · score: 5 (3 votes) · LW · GW
The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").

Comment by ofer on Partial Agency · 2019-09-29T04:21:33.520Z · score: 4 (2 votes) · LW · GW
Toward the end, the parameter value could mutate away.

I agree that it's possible to get myopic models in the population after arbitrarily long runtime due to mutations. It seems less likely the more bits that need to change­—in any model in the current population—to get a model that in completely myopic.

From a safety perspective, if the prospect of some learning algorithm yielding a non-myopic model is concerning, the prospect of it creating non-myopic models along the way is plausibly also concerning.

And how would you apply evolutionary algorithms to really non-myopic settings, like reinforcement learning where you can't create any good episode boundaries (for example, you have a robot interacting with an environment "live", no resets, and you want to learn on-line)?

In this example, if we train an environment model on the data collected so far (ignoring the "you want to learn on-line" part), evolutionary algorithms might be an alternative to regular deep learning. More realistically, some actors would probably invest a lot of resources in developing top predictive models for stock prices etc., and evolutionary algorithms might be one of the approaches being experimented with.

Also, people might experiment with evolutionary algorithms as an alternative to RL, for environments that can be simulated, as OpenAI did; they wrote (2017): "Our work suggests that neuroevolution approaches can be competitive with reinforcement learning methods on modern agent-environment benchmarks, while offering significant benefits related to code complexity and ease of scaling to large-scale distributed settings.".

Comment by ofer on Partial Agency · 2019-09-28T09:00:30.584Z · score: 9 (4 votes) · LW · GW

[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning on this was wrong).]

If there's a learning-theoretic setup which incentivizes the development of "full agency" (whatever that even means, really!) I don't know what it is yet.

Consider evolutionary algorithms. It seems that (theoretically) they tend to yield non-myopic models by default given sufficiently long runtime. For example, a network parameter value that causes behavior that minimizes loss in future training inferences might be more likely to end up in the final model than one that causes behavior that minimizes loss in the current inference at a great cost for the loss in future ones.

Comment by ofer on Towards an empirical investigation of inner alignment · 2019-09-24T04:57:07.388Z · score: 2 (2 votes) · LW · GW

Very interesting! This research direction might lead to researchers having better intuitions about what sort of mesa-objectives we're more likely to end up with.

Perhaps similar experiments can be done with supervised learning (instead of RL).

Comment by ofer on AI Alignment Open Thread August 2019 · 2019-09-23T07:11:20.905Z · score: 1 (1 votes) · LW · GW

These biases seem very important to keep in mind!

If "AI safety" refers here only to AI alignment, I'd be happy to read about how overconfidence about the difficulty/safety of one's approach might exacerbate the unilateralist's curse.

Comment by ofer on The unexpected difficulty of comparing AlphaStar to humans · 2019-09-21T18:43:55.147Z · score: 1 (1 votes) · LW · GW

Ah, makes sense, thanks.

Comment by ofer on The unexpected difficulty of comparing AlphaStar to humans · 2019-09-21T17:54:29.930Z · score: 2 (2 votes) · LW · GW
From AlphaStar, we’ve learned that one of two things is true: Either AI can [...] solve basic game theory problems

I'm confused about the mention of game theory. Did AlphaStar play in games that included more than two teams?

Comment by ofer on Feature Wish List for LessWrong · 2019-09-21T15:49:30.116Z · score: 2 (2 votes) · LW · GW

The ability to get notified, in any way, about new comments in specific threads/posts would be very helpful for me!

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-15T10:16:23.760Z · score: 1 (1 votes) · LW · GW

Some tangentially related thoughts:

It seems that in many simple worlds (such as the Bomb world), an indexically-selfish agent with a utility function over centered histories would prefer to commit to UDT with a utility function over uncentered histories; where is defined as the sum of all the "uncentered versions" of (version corresponds to when the pointer is assumed to point to agent ).

Things seem to get more confusing in messy worlds in which the inability of an agent to define a utility function (over uncentered histories) that distinguishes between agent1 and agent2 does not entail that the two agents are about to make the same decision.

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-14T19:05:36.703Z · score: 3 (2 votes) · LW · GW

I agree. It seems that in that situation the person would be "rational" to choose Right.

I'm still confused about the "UDT is incompatible with this kind of selfish values" part. It seems that an indexically-selfish person—after failing to make a binding commitment and seeing the bomb—could still rationally commit to UDT from that moment on, by defining the utility s.t. only copies that found themselves in that situation (i.e. those who failed to make a binding commitment and saw the bomb) matter. That utility is a function over uncentered histories of the world, and would result in UDT choosing Right.

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-14T17:12:54.891Z · score: 12 (3 votes) · LW · GW
Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don’t know if I’m a simulation or a real person. If I was selfish in an indexical way, I would think something like “If I’m a simulation then it doesn’t matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I’m a real person, choosing Left will cause me to be burned. So I should choose Right.”

It seems to me that even in this example, a person (who is selfish in an indexical way) would prefer—before opening their eyes—to make a binding commitment to choose left. If so, the "intuitively correct answer" that UDT is unable to give is actually just the result of a failure to make a beneficial binding commitment.

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-13T21:55:24.388Z · score: 15 (10 votes) · LW · GW

(I'm not a decision theorist)

FDT in any form will violate Guaranteed Payoffs, which should be one of the most basic constraints on a decision theory

Fulfilling the Guaranteed Payoffs principle as defined here seems to entail two-boxing in the Transparent Newcomb's Problem, and generally not being able to follow through on precommitments when facing a situation with no uncertainty.

My understanding is that a main motivation for UDT (which FDT is very similar to?) is to get an agent that, when finding itself in a situation X, follows through on any precommitment that—before learning anything about the world—the agent would have wanted to follow through on when it is in situation X. Such a behavior would tend to violate the Guaranteed Payoffs principle, but would be beneficial for the agent?

Comment by ofer on Two senses of “optimizer” · 2019-09-12T07:29:42.323Z · score: 1 (1 votes) · LW · GW
Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in.

This definition of optimizer_2 depends on the definition of "environment". It seems that for an RL agent you use the word "environment" to mean the formal environment as defined in RL. How do you define "environment", for this purpose, in non-RL settings?

What should be considered the environment of a SAT solver, or an arbitrary mesa-optimizer that was optimized to be a SAT solver?

Comment by ofer on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-12T06:19:19.602Z · score: 1 (1 votes) · LW · GW
Rohin's opinion: I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.

It seems to me that a SAT solver can be arbitrarily competent at solving SAT problems without being the second kind of optimizer (i.e. without acting upon its environment to change it), even while it solves SAT problems that encode the dynamics of our world. For example, this seems to be the case for a SAT solver that is just a brute force search with arbitrarily large amount of computing power.

[EDIT: When writing this comment, I considered "the environment of a SAT solver" to be the world that contains the computer running the SAT solver. However, this seem to contradict what Joar had in mind in his post].

Comment by ofer on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T18:52:38.880Z · score: 1 (1 votes) · LW · GW

Ah, I agree (edited my comment above accordingly).

Comment by ofer on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T18:05:50.245Z · score: 4 (2 votes) · LW · GW
Interesting... it seems that this doesn't necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.

I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).

I think it still applies to evolutionary algorithms (which might end up being relevant).

how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?

Maybe learning algorithms that have the following property are more likely to yield models with "cross-episodic behavior":

During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.

Also, what name would you suggest for this problem, if not "inner alignment"?

Maybe "non-myopia" as Evan suggested.