A probabilistic off-switch that the agent is indifferent to 2018-09-25T13:13:16.526Z · score: 6 (4 votes)
Looking for AI Safety Experts to Provide High Level Guidance for RAISE 2018-05-06T02:06:51.626Z · score: 41 (13 votes)
A Safer Oracle Setup? 2018-02-09T12:16:12.063Z · score: 12 (4 votes)


Comment by ofer on Defining Myopia · 2019-10-23T23:49:21.053Z · score: 1 (1 votes) · LW · GW
In particular, I think there’s a distinction between agents with objective functions over states the world vs. their own subjective experience vs. their output.

This line of thinking seems to me very important!

The following point might be obvious to Evan, but it's probably not obvious to everyone: Objective functions over the agent's output should probably not be interpreted as objective functions over the physical representation of the output (e.g. the configuration of atoms in certain RAM memory cells). That would just be a special case of objective functions over world states. Rather, we should probably be thinking about objective functions over the output as it is formally defined by the code of the agent (when interpreting "the code of the agent" as a mathematical object, like a Turing machine, and using a formal mapping from code to output).

Perhaps the following analogy can convey this idea: think about a human facing Newcomb's problem. The person has the following instrumental goal: "be a person that does not take both boxes" (because that makes Omega put $1,000,000 in the first box). Now imagine that that was the person's terminal goal rather than an instrumental goal. That person might be analogous to a program that "wants" to be a program that its (formally defined) output maximizes a given utility function.

Comment by ofer on Defining Myopia · 2019-10-23T22:14:12.354Z · score: 1 (1 votes) · LW · GW

Taking a step back, I want to note two things about my model of the near future (if your model disagrees with those things, that disagreement might explain what's going on in our recent exchanges):

(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.

(2) Suppose there's some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.

Regarding the questions you raised:

How would you propose to apply evolutionary algorithms to online learning?

One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).

How would you propose to apply evolutionary algorithms to non-episodic environments?

I'm not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can't be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there's training data or for which the environment can be simulated).

Comment by ofer on Healthy Competition · 2019-10-21T06:28:29.126Z · score: 2 (2 votes) · LW · GW
If you want to challenge a monopoly with a new org, there's likewise a particular burden to do a good job.

(This seems to depend on whether the job/project in question benefits from concentration.)

Comment by ofer on Healthy Competition · 2019-10-21T06:19:11.327Z · score: 5 (3 votes) · LW · GW

This topic seems very important!

Another potential consideration: In some cases, not having any competition can expose an org to a ~%50 probability of having net-negative impact, simply due to the possibility that a counterfactual org (founded by someone else) would have done a better job.

Comment by ofer on Defining Myopia · 2019-10-20T16:43:24.566Z · score: 2 (2 votes) · LW · GW
Conjecture: It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.

There's also reason to suspect the conjecture to be false. There's a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there's an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.

Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).

Comment by ofer on Optimization Provenance · 2019-10-16T18:04:17.157Z · score: 1 (1 votes) · LW · GW
The optimization processes used in the agent must all be capable of controlling whether it creates a mesa-optimizer.

I'm confused about this sentence - my understanding is that the term mesa-optimizer refers to the agent/model itself when it is doing some optimization. I think the term "run-time optimization" (which I've seen in this slide, seemingly from a talk by Yann LeCun) refers to this type of optimization.

4. Does this apply to other forms of optimization daemons?

Isn't every optimization daemon a mesa-optimizer?

I was under the impression that the term "optimization daemon" was used to describe a mesa-optimizer that is a "consequentialist" (I don't know whether there's a common definition for the term "consequentialist" in this context; my own tentative fuzzy definition is "something that has preferences about the spacetime of the world/multiverse".)

Comment by ofer on Gradient hacking · 2019-10-16T04:27:26.074Z · score: 6 (4 votes) · LW · GW

I think this post describes an extremely important problem and research directions, and I hope a lot more research and thought goes into this!

ETA: Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.

Comment by ofer on Impact measurement and value-neutrality verification · 2019-10-15T08:59:49.533Z · score: 4 (3 votes) · LW · GW

Very interesting!

Regarding value-neutrality verification: If deceptive alignment occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn't give us much assurance about the behavior of the model.

Comment by ofer on Thoughts on "Human-Compatible" · 2019-10-10T19:27:37.219Z · score: 5 (3 votes) · LW · GW
Any other ideas for "decoupled" AIs, or risks that apply to this approach in general?

If the question is about all the risks that apply, rather than special risks with this specific approach, then I'll note that the usual risks from the inner alignment problem seem to apply.

Comment by ofer on Machine Learning Projects on IDA · 2019-10-08T16:50:55.094Z · score: 1 (1 votes) · LW · GW

Regarding the following passage from the document:

What kind of built-in operations and environments should we use?
In existing work on NPI, the neural net is given outputs that correspond to basic operations on data. This makes it easier to learn algorithms that depend on those basic operations. For IDA, it would be ideal to learn these operations from examples. (If we were learning from human decompositions, we might not know about these “basic operations on data” ahead of time).

Do you have ideas/intuitions about how "basic operations" in the human brain can be learned? Also, how basic are the "basic operations" you're thinking about here? (Are we talking about something like the activity of an individual biological neuron? How active is a particular area in the prefrontal cortex? Symbolic-level stuff?)

Generally, do you consider imitating human cognition at the level of "basic operations" to be part of the IDA agenda? (As opposed to, say, training a model to "directly" predict the output of a human-working-for-10-minutes).

Comment by ofer on List of resolved confusions about IDA · 2019-10-01T15:13:25.543Z · score: 5 (3 votes) · LW · GW
The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").

Comment by ofer on Partial Agency · 2019-09-29T04:21:33.520Z · score: 4 (2 votes) · LW · GW
Toward the end, the parameter value could mutate away.

I agree that it's possible to get myopic models in the population after arbitrarily long runtime due to mutations. It seems less likely the more bits that need to change­—in any model in the current population—to get a model that in completely myopic.

From a safety perspective, if the prospect of some learning algorithm yielding a non-myopic model is concerning, the prospect of it creating non-myopic models along the way is plausibly also concerning.

And how would you apply evolutionary algorithms to really non-myopic settings, like reinforcement learning where you can't create any good episode boundaries (for example, you have a robot interacting with an environment "live", no resets, and you want to learn on-line)?

In this example, if we train an environment model on the data collected so far (ignoring the "you want to learn on-line" part), evolutionary algorithms might be an alternative to regular deep learning. More realistically, some actors would probably invest a lot of resources in developing top predictive models for stock prices etc., and evolutionary algorithms might be one of the approaches being experimented with.

Also, people might experiment with evolutionary algorithms as an alternative to RL, for environments that can be simulated, as OpenAI did; they wrote (2017): "Our work suggests that neuroevolution approaches can be competitive with reinforcement learning methods on modern agent-environment benchmarks, while offering significant benefits related to code complexity and ease of scaling to large-scale distributed settings.".

Comment by ofer on Partial Agency · 2019-09-28T09:00:30.584Z · score: 9 (4 votes) · LW · GW
If there's a learning-theoretic setup which incentivizes the development of "full agency" (whatever that even means, really!) I don't know what it is yet.

Consider evolutionary algorithms. It seems that (theoretically) they tend to yield non-myopic models by default given sufficiently long runtime. For example, a network parameter value that causes behavior that minimizes loss in future training inferences might be more likely to end up in the final model than one that causes behavior that minimizes loss in the current inference at a great cost for the loss in future ones.

Comment by ofer on Towards an empirical investigation of inner alignment · 2019-09-24T04:57:07.388Z · score: 2 (2 votes) · LW · GW

Very interesting! This research direction might lead to researchers having better intuitions about what sort of mesa-objectives we're more likely to end up with.

Perhaps similar experiments can be done with supervised learning (instead of RL).

Comment by ofer on AI Alignment Open Thread August 2019 · 2019-09-23T07:11:20.905Z · score: 1 (1 votes) · LW · GW

These biases seem very important to keep in mind!

If "AI safety" refers here only to AI alignment, I'd be happy to read about how overconfidence about the difficulty/safety of one's approach might exacerbate the unilateralist's curse.

Comment by ofer on The unexpected difficulty of comparing AlphaStar to humans · 2019-09-21T18:43:55.147Z · score: 1 (1 votes) · LW · GW

Ah, makes sense, thanks.

Comment by ofer on The unexpected difficulty of comparing AlphaStar to humans · 2019-09-21T17:54:29.930Z · score: 2 (2 votes) · LW · GW
From AlphaStar, we’ve learned that one of two things is true: Either AI can [...] solve basic game theory problems

I'm confused about the mention of game theory. Did AlphaStar play in games that included more than two teams?

Comment by ofer on Feature Wish List for LessWrong · 2019-09-21T15:49:30.116Z · score: 2 (2 votes) · LW · GW

The ability to get notified, in any way, about new comments in specific threads/posts would be very helpful for me!

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-15T10:16:23.760Z · score: 1 (1 votes) · LW · GW

Some tangentially related thoughts:

It seems that in many simple worlds (such as the Bomb world), an indexically-selfish agent with a utility function over centered histories would prefer to commit to UDT with a utility function over uncentered histories; where is defined as the sum of all the "uncentered versions" of (version corresponds to when the pointer is assumed to point to agent ).

Things seem to get more confusing in messy worlds in which the inability of an agent to define a utility function (over uncentered histories) that distinguishes between agent1 and agent2 does not entail that the two agents are about to make the same decision.

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-14T19:05:36.703Z · score: 3 (2 votes) · LW · GW

I agree. It seems that in that situation the person would be "rational" to choose Right.

I'm still confused about the "UDT is incompatible with this kind of selfish values" part. It seems that an indexically-selfish person—after failing to make a binding commitment and seeing the bomb—could still rationally commit to UDT from that moment on, by defining the utility s.t. only copies that found themselves in that situation (i.e. those who failed to make a binding commitment and saw the bomb) matter. That utility is a function over uncentered histories of the world, and would result in UDT choosing Right.

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-14T17:12:54.891Z · score: 12 (3 votes) · LW · GW
Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don’t know if I’m a simulation or a real person. If I was selfish in an indexical way, I would think something like “If I’m a simulation then it doesn’t matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I’m a real person, choosing Left will cause me to be burned. So I should choose Right.”

It seems to me that even in this example, a person (who is selfish in an indexical way) would prefer—before opening their eyes—to make a binding commitment to choose left. If so, the "intuitively correct answer" that UDT is unable to give is actually just the result of a failure to make a beneficial binding commitment.

Comment by ofer on A Critique of Functional Decision Theory · 2019-09-13T21:55:24.388Z · score: 13 (8 votes) · LW · GW

(I'm not a decision theorist)

FDT in any form will violate Guaranteed Payoffs, which should be one of the most basic constraints on a decision theory

Fulfilling the Guaranteed Payoffs principle as defined here seems to entail two-boxing in the Transparent Newcomb's Problem, and generally not being able to follow through on precommitments when facing a situation with no uncertainty.

My understanding is that a main motivation for UDT (which FDT is very similar to?) is to get an agent that, when finding itself in a situation X, follows through on any precommitment that—before learning anything about the world—the agent would have wanted to follow through on when it is in situation X. Such a behavior would tend to violate the Guaranteed Payoffs principle, but would be beneficial for the agent?

Comment by ofer on Two senses of “optimizer” · 2019-09-12T07:29:42.323Z · score: 1 (1 votes) · LW · GW
Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in.

This definition of optimizer_2 depends on the definition of "environment". It seems that for an RL agent you use the word "environment" to mean the formal environment as defined in RL. How do you define "environment", for this purpose, in non-RL settings?

What should be considered the environment of a SAT solver, or an arbitrary mesa-optimizer that was optimized to be a SAT solver?

Comment by ofer on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-12T06:19:19.602Z · score: 1 (1 votes) · LW · GW
Rohin's opinion: I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.

It seems to me that a SAT solver can be arbitrarily competent at solving SAT problems without being the second kind of optimizer (i.e. without acting upon its environment to change it), even while it solves SAT problems that encode the dynamics of our world. For example, this seems to be the case for a SAT solver that is just a brute force search with arbitrarily large amount of computing power.

[EDIT: When writing this comment, I considered "the environment of a SAT solver" to be the world that contains the computer running the SAT solver. However, this seem to contradict what Joar had in mind in his post].

Comment by ofer on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T18:52:38.880Z · score: 1 (1 votes) · LW · GW

Ah, I agree (edited my comment above accordingly).

Comment by ofer on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T18:05:50.245Z · score: 4 (2 votes) · LW · GW
Interesting... it seems that this doesn't necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.

I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).

I think it still applies to evolutionary algorithms (which might end up being relevant).

how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?

Maybe learning algorithms that have the following property are more likely to yield models with "cross-episodic behavior":

During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.

Also, what name would you suggest for this problem, if not "inner alignment"?

Maybe "non-myopia" as Evan suggested.

Comment by ofer on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T10:44:19.543Z · score: 5 (3 votes) · LW · GW
Inner alignment - The ML training process may not produce a model that actually optimizes for what we intend for it to optimize for (namely minimizing loss for just the current episode, conditional on the current episode being selected as a training episode).

If the trained model tries to minimize loss in future episodes, it definitely seems dangerous, but I'm not sure that we should consider this an inner-alignment failure. In some sense we got the behavior that our episodic learning algorithm was optimizing for.

For example, consider the following episodic learning algorithm: At the end of each episode, if the model failed to achieve the episode's goal its network parameters are completely randomized (and if it achieves the goal, the model is unchanged). If we run this learning algorithm for an arbitrarily long time, we should expect to end up with a model that behaves in a way that results in achieving the goal in every future episode (if such a model exists).

Comment by ofer on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T10:31:20.461Z · score: 2 (2 votes) · LW · GW

I think the following is potentially another remaining safety problem:

[EDIT: actually it's an inner alignment problem, using the definition here]

[EDIT2: i.e. using the following definitions from the above link:

  • "we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. "
  • "We will call the problem of eliminating the base-mesa objective gap the inner alignment problem".


Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many "luckier" copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by "magically" finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those "luckier" copies.

Comment by ofer on Are minimal circuits deceptive? · 2019-09-07T22:26:00.149Z · score: 4 (3 votes) · LW · GW

Very interesting!

I'm confused about why the "spontaneous meta-learning" in Ortega et al. is equivalent to (or a special case of?) mesa-optimization; which was also suggested in MIRI's August 2019 Newsletter. My understanding of Ortega et al. is that "spontaneous meta-learning" describes a scenario in which training on a sequence from a single generator is equivalent to training on sequences from multiple generators. I haven't seen them discuss this issue in the context of the trained model itself doing search/optimization.

Comment by ofer on Concrete experiments in inner alignment · 2019-09-07T08:09:20.959Z · score: 4 (3 votes) · LW · GW
To what extent do models care about their performance across episodes? If there exists a side-channel which only increases next-episode performance, under what circumstances will a model exploit such a thing?

If an agent is trained with an episodic learning scheme and ends up with a behavior that maximizes reward across episodes, I'm not sure we should consider this an inner alignment failure. In some sense, we got the behavior that our learning scheme was optimizing for. [EDIT: this is not true necessarily true for all learning algorithms, e.g. gradient descent, see discussion here]

To quickly see this, imagine an episodic learning scheme where—at the end of each episode—if the agent failed to achieve the episode's goal then its policy network parameters are completely randomized, and otherwise the agent is unchanged. Assuming we have infinite resources, if we run this learning scheme for an arbitrarily long time, we should expect to end up with an agent that tries to achieve goals in future episodes.

Comment by ofer on One Way to Think About ML Transparency · 2019-09-03T17:15:52.776Z · score: 1 (2 votes) · LW · GW
You'd have to memorize all the training data and labels too.

(just noting that same goes for a decision tree that isn't small enough s.t. the human can memorize it)

Comment by ofer on One Way to Think About ML Transparency · 2019-09-03T15:59:25.246Z · score: 2 (2 votes) · LW · GW
Unless the human could memorize the initialization parameters, they would be using a different neural network to classify.

Why wouldn't the "human-trained network" be identical to the "original network"? [EDIT: sorry, missed your point. If the human knows the logic of the random number generator that was used to initialize the parameters of the original network, they can manually run the same logic themselves.]

By the same logic, any decision tree that is too large for a human to memorize does not allow theory simulatability as defined in the OP.

Comment by ofer on One Way to Think About ML Transparency · 2019-09-03T07:24:07.541Z · score: 3 (3 votes) · LW · GW

I'm still not sure about the distinction. A human with an arbitrarily large amount of time & paper could "train" a new NN (instead of working to "extract a decision tree"), and then "use" that NN.

Comment by ofer on Two senses of “optimizer” · 2019-08-22T09:55:11.500Z · score: 1 (1 votes) · LW · GW
Also, as a terminological note, I've taken to using "optimizer" for optimizer_1 and "agent" for something closer to optimizer_2, where I've been defining an agent as an optimizer that is performing a search over what its own action should be.

I'm confused about this part. According to this definition, is "agent" a special case of optimizer_1? If so it doesn't seem close to how we might want to define a "consequentialist" (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).

Comment by ofer on Two senses of “optimizer” · 2019-08-22T09:49:55.793Z · score: 3 (3 votes) · LW · GW

Maybe we're just not using the same definitions, but according to the definitions in the OP as I understand them, a box might indeed contain an arbitrarily strong optimizer_1 while not containing an optimizer_2.

For example, suppose the box contains an arbitrarily large computer that runs a brute-force search for some formal optimization problem. [EDIT: for some optimization problems, the evaluation of a solution might result in the execution of an optimizer_2]

Comment by ofer on Two senses of “optimizer” · 2019-08-22T06:02:02.926Z · score: 2 (3 votes) · LW · GW

It seems useful to have a quick way of saying:

"The quarks in this box implement a Turing Machine that [performs well on the formal optimization problem P and does not do any other interesting stuff]. And the quarks do not do any other interesting stuff."

(which of course does not imply that the box is safe)

Comment by ofer on Clarifying some key hypotheses in AI alignment · 2019-08-16T05:39:53.012Z · score: 7 (5 votes) · LW · GW

Meta: I think there's an attempt to deprecate the term "inner optimizer" in favor of "mesa-optimizer" (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).

Comment by ofer on Do you use twitter for intellectual engagement? Do you like it? · 2019-08-12T10:34:15.489Z · score: 6 (3 votes) · LW · GW

Looking at the "regular" Twitter feed seems as dangerous for one's productivity as looking at Facebook's feed. Market incentives require Twitter to make their users spend as much time as possible on their platform (using the best ML models they can train for that purpose).

A safer way to use Twitter is to create a very short list of Twitter accounts (the accounts with the largest EV/tweet), and then regularly going over the complete "feed" of just that list - sorted chronologically (not giving Twitter any say in what you see).

Comment by ofer on How can I help research Friendly AI? · 2019-08-09T09:30:25.875Z · score: 3 (2 votes) · LW · GW

If you haven't already, check out the 80,000 Hours website (their goal is to provide useful advice on how people can use their career to do the most good).

Here are some links that seem relevant specifically for you (some might be out of date): (see "AI safety technical researcher" box)

You can also apply for their coaching service.

Comment by ofer on Aligning a toy model of optimization · 2019-07-01T18:56:41.189Z · score: 3 (2 votes) · LW · GW

When you say "test" do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?

If so, I'm concerned that we won't be able to write such a program.

If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don't understand how we can use to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).

Comment by ofer on Aligning a toy model of optimization · 2019-06-30T21:43:41.808Z · score: 4 (3 votes) · LW · GW

In the case of deceptive alignment, our ability to test whether the model behaves badly on input affects the behavior of the model on input (and similarly, our ability to come up with a s.t. allows us to find --if the model behaves badly on input --affects the behavior of the model on input ).

Therefore, to the extent that deceptive alignment is plausible in programs that outputs, the inner alignment problem seems to me very hard.

Comment by ofer on Aligning a toy model of optimization · 2019-06-29T06:16:52.602Z · score: 5 (3 votes) · LW · GW

Not much of an impossibility argument, but I just want to point out that any solution that involves outputting a program should somehow deal with the concern that the program might contain inner optimizers. This aspect seems to me very hard (unless we somehow manage to conclude that the smallest/fastest program that computes some function does not contain inner optimizers).

ETA: the term "inner optimizer" is deprecated in favor of "mesa-optimizer".

Comment by ofer on Risks from Learned Optimization: Introduction · 2019-06-09T19:44:45.734Z · score: 3 (2 votes) · LW · GW

Agreed (haven't thought about that).

Comment by ofer on Risks from Learned Optimization: Introduction · 2019-06-09T05:00:56.105Z · score: 5 (4 votes) · LW · GW

The distinction between the mesa- and behavioral objectives might be very useful when reasoning about deceptive alignment (in which the mesa-optimizer tries to have a behavioral objective that is similar to the base objective, as an instrumental goal for maximizing the mesa-objective).

Comment by ofer on Conditions for Mesa-Optimization · 2019-06-05T11:26:17.386Z · score: 6 (4 votes) · LW · GW
my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

This claim seems incorrect for at least some tasks (if you already think that, skip the rest of this comment).

Consider the following 2-player turn-based zero-sum game as an example for a task in which "heuristics" seemingly can't replace a tree search.

The game starts with an empty string. In each turn the following things happen:

(1) the player adds to the end of the string either "A" or "B".

(2) the string is replaced with its SHA256 hash.

Player 1 wins iff after 10 turns the first bit in the binary representation of the string is 1.

(Alternatively, consider the 1-player version of this game, starting with a random string.)

Comment by ofer on Where are people thinking and talking about global coordination for AI safety? · 2019-05-22T10:39:53.186Z · score: 17 (9 votes) · LW · GW
My question is, who is thinking directly about how to achieve such coordination (aside from FHI's Center for the Governance of AI, which I'm aware of) and where are they talking about it?

OpenAI has a policy team (this 80,000 Hours podcast episode is an interview with three people from that team), and I think their research areas include models for coordination between top AI labs, and improving publication norms in AI (e.g. maybe striving for norms that are more like those in computer security, where people are expected to follow some responsible disclosure process when publishing about new vulnerabilities). For example, the way OpenAI is releasing their new language model GPT-2 seems like a useful way to learn about the usefulness/feasibility of new publication norms in AI (see the "Release Strategy" section here).

I think related work is also being done at the Centre for the Study of Existential Risk (CSER).

Comment by ofer on Interpretations of "probability" · 2019-05-11T14:45:40.883Z · score: 16 (5 votes) · LW · GW
The claim "I think this coin is heads with probability 50%" is an expression of my own ignorance, and 50% probability means that I'd bet at 1 : 1 odds (or better) that the coin came up heads.

Just a minor quibble - using this interpretation to define one's subjective probabilities is problematic because people are not necessarily indifferent about placing a bet that has an expected value of 0 (e.g. due to loss aversion).

Therefore, I think the following interpretation is more useful: Suppose I win [some reward] if the coin comes up heads. I'd prefer to replace the winning condition with "the ball in a roulette wheel ends up in a red slot" for any roulette wheel in which more than 50% of the slots are red.

(I think I first came across this type of definition in this post by Andrew Critch)

Comment by ofer on Implications of GPT-2 · 2019-05-11T06:27:18.817Z · score: 1 (1 votes) · LW · GW

Thank you for clarifying!

FWIW, when I wrote "the exact same problem but with different labels" I meant "the exact same problem but with different arbitrary names for entities".

For example, I would consider the following two problems to be "the exact same problem but with different labels":

"X+1=2 therefore X="

"Y+1=2 therefore Y="

But NOT the following two problems:



Comment by ofer on Implications of GPT-2 · 2019-05-10T09:58:58.722Z · score: 1 (1 votes) · LW · GW
And this sounds like goal post moving:

I'm failing to see a goal-post-moving between me writing:

It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?

and then later writing (in reply to your comment quoting that sentence):

unless a very similar problem appears in the training data - e.g. the exact same problem but with different labels

If I'm missing something I'd be grateful for a further explanation.

Comment by ofer on Alignment Newsletter One Year Retrospective · 2019-04-21T18:05:23.452Z · score: 3 (2 votes) · LW · GW
However, Twitter has become worse over time, possibly because it has learned to show me non-academic stuff that is more attention-grabbing or controversial, despite me trying not to click on those sorts of things.

On Twitter you can create a list of relevant people (e.g. people who tend to tweet about relevant papers/posts) and then go over the complete "feed" of just that list, sorted chronologically.