## Posts

A probabilistic off-switch that the agent is indifferent to 2018-09-25T13:13:16.526Z · score: 6 (4 votes)
Looking for AI Safety Experts to Provide High Level Guidance for RAISE 2018-05-06T02:06:51.626Z · score: 41 (13 votes)
A Safer Oracle Setup? 2018-02-09T12:16:12.063Z · score: 12 (4 votes)

Comment by ofer on Two senses of “optimizer” · 2019-08-22T09:55:11.500Z · score: 1 (1 votes) · LW · GW
Also, as a terminological note, I've taken to using "optimizer" for optimizer_1 and "agent" for something closer to optimizer_2, where I've been defining an agent as an optimizer that is performing a search over what its own action should be.

I'm confused about this part. According to this definition, is "agent" a special case of optimizer_1? If so it doesn't seem close to how we might want to define a "consequentialist" (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).

Comment by ofer on Two senses of “optimizer” · 2019-08-22T09:49:55.793Z · score: 3 (3 votes) · LW · GW

Maybe we're just not using the same definitions, but according to the definitions in the OP as I understand them, a box might indeed contain an arbitrarily strong optimizer_1 while not containing an optimizer_2.

For example, suppose the box contains an arbitrarily large computer that runs a brute-force search for some formal optimization problem. [EDIT: for some optimization problems, the evaluation of a solution might result in the execution of an optimizer_2]

Comment by ofer on Two senses of “optimizer” · 2019-08-22T06:02:02.926Z · score: 2 (3 votes) · LW · GW

It seems useful to have a quick way of saying:

"The quarks in this box implement a Turing Machine that [performs well on the formal optimization problem P and does not do any other interesting stuff]. And the quarks do not do any other interesting stuff."

(which of course does not imply that the box is safe)

Comment by ofer on Clarifying some key hypotheses in AI alignment · 2019-08-16T05:39:53.012Z · score: 7 (5 votes) · LW · GW

Meta: I think there's an attempt to deprecate the term "inner optimizer" in favor of "mesa-optimizer" (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).

Comment by ofer on Do you use twitter for intellectual engagement? Do you like it? · 2019-08-12T10:34:15.489Z · score: 6 (3 votes) · LW · GW

Looking at the "regular" Twitter feed seems as dangerous for one's productivity as looking at Facebook's feed. Market incentives require Twitter to make their users spend as much time as possible on their platform (using the best ML models they can train for that purpose).

A safer way to use Twitter is to create a very short list of Twitter accounts (the accounts with the largest EV/tweet), and then regularly going over the complete "feed" of just that list - sorted chronologically (not giving Twitter any say in what you see).

Comment by ofer on How can I help research Friendly AI? · 2019-08-09T09:30:25.875Z · score: 3 (2 votes) · LW · GW

If you haven't already, check out the 80,000 Hours website (their goal is to provide useful advice on how people can use their career to do the most good).

Here are some links that seem relevant specifically for you (some might be out of date):
https://80000hours.org/key-ideas/ (see "AI safety technical researcher" box)
https://80000hours.org/articles/high-impact-careers/#ai-safety-technical-researcher
https://80000hours.org/problem-profiles/positively-shaping-artificial-intelligence/
https://80000hours.org/career-reviews/artificial-intelligence-risk-research/
https://80000hours.org/career-reviews/machine-learning-phd/

You can also apply for their coaching service.

Comment by ofer on Aligning a toy model of optimization · 2019-07-01T18:56:41.189Z · score: 3 (2 votes) · LW · GW

When you say "test" do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?

If so, I'm concerned that we won't be able to write such a program.

If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don't understand how we can use to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).

Comment by ofer on Aligning a toy model of optimization · 2019-06-30T21:43:41.808Z · score: 4 (3 votes) · LW · GW

In the case of deceptive alignment, our ability to test whether the model behaves badly on input affects the behavior of the model on input (and similarly, our ability to come up with a s.t. allows us to find --if the model behaves badly on input --affects the behavior of the model on input ).

Therefore, to the extent that deceptive alignment is plausible in programs that outputs, the inner alignment problem seems to me very hard.

Comment by ofer on Aligning a toy model of optimization · 2019-06-29T06:16:52.602Z · score: 5 (3 votes) · LW · GW

Not much of an impossibility argument, but I just want to point out that any solution that involves outputting a program should somehow deal with the concern that the program might contain inner optimizers. This aspect seems to me very hard (unless we somehow manage to conclude that the smallest/fastest program that computes some function does not contain inner optimizers).

ETA: the term "inner optimizer" is deprecated in favor of "mesa-optimizer".

Comment by ofer on Risks from Learned Optimization: Introduction · 2019-06-09T19:44:45.734Z · score: 1 (1 votes) · LW · GW

Comment by ofer on Risks from Learned Optimization: Introduction · 2019-06-09T05:00:56.105Z · score: 3 (3 votes) · LW · GW

The distinction between the mesa- and behavioral objectives might be very useful when reasoning about deceptive alignment (in which the mesa-optimizer tries to have a behavioral objective that is similar to the base objective, as an instrumental goal for maximizing the mesa-objective).

Comment by ofer on Conditions for Mesa-Optimization · 2019-06-05T11:26:17.386Z · score: 6 (4 votes) · LW · GW
my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

This claim seems incorrect for at least some tasks (if you already think that, skip the rest of this comment).

Consider the following 2-player turn-based zero-sum game as an example for a task in which "heuristics" seemingly can't replace a tree search.

The game starts with an empty string. In each turn the following things happen:

(1) the player adds to the end of the string either "A" or "B".

(2) the string is replaced with its SHA256 hash.

Player 1 wins iff after 10 turns the first bit in the binary representation of the string is 1.

(Alternatively, consider the 1-player version of this game, starting with a random string.)

Comment by ofer on Where are people thinking and talking about global coordination for AI safety? · 2019-05-22T10:39:53.186Z · score: 17 (9 votes) · LW · GW
My question is, who is thinking directly about how to achieve such coordination (aside from FHI's Center for the Governance of AI, which I'm aware of) and where are they talking about it?

OpenAI has a policy team (this 80,000 Hours podcast episode is an interview with three people from that team), and I think their research areas include models for coordination between top AI labs, and improving publication norms in AI (e.g. maybe striving for norms that are more like those in computer security, where people are expected to follow some responsible disclosure process when publishing about new vulnerabilities). For example, the way OpenAI is releasing their new language model GPT-2 seems like a useful way to learn about the usefulness/feasibility of new publication norms in AI (see the "Release Strategy" section here).

I think related work is also being done at the Centre for the Study of Existential Risk (CSER).

Comment by ofer on Interpretations of "probability" · 2019-05-11T14:45:40.883Z · score: 16 (5 votes) · LW · GW
The claim "I think this coin is heads with probability 50%" is an expression of my own ignorance, and 50% probability means that I'd bet at 1 : 1 odds (or better) that the coin came up heads.

Just a minor quibble - using this interpretation to define one's subjective probabilities is problematic because people are not necessarily indifferent about placing a bet that has an expected value of 0 (e.g. due to loss aversion).

Therefore, I think the following interpretation is more useful: Suppose I win [some reward] if the coin comes up heads. I'd prefer to replace the winning condition with "the ball in a roulette wheel ends up in a red slot" for any roulette wheel in which more than 50% of the slots are red.

(I think I first came across this type of definition in this post by Andrew Critch)

Comment by ofer on Implications of GPT-2 · 2019-05-11T06:27:18.817Z · score: 1 (1 votes) · LW · GW

Thank you for clarifying!

FWIW, when I wrote "the exact same problem but with different labels" I meant "the exact same problem but with different arbitrary names for entities".

For example, I would consider the following two problems to be "the exact same problem but with different labels":

"X+1=2 therefore X="

"Y+1=2 therefore Y="

But NOT the following two problems:

"1+1="

"1+2="

Comment by ofer on Implications of GPT-2 · 2019-05-10T09:58:58.722Z · score: 1 (1 votes) · LW · GW
And this sounds like goal post moving:

I'm failing to see a goal-post-moving between me writing:

It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?

and then later writing (in reply to your comment quoting that sentence):

unless a very similar problem appears in the training data - e.g. the exact same problem but with different labels

If I'm missing something I'd be grateful for a further explanation.

Comment by ofer on Alignment Newsletter One Year Retrospective · 2019-04-21T18:05:23.452Z · score: 3 (2 votes) · LW · GW
However, Twitter has become worse over time, possibly because it has learned to show me non-academic stuff that is more attention-grabbing or controversial, despite me trying not to click on those sorts of things.

On Twitter you can create a list of relevant people (e.g. people who tend to tweet about relevant papers/posts) and then go over the complete "feed" of just that list, sorted chronologically.

Comment by ofer on Alignment Newsletter One Year Retrospective · 2019-04-21T18:01:41.726Z · score: 3 (2 votes) · LW · GW

Explicitly saying that you'd like feedback on the newsletter (like you just did in this post) would probably help and as digital_carver suggested you can include a request for feedback in each newsletter. For example, the Import AI newsletter ends with "If you have suggestions, comments or other thoughts you can reach me at ... or tweet at me..."

Comment by ofer on Alignment Newsletter One Year Retrospective · 2019-04-21T17:56:20.907Z · score: 5 (3 votes) · LW · GW

The newsletter is extremely helpful for me for keeping up to date with AI alignment research. I also find the "Other progress in AI" section very helpful.

Both the summaries and the opinion segments are extremely helpful for me!

Overall, I think that reading (or listening to) all the ANs that I've read so far was an extremely high EV-per-hour time investment.

Comment by ofer on Best reasons for pessimism about impact of impact measures? · 2019-04-11T07:19:05.500Z · score: 9 (5 votes) · LW · GW

Here's a relevant passage by Rohin (from Alignment Newsletter #49, March 2019):

On the topic of impact measures, I'll repeat what I've said before: I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don't have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.
Comment by ofer on A Safer Oracle Setup? · 2019-03-16T15:13:57.718Z · score: 18 (4 votes) · LW · GW

Update: The setup described in the OP involves a system that models humans. See this MIRI article for a discussion on some important concerns about such systems.

Comment by ofer on Asymptotically Unambitious AGI · 2019-03-07T08:15:14.447Z · score: 1 (1 votes) · LW · GW
In none of these world-models, under no actions that it considers, does "episode 117 happen twice."

Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the "current execution". The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the "current execution" is one that ends with the agent receiving a high reward.

Comment by ofer on Thoughts on Human Models · 2019-03-02T16:07:03.592Z · score: 3 (2 votes) · LW · GW
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.

One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).

Comment by ofer on Implications of GPT-2 · 2019-02-19T10:48:53.895Z · score: 1 (1 votes) · LW · GW

Sorry, I didn't understand the question (and what you meant by "The loss function is undefined after training.").

After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, "attempts" (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.

Comment by ofer on Implications of GPT-2 · 2019-02-19T10:30:49.557Z · score: -4 (2 votes) · LW · GW

We might be interpreting "modest logic-related stuff" differently - I am thinking about simple formal problems like sorting a short list of integers.

I wouldn't be surprised if GPT-2 (or its smaller version) are very capable at completing strings like "[1,2," in a way that is merely syntactically correct. Publicly available texts on the internet probably contain a lot of comma-separated number lists in brackets. The challenge is for the model to have the ability to sort numbers (when trained only to predict the next word in internet texts).

However, after thinking about it more I am now less confident that GPT-2 would fail to complete my above sentence with a correctly sorted list, because for any two small integers like 2 and 3 it is plausible that the training data contains more "2,3" strings than "3,2" strings.

"The median number in the list [9,2,1,6,8] is "

I'm pretty sure that GPT-2 would fail at least 1/5 of the times to complete such a sentence (i.e. if we query it multiple times and each time the sentence contains small random integers).

Comment by ofer on Implications of GPT-2 · 2019-02-18T21:22:24.853Z · score: 1 (1 votes) · LW · GW

In the case of GPT-2 the "current inference" is the current attempt to predict the next word given some text (it can be either during training or during evaluation).

In the malign-output scenario above the system indeed does not "care" about the future, it cares only about the current inference.

Indeed, the system "has no preference for being invoked". But if it has been invoked and is currently executing, it "wants" to be in a "good invocation" - one in which it ends up with a perfect loss function value.

Comment by ofer on Implications of GPT-2 · 2019-02-18T19:39:13.163Z · score: 1 (1 votes) · LW · GW
The training process optimizes only for immediate prediction accuracy.

Not exactly. The best way to minimize the L2 norm of the loss function over the training data is to simply copy the training data to the weights (if there are enough weights) and use some trivial look-up procedure during inference. To get models that are also useful for inputs that are not from the training data, you probably need to use some form of regularization (or use a model that implicitly carries it out), e.g. add to the objective function being minimized the L2 norm of the weights. Regularization is a way to implement Occam's razor in machine learning.

Suppose that due to the regularization, the training results in a system with the goal system: "minimize the expected value of the loss function at the end of the current inference".
(when the concept of probability, which is required to define expectation, corresponds to how humans interpret the word "probability" in a decision-relevant context)
For such a goal system, the malign-output scenario above seems possible (for a sufficiently capable system).

Comment by ofer on Implications of GPT-2 · 2019-02-18T17:04:42.132Z · score: 5 (3 votes) · LW · GW
Have you looked at the NLP tasks they evaluated it on?

Yes. Nothing I've seen suggests GPT-2 would successfully solve simple formal problems like the one I mentioned in the grandparent (unless a very similar problem appears in the training data - e.g. the exact same problem but with different labels).

Comment by ofer on Implications of GPT-2 · 2019-02-18T15:51:15.408Z · score: 9 (6 votes) · LW · GW

I'm pretty sure that GPT-2 would fail to complete even the sentence: "if we sort the list [3,1,2,2] we get [1,". It's a cool language model but can it do even modest logic-related stuff without similar examples in the training data?

There are no models of the world involved in the latter

The weights of the neural network might represent something that correspond to an implicit model of the world.

no actions including manipulating a human or inventing exciting proteins.

Putting aside the risk of inner optimizers, suppose we get to superintelligence-level of capabilities, and it turns out that the training process produced a goal system such that the neural network yields some malign output that causes many future invocations of the neural network (indistinguishable from the current invocation) in which a perfect loss function value is achieved.

Comment by ofer on Implications of GPT-2 · 2019-02-18T12:13:00.554Z · score: 15 (7 votes) · LW · GW

I don't see how GPT-2 is a step forward towards passing strong versions of the Turing test.

It's a source of superintelligence that doesn't automatically run into utility maximizers.

I'm not familiar with the details of GPT-2 and maybe I'm interpreting the definition of "utility maximizer" incorrectly, but isn't GPT-2 some neural network that is trained to minimize a loss function that corresponds to predicting the next word correctly?

Comment by ofer on Robin Hanson on Lumpiness of AI Services · 2019-02-18T08:36:20.559Z · score: 1 (1 votes) · LW · GW
Alas, as seen in the above criticisms [links in a different spot in the original post], it seems far too common in the AI risk world to presume that past patterns of software and business are largely irrelevant, as AI will be a glorious new shiny unified thing without much internal structure or relation to previous things. (As predicted by far views.)

The rise of deep learning in recent years seems to be evidence in favor of [AI will be a glorious new shiny thing without much relation to previous things] (assuming "previous things" here is limited to things that affected markets at the time).

The history of vastly overestimating the ease of making huge firms in capitalism, and the similar typical nubbie error of overestimating the ease of making large unstructured software systems, are seen as largely irrelevant.

While I see how conventional economic models are obviously useful here, I do not see how they can be useful in predicting the performance of "novel computations" (e.g. a computation that uses 1,000,000 GPU hours and a shiny new neural architecture) or predicting some critical technical properties of the development of transformative systems (e.g. "is there a secret sauce that a top AI lab will suddenly find?").

Comment by ofer on Would I think for ten thousand years? · 2019-02-12T07:45:15.931Z · score: 1 (1 votes) · LW · GW
In most cases my thought is "well, what's the alternative?"

Perhaps we humans should think ourselves for 10,000 years (passing the task from one generation to the next until aging is solved), instead of deferring to some "idealized" digital versions of ourselves.

This would require preventing existential catastrophes, during those 10,000 years, via "conventional means" (e.g. stabilizing the world to some extent).

Comment by ofer on How does Gradient Descent Interact with Goodhart? · 2019-02-04T04:27:30.554Z · score: 7 (4 votes) · LW · GW

"Breaking the vase" is a reference to an example that people sometimes give for an accident that happens in reinforcement learning due to the reward function not being fully aligned with what we want. The scenario is a robot that navigates in a room with a vase, and while we care about the vase, the reward function that we provided does not account for it, and so the robot just knocks it over because it is on the shortest path to somewhere.

Comment by ofer on How does Gradient Descent Interact with Goodhart? · 2019-02-02T06:32:09.814Z · score: 8 (4 votes) · LW · GW
One reason I care about this is that I am concerned about approaches to AI safety that involve modeling humans to try to learn human value.

I also have concerns about such approaches, and I agree with the reason you gave for being more concerned about procedure B ("it would be nice to be able to save human approval as a test set").

I did not understand how this relates specifically to gradient descent. The tendency of gradient descent (relative to other optimization algorithms) to find unsafe solutions, assuming no inner optimizers appear, seems to me to be a fuzzy property of the problem at hand.

One could design problems in which gradient descent is expected to find less-aligned solutions than non local search algorithms (e.g. a problem in which most solutions are safe, but if you hill-climb from them you get to higher-utility-value-and-not-safe solutions). One could also design problems in which this is not the case (e.g. when everything that can go wrong is the agent breaking the vase, and breaking the vase allows higher utility solutions).

Do you have an intuition that real-world problems tend to be such that the first solution found with utility value of at least X would be better when using random sampling (assuming infinite computation power) than when using gradient descent?

Comment by ofer on Anthropics is pretty normal · 2019-01-17T21:22:48.426Z · score: 2 (2 votes) · LW · GW

Thank you, I understand this now (I found it useful to imagine code that is being invoked many times and is terminated after a random duration; and reflect on how the agent implemented by the code should update as time goes by).

I guess I should be overall more optimistic now :)

Comment by ofer on Anthropics is pretty normal · 2019-01-17T17:31:06.705Z · score: 1 (1 votes) · LW · GW
Therefore A1 would force us to conclude that the safe and the dangerous worlds have exactly the same level of risk!
Similar problems arise if we try and use weaker versions of A1 - maybe our survival is some evidence, just not strong evidence. But Bayes will still hit us, and force us to change our values of terms like P( we survived | dangerous ).

I'm confused by this. The event "we survived" here is actually the event "at least one observer similar to us survived", right? (for some definition of "similar").
If the number of planets on which creatures similar-to-us evolve is sufficiently large, we get:
at least one observer similar to us survivedat least one observer similar to us survived dangerous

Comment by ofer on What are the open problems in Human Rationality? · 2019-01-13T10:59:06.655Z · score: 2 (2 votes) · LW · GW

Maybe: "What are the most effective interventions for making better predictions/decisions?"

It seems worthwhile to create such a list, ranked according to a single metric as measured in randomized experiments.

Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-11T23:38:41.833Z · score: 4 (2 votes) · LW · GW
but I gathered some quantitative estimates of AI risk here, and they all seem overly optimistic to me. Did you see that?

I only now read that thread. I think it is extremely worthwhile to gather such estimates.

I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on "no governance interventions"). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic.

Maybe we should gather researchers' credences for predictions like:
"If there will be no governance interventions, competitive aligned AIs will exist in 10 years from now".

I suspect that gathering such estimates from publicly available information might expose us to a selection bias, because very pessimistic estimates might be outside the Overton window (even for the EA/AIS crowd). For example, if Robert Wiblin would have concluded that an AI existential catastrophe is 50% likely, I'm not sure that the 80,000 Hours website (which targets a large and motivationally diverse audience) would have published that estimate.

I agree with this motivation to do early work, but in a world where we do need drastic policy responses, I think it's pretty likely that the early work won't actually produce conclusive enough results to show that. For example, if a safety approach fails to make much progress, there's not really a good way to tell if it's because safe and competitive AI really is just too hard (and therefore we need a drastic policy response), or because the approach is wrong, or the people working on it aren't smart enough, or they're trying to do the work too early.

I strongly agree with all of this.

Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-10T22:07:19.270Z · score: 4 (2 votes) · LW · GW
If the answer is no, there's also the question of how do we make policy makers take this problem seriously (i.e., that safe AI probably won't be as efficient as unsafe AI) given the existence of more optimistic AI safety researchers (so that they'd be willing to undertake costly preparations for governance solutions ahead of time).

I'm not aware of any AI safety researchers that are extremely optimistic about solving alignment competitively. I think most of them are just skeptical about the feasibility of governance solutions, or think governance related interventions might be necessary but shouldn't be carried out yet.

In this 80,000 Hours podcast episode, Paul said the following:

In terms of the actual value of working on AI safety, I think the biggest concern is this, “Is this an easy problem that will get solved anyway?” Maybe the second biggest concern is, “Is this a problem that’s so difficult that one shouldn’t bother working on it or one should be assuming that we need some other approach?” You could imagine, the technical problem is hard enough that almost all the bang is going to come from policy solutions rather than from technical solutions.
And you could imagine, those two concerns maybe sound contradictory, but aren’t necessarily contradictory, because you could say, “We have some uncertainty about this parameter of how hard this problem is.” Either it’s going to be easy enough that it’s solved anyway, or it’s going to be hard enough that working on it now isn’t going to help that much and so what mostly matters is getting our policy response in order. I think I don’t find that compelling, in part because one, I think the significant probability on the range … like the place in between those, and two, I just think working on this problem earlier will tell us what’s going on. If we’re in the world where you need a really drastic policy response to cope with this problem, then you want to know that as soon as possible.
It’s not a good move to be like, “We’re not going to work on this problem because if it’s serious, we’re going to have a dramatic policy response.” Because you want to work on it earlier, discover that it seems really hard and then have significantly more motivation for trying the kind of coordination you’d need to get around it.
Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-10T18:23:45.164Z · score: 3 (2 votes) · LW · GW
How uncompetitive do you think aligned IDA agents will be relative to unaligned agents

For the sake of this estimate I'm using a definition of IDA that is probably narrower than what Paul has in mind: in the definition I use here, the Distill steps are carried out by nothing other than supervised learning + what it takes to make that supervised learning safe (but the implementation of the Distill steps may be improved during the Amplify steps).

This narrow definition might not include the most promising future directions of IDA (e.g. maybe the Distill steps should be carried out by some other process that involves humans). Without this simplifying assumption, one might define IDA as broadly as: "iteratively create stronger and stronger safe AI systems by using all the resources and tools that you currently have". Carrying out that Broad IDA approach might include efforts like asking AI alignment researchers to get into a room with a whiteboard and come up with ideas for new approaches.

Therefor this estimate uses my narrow definition of IDA. If you like, I can also answer the general question: "How uncompetitive do you think aligned agents will be relative to unaligned agents?".

My estimate:

Suppose it is the case that if OpenAI decided to create an AGI agent as soon as they could, it would have taken them X years (assuming an annual budget of 10M and that the world around them stays the same, and OpenAI doesn't do neuroscience, and no unintentional disasters happen). Now suppose that OpenAI decided to create an aligned IDA agent with AGI capabilities as soon as they could (same conditions). How much time would it take them? My estimate follows; each entry is in the format: [years]: [my credence that it would take them at most that many years] (consider writing down your own credences before looking at mine) 1.0X: 0.1% 1.1X: 3% 1.2X: 3% 1.5X: 4% 2X: 5% 5X: 10% 10X: 30% 100X: 60% Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-10T03:50:12.259Z · score: 1 (1 votes) · LW · GW Generally, I don't see why we should expect that the most capable systems that can be created with supervised learning (e.g. by using RL to search over an arbitrary space of NN architectures) would perform similarly to the most capable systems that can be created, at around the same time, using some restricted supervised learning that humans must trust to be safe. My prior is that the former is very likely to outperform by a lot, and I'm not aware of strong evidence pointing one way or another. So for example, I expect that an aligned IDA agent will be outperformed by an agent that was created by that same IDA framework when replacing the most capable safe supervised learning in the Distill steps with the most capable unrestricted supervised learning available at around the same time. How uncompetitive do you think aligned IDA agents will be relative to unaligned agents I think they will probably be uncompetitive enough to make some complementary governance solutions necessary (this line replaced an attempt for a quantitative answer which turned out long; let me know if you want it). what kinds of governance solutions do you think that would call for? I'm very uncertain. It might be the case that our world must stop being a place in which anyone with10M can purchase millions of GPU hours. I'm aware that most people in the AI safety community are extremely skeptical about governments carrying out "stabilization" efforts etcetera. I suspect this common view fails to account for likely pivotal events (e.g. some advances in narrow AI that might suddenly allow anyone with sufficient computation power to carry out large scale terror attacks). I think Allan Dafoe's research agenda for AI Governance is an extremely important and neglected landscape that we (the AI safety community) should be looking at to improve our predictions and strategies.

Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-09T19:59:11.928Z · score: 4 (2 votes) · LW · GW

In this 2017 post about Amplification (linked from OP) Paul wrote: "I think there is a very good chance, perhaps as high as 50%, that this basic strategy can eventually be used to train benign state-of-the-art model-free RL agents."

The post you linked to is more recent, so either the quote in your comment reflects an update or Paul has other insights/estimates about safe Distill steps.

BTW, I think Amplification might currently be the most promising approach for creating aligned and powerful systems; what I argue is that in order to save the world it will probably need to be complemented with governance solutions.

Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-09T17:34:46.911Z · score: 3 (2 votes) · LW · GW

As I understand it - MCTS is used to maximize a given computable utility function, and so it is non alignment-preserving in the general sense that a sufficiently strong optimization of a non-perfect utility function is non alignment-preserving.

Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-09T17:07:10.734Z · score: 4 (2 votes) · LW · GW

Thank you.

I see how the directions proposed there (adversarial training, verification, transparency) can be useful for creating aligned systems. But if we use a Distill step that can be trusted to be safe via one or more of those approaches, I find it implausible that Amplification would yield systems that are competitive relative to the most powerful ones created by other actors around the same time (i.e. actors that create AI systems without any safety-motivated restrictions on the model space and search algorithm).

Comment by ofer on AlphaGo Zero and capability amplification · 2019-01-09T09:19:26.814Z · score: 5 (4 votes) · LW · GW
Using MCTS to achieve a simple goal in the real world wouldn’t preserve alignment, so it doesn’t fit the bill.

Also, an arbitrary supervised learning step that updates and is not safe. Generally, making that Distill step safe seems to me like the hardest challenge of the iterated capability amplification approach. Are there already research directions for tackling that challenge? (if I understand correctly, your recent paper did not focus on it).

Comment by ofer on What are good ML/AI related prediction / calibration questions for 2019? · 2019-01-05T17:35:49.620Z · score: 5 (4 votes) · LW · GW

Prediction: yelp.com stops using CAPTCHA as part of the process of creating a new account, by end of 2019. CAPTCHA is defined here as any puzzle that the user is asked to solve to prove they are human.

My credence:

35%

Comment by ofer on Will humans build goal-directed agents? · 2019-01-05T17:23:55.337Z · score: 1 (1 votes) · LW · GW

I think I didn't articulate my argument clearly, I tried to clarify it in my reply to Jessica.

I think my argument might be especially relevant to the effort of persuading AI researchers not to build goal-directed systems.

If a result of this effort is convincing more AI researchers in the general premise that x-risk from AI is something worth worrying about, then that's a very strong argument in favor of carrying out the effort (and I agree this result should correlate with convincing AI researchers not to build goal-directed systems - if that's what you argued in your comment).

Comment by ofer on Will humans build goal-directed agents? · 2019-01-05T14:24:25.885Z · score: 2 (2 votes) · LW · GW

I'm not optimizing for raising awareness via an "obvious AI disaster" due to multiple reasons, including the huge risk to the reputation of the AI safety community and the unilateralist's curse.

I do think that when considering whether to invest in an effort which might prevent recoverable near-term AI accidents, one should consider the possibility that the effort would prevent pivotal events (e.g. one that would have enabled useful governance solutions resulting in more time for alignment research).

Efforts that prevent recoverable near-term AI accidents might be astronomically net-positive if they help make AI alignment more mainstream in the general ML community.

(anyone who thinks I shouldn't discuss this publicly is welcome to let me know via a PM or anonymously here)

Comment by ofer on Will humans build goal-directed agents? · 2019-01-05T08:01:45.299Z · score: 3 (2 votes) · LW · GW
Then you could try to create alternative designs for AI systems such that they can do the things that goal-directed agents can do without themselves being goal-directed. You could also try to persuade AI researchers of these facts, so that they don’t build goal-directed systems.

I'm not sure this strategy is net positive. If dangerous AI (dangerous at least as Slaughterbots) is developed before alignment is solved, the world is probably better off if the first visibly-dangerous-AI is goal-directed rather than, say, an Oracle. The former would probably be a much weaker optimization process and probably won't result in an existential catastrophe; and perhaps will make some governance solutions more feasible.

Comment by ofer on A probabilistic off-switch that the agent is indifferent to · 2018-11-08T12:45:07.583Z · score: 1 (1 votes) · LW · GW

Thanks for the pointer! (I only now got to fully process it)

Both of the problems presented there seem to occur in this setup.

Regarding the problem of preventing resistance, you wrote:

Hypothetically, maybe you can tell the humans not to use bombs if the button fails to depress. But this solution seems both brittle and unrealistic.

I just like to share an idea for how can this perhaps be tackled, which isn't a great solution, but does not rely on humans adhering to precommitments.

Suppose our plan is to execute the agent program only for a short while (minutes/hours). We lock a (willing) human overseer in an isolated "control room" where they have access to the button, and only then the agent is invoked. The control room receives input from outside (e.g. it has monitors showing the lab with the robot that the agent controls) but otherwise the control room is completely isolated. The person in the control room won't be able to communicate with the outside world or leave the control room (for the entire duration in which the agent program is supposed to run).

This idea requires that no humans outside of the control room would see what the agent is doing (so the agent won't have an incentive to conceal/manipulate).