The recent NeurIPS call for papers requires authors to include a statement about the potential broader impact of their work 2020-02-24T07:44:20.850Z · score: 12 (5 votes)
ofer's Shortform 2019-11-26T14:59:40.664Z · score: 4 (1 votes)
A probabilistic off-switch that the agent is indifferent to 2018-09-25T13:13:16.526Z · score: 11 (5 votes)
Looking for AI Safety Experts to Provide High Level Guidance for RAISE 2018-05-06T02:06:51.626Z · score: 43 (14 votes)
A Safer Oracle Setup? 2018-02-09T12:16:12.063Z · score: 12 (4 votes)


Comment by ofer on Learning the prior · 2020-07-07T18:55:17.679Z · score: 3 (2 votes) · LW · GW

I'm confused about this point. My understanding is that, if we sample iid examples from some dataset and then naively train a neural network with them, in the limit we may run into universal prior problems, even during training (e.g. an inference execution that leverages some software vulnerability in the computer that runs the training process).

Comment by ofer on [AN #105]: The economic trajectory of humanity, and what we might mean by optimization · 2020-06-28T06:08:29.299Z · score: 1 (1 votes) · LW · GW

Claims of the form “neural nets are fundamentally incapable of X” are almost always false: recurrent neural nets are Turing-complete, and so can encode arbitrary computation.

I think RNNs are not Turing-complete (assuming the activations and weights can be represented by a finite number of bits). Models with finite state space (reading from an infinite input stream) can't simulate a Turing machine.

(Though I share the background intuition.)

Comment by ofer on AI safety via market making · 2020-06-27T06:14:47.937Z · score: 4 (3 votes) · LW · GW

Interesting idea.

Suppose that in the first time step is able to output a string that will manipulate into: (1) giving a probability that is maximally different than ; and (2) not looking at the rest of (i.e. the human will never see ,,...).

Ignoring inner alignment problems, in the limit it seems plausible that will output such an ; resulting in , and the smallest possible given .

[EDIT: actually, such problems are not specific to this idea and seem to generally apply to the 'AI safety via debate' approach.]

Comment by ofer on Likelihood of hyperexistential catastrophe from a bug? · 2020-06-21T05:47:39.751Z · score: 1 (1 votes) · LW · GW

The mugger scenario triggers strong game theoretical intuitions (eg "it's bad to be the sort of agent that other agents can benefit from making threats against") and the corresponding evolved decision-making processes. Therefore, when reasoning about scenarios that do not involve game theoretical dynamics (as is the case here), it may be better to use other analogies.

(For the same reason, "Pascal's mugging" is IMO a bad name for that concept, and "finite Pascal's wager" would have been better.)

Comment by ofer on ofer's Shortform · 2020-06-07T18:10:59.281Z · score: 1 (1 votes) · LW · GW

Paul Christiano's definition of slow takeoff may be too narrow, and sensitive to a choice of "basket of selected goods".

(I don't have a background in economy, so the following may be nonsense.)

Paul Christiano operationalized slow takeoff as follows:

There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles. (Similarly, we’ll see an 8 year doubling before a 2 year doubling, etc.)

My understanding is that "world output" is defined with respect to some "basket of selected goods" (which may hide in the definition of inflation). Let's say we use whatever basket the World Bank used here.

Suppose that X years from now progress in AI makes half of the basket extremely cheaper to produce, but makes the other half only slightly cheaper to produce. The increase in the "world output" does not depend much on whether the first half of the basket is now 10x cheaper or 10,000x cheaper. In both cases the price of the basket is dominated by its second half.

If the thing we care about here is whether "incredibly powerful AI will emerge in a world where crazy stuff is already happening (and probably everyone is already freaking out)"—as Paul wrote—we shouldn't consider the above 10x and 10,000x cases to be similar.

Comment by ofer on OpenAI announces GPT-3 · 2020-05-31T09:04:39.181Z · score: 6 (4 votes) · LW · GW

As abergal wrote, not carrying the "1" can simply mean it does digit-wise addition (which seems trivial via memorization). But notice that just before that quote they also write:

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized.

That seems like evidence against memorization, but maybe their simple search failed to find most cases with some relevant training signal, eg: "In this diet you get 350 calories during breakfast: 200 calories from X and 150 calories from Y."

Comment by ofer on Databases of human behaviour and preferences? · 2020-04-22T08:53:20.720Z · score: 4 (3 votes) · LW · GW

Maybe Minecraft-related datasets can be helpful. I'm not familiar with them myself, but I found these two:

CraftAssist: A Framework for Dialogue-enabled Interactive Agents

MineRL: A Large-Scale Dataset of Minecraft Demonstrations

Comment by ofer on Three Kinds of Competitiveness · 2020-04-01T07:33:03.860Z · score: 3 (2 votes) · LW · GW

Good point about inner alignment problems being a blocker to date-competitiveness for IDA... but aren't they also a blocker to date-competitiveness for every other alignment scheme too pretty much?

I think every alignment approach (other than interpretability-as-a-standalone-approach) that involves contemporary ML (i.e. training large neural networks) may have its date-competitiveness affected by inner alignment.

What alignment schemes don't suffer from this problem?

Most alignment approaches may have their date-competitiveness affected by inner alignment. (It seems theoretically possible to use whole brain emulation without inner alignment related risks, but as you mentioned elsewhere someone may build a neuromorphic AGI before we get there.)

I'm thinking "Do anything useful that a human with a lot of time can do" is going to be substantially less capable than full-blown superintelligent AGI.

I agree. Even a "narrow AI" system that is just very good at predicting stock prices may outperform "a human with a lot of time" (by leveraging very-hard-to-find causal relations).

Instead of saying we should expect IDA to be performance-competitive, I should have said something like the following: If at some point in the future we get to a situation where trillions of safe AGI systems are deployed—and each system can "only" do anything that a human-with-a-lot-of-time can do—and we manage to not catastrophically screw up until that point, I think humanity will probably be out of the woods. (All of humanity's regular problems will probably get resolved very quickly, including the lack of coordination.)

Comment by ofer on Three Kinds of Competitiveness · 2020-03-31T15:41:15.346Z · score: 3 (2 votes) · LW · GW

Very interesting definitions! I like the way they're used here to compare different scenarios.

Proposal: Iterated Distillation and Amplification: [...] I currently think of this scheme as decently date-competitive but not as cost-competitive or performance-competitive.

I think IDA's date-competitiveness will depend on the progress we'll have in inner alignment (or our willingness to bet against inner alignment problems occurring, and whether we'll be correct about it). Also, I don't see why we should expect IDA to not be very performance-competitive (if I understand correctly the hope is to get a system that can do anything useful that a human with a lot of time can do).

Generally, when using these definitions for comparing alignment approaches (rather than scenarios) I suspect we'll end up talking a lot about "the combination of date- and performance-competitiveness", because I expect the performance-competitiveness of most approaches will depend on how much research effort is invested in them.

Comment by ofer on Largest open collection quotes about AI · 2020-03-31T13:27:44.913Z · score: 6 (4 votes) · LW · GW

This spreadsheet is super impressive and has been very useful to me (it allowed me to find some very interesting stuff, like this discussion with Bill Gates and Elon Musk), thank you for creating it!

Comment by ofer on ofer's Shortform · 2020-03-26T20:43:56.292Z · score: 2 (3 votes) · LW · GW

Uneducated hypothesis: All hominidae species tend to thrive in huge forests, unless they've discovered fire. From the moment a species discovers fire, any individual can unilaterally burn the entire forest (due to negligence/anger/curiosity/whatever), and thus a huge forest is unlikely to serve as a long-term habitat for many individuals of that species.

Comment by ofer on Where can we donate time and money to avert coronavirus deaths? · 2020-03-18T07:27:11.665Z · score: 2 (3 votes) · LW · GW

For donating money:

It may be worthwhile to look into the COVID-19 Solidarity Response Fund (co-created by WHO). From WHO's website:

The Covid-19 Solidarity Response Fund is a secure way for individuals, philanthropies and businesses to contribute to the WHO-led effort to respond to the pandemic.

The United Nations Foundation and the Swiss Philanthropy Foundation have created the solidarity fund to support WHO and partners in a massive effort to help countries prevent, detect, and manage the novel coronavirus – particularly those where the needs are the greatest.

The fund will enable us to:

  • Send essential supplies such as personal protective equipment to frontline health workers
  • Enable all countries to track and detect the disease by boosting laboratory capacity through training and equipment.
  • Ensure health workers and communities everywhere have access to the latest science-based information to protect themselves, prevent infection and care for those in need.
  • Accelerate efforts to fast-track the discovery and development of lifesaving vaccines, diagnostics and treatments
Comment by ofer on How's the case for wearing googles for COVID-19 protection when in public transportation? · 2020-03-15T15:07:50.326Z · score: 1 (1 votes) · LW · GW

After seeing this preprint I'm less confident in my above update.

Comment by ofer on How long does SARS-CoV-2 survive on copper surfaces · 2020-03-14T14:40:05.385Z · score: 1 (1 votes) · LW · GW

Disclaimer: I'm not an expert.

It seems to me that this preprint suggests that in certain conditions the half-life of HCoV-19 (SARS-CoV-2) is ~0.4 hours on copper, ~3.5 hours on cardboard, ~5.5 hours on steel, and ~7 hours on plastic.

Comment by ofer on How's the case for wearing googles for COVID-19 protection when in public transportation? · 2020-03-11T13:24:40.832Z · score: 1 (1 votes) · LW · GW

[EDIT: You probably shouldn't read this comment, and instead read this post by Scott Alexander.]

FYI, regular surgical masks are insufficient for protection against COVID-19. A respirator graded n95 or higher is required.

Disclaimer: I'm not an expert.

[EDIT (2020-05-30): you really shouldn't use the following for updating your beliefs.]

After a quick look at some of the papers mentioned in Elizabeth's answers here I updated away from the belief that surgical masks are substantially less effective than N95 masks at preventing the wearer from getting infected with the novel coronavirus (it now seems to me likely plausible that surgical masks are not substantially less effective). But I can easily be wrong about that, and the evidence I've seen seems to me weak (the papers I've seen did not involve the novel coronavirus).

Comment by ofer on March Coronavirus Open Thread · 2020-03-10T15:11:51.615Z · score: 1 (3 votes) · LW · GW

Maybe citing the CDC:

It’s likely that at some point, widespread transmission of COVID-19 in the United States will occur. Widespread transmission of COVID-19 would translate into large numbers of people needing medical care at the same time. Schools, childcare centers, and workplaces, may experience more absenteeism. Mass gatherings may be sparsely attended or postponed. Public health and healthcare systems may become overloaded, with elevated rates of hospitalizations and deaths. Other critical infrastructure, such as law enforcement, emergency medical services, and sectors of the transportation industry may also be affected. Healthcare providers and hospitals may be overwhelmed. At this time, there is no vaccine to protect against COVID-19 and no medications approved to treat it.

Comment by ofer on What "Saving throws" does the world have against coronavirus? (And how plausible are they?) · 2020-03-04T21:02:04.751Z · score: 4 (3 votes) · LW · GW

Are there more?

Speaking as a layperson, it seems to me plausible that we'll see a "successful saving throw" in the form of a new coronavirus testing method (perhaps powered by machine learning) that will be cheap, quick and accurate. It will then be used in a massive scale all over the world and will allow governments to quarantine people much more effectively.

Comment by ofer on Coronavirus: Justified Practical Advice Thread · 2020-03-01T07:32:06.629Z · score: 8 (5 votes) · LW · GW

It is recommended to avoid touching your eyes, nose, and mouth[1]. People tend to inadvertently touch their eyes, nose, and mouth many times per hour[2]. If you think you can substantially reduce the number of times you touch your face by training yourself to avoid doing it, in some low-effort way, go for it. If it takes time to become good at not touching one's face, it may be worthwhile to start training at it now even if where you live is currently coronavirus-free.


[1]: The CDC (Centers for Disease Control and Prevention) writes:

The best way to prevent illness is to avoid being exposed to this virus. However, as a reminder, CDC always recommends everyday preventive actions to help prevent the spread of respiratory diseases, including:


  • Avoid touching your eyes, nose, and mouth.

[2]: The video by the CDC that Davidmanheim linked to claimed: "Studies have shown that people touch their eyes, nose, and mouth about 25 times every hour without even realizing it!"

Comment by ofer on ofer's Shortform · 2020-02-29T22:27:58.040Z · score: 1 (1 votes) · LW · GW

[Coronavirus related]

If some organization had perfect knowledge about the location of each person on earth (at any moment); and got an immediate update on any person that is diagnosed with the coronavirus, how much difference could that make in preventing the spread of the coronavirus?

What if the only type of action that the organization could take is sending people messages? For example, if Alice was just diagnosed with the coronavirus and 10 days ago she was on a bus with Bob, now Bob gets a message: "FYI the probability you have the coronavirus just increased from 0.01% to 0.5% due to someone that was near you 10 days ago. Please self-quarantine for 4 days." (These numbers are made up, obviously.)

Comment by ofer on Does iterated amplification tackle the inner alignment problem? · 2020-02-16T07:32:38.473Z · score: 3 (3 votes) · LW · GW

My understanding is that amplification-based approaches are meant to tackle inner alignment by using the amplified systems that are already trusted (e.g. humans + many invocations of a trusted model) to mitigate inner alignment problems in the next (slightly more powerful) models that are being trained. A few approaches for this have already been suggested (I'm not aware of published empirical results), see Evan's comment for some pointers.

I hope a lot more research will be done on this topic. It's not clear to me whether we should expect to have amplified systems that allow us to mitigate inner alignment risks to a satisfactory extent before the point where we have x-risk posing systems, how can we make that more likely, and if it's not feasible how do we realize that as soon as possible?

Comment by ofer on Bayesian Evolving-to-Extinction · 2020-02-15T10:57:15.063Z · score: 3 (2 votes) · LW · GW

It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it's simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly.

I'm not sure about the latter. Suppose there is a "simple" ticket that randomly writes stuff to the logs in a way that makes future training examples harder to predict. I don't see what would cause that ticket to be selected for.

Comment by ofer on ofer's Shortform · 2020-02-14T21:17:32.415Z · score: 1 (1 votes) · LW · GW

This doesn't require AI, it happens anywhere that competing prices are easily available and fairly mutable.

It happens without AI to some extent, but if a lot of businesses will be setting prices via RL based systems (which seems to me likely), then I think it may happen to a much greater extent. Consider that in the example above, it may be very hard for the five barbers to coordinate a $3 price increase without any communication (and without AI) if, by assumption, the only Nash equilibrium is the state where all the five barbers charge market prices.

AI will be no more nor less liable than humans making the same decisions would be.

People sometimes go to jail for illegally coordinating prices with competitors; I don't see how an antitrust enforcement agency will hold anyone liable in the above example.

Comment by ofer on ofer's Shortform · 2020-02-14T21:16:13.155Z · score: 1 (1 votes) · LW · GW

Suppose the code of the deep RL algorithm that was used to train the huge policy network is publicly available on GitHub, as well as everything else that was used to train the policy network, plus the final policy network itself. How can an antitrust enforcement agency use all this to determine whether an antitrust violation has occurred? (in the above example)

Comment by ofer on ofer's Shortform · 2020-02-14T12:33:04.426Z · score: 3 (2 votes) · LW · GW

I'm curious how antitrust enforcement will be able to deal with progress in AI. (I know very little about antitrust laws.)

Imagine a small town with five barbershops. Suppose an antitrust law makes it illegal for the five barbershop owners to have a meeting in which they all commit to increase prices by $3.

Suppose that each of the five barbershops will decide to start using some off-the-shelf deep RL based solution to set their prices. Suppose that after some time in which they're all using such systems, lo and behold, they all gradually increase prices by $3. If the relevant government agency notices this, who can they potentially accuse of committing a crime? Each barbershop owner is just setting their prices to whatever their off-the-shelf system recommends (and that system is a huge neural network that no one understands at a relevant level of abstraction).

Comment by ofer on Simulation of technological progress (work in progress) · 2020-02-11T21:16:56.252Z · score: 2 (2 votes) · LW · GW

Very interesting :)

I suspect the model is making a hidden assumption about the lack of "special projects"; e.g. the model assumes there can't be a single project that yields a bonus that makes all the other projects' tasks instantly solvable?

Also, I'm not sure that the model allows us to distinguish between scenarios in which a major part of overall progress is very local (e.g. happens within a single company) and more Hansonian scenarios in which the contribution to progress is well distributed among many actors.

Comment by ofer on Preface to CLR's Research Agenda on Cooperation, Conflict, and TAI · 2020-02-11T10:38:26.147Z · score: 5 (4 votes) · LW · GW

the failure mode of an amoral AI system that doesn't care about you seems both more likely and more amenable to technical safety approaches (to me at least).

It seems to me that at least some parts of this research agenda are relevant for some special cases of "the failure mode of an amoral AI system that doesn't care about you". A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like "under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?". Such questions seem relevant under a very wide range of moral systems (including ones that don't place much weight on s-risks).

Comment by ofer on Did AI pioneers not worry much about AI risks? · 2020-02-10T12:57:59.281Z · score: 20 (9 votes) · LW · GW

The following quoted texts are from this post by Scott Alexander:

Alan Turing:

Let us now assume, for the sake of argument, that these machines are a genuine possibility, and look at the consequences of constructing them. To do so would of course meet with great opposition, unless we have advanced greatly in religious tolerance since the days of Galileo. There would be great opposition from the intellectuals who were afraid of being put out of a job. It is probable though that the intellectuals would be mistaken about this. There would be plenty to do in trying to keep one’s intelligence up to the standards set by the machines, for it seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers…At some stage therefore we should have to expect the machines to take control.

[EDIT: a similar text, attributed to Alan Turing, appears here (from the last paragraph) - continued here.]

I. J. Good:

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make

[EDIT: I didn't manage to verify it yet, but it seems that that last quote is from a 58 page paper by I. J. Good, titled Speculations Concerning the First Ultraintelligent Machine; here is an archived version of the broken link in Scott's post.]

Comment by ofer on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T12:41:08.569Z · score: 1 (1 votes) · LW · GW

I want to flag that—in the case of evolutionary algorithms—we should not assume here that the fitness function is defined with respect to just the current batch of images, but rather with respect to, say, all past images so far (since the beginning of the entire training process); otherwise the selection pressure is "myopic" (i.e. models that outperform others on the current batch of images have higher fitness).

(I might be over-pedantic about this topic due to previously being very confused about it.)

Comment by ofer on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-06T19:01:36.103Z · score: 1 (1 votes) · LW · GW

instead, if there are hyperparameters that prevent the error rate going below 0.1, these will be selected by gradient descent as giving a better performance.

I don't follow this point. If we're talking about using SGD to update (hyper)parameters, using a batch of images from the currently used datasets, then the gradient update would be determined by the gradient of the loss with respect to that batch of images.

Comment by ofer on Synthesizing amplification and debate · 2020-02-06T16:32:01.068Z · score: 1 (1 votes) · LW · GW

Let H:Q→A be a human.


Let Amp(H,M)(Q)=H(“What answer would you give to Q given access to M?”).

Nitpick: is meant to be defined here as a human with access to ?

Comment by ofer on Instrumental Occam? · 2020-02-01T22:15:37.765Z · score: 3 (2 votes) · LW · GW

So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?

Maybe problems that don't have simple solutions (i.e. all their solutions have a large description length) are usually intractable for us. If so, given a problem that we're trying to solve, the assumption that it has simple solutions is probably either useful (if it's true) or costless (if it isn't). In other words: "look for your missing key under the lamppost, not because it's probably there, but because you'll only ever find it if it's there".

Comment by ofer on Gradient hacking · 2020-02-01T07:09:10.392Z · score: 5 (3 votes) · LW · GW

I wasn't claiming that there'll be an explicit OR gate, just something functionally equivalent to it.

Sure, we're on the same page here. I think by "There's still a gradient signal to change the OR gate" you mean exactly what I meant when I said "that would just be passing the buck to the output of that OR".

I'm not sure I understand 2 and 3. The activations are in practice discrete (e.g. represented by 32 bits), and so the subnetworks can be designed such that they never output values within the range (if that's important/useful for the mechanism to work).

It's non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain?

Humans can't control their brain in the level of abstraction of neurons—by thinking alone—but in a higher level of abstraction they do have some control that can be useful. For example, consider a human in a Newcomb's problem that decides to 1-box. Arguably, they reason in a certain way in order to make their brain have a certain property (namely, being a brain that decides to 1-box in a Newcomb's problem).

(Independence is very tricky because they're part of the same plan, and so a change in your underlying motivation to pursue that plan affects both).

Perhaps I shouldn't have used the word "independent"; I just meant that the output of one subnetwork does not affect the output of the other (during any given inference).

Comment by ofer on Gradient hacking · 2020-01-30T08:27:59.588Z · score: 3 (2 votes) · LW · GW

Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there's a gradient signal to change the failing-hard mechanism.

I didn't mean feeding the outputs of the two subnetworks to an OR dedicated to that purpose (that would just be passing the buck to the output of that OR). Suppose instead that the task is classifying cat/dog images and that each subnetwork can independently cause the network to classify a dog image as a cat by messing with a completely different piece of logic (e.g. one subnetwork is doing the equivalent of causing a false detection of cat whiskers, and the other is doing the equivalent of causing a false detection of cat eyes) such that the loss of the model is similar if any of the two subnetworks or both "decide to make the model fail".

I want to emphasize that I don't argue that we should be concerned about such sophisticated mechanisms randomly appearing during training. I argue that, if a huge neural network implements a sufficiently powerful optimization process with a goal system that involves our world, then it seems possible that that optimization process would construct such sophisticated mechanisms within the neural network. (And so the above is merely an argument that such theoretical mechanisms exist, not that they are easy to construct.)

Comment by ofer on The Epistemology of AI risk · 2020-01-29T20:18:20.497Z · score: 1 (1 votes) · LW · GW

Inner alignment says, well, it's not exactly like that. There's going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.

As I understand the language, the "loss function used to train our AIs" matches "our objective function" from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a "refinement of the traditional argument" (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).

My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes

I'm not sure what you mean by saying "the rest of the book talking about unipolar outcomes". In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart's law assume or depend on a unipolar outcome?

This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won't solve it during the development of those systems.

Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?

Comment by ofer on Gradient hacking · 2020-01-29T19:12:58.423Z · score: 3 (2 votes) · LW · GW

the gradients will point in the direction of removing the penalty by reducing the agent's determination to fail upon detecting goal shift.

But it need not be the case, and indeed the "failing-hard mechanism" would be optimized for that to not be the case (in a gradient hacking scenario).

To quickly see that it need not be the case, suppose that the "failing-hard mechanism" is implemented as two subnetworks within the model such that each one of them can output a value that causes the model to fail hard, and they are designed to either both output such a value or both not output such a value. Changing any single weight within the two subnetworks would not break the "failing-hard mechanism", and so we can expect all the partial derivatives with respect to weights within the two subnetworks to be close to zero (i.e. updating the weights in the direction of the gradient would not destroy the "failing-hard mechanism").

Comment by ofer on The Epistemology of AI risk · 2020-01-29T11:47:43.186Z · score: 5 (2 votes) · LW · GW

If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I'd assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.

I'm not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem). That doesn't seem to me like meaningful evidence for the proposition "the arguments in Superintelligence are not sound".

I can't speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I've talked to.

It's been a while since I read listened to the audiobook version of Superintelligence, but I don't recall the book arguing that the "second‐place AI lab" will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence. And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?

In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).

I don't recall the book relying on (or [EDIT: with a lot less confidence] even mentioning the possibility of) a discontinuity in capabilities. I believe it does argue that once there are AI systems that can do anything humans can, we can expect extremely fast progress.

Comment by ofer on The Epistemology of AI risk · 2020-01-28T15:14:10.823Z · score: 9 (2 votes) · LW · GW

and there's been a shift in arguments.

The set of arguments that are being actively discussed by AI safety researchers obviously changed since 2014 (which is true for any active field?). I assume that by "there's been a shift in arguments" you mean something more than that, but I'm not sure what.

Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers? Does the progress in deep learning since 2014 made the core arguments in the book less compelling? (Do the arguments about instrumental convergence and Goodhart's law fail to apply to deep RL?)

Comment by ofer on Oracles: reject all deals - break superrationality, with superrationality · 2020-01-27T17:50:36.844Z · score: 1 (1 votes) · LW · GW

I just want to flag that this approach seems to assume that—before we build the Oracle—we design the Oracle (or the procedure that produces it) such that it will assign prior of zero to the second types of worlds.

If we use some arbitrary scaled-up supervised learning training process to train a model that does well on general question answering, we can't just safely sidestep the malign prior problem by providing information/instructions about the prior as part of the question. The simulations of the model that distant superintelligences run may involve such inputs as well. (In those simulations the loss may end up being minimal for whatever output the superintelligence wants the model to yield; regardless of the prescriptive information about the prior in the input.)

Comment by ofer on Moral public goods · 2020-01-26T21:07:54.994Z · score: 1 (1 votes) · LW · GW

(if they were a total utilitarian then I think they'd already be committed to option 2)

I should have written "aggregative consequentialism" instead of "total utilitarianism". (The problem being that a noble who is an aggregative consequentialist would care about themselves <1% as much as peasants put together, for sufficiently large .)

But I think you do have similar problems with any attempt to model them as consequentialists.

This makes sense to me if we restrict the discussion to causal reasoning (otherwise, a noble who suspects that they are correlated with many other nobles may donate money to some peasants, even if they care about themselves >10 million times as much as any single peasant.)

Comment by ofer on Moral public goods · 2020-01-26T07:31:41.171Z · score: 5 (3 votes) · LW · GW

Great post!

Consequentialism is a really bad model for most people’s altruistic behavior, and especially their compromises between altruistic and selfish ends. To model someone as a thoroughgoing consequentialist, you have two bad options:

  1. They care about themselves >10 million times as much as other people. [...]
  2. They care about themselves <1% as much as everyone else in the whole world put together. [...]

It seems to me that "consequentialism" here refers to total utilitarianism rather than consequentialism in general.

Comment by ofer on Oracles: reject all deals - break superrationality, with superrationality · 2020-01-24T16:06:05.073Z · score: 1 (1 votes) · LW · GW

So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".

I'm confused about his point. How do we "specify the position" of the Oracle? Suppose the Oracle is implemented as a supervised learning model. Its current input (and all the training data) could have been generated by an arbitrary distant superintelligence that is simulating the environment of the Oracle. What is special about "this box" (the box that we have in mind)? What privileged status does this particular box have relative to other boxes in similar environments that are simulated by distant superintelligences?

Comment by ofer on ofer's Shortform · 2020-01-13T07:06:17.851Z · score: 1 (1 votes) · LW · GW

I crossed out the 'caring about privacy' bit after reasoning that the marginal impact of caring more about one's privacy might depend on potential implications of things like "quantum immortality" (that I currently feel pretty clueless about).

Comment by ofer on Outer alignment and imitative amplification · 2020-01-10T11:54:28.251Z · score: 4 (2 votes) · LW · GW

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

I would argue that according to this definition, there are no loss functions that are outer aligned at optimum (other than ones according to which no model performs optimally). [EDIT: this may be false if a loss function may depend on anything other than the model's output (e.g. if it may contain a regularization term).]

For any model that performs optimally according to a loss function there is a model that is identical to except that at the beginning of the execution it hacks the operating system or carries out mind crimes. But for any input, and formally map that input to the same output, and thus also performs optimally according to , and therefore is not outer aligned at optimum.

Comment by ofer on Relaxed adversarial training for inner alignment · 2020-01-10T11:16:10.992Z · score: 4 (2 votes) · LW · GW

we can try to train a purely predictive model with only a world model but no optimization procedure or objective.

How might a "purely predictive model with only a world model but no optimization procedure" look like, when considering complicated domains and arbitrarily high predictive accuracy?

It seems plausible that a sufficiently accurate predictive model would use powerful optimization processes. For example, consider a predictive model that predicts the change in Apple's stock price at some moment (based on data until ). A sufficiently powerful model might, for example, search for solutions to some technical problem related to the development of the next iPhone (that is being revealed that day) in order to calculate the probability that Apple's engineers overcame it.

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T13:28:30.923Z · score: 1 (1 votes) · LW · GW

In this scenario, my argument is that the size ratio for "almost-AGI architectures" is better (e.g. ), and so you're more likely to find one of those first.

For a "local search NAS" (rather than "random search NAS") it seems that we should be considering here the set of ["almost-AGI architectures" from which the local search would not find an "AGI architecture"].

The "$1B NAS discontinuity scenario" allows for the $1B NAS to find "almost-AGI architectures" before finding an "AGI architecture".

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T00:12:39.797Z · score: 1 (1 votes) · LW · GW

If you model the NAS as picking architectures randomly

I don't. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as "trial and error", by trial and error I did not mean random search.)

If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn't make much of a difference,

Earlier in this discussion you defined fragility as the property "if you make even a slight change to the thing, then it breaks and doesn't work". While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don't follow the logic of that paragraph.

Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them "AGI architectures"). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<).

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T00:09:58.377Z · score: 1 (1 votes) · LW · GW

Creating some sort of commitment device that would bind us to follow UDT—before we evaluate some set of hypotheses—is an example for one potentially consequential intervention.

As an aside, my understanding is that in environments that involve multiple UDT agents, UDT doesn't necessarily work well (or is not even well-defined?).

Also, if we would use SGD to train a model that ends up being an aligned AGI, maybe we should figure out how to make sure that that model "follows" a good decision theory. (Or does this happen by default? Does it depend on whether "following a good decision theory" is helpful for minimizing expected loss on the training set?)

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T14:42:41.278Z · score: 1 (1 votes) · LW · GW

I’d like you to state what position you think I’m arguing for

I think you're arguing for something like: Conditioned on [the first AGI is created at time by AI lab X], it is very unlikely that immediately before the researchers at X have a very low credence in the proposition "we will create an AGI sometime in the next 30 days".

(Tbc, I did not interpret you as arguing about timelines or AGI transformativeness; and neither did I argue about those things here.)

I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me.

Using the "fire alarm" concept here was a mistake, sorry for that. Instead of writing:

I'm pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.

I should have written:

I'm pretty agnostic about whether the result of that $100M NAS would be "almost AGI".

This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”.

I generally have a vague impression that many AIS/x-risk people tend to place too much weight on trend extrapolation arguments in AI (or tend to not give enough attention to important details of such arguments), which may have triggered me to write the related stuff (in response to you seemingly applying a trend extrapolation argument with respect to NAS). I was not listing the reasons for my beliefs specifically about NAS.

If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).

(I'm mindful of your time and so I don't want to branch out this discussion into unrelated topics, but since this seems to me like a potentially important point...) Even if we did have infinite time and the ability to somehow determine the correctness of any given hypothesis with super-high-confidence, we may not want to evaluate all hypotheses—that involve other agents—in arbitrary order. Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time). For example, after considering some game-theoretical meta considerations we might decide to make certain binding commitments before evaluating such and such hypotheses; or we might decide about what additional things we should consider or do before evaluating some other hypotheses, etcetera.

Conditioned on the first AGI being aligned, it may be important to figure out how do we make sure that that AGI "behaves wisely" with respect to this topic (because the AGI might be able to evaluate a lot of weird hypotheses that we can't).

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-05T20:07:33.496Z · score: 1 (1 votes) · LW · GW

What caused the researchers to go from “$1M run of NAS” to “$1B run of NAS”, without first trying “$10M run of NAS”? I especially have this question if you’re modeling ML research as “trial and error”;

I indeed model a big part of contemporary ML research as "trial and error". I agree that it seems unlikely that before the first $1B NAS there won't be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I'm pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.

Current AI systems are very subhuman, and throwing more money at NAS has led to relatively small improvements. Why don't we expect similar incremental improvements from the next 3-4 orders of magnitude of compute?

If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don't fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution). More generally, when using trend extrapolations in AI, consider the following from this Open Phil blog post (2016) by Holden Karnofsky (footnote 7):

The most exhaustive retrospective analysis of historical technology forecasts we have yet found, Mullins (2012), categorized thousands of published technology forecasts by methodology, using eight categories including “multiple methods” as one category. [...] However, when comparing success rates for methodologies solely within the computer technology area tag, quantitative trend analysis performs slight below average,

(The link in the quote appears to be broken, here is one that works.)

NAS seems to me like a good example for an expensive computation that could plausibly constitute a "search in idea-space" that finds an AGI model (without human involvement). But my argument here applies to any such computation. I think it may even apply to a '$1B SGD' (on a single huge network), if we consider a gradient update (or a sequence thereof) to be an "exploration step in idea-space".

Suppose that such a NAS did lead to human-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did?

I first need to understand what "human-level AGI" means. Can models in this category pass strong versions of the Turing test? Does this category exclude systems that outperform humans on one or more important dimensions? (It seems to me that the first SGD-trained model that passes strong versions of the Turing test may be a superintelligence.)

In all the previous NASs, why did the paths taken produce AI systems that were so much worse than the one taken by the $1B NAS? Did the $1B NAS just get lucky?

Yes, the $1B NAS may indeed just get lucky. A local search sometimes gets lucky (in the sense of finding a local optimum that is a lot better than the ones found in most runs; not in the sense of miraculously starting the search at a great fragile solution). [EDIT: also, something about this NAS might be slightly novel - like the neural architecture space.]

If you want to make the case for a discontinuity because of the lack of human involvement, you would need to argue:

  • The replacement for humans is way cheaper / faster / more effective than humans (in that case why wasn’t it automated earlier?)
  • The discontinuity happens as soon as humans are replaced (otherwise, the system-without-human-involvement becomes the new baseline, and all future systems will look like relatively continuous improvements of this system)

In some past cases where humans did not serve any role in performance gains that were achieved with more compute/data (e.g. training GPT-2 by scaling up GPT), there were no humans to replace. So I don't understand the question "why wasn’t it automated earlier?"

In the second point, I need to first understand how you define that moment in which "humans are replaced". (In the $1B NAS scenario, would that moment be the one in which the NAS is invoked?)

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T10:25:21.549Z · score: 1 (1 votes) · LW · GW

Conditioned on [$1B NAS yields the first AGI], that NAS itself may essentially be "a local search in idea-space". My argument is that such a local search in idea-space need not start in a world where "almost-AGI" models already exist (I listed in the grandparent two disjunctive reasons in support of this).

Relatedly, "modeling ML research as a local search in idea-space" is not necessarily contradictory to FOOM, if an important part of that local search can be carried out without human involvement (which is a supposition that seems to be supported by the rise of NAS and meta-learning approaches in recent years).

I don't see how my reasoning here relies on it being possible to "find fragile things using local search".