ofer's Shortform 2019-11-26T14:59:40.664Z · score: 4 (1 votes)
A probabilistic off-switch that the agent is indifferent to 2018-09-25T13:13:16.526Z · score: 6 (4 votes)
Looking for AI Safety Experts to Provide High Level Guidance for RAISE 2018-05-06T02:06:51.626Z · score: 41 (13 votes)
A Safer Oracle Setup? 2018-02-09T12:16:12.063Z · score: 12 (4 votes)


Comment by ofer on Moral public goods · 2020-01-26T07:31:41.171Z · score: 1 (1 votes) · LW · GW

Great post!

Consequentialism is a really bad model for most people’s altruistic behavior, and especially their compromises between altruistic and selfish ends. To model someone as a thoroughgoing consequentialist, you have two bad options:

  1. They care about themselves >10 million times as much as other people. [...]
  2. They care about themselves <1% as much as everyone else in the whole world put together. [...]

It seems to me that "consequentialism" here refers to total utilitarianism rather than consequentialism in general.

Comment by ofer on Oracles: reject all deals - break superrationality, with superrationality · 2020-01-24T16:06:05.073Z · score: 1 (1 votes) · LW · GW

So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".

I'm confused about his point. How do we "specify the position" of the Oracle? Suppose the Oracle is implemented as a supervised learning model. Its current input (and all the training data) could have been generated by an arbitrary distant superintelligence that is simulating the environment of the Oracle. What is special about "this box" (the box that we have in mind)? What privileged status does this particular box have relative to other boxes in similar environments that are simulated by distant superintelligences?

Comment by ofer on ofer's Shortform · 2020-01-13T07:06:17.851Z · score: 1 (1 votes) · LW · GW

I crossed out the 'caring about privacy' bit after reasoning that the marginal impact of caring more about one's privacy might depend on potential implications of things like "quantum immortality" (that I currently feel pretty clueless about).

Comment by ofer on Outer alignment and imitative amplification · 2020-01-10T11:54:28.251Z · score: 4 (2 votes) · LW · GW

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

I would argue that according to this definition, there are no loss functions that are outer aligned at optimum (other than ones according to which no model performs optimally). [EDIT: this may be false if a loss function may depend on anything other than the model's output (e.g. if it may contain a regularization term).]

For any model that performs optimally according to a loss function there is a model that is identical to except that at the beginning of the execution it hacks the operating system or carries out mind crimes. But for any input, and formally map that input to the same output, and thus also performs optimally according to , and therefore is not outer aligned at optimum.

Comment by ofer on Relaxed adversarial training for inner alignment · 2020-01-10T11:16:10.992Z · score: 4 (2 votes) · LW · GW

we can try to train a purely predictive model with only a world model but no optimization procedure or objective.

How might a "purely predictive model with only a world model but no optimization procedure" look like, when considering complicated domains and arbitrarily high predictive accuracy?

It seems plausible that a sufficiently accurate predictive model would use powerful optimization processes. For example, consider a predictive model that predicts the change in Apple's stock price at some moment (based on data until ). A sufficiently powerful model might, for example, search for solutions to some technical problem related to the development of the next iPhone (that is being revealed that day) in order to calculate the probability that Apple's engineers overcame it.

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T13:28:30.923Z · score: 1 (1 votes) · LW · GW

In this scenario, my argument is that the size ratio for "almost-AGI architectures" is better (e.g. ), and so you're more likely to find one of those first.

For a "local search NAS" (rather than "random search NAS") it seems that we should be considering here the set of ["almost-AGI architectures" from which the local search would not find an "AGI architecture"].

The "$1B NAS discontinuity scenario" allows for the $1B NAS to find "almost-AGI architectures" before finding an "AGI architecture".

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T00:12:39.797Z · score: 1 (1 votes) · LW · GW

If you model the NAS as picking architectures randomly

I don't. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as "trial and error", by trial and error I did not mean random search.)

If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn't make much of a difference,

Earlier in this discussion you defined fragility as the property "if you make even a slight change to the thing, then it breaks and doesn't work". While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don't follow the logic of that paragraph.

Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them "AGI architectures"). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<).

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-08T00:09:58.377Z · score: 1 (1 votes) · LW · GW

Creating some sort of commitment device that would bind us to follow UDT—before we evaluate some set of hypotheses—is an example for one potentially consequential intervention.

As an aside, my understanding is that in environments that involve multiple UDT agents, UDT doesn't necessarily work well (or is not even well-defined?).

Also, if we would use SGD to train a model that ends up being an aligned AGI, maybe we should figure out how to make sure that that model "follows" a good decision theory. (Or does this happen by default? Does it depend on whether "following a good decision theory" is helpful for minimizing expected loss on the training set?)

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-06T14:42:41.278Z · score: 1 (1 votes) · LW · GW

I’d like you to state what position you think I’m arguing for

I think you're arguing for something like: Conditioned on [the first AGI is created at time by AI lab X], it is very unlikely that immediately before the researchers at X have a very low credence in the proposition "we will create an AGI sometime in the next 30 days".

(Tbc, I did not interpret you as arguing about timelines or AGI transformativeness; and neither did I argue about those things here.)

I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me.

Using the "fire alarm" concept here was a mistake, sorry for that. Instead of writing:

I'm pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.

I should have written:

I'm pretty agnostic about whether the result of that $100M NAS would be "almost AGI".

This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”.

I generally have a vague impression that many AIS/x-risk people tend to place too much weight on trend extrapolation arguments in AI (or tend to not give enough attention to important details of such arguments), which may have triggered me to write the related stuff (in response to you seemingly applying a trend extrapolation argument with respect to NAS). I was not listing the reasons for my beliefs specifically about NAS.

If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).

(I'm mindful of your time and so I don't want to branch out this discussion into unrelated topics, but since this seems to me like a potentially important point...) Even if we did have infinite time and the ability to somehow determine the correctness of any given hypothesis with super-high-confidence, we may not want to evaluate all hypotheses—that involve other agents—in arbitrary order. Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time). For example, after considering some game-theoretical meta considerations we might decide to make certain binding commitments before evaluating such and such hypotheses; or we might decide about what additional things we should consider or do before evaluating some other hypotheses, etcetera.

Conditioned on the first AGI being aligned, it may be important to figure out how do we make sure that that AGI "behaves wisely" with respect to this topic (because the AGI might be able to evaluate a lot of weird hypotheses that we can't).

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-05T20:07:33.496Z · score: 1 (1 votes) · LW · GW

What caused the researchers to go from “$1M run of NAS” to “$1B run of NAS”, without first trying “$10M run of NAS”? I especially have this question if you’re modeling ML research as “trial and error”;

I indeed model a big part of contemporary ML research as "trial and error". I agree that it seems unlikely that before the first $1B NAS there won't be any $10M NAS. Suppose there will even be a $100M NAS just before the $1B NAS that (by assumption) results in AGI. I'm pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI.

Current AI systems are very subhuman, and throwing more money at NAS has led to relatively small improvements. Why don't we expect similar incremental improvements from the next 3-4 orders of magnitude of compute?

If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don't fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution). More generally, when using trend extrapolations in AI, consider the following from this Open Phil blog post (2016) by Holden Karnofsky (footnote 7):

The most exhaustive retrospective analysis of historical technology forecasts we have yet found, Mullins (2012), categorized thousands of published technology forecasts by methodology, using eight categories including “multiple methods” as one category. [...] However, when comparing success rates for methodologies solely within the computer technology area tag, quantitative trend analysis performs slight below average,

(The link in the quote appears to be broken, here is one that works.)

NAS seems to me like a good example for an expensive computation that could plausibly constitute a "search in idea-space" that finds an AGI model (without human involvement). But my argument here applies to any such computation. I think it may even apply to a '$1B SGD' (on a single huge network), if we consider a gradient update (or a sequence thereof) to be an "exploration step in idea-space".

Suppose that such a NAS did lead to human-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did?

I first need to understand what "human-level AGI" means. Can models in this category pass strong versions of the Turing test? Does this category exclude systems that outperform humans on one or more important dimensions? (It seems to me that the first SGD-trained model that passes strong versions of the Turing test may be a superintelligence.)

In all the previous NASs, why did the paths taken produce AI systems that were so much worse than the one taken by the $1B NAS? Did the $1B NAS just get lucky?

Yes, the $1B NAS may indeed just get lucky. A local search sometimes gets lucky (in the sense of finding a local optimum that is a lot better than the ones found in most runs; not in the sense of miraculously starting the search at a great fragile solution). [EDIT: also, something about this NAS might be slightly novel - like the neural architecture space.]

If you want to make the case for a discontinuity because of the lack of human involvement, you would need to argue:

  • The replacement for humans is way cheaper / faster / more effective than humans (in that case why wasn’t it automated earlier?)
  • The discontinuity happens as soon as humans are replaced (otherwise, the system-without-human-involvement becomes the new baseline, and all future systems will look like relatively continuous improvements of this system)

In some past cases where humans did not serve any role in performance gains that were achieved with more compute/data (e.g. training GPT-2 by scaling up GPT), there were no humans to replace. So I don't understand the question "why wasn’t it automated earlier?"

In the second point, I need to first understand how you define that moment in which "humans are replaced". (In the $1B NAS scenario, would that moment be the one in which the NAS is invoked?)

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-04T10:25:21.549Z · score: 1 (1 votes) · LW · GW

Conditioned on [$1B NAS yields the first AGI], that NAS itself may essentially be "a local search in idea-space". My argument is that such a local search in idea-space need not start in a world where "almost-AGI" models already exist (I listed in the grandparent two disjunctive reasons in support of this).

Relatedly, "modeling ML research as a local search in idea-space" is not necessarily contradictory to FOOM, if an important part of that local search can be carried out without human involvement (which is a supposition that seems to be supported by the rise of NAS and meta-learning approaches in recent years).

I don't see how my reasoning here relies on it being possible to "find fragile things using local search".

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T23:00:32.526Z · score: 1 (1 votes) · LW · GW

It seems to me that to get FOOM you need the property "if you make even a slight change to the thing, then it breaks and doesn't work"

The above 'FOOM via $1B NAS' scenario doesn't seem to me to require this property. Notice that the increase in capabilities during that NAS may be gradual (i.e. before evaluating the model that implements an AGI the NAS evaluates models that are "almost AGI"). The scenario would still count as a FOOM as long as the NAS yields an AGI and no model before that NAS ever came close to AGI.

Conditioned on [$1B NAS yields the first AGI], a FOOM seems to me particularly plausible if either:

  1. no previous NAS at a similar scale was ever carried out; or
  2. the "path in model space" that the NAS traverses is very different from all the paths that previous NASs traversed. This seems to me plausible even if the model space of the $1B NAS is identical to ones used in previous NASs (e.g. if different random seeds yield very different paths); and it seems to me even more plausible if the model space of the $1B NAS is slightly novel.
Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T20:39:29.431Z · score: 1 (1 votes) · LW · GW

Yeah, in FOOM worlds I agree more with your (Donald’s) reasoning. (Though I still have questions, like, how exactly did someone stumble upon the correct mathematical principles underlying intelligence by trial and error?)

If there's an implicit assumption here that FOOM worlds require someone to stumble upon "the correct mathematical principles underlying intelligence", I don't understand why such an assumption is justified. For example, suppose that at some point in the future some top AI lab will throw $1B at a single massive neural architecture search—over some arbitrary slightly-novel architecture space—and that NAS will stumble upon some complicated architecture that its corresponding model, after being trained with a massive amount of computing power, will implement an AGI.

Comment by ofer on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-03T05:39:18.056Z · score: 3 (2 votes) · LW · GW

This argument seems to point at some extremely important considerations in the vicinity of "we should act according to how we want civilizations similar to us to act" (rather than just focusing on causally influencing our future light cone), etc.

The details of the distribution over possible worlds that you use here seem to matter a lot. How robust are the "robust worlds"? If they are maximally robust (i.e. things turn out great with probability 1 no matter what the civilization does) then we should assign zero weight to the prospect of being in a "robust world", and place all our chips on being in a "fragile world".

Contrarily, if the distribution over possible worlds assigns sufficient probability to worlds in which there is a single very risky thing that cuts EV down by either 10% or 90% depending on whether the civilization takes it seriously or not, then perhaps such worlds should dominate our decision making.

Comment by ofer on A dilemma for prosaic AI alignment · 2019-12-26T17:39:09.340Z · score: 1 (1 votes) · LW · GW

it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge "already in" the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job).

I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that "does its job" while preserving competitiveness is an "alignment-hard" problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.

Comment by ofer on ofer's Shortform · 2019-12-23T07:05:07.247Z · score: 1 (1 votes) · LW · GW

It seems likely that as technology progresses and we get better tools for finding, evaluating and comparing services and products, advertising becomes closer to a zero-sum game between advertisers and advertisees.

The rise of targeted advertising and machine learning might cause people who care less about their privacy (e.g. people who are less averse to giving arbitrary apps access to a lot of data) to be increasingly at a disadvantage in this zero-sum-ish game.

Also, the causal relationship between 'being a person who is likely to pay above-market prices' and 'being offered above-market prices' may gradually become stronger.

Comment by ofer on Counterfactual Mugging: Why should you pay? · 2019-12-20T08:03:33.967Z · score: 1 (1 votes) · LW · GW
In this situation, an agent who doesn't care about counterfactual selves always gets 0 regardless of the coin

Since the agent is very correlated with its counterfactual copy, it seems that superrationality (or even just EDT) would make the agent pay $100, and get the $10000.

Comment by ofer on A dilemma for prosaic AI alignment · 2019-12-20T07:12:05.082Z · score: 4 (3 votes) · LW · GW
My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero).

I'd be happy to read more about this line of thought. (For example, does "loss function" here refer to an objective function that includes a regularization term? If not, what might we assume about the theoretical optimum that amounts to a safety benefit?)

Comment by ofer on A dilemma for prosaic AI alignment · 2019-12-18T17:02:07.392Z · score: 1 (1 votes) · LW · GW

Whoops, (2) came out cryptic, and is incorrect, sorry. The (correct?) idea I was trying to convey is the following:

If 'the safety scheme' in plan 1 requires anything at all that ruins competitiveness—for example, some human-in-the-loop process that occurs recurrently during training—then no further assumptions (such as that conjecture) are necessary for the reasoning in the OP, AFAICT.

This idea no longer seems to me to amount to making the conjecture strictly weaker.

Comment by ofer on A dilemma for prosaic AI alignment · 2019-12-18T06:55:28.351Z · score: 0 (2 votes) · LW · GW

Interesting post!

Conjecture: Cutting-edge AI will come from cutting-edge algorithms/architectures trained towards cutting-edge objectives (incl. unsupervised learning) in cutting-edge environments/datasets. Anything missing one or more of these components will suffer a major competitiveness penalty.

I would modify this conjecture in the following two ways:

1. I would replace "cutting-edge algorithms" with "cutting-edge algorithms and/or algorithms that use a huge amount of computing power".

2. I would make the conjecture weaker, such that it won't claim that "Anything missing one or more of these components will suffer a major competitiveness penalty".

Comment by ofer on A dilemma for prosaic AI alignment · 2019-12-18T06:48:15.245Z · score: 6 (4 votes) · LW · GW

I'd be happy to read an entire post about this view.

What level of language modeling may be sufficient for competitively helping in building the next AI, according to this view? For example, could such language modeling capabilities allow a model to pass strong (text-based) versions of the Turing test?

Comment by ofer on What are the best arguments and/or plans for doing work in "AI policy"? · 2019-12-09T16:03:51.067Z · score: 3 (2 votes) · LW · GW

Note that research related to governments is just a part of "AI policy" (which also includes stuff like research on models/interventions related to cooperation between top AI labs and publications norms in ML).

Comment by ofer on What are the best arguments and/or plans for doing work in "AI policy"? · 2019-12-09T15:55:29.901Z · score: 3 (2 votes) · LW · GW

Also, this 80,000 Hours podcast episode with Allan Dafoe.

Comment by ofer on Open question: are minimal circuits daemon-free? · 2019-12-07T22:48:26.162Z · score: 3 (2 votes) · LW · GW

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let be a minimal circuit that takes as input a string of length that encodes a Turing machine, and outputs a string that is the concatenation of the first configurations in the simulation of that Turing machine (each configuration is encoded as a string).

Now consider a string that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input , the computation of the output of simulates a consequentialist; and is a minimal circuit.

Comment by ofer on AI Alignment Open Thread October 2019 · 2019-12-05T15:05:58.436Z · score: 4 (2 votes) · LW · GW

Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?

I agree. We can frame this empirical uncertainty more generally by asking: What is the smallest such that there is no meaningful difference between all the things that can happen in a human brain while thinking about a question for minutes vs. 1,000 days.

Or rather: What is the smallest such that 'learning to generate answers that humans may give after thinking for minutes' is not easier than 'learning to generate answers that humans may give after thinking for 1,000 days'.

I should note that, conditioned on the above scenario, I expect that labeled 10-minute-thinking training examples would be at most a tiny fraction of all the training data (when considering all the learning that had a role in building the model, including learning that produced pre-trained weights etcetera). I expect that most of the learning would be either 'supervised with automatic labeling' or unsupervised (e.g. 'predict the next token') and that a huge amount of text (and code) that humans wrote will be used; some of which would be the result of humans thinking for a very long time (e.g. a paper on arXiv that is the result of someone thinking about a problem for a year).

Comment by ofer on AI Alignment Open Thread October 2019 · 2019-12-04T02:09:34.272Z · score: 1 (1 votes) · LW · GW

Thank you! I think I now have a better model of how people think about factored cognition.

the 'humans thinking for 10 minutes chained together' might have very different properties from 'one human thinking for 1000 days'. But given that those have different properties, it means it might be hard to train the 'one human thinking for 1000 days' system relative to the 'thinking for 10 minutes' system, and the fact that one easily extends to the other is evidence that this isn't how thinking works.

In the above scenario I didn't assume that humans—or the model—use factored cognition when the 'thinking duration' is long. Suppose instead that the model is running a simulation of a system that is similar (in some level of abstraction) to a human brain. For example, suppose some part of the model represents a configuration of a human brain, and during inference some iterative process repeatedly advances that configuration by a single "step". Advancing the configuration by 100,000 steps (10 minutes) is not qualitatively different from advancing it by 20 billion steps (1,000 days); and the runtime is linear in the number of steps.

Generally, one way to make predictions about the final state of complicated physical processes is to simulate them. Solutions that do not involve simulations (or equivalent) may not even exist, or may be less likely to be found by the training algorithm.

Comment by ofer on AI Alignment Open Thread October 2019 · 2019-12-03T18:19:45.856Z · score: 6 (3 votes) · LW · GW

[Question about factored cognition]

Suppose that at some point in the future, for the first time in history, someone trains an ML model that takes any question as input, and outputs an answer that an ~average human might have given after thinking about it for 10 minutes. Suppose that model is trained without any safety-motivated interventions.

Suppose also that the architecture of that model is such that '10 minutes' is just a parameter, , that the operator can choose per inference, and there's no upper bound on it; and the inference runtime increases linearly with . So, for example, the model could be used to get an answer that a human would have come up with after thinking for 1000 days.

In this scenario, would it make sense to use the model for factored cognition? Or should we consider running this model with to be no more dangerous than running it many times with ?

Comment by ofer on ofer's Shortform · 2019-11-26T14:59:40.853Z · score: 12 (4 votes) · LW · GW

Nothing in life is as important as you think it is when you are thinking about it.

--Daniel Kahneman, Thinking, Fast and Slow

To the extent that the above phenomenon tends to occur, here's a fun story that attempts to explain it:

At every moment our brain can choose something to think about (like "that exchange I had with Alice last week"). How does the chosen thought get selected from the thousands of potential thoughts? Let's imagine that the brain assigns an "importance score" to each potential thought, and thoughts with a larger score are more likely to be selected. Since there are thousands of thoughts to choose from, the optimizer's curse makes our brain overestimate the importance of the thought that it ends up selecting.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-22T12:20:42.157Z · score: 3 (2 votes) · LW · GW

If "unintended optimization" referrers only to the inner alignment problem, then there's also the malign prior problem.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-22T11:33:06.207Z · score: 1 (1 votes) · LW · GW

Well, the reason I mentioned the "utility function over different states of matter" thing is because if your utility function isn't specified over states of matter, but is instead specified over your actions (e.g. behave in a way that's as corrigible as possible), you don't necessarily get instrumental convergence.

I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we're talking about systems that 'want to affect (some part of) the world', and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).

My impression is that early thinking about Oracles wasn't really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there's no real reason to believe these early "Oracle" models are an accurate description of current or future (un)supervised learning systems.

It seems possible that something like this has happened. Though as far as I know, we don't currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.

How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase "training distribution" then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?

Therefore, I'm sympathetic to the following perspective, from Armstrong and O'Rourke (2018) (the last sentence was also quoted in the grandparent):

we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-21T15:34:59.152Z · score: 1 (1 votes) · LW · GW

Sorry for the delayed response!

My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I'm not sure how the concept applies in other cases.

I'm confused about the "I'm not sure how the concept applies in other cases" part. It seems to me that 'arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer' are a special case of 'agents which want to achieve a broad variety of utility functions over different states of matter'.

Like, if we aren't using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals?

I'm not sure what's the interpretation of 'unintended optimization', but I think that a sufficiently broad interpretation would cover the failure modes I'm talking about here.

I'm interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that's potentially smarter than we are.

I agree. So the following is a pending question that I haven't addressed here yet: Would '(un)supervised learning at arbitrarily large scale' produce arbitrarily capable systems that "want to affect the world"?

I won't address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I'll end up writing something about it). For now I'll note that my best guess is that most AI safety researchers think that it's at least plausible (>10%) that the answer to that question is "yes".

I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not 'want to affect the world'). Here's some supporting evidence for this:

  • Stuart Armstrong and Xavier O'Rourke wrote in their Safe Uses of AI Oracles paper:

    we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

  • Stuart Russell wrote in his book Human Compatible (2019):

    if the objective of the Oracle AI system is to provide accurate answers to questions in a reasonable amount of time, it will have an incentive to break out of its cage to acquire more computational resources and to control the questioners so that they ask only simple questions.

Comment by ofer on Robin Hanson on the futurist focus on AI · 2019-11-14T21:30:25.369Z · score: 5 (4 votes) · LW · GW

It seems you consider previous AI booms to be a useful reference class for today's progress in AI.

Suppose we will learn that the fraction of global GDP that currently goes into AI research is at least X times higher than in any previous AI boom. What is roughly the smallest X for which you'll change your mind (i.e. no longer consider previous AI booms to be a useful reference class for today's progress in AI)?

[EDIT: added "at least"]

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-13T11:31:25.014Z · score: 1 (1 votes) · LW · GW

Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer.

I'm not sure whether you're asking about my thoughts on:

  1. how can '(un)supervised learning at arbitrarily large scale' produce such systems; or

  2. conditioned on such systems existing, why might they have convergent instrumental goals that look like those failure modes.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-12T13:59:24.428Z · score: 1 (1 votes) · LW · GW

Are you referring to the possibility of unintended optimization

Yes (for a very broad interpretation of 'optimization'). I mentioned some potential failure modes in this comment.

Comment by ofer on Chris Olah’s views on AGI safety · 2019-11-10T08:25:28.607Z · score: 5 (3 votes) · LW · GW

I'm not sure I understand the question, but in case it's useful/relevant here:

A computer that trains an ML model/system—via something that looks like contemporary ML methods at an arbitrarily large scale—might be dangerous even if it's not connected to anything. Humans might get manipulated (e.g. if researchers ever look at the learned parameters), mind crime might occur, acausal trading might occur, the hardware of the computer might be used to implement effectors in some fantastic way. And those might be just a tiny fraction of a large class of relevant risks that the majority of which we can't currently understand.

Such 'offline computers' might be more dangerous than an RL agent that by design controls some actuators, because problems with the latter might be visible to us at a much lower scale of training (and therefore with much less capable/intelligent systems).

Comment by ofer on Defining Myopia · 2019-11-09T13:47:51.438Z · score: 9 (2 votes) · LW · GW

In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.

Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).

Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.

(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)

Comment by ofer on Lite Blocking · 2019-11-06T08:14:59.031Z · score: 1 (1 votes) · LW · GW

Social networks generally have far more things they could show you than you'll be able to look at. To prioritize they use inscrutable algorithms that boil down to "we show you the things we predict you're going to like".

Presumably, social networks tend to optimize for metrics like time spent and user retention. (There might even be a causal relationship between this optimization and threads getting derailed.)

Also, this seems like a stable/likely state, because if any single social network would unilaterally switch to optimize for 'showing users things they like' (or any other metric different from the above), competing social networks would plausibly be "stealing" their users.

Comment by ofer on Book Review: Design Principles of Biological Circuits · 2019-11-05T16:02:55.619Z · score: 25 (12 votes) · LW · GW

Thanks for writing this!

I used to agree with this position. I used to argue that there was no reason to expect human-intelligible structure inside biological organisms, or deep neural networks, or other systems not designed to be understandable. But over the next few years after that biologist’s talk, I changed my mind, and one major reason for the change is Uri Alon’s book An Introduction to Systems Biology: Design Principles of Biological Circuits.

I'm curious what you think about the following argument:

The examples in this book are probably subject to a strong selection effect: examples for mechanisms that researchers currently understand were more likely to be included in the book than those that no one has a clue about. There might be a large variance in the "opaqueness" of biological mechanisms (in different levels of abstraction). So perhaps this book provides little evidence on the (lack of) opaqueness of, say, the mechanisms that allowed John von Neumann to create the field of cellular automata, at a level of abstraction that is analogous to artificial neural networks.

Comment by ofer on What AI safety problems need solving for safe AI research assistants? · 2019-11-05T09:33:30.300Z · score: 1 (1 votes) · LW · GW

May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs.

It seems worth pointing out that due to the inner alignment problem, we shouldn't assume that naively training, say, unsupervised learning models with human-level capabilities (e.g. for the purpose of generating novel FAI proposals) will be safe — conditioned on it being possible capabilities-wise.

Comment by ofer on More variations on pseudo-alignment · 2019-11-05T07:29:58.209Z · score: 3 (2 votes) · LW · GW

I'm not sure that I understand your definition of suboptimality deceptive alignment correctly. My current (probably wrong) interpretation of it is: "a model has a suboptimality deceptive alignment problem if it does not currently have a deceptive alignment problem but will plausibly have one in the future". This sounds to me like a wrong interpretation of this concept - perhaps you could point out how it differs from the correct interpretation?

If my interpretation is roughly correct, I suggest naming this concept in a way that would not imply that it is a special case of deceptive alignment. Maybe "prone to deceptive alignment"?

Comment by ofer on More variations on pseudo-alignment · 2019-11-05T07:24:28.342Z · score: 5 (3 votes) · LW · GW

This post made me re-visit the idea in your paper to distinguish between:

  1. Internalization of the base objective ("The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned."); and
  2. Modeling of the base objective ("The base objective is incorporated into the mesa-optimizer’s epistemic model rather than its objective").

I'm currently confused about this distinction. The phrase "point to" seems to me vague. What should count as a model that points to a representation of the base objective (as opposed to internalizing it)?

Suppose we have a model that is represented by a string of 10 billion bits. Suppose it is the case that there is a set of 100 bits such that if we flip all of them, the model would behave very differently (but would still be very "capable", i.e. the modification would not just "break" it).

[EDIT: by "behave very differently" I mean something like "maximize some objective function that is far away from the base objective on objective function space"]

Is it theoretically possible that a model that fits this description is the result of internalization of the base objective rather than modeling of the base objective?

Comment by ofer on Defining Myopia · 2019-11-04T17:45:26.652Z · score: 3 (2 votes) · LW · GW

I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) "informal strategic stuff" (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.

The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).

My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).

I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:

During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.

I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I'll use the following example to explain why I think that:

Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can't perform better than chance), however, if parameter has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of does not (directly) affect predictions.

How should we expect our learning algorithm to update the parameter at the end of iteration 2?

If our learning algorithm is gradient decent, it seems that we should NOT expect to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to ) is expected to be positive.

In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of (over the model population).

Comment by ofer on [deleted post] 2019-11-04T08:46:50.195Z


Comment by ofer on Gradient hacking · 2019-11-02T16:42:33.622Z · score: 3 (2 votes) · LW · GW

Why can't the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?

Gradient descent is a very simple algorithm. It only "gets rid" of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter , by being a model that outputs a very incorrect value if is even slightly different than the desired value.

Comment by ofer on [deleted post] 2019-11-02T16:40:40.310Z


Comment by ofer on Chris Olah’s views on AGI safety · 2019-11-02T09:43:05.575Z · score: 7 (4 votes) · LW · GW
Oracles must somehow be incentivized to give useful answers

A microscope model must also be trained somehow, for example with unsupervised learning. Therefore, I expect such a model to also look like it's "incentivized to give useful answers" (e.g. an answer to the question: "what is the next word in the text?").

My understanding is that what distinguishes a microscope model is the way it is being used after it's already trained (namely, allowing researchers to look at its internals for the purpose of gaining insights etcetera, rather than making inferences for the sake of using its valuable output). If this is correct, it seems that we should only use safe training procedures for the purpose of training useful microscopes, rather than training arbitrarily capable models.

Comment by ofer on Chris Olah’s views on AGI safety · 2019-11-01T23:46:35.653Z · score: 21 (7 votes) · LW · GW

Thanks for writing this!

I'm curious what's Chris's best guess (or anyone else's) about where to place AlphaGo Zero on that diagram. Presumably its place is somewhere after "Human Performance", but is it close to the "Crisp Abstractions" pick, or perhaps way further - somewhere in the realm of "Increasingly Alien Abstractions"?

Specifically, rather than using machine learning to build agents which directly take actions in the world, we could use ML as a microscope—a way of learning about the world without directly taking actions in it.

Is there an implicit assumption here that RL agents are generally more dangerous than models that are trained with (un)supervised learning?

(Later the OP contrasts microscopes with oracles, so perhaps Chris interprets a microscope as a model that is smaller, or otherwise somehow restricted, s.t. we know it's safe?)

Comment by ofer on Defining Myopia · 2019-10-23T23:49:21.053Z · score: 3 (2 votes) · LW · GW
In particular, I think there’s a distinction between agents with objective functions over states the world vs. their own subjective experience vs. their output.

This line of thinking seems to me very important!

The following point might be obvious to Evan, but it's probably not obvious to everyone: Objective functions over the agent's output should probably not be interpreted as objective functions over the physical representation of the output (e.g. the configuration of atoms in certain RAM memory cells). That would just be a special case of objective functions over world states. Rather, we should probably be thinking about objective functions over the output as it is formally defined by the code of the agent (when interpreting "the code of the agent" as a mathematical object, like a Turing machine, and using a formal mapping from code to output).

Perhaps the following analogy can convey this idea: think about a human facing Newcomb's problem. The person has the following instrumental goal: "be a person that does not take both boxes" (because that makes Omega put $1,000,000 in the first box). Now imagine that that was the person's terminal goal rather than an instrumental goal. That person might be analogous to a program that "wants" to be a program that its (formally defined) output maximizes a given utility function.

Comment by ofer on Defining Myopia · 2019-10-23T22:14:12.354Z · score: 1 (1 votes) · LW · GW

Taking a step back, I want to note two things about my model of the near future (if your model disagrees with those things, that disagreement might explain what's going on in our recent exchanges):

(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.

(2) Suppose there's some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.

Regarding the questions you raised:

How would you propose to apply evolutionary algorithms to online learning?

One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).

How would you propose to apply evolutionary algorithms to non-episodic environments?

I'm not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can't be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there's training data or for which the environment can be simulated).

Comment by ofer on Healthy Competition · 2019-10-21T06:28:29.126Z · score: 2 (2 votes) · LW · GW
If you want to challenge a monopoly with a new org, there's likewise a particular burden to do a good job.

(This seems to depend on whether the job/project in question benefits from concentration.)