[AN #139]: How the simplicity of reality explains the success of neural nets 2021-02-24T18:30:04.038Z
[AN #138]: Why AI governance should find problems rather than just solving them 2021-02-17T18:50:02.962Z
[AN #137]: Quantifying the benefits of pretraining on downstream task performance 2021-02-10T18:10:02.561Z
[AN #136]: How well will GPT-N perform on downstream tasks? 2021-02-03T18:10:03.856Z
[AN #135]: Five properties of goal-directed systems 2021-01-27T18:10:04.648Z
[AN #134]: Underspecification as a cause of fragility to distribution shift 2021-01-21T18:10:06.783Z
[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines) 2021-01-13T18:10:04.932Z
[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate 2021-01-06T18:20:05.694Z
[AN #131]: Formalizing the argument of ignored attributes in a utility function 2020-12-31T18:20:04.835Z
[AN #130]: A new AI x-risk podcast, and reviews of the field 2020-12-24T18:20:05.289Z
[AN #129]: Explaining double descent by measuring bias and variance 2020-12-16T18:10:04.840Z
[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands 2020-12-09T18:20:07.910Z
[AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment 2020-12-02T18:20:05.196Z
[AN #126]: Avoiding wireheading by decoupling action feedback from action effects 2020-11-26T23:20:05.290Z
[AN #125]: Neural network scaling laws across multiple modalities 2020-11-11T18:20:04.504Z
[AN #124]: Provably safe exploration through shielding 2020-11-04T18:20:06.003Z
[AN #123]: Inferring what is valuable in order to align recommender systems 2020-10-28T17:00:06.053Z
[AN #122]: Arguing for AGI-driven existential risk from first principles 2020-10-21T17:10:03.703Z
[AN #121]: Forecasting transformative AI timelines using biological anchors 2020-10-14T17:20:04.918Z
[AN #120]: Tracing the intellectual roots of AI and AI alignment 2020-10-07T17:10:07.013Z
The Alignment Problem: Machine Learning and Human Values 2020-10-06T17:41:21.138Z
[AN #119]: AI safety when agents are shaped by environments, not rewards 2020-09-30T17:10:03.662Z
[AN #118]: Risks, solutions, and prioritization in a world with many AI systems 2020-09-23T18:20:04.779Z
[AN #117]: How neural nets would fare under the TEVV framework 2020-09-16T17:20:14.062Z
[AN #116]: How to make explanations of neurons compositional 2020-09-09T17:20:04.668Z
[AN #115]: AI safety research problems in the AI-GA framework 2020-09-02T17:10:04.434Z
[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents 2020-08-26T17:20:04.960Z
[AN #113]: Checking the ethical intuitions of large language models 2020-08-19T17:10:03.773Z
[AN #112]: Engineering a Safer World 2020-08-13T17:20:04.013Z
[AN #111]: The Circuits hypotheses for deep learning 2020-08-05T17:40:22.576Z
[AN #110]: Learning features from human feedback to enable reward learning 2020-07-29T17:20:04.369Z
[AN #109]: Teaching neural nets to generalize the way humans would 2020-07-22T17:10:04.508Z
[AN #107]: The convergent instrumental subgoals of goal-directed agents 2020-07-16T06:47:55.532Z
[AN #108]: Why we should scrutinize arguments for AI risk 2020-07-16T06:47:38.322Z
[AN #106]: Evaluating generalization ability of learned reward models 2020-07-01T17:20:02.883Z
[AN #105]: The economic trajectory of humanity, and what we might mean by optimization 2020-06-24T17:30:02.977Z
[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID 2020-06-18T17:10:02.641Z
[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL 2020-06-10T17:20:02.171Z
[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment 2020-06-03T17:20:02.221Z
[AN #101]: Why we should rigorously measure and forecast AI progress 2020-05-27T17:20:02.460Z
[AN #100]: What might go wrong if you learn a reward function while acting 2020-05-20T17:30:02.608Z
[AN #99]: Doubling times for the efficiency of AI algorithms 2020-05-13T17:20:02.637Z
[AN #98]: Understanding neural net training by seeing which gradients were helpful 2020-05-06T17:10:02.563Z
[AN #97]: Are there historical examples of large, robust discontinuities? 2020-04-29T17:30:02.043Z
[AN #96]: Buck and I discuss/argue about AI Alignment 2020-04-22T17:20:02.821Z
[AN #95]: A framework for thinking about how to make AI go well 2020-04-15T17:10:03.312Z
[AN #94]: AI alignment as translation between humans and machines 2020-04-08T17:10:02.654Z
[AN #93]: The Precipice we’re standing at, and how we can back away from it 2020-04-01T17:10:01.987Z
[AN #92]: Learning good representations with contrastive predictive coding 2020-03-25T17:20:02.043Z
[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement 2020-03-18T17:10:02.205Z


Comment by rohinmshah on [AN #139]: How the simplicity of reality explains the success of neural nets · 2021-02-25T21:43:11.787Z · LW · GW

What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.

Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.

(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)

Comment by rohinmshah on Utility Maximization = Description Length Minimization · 2021-02-24T21:24:19.911Z · LW · GW

The core conceptual argument is: the higher your utility function can go, the bigger the world must be, and so the more bits it must take to describe it in its unoptimized state under M2, and so the more room there is to reduce the number of bits.

If you could only ever build 10 paperclips, then maybe it takes 100 bits to specify the unoptimized world, and 1 bit to specify the optimized world.

If you could build 10^100 paperclips, then the world must be humongous and it takes 10^101 bits to specify the unoptimized world, but still just 1 bit to specify the perfectly optimized world.

If you could build ∞ paperclips, then the world must be infinite, and it takes ∞ bits to specify the unoptimized world. Infinities are technically challenging, and John's comment goes into more detail about how you deal with this sort of case.

For more intuition, notice that exp(x) is a bijective function from (-∞, ∞) to (0, ∞), so it goes from something unbounded on both sides to something unbounded on one side. That's exactly what's happening here, where utility is unbounded on both sides and gets mapped to something that is unbounded only on one side.

Comment by rohinmshah on Some thoughts on risks from narrow, non-agentic AI · 2021-02-23T23:48:29.087Z · LW · GW

I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.

Fwiw I think I didn't realize you weren't making claims about what post-singularity looked like, and that was part of my confusion about this post. Interpreting it as "what's happening until the singularity" makes more sense. (And I think I'm mostly fine with the claim that it isn't that important to think about what happens after the singularity.)

Comment by rohinmshah on Some thoughts on risks from narrow, non-agentic AI · 2021-02-23T19:14:13.315Z · LW · GW

I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).

This makes sense to me, and seem to map somewhat onto Parts 1 and 2 of WFLL.

However, you also call those parts "going out with a whimper" and "going out with a bang", which seems to be claims about the impacts of bad generalizations. In that post, are you intending to make claims about possible kinds of bad generalizations that ML models could make, or possible ways that poorly-generalizing ML models could lead to catastrophe (or both)?

Personally, I'm pretty on board with the two types of bad generalizations as plausible things that could happen, but less on board with "going out with a whimper" as leading to catastrophe. It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future. (Possibly the answer is "the AI systems don't allow us to do this because it would interfere with their continued operation".)

Comment by rohinmshah on 2019 Review: Voting Results! · 2021-02-21T17:24:41.368Z · LW · GW

Well, it might be the case that a system is aligned but is mistakenly running an exploitable decision theory. I think the idea is we would prefer to have things set up so that failures are contained, ie if your AI is running an exploitable decision theory, that problem doesn't cascade into even worse problems.

At least for the neural net approach, I don't think "mistakenly running an exploitable decision theory" is a plausible risk -- there isn't going to be a "decision theory" module that we get right or wrong. I feel like this sort of worry is mistaking the abstractions in our map for the territory. I think it is more likely that there will be a bunch of messy reasoning that doesn't neatly correspond to a single decision theory (just as with humans).

But let's take the least convenient world, in which that messy reasoning could lead to mistakes that we might call "having the wrong decision theory". I think often I would count this as a failure of alignment. Presumably humans want the AI system to be cautious about taking huge impactful decisions in novel situations, and any reasonably intelligent AI system should know this, so giving up the universe to an acausal threatener without consulting humans would count as the AI knowably not doing what we want, aka a failure of alignment. So I continue to stand by my claim.

(In your parlance, I might say "humans have preferences over the AI system's decision theory, so if the AI system uses a flawed decision theory that counts as a failure of alignment". But the reason that it makes sense to carve up the world in this way is because there won't be an explicit decision theory module.)

Fwiw, I expect MIRI people are usually on board with this, e.g. this seems to have a similar flavor (though it doesn't literally say the same thing).

Comment by rohinmshah on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-02-20T21:34:10.727Z · LW · GW

More explanation on why I'm not summarizing the paper about Kolmogorov complexity:

I don’t find the Kolmogorov complexity stuff very convincing. In general, algorithmic information theory tends to have arguments of the form “since this measure simulates every computable process, it (eventually) at least matches any particular computable process”. This feels pretty different from normal notions of “simplicity” or “intelligence”, and so I often try to specifically taboo phrases like “Solomonoff induction” or “Kolmogorov complexity” and replace them with something like “by simulating every possible computational process”, and see if the argument still seems convincing. That mostly doesn’t seem to be the case here.

If I try to do this with the arguments here, I get something like:

Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.

This argument just doesn’t sound very compelling to me. It also can be applied to literally any machine learning algorithm; I don’t see why this is specific to neural nets. If this is just meant to explain why it is okay to overparameterize neural nets, then that makes more sense to me, though then I’d say something like “with overparameterized neural nets, many different parameterizations instantiate the same function, and so the ‘effective parameterization’ is lower than you might have thought”, rather than saying anything about Kolmogorov complexity.

(This doesn't capture everything that simplicity is meant to capture -- for example, it doesn't capture the argument that neural networks can express overfit-to-the-training-data models, but those are high complexity and so low likelihood in the prior and so don't happen in general; but as mentioned above I find the Kolmogorov complexity argument for this pretty tenuous.)

Comment by rohinmshah on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-02-20T21:31:59.819Z · LW · GW

Zach's summary for the Alignment Newsletter (just for the SGD as Bayesian sampler paper):

Neural networks have been shown empirically to generalize well in the overparameterized setting, which suggests that there is an inductive bias for the final learned function to be simple. The obvious next question: does this inductive bias come from the _architecture_ and _initialization_ of the neural network, or does it come from stochastic gradient descent (SGD)? This paper argues that it is primarily the former.

Specifically, if the inductive bias came from SGD, we would expect that bias to go away if we replaced SGD with random sampling. In random sampling, we sample an initialization of the neural network, and if it has zero training error, then we’re done, otherwise we repeat.

The authors explore this hypothesis experimentally on the MNIST, Fashion-MNIST, and IMDb movie review databases. They test on variants of SGD, including Adam, Adagrad, and RMSprop. Since actually running rejection sampling for a dataset would take _way_ too much time, the authors approximate it using a Gaussian Process. This is known to be a good approximation in the large width regime.

Results show that the two probabilities are correlated over a wide order of magnitudes for different architectures, datasets, and optimization methods. While correlation isn't perfect over all scales, it tends to improve as the frequency of the function increases. In particular, the top few most likely functions tend to have highly correlated probabilities under both generation mechanisms.

Zach's opinion:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point. Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.

One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I'm not aware of any work that claims SGD uniformly samples from this boundary, but it's worth considering that possibility if the experimental results hold up.

Rohin’s opinion:

I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like <@double descent@>(@Deep Double Descent@). I don’t think this is in conflict with the claim of the paper, which is that _most_ of the inductive bias comes from the architecture and initialization.

[Other]( [work]( by the same group provides some theoretical and empirical arguments that the neural network prior does have an inductive bias towards simplicity. I find those results suggestive but not conclusive, and am far more persuaded by the paper summarized here, so I don’t expect to summarize them.

Comment by rohinmshah on Recursive Quantilizers II · 2021-02-20T03:10:16.717Z · LW · GW

I think what's really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that's necessary (to get concepts we care about right).

I'm super on board with this desideratum, and agree that it would not be a good move to change it to some fixed number of levels. I also agree that from a conceptual standpoint many ideas are "about all the levels".

My questions / comments are about the implementation proposed in this post. I thought that you were identifying "levels of reasoning" with "depth in the idealized recursive QAS tree"; if that's the case I don't see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)

I'm pretty sure I'm just failing to understand some fact about the particular implementation, or what you mean by "levels of reasoning", or its relation to the idealized recursive QAS tree.

This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.

I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn't do this.

In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it's all part of what's available "at the start").

Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.

It seems like you have to choose one of two options:

  1. Order of feedback does matter, in which case bad early feedback can lock you in to a bad outcome
  2. Order of feedback doesn't matter, in which case you can't improve your interpretation of feedback over time (at least, not in a consistent way)

(This seems true more generally for any system that aims to learn at all the levels, not just for the implementation proposed in this post.)

It seems to me like you're trying to illustrate something like "Abram's proposal doesn't get at the bottlenecks".

I think it's more like "I'm not clear on the benefit of this proposal over (say) learning from comparisons". I'm not asking about bottlenecks; I'm asking about what the improvement is.

I'm curious how you see whitelisting working.

The same way I see any other X working: we explicitly train the neural net to satisfy X through human feedback (perhaps using amplification, debate, learning the prior, etc). For a whitelist, we might be able to do something slightly different: we train a classifier to say whether the situation is or isn't in our whitelist, and then only query the agent when it is in our whitelist (otherwise reverting to a safe baseline). The classifier and agent share most of their weights.

Then we also do a bunch of stuff to verify that the neural net actually satisfies X (perhaps adversarial training, testing, interpretability, etc). In the whitelisting case, we'd be doing this on the classifier, if that's the route we went down.

It feels like your beliefs about what kind of methods might work for "merely way-better-than-human" systems are a big difference between you and I, which might be worth discussing more, although I don't know if it's very central to everything else we're discussing.

(Addressed this in the other comment)

Comment by rohinmshah on Recursive Quantilizers II · 2021-02-20T02:34:28.686Z · LW · GW

Responding first to the general approach to good-enough alignment:

I think I would agree with this if you said "optimization that's at or below human level" rather than "not ridiculously far above".

Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great.

Less important response: If by "not great" you mean "existentially risky", then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.

My real objection: Your claim is about what happens after you've already failed, in some sense -- you're starting from the assumption that you've deployed a misaligned agent. From my perspective, you need to start from a story in which we're designing an AI system, that will eventually have let's say "5x the intelligence of a human", whatever that means, but we get to train that system however we want. We can inspect its thought patterns, spend lots of time evaluating its decisions, test what it would do in hypothetical situations, use earlier iterations of the tool to help understand later iterations, etc. My claim is that whatever bad optimization "sneaks through" this design process is probably not going to have much impact on the agent's performance, or we would have already caught it.

Possibly related: I don't like thinking of this in terms of how "wrong" the values are, because that doesn't allow you to make distinctions about whether behaviors have already been seen at training or not.

But really, mainly, I was making the normative claim. A culture of safety is not one in which "it's probably fine" is allowed as part of any real argument. Any time someone is tempted to say "it's probably fine", it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many "it's probably fine" arguments; so at best you should carefully count how many you allow yourself.

A relevant empirical claim sitting behind this normative intuition is something like: "without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards".

If your claim is just that "we're probably fine" is not enough evidence for an argument, I certainly agree with that. That was an offhand remark in an opinion in a newsletter where words are at a premium; I obviously hope to do better than that in reality.

This all seems pretty closely related to Eliezer's writing on security mindset.

Some thoughts here:

  1. I am unconvinced that we need a solution that satisfies a security-mindset perspective, rather than one that satisfies an ordinary-paranoia perspective. (A crucial point here is that the goal is not to build adversarial optimizers in the first place, rather than defending against adversarial optimization.) As far as I can tell the argument for this claim is... a few fictional parables? (Readers: Before I get flooded with examples of failures where security mindset could have helped, let me note that I will probably not be convinced by this unless you can also account for the selection bias in those examples.)
  2. I don't really see why the ML-based approaches don't satisfy the requirement of being based on security mindset. (I agree "we're probably fine" does not satisfy that requirement.) Note that there isn't a solution that is maximally security-mindset-y, the way I understand the phrase (while still building superintelligent systems). A simple argument: we always have to specify something (code if nothing else); that something could be misspecified. So here I'm just claiming that ML-based approaches seem like they can be "sufficiently" security-mindset-y.
  3. I might be completely misunderstanding the point Eliezer is trying to make, because it's stated as a metaphor / parable instead of just stating the thing directly (and a clear and obvious disanalogy is that we are dealing with the construction of optimizers, rather than the construction of artifacts that must function in the presence of optimization).
Comment by rohinmshah on Formal Solution to the Inner Alignment Problem · 2021-02-19T17:59:39.508Z · LW · GW

N won't necessarily decrease over time, but all of the models will eventually agree with other.

Ah, right. I rewrote that paragraph, getting rid of that sentence and instead talking about the tradeoff directly.

I would have described Vanessa's and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.

Added a sentence to the opinion noting the benefits of explicitly quantified uncertainty.

Comment by rohinmshah on Formal Solution to the Inner Alignment Problem · 2021-02-19T09:03:39.187Z · LW · GW

Planned summary:

Since we probably can’t specify a reward function by hand, one way to get an agent that does what we want is to have it imitate a human. As long as it does this faithfully, it is as safe as the human it is imitating. However, in a train-test paradigm, the resulting agent may faithfully imitate the human on the training distribution but fail catastrophically on the test distribution. (For example, a deceptive model might imitate faithfully until it has sufficient power to take over.) One solution is to never stop training, that is, use an online learning setup where the agent is constantly learning from the demonstrator.

There are a few details to iron out. The agent needs to reduce the frequency with which it queries the demonstrator (otherwise we might as well just have the demonstrator do the work). Crucially, we need to ensure that the agent will never do something that the demonstrator wouldn’t have done, because such an action could be arbitrarily bad.

This paper proposes a solution in the paradigm where we use Bayesian updating rather than gradient descent to select our model, that is, we have a prior over possible models and then when we see a demonstrator action we update our distribution appropriately. In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p. The total probability of all the actions will typically be less than 1 -- the remaining probability is assigned to querying the demonstrator.

The key property here is that as long as the true demonstrator is in the top N models, then the agent never autonomously takes an action with more probability than the demonstrator would. Therefore as long as we believe the demonstrator is safe then the agent should be as well. Since the agent learns more about the demonstrator every time it queries them, over time it needs to query the demonstrator less often. Note that the higher N is, the more likely it is that the true model is one of those N models (and thus we have more safety), but also the more likely it is that we will have to query the demonstrator. This tradeoff is controlled by a hyperparameter α that implicitly determines N.


One of the most important approaches to improve inner alignment is to monitor the performance of your system online, and train to correct any problems. This paper shows the benefit of explicitly quantified, well-specified uncertainty: it allows you to detect problems _before they happen_ and then correct for them.

This setting has also been studied in <@delegative RL@>(@Delegative Reinforcement Learning@), though there the agent also has access to a reward signal in addition to a demonstrator.

Comment by rohinmshah on Formal Solution to the Inner Alignment Problem · 2021-02-19T04:32:49.050Z · LW · GW

Lol yes fixed

Comment by rohinmshah on AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger · 2021-02-19T04:32:12.629Z · LW · GW

So my understanding is there are two arguments:

  • A myopic objective requires an extra distinction to say "don't continue past the end of the episode"
  • Something about online learning

The online learning argument is actually super complicated and dependent on a bunch of factors, so I'm not going to summarize that one here. So I've just added the other one:

Even if training on a myopic base objective, we might expect the mesa objective to be non-myopic, as the non-myopic objective "pursue X" is simpler than the myopic objective "pursue X until time T".

Comment by rohinmshah on AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger · 2021-02-19T04:21:14.828Z · LW · GW

I did mean to include that; going to delete the word "algorithms" since that's what's causing the ambiguity.

Comment by rohinmshah on Formal Solution to the Inner Alignment Problem · 2021-02-19T01:51:44.871Z · LW · GW

While I share your position that this mostly isn't addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of "inner alignment".

There's an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it's in a simulation and predicts well to be selected by the Bayesian updating but eventually executes a treacherous turn). The algorithm presented precludes this from happening with some probability. We can debate the significance, but it seems to me like it is clearly doing something solution-like with respect to the inner alignment problem.

Comment by rohinmshah on AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger · 2021-02-19T01:17:02.459Z · LW · GW

Planned summary for the Alignment Newsletter:

This podcast delves into a bunch of questions and thoughts around <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@). Here are some of the points that stood out to me (to be clear, many of these have been covered in this newsletter before, but it seemed worth it to state them again):

- A model is a mesa optimizer if it is a _mechanistic_ optimizer, that is, it is executing an algorithm that performs search for some objective.
- We need to focus on mechanistic optimizers instead of things that behave as though they are optimizing for some goal, because those two categories can have very different generalization behavior, and we are primarily interested in how they will generalize.
- Humans do seem like mesa optimizers relative to evolution (though perhaps not a central example). In particular, it seems accurate to say that humans look at different possible strategies and select the ones which have good properties, and thus we are implementing a mechanistic search algorithm.
- To reason about whether machine learning will result in these mechanistic optimizers, we need to reason about the _inductive biases_ of machine learning algorithms. We mostly don’t yet know how likely they are.
- Evan expects that in powerful neural networks there will exist a combination of neurons that encode the objective, which we might be able to find with interpretability techniques.
- We can’t rely on generalization bounds to guarantee performance, since in practice there is always some distribution shift (which invalidates those bounds).
- Although it is usually phrased in the train/test paradigm, mesa optimization is still a concern in an online learning setup, since at every time we are interested in whether the model will generalize well to the next data point it sees.
- We will probably select for simple ML models (in the sense of short description length) but not for low inference time, such that mechanistic optimizers are more likely than models that use more space (the extreme version being lookup tables).
- If you want to avoid mesa optimizers entirely (rather than aligning them), you probably need to have a pretty major change from the current practice of AI, as with STEM AI and Microscope AI (explained <@here@>(@An overview of 11 proposals for building safe advanced AI@)).
- Even in a <@CAIS scenario@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@) where we have (say) a thousand models doing different tasks, each of those tasks will still likely be complex enough to lead to the models being mesa optimizers.
- There are lots of mesa objectives which would lead to deceptive alignment relative to corrigible or internalized alignment, and so we should expect deceptive alignment a priori.

Comment by rohinmshah on Recursive Quantilizers II · 2021-02-19T01:15:32.375Z · LW · GW

(Noting that given this was a month ago I have lost context and am more likely than usual to contradict what I previously wrote)

The point of this is to be able to handle feedback at arbitrary levels, not to require feedback at arbitrary levels.

What's the point of handling feedback at high levels if we never actually get feedback at those levels?

Perhaps another way of framing it: suppose we found out that humans were basically unable to give feedback at level 6 or above. Are you now happy having the same proposal, but limited to depth 5? I get the sense that you wouldn't be, but I can't square that with "you only need to be able to handle feedback at high levels but you don't require such feedback".

Even if humans only ever provide feedback about finitely many meta levels, the idea is for the system to generalize to other levels.

I don't super see how this happens but I could imagine it does. (And if it did it would answer my question above.) I feel like I would benefit from concrete examples with specific questions and their answers.

My intention is for the procedure to be interactive; however, I definitely haven't emphasized how that aspect would work.

Okay, that makes sense. It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it's particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.

I think "being corrigible" is an importantly highly-meta concept.


Even outside that context, I just don't know that it's possible to specify a very good notion of "corrigibility" at a finite meta-level. It's kind of about not trusting any value function specified at any finite meta-level.

I agree that any safety story will probably require you to get some concept X right. (Corrigibility is one candidate for X.) Your safety story would then be "X is inductively preserved as the AI system self-modifies / learns new information / makes a successor agent", and so X has to scale arbitrarily far. You have to get this "perfectly" right in that it can't be that your agent satisfies X under normal conditions but then fails when COVID hits; this is challenging. You don't have to get it "perfectly" right in that you could get some more conservative / careful X' that restricts the agent's usefulness (e.g. it has to check in with the human more often) but over time it can self-modify / make successor agents with property X instead.

Importantly, if it turns out that X = corrigibility is too hard, we can also try less performant but safer things, like X = "we revert to a safe baseline policy if we're not in <whitelist of acceptable situations>", and the whitelist can grow over time.

(As a side note, I am pretty pessimistic about ambitious choices of X, such as X = human values, or X = optimal behavior in all possible situations, because those are high-complexity and not something that even humans could get right. It feels like this proposal is trying to be similarly ambitious, though I wouldn't be surprised if I changed my mind on that very quickly.)

I agree that under this framework of levels of feedback, X has to be specified at "all the levels".

I am less convinced that you need a complex scheme for giving feedback at all levels to do this sort of thing. The training scheme is not the same as the learned agent; you can have a training scheme that has a simple (and incorrect) feedback interpretation system like Boltzmann rationality, and get out a learned agent that has internalized a much more careful interpretation system. For example, Learning to summarize from human feedback does use Boltzmann rationality, but could finetune GPT-3 to e.g. interpret human instructions pragmatically. This interpretation system can apply "at all levels", in the same way that human brains can apply similar heuristics "at all levels".

(There are still issues with just applying the learning from human preferences approach, but they seem to be much more about "did the neural net really learn the intended concept" / inner alignment, rather than "the neural net learned what to do at level 1 but not at any of the higher levels".)

Partly I want to defend the "all meta levels" idea as an important goalpost rather than necessary

Yeah, that seems reasonable to me.

So I guess I'm saying, sufficiency seems like the more interesting question than necessity.

I do agree that sufficiency is more interesting when it can actually be guaranteed. Idk what I meant when I wrote the opinion, but my guess was that it was something like "I'm observing that we can get by with something easier to satisfy that seems more practical to do", so more like a tradeoff between importance and tractability. I don't think I meant it as a strong critique or anything like that.

My question is: do you think there's a method that's good enough to scale up to arbitrary capability?

I reject the notion that we need a method that scales up to arbitrary capability. I'd love it if we got one, but it's seeming less and less plausible to me that we'll get such a method. I prefer to make it so that we are in a paradigm where you can notice when your method fails to scale, fix the problem, and then continue. You do need to ensure that you can fix the problem (i.e. no treacherous turns), so this isn't a full panacea, but it does mean that you don't e.g. need a perfect human model.

One example of how to do this is to use X = "revert to a safe baseline policy outside of <whitelist>", and enlarge the whitelist over time. In this case "failing to scale" is "our AI system couldn't solve the task because our whitelist hobbled it too much".

So, to your questions:

  • Does it seem possible to pre-specify some fixed way of interpreting feedback, which will scale up to arbitrarily capable systems? IE, when I say a very capable system "understands" what I want, does it really seem like we can rely on a fixed notion of understanding, even thinking only of capabilities?

No, that doesn't seem possible for arbitrary capabilities (except in some vacuous sense where there exists some way of doing this that in principle we could hardcode, or in another vacuous sense where we fix a method of interpretation like "all feedback implies that I should shut down", which is safe but not performant). It seems possible for capabilities well beyond human capabilities, and if we succeed at that, we can use those capabilities to design the next generation of AI systems.

  • Especially for alignment purposes, don't you expect any fixed model of interpreting feedback to be too brittle by default, and somehow fall apart when a sufficiently powerful intelligence is interpreting feedback in such a fixed way?

Yes, I do expect this to be brittle for a sufficiently powerful intelligence, again ignoring some vacuous counterexamples. Again, I expect it would be fine for a merely way-better-than-humans intelligence.

This seems like a perfect example to me. It works pretty well for current systems, but scaled up, it seems it would ultimately reach dramatically wrong ideas about what humans value. (In particular, it ultimately must think the highest-utility action is the most probable one, an assumption which will engender poor interpretations of situations in which errors are more common than 'correct' actions, such as those common to the heuristics and biases literature.)

Yup, totally agree. Make sure to update it as you scale up further.

At MIRI we tend to take "we're probably fine" as a strong indication that we're not fine ;p

Yeah I have been and continue to be confused by this perspective, at least as an empirical claim (as opposed to a normative one). I get the sense that it's partly because optimization amplifies and so there is no "probably", there is only one or the other. I can kinda see that when you assume an arbitrarily powerful AIXI-like superintelligence, but it seems basically wrong when you expect the AI system to apply optimization that's not ridiculously far above that applied by a human.

Comment by rohinmshah on Fixing The Good Regulator Theorem · 2021-02-11T22:31:21.385Z · LW · GW

I explicitly remove that constraint in the "Minimum Entropy -> Maximum Expected Utility and Imperfect Knowledge" section.

... That'll teach me to skim through the math in posts I'm trying to summarize. I've edited the summary, lmk if it looks good now.

Comment by rohinmshah on Fixing The Good Regulator Theorem · 2021-02-11T20:13:33.974Z · LW · GW

In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict.

I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.

On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn't have access to all the information S contains).

If S is derived from X, then "information in S" = "information in X relevant to S"

On point (2), it's not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don't change our optimal strategy, and then we wouldn't need to keep around all the info about S.

Fair point. I kind of wanted to abstract away this detail in the operationalization of "relevant", but it does seem misleading as stated. Changed to "important for optimal performance".

The idea that information comes in two steps, with the second input "choosing which game we play", is important.

I was hoping that this would come through via the neural net example, where Z obviously includes new information in the form of the new test inputs which have to be labeled. I've added the sentence "Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to look at S" to the second point to clarify. 

(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)

Comment by rohinmshah on Fixing The Good Regulator Theorem · 2021-02-11T04:29:18.319Z · LW · GW

Planned summary (edited in response to feedback):

Consider a setting in which we must extract information from some data X to produce M, so that it can later perform some task Z in a system S while only having access to M. We assume that the task depends only on S and not on X (except inasmuch as X affects S). As a concrete example, we might consider gradient descent extracting information from a training dataset (X) and encoding it in neural network weights (M), which can later be used to classify new test images (Z) taken in the world (S) without looking at the training dataset.

The key question: when is it reasonable to call M a model of S?
1. If we assume that this process is done optimally, then M must contain all information in X that is needed for optimal performance on Z.
2. If we assume that every aspect of S is important for optimal performance on Z, then M must contain all information about S that it is possible to get. Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to infer properties of S.
3. If we assume that M contains _no more_ information than it needs, then it must contain exactly the information about S that can be deduced from X.

It seems reasonable to say that in this case we constructed a model M of the system S from the source X "as well as possible". This post formalizes this conceptual argument and presents it as a refined version of the [Good Regulator Theorem](

Returning to the neural net example, this argument suggests that since neural networks are trained on data from the world, their weights will encode information about the world and can be thought of as a model of the world.

Comment by rohinmshah on Epistemology of HCH · 2021-02-10T20:42:16.105Z · LW · GW

Planned summary for the Alignment Newsletter:

This post identifies and explores three perspectives one can take on <@HCH@>(@Humans Consulting HCH@):

1. **Philosophical abstraction:** In this perspective, HCH is an operationalization of the concept of one’s enlightened judgment.

2. **Intermediary alignment scheme:** Here we consider HCH as a scheme that arguably would be aligned if we could build it.

3. **Model of computation:** By identifying the human in HCH with some computation primitive (e.g. arbitrary polynomial-time algorithms), we can think of HCH as a particular theoretical model of computation that can be done using that primitive.

Comment by rohinmshah on Eight Definitions of Observability · 2021-02-10T17:04:24.876Z · LW · GW

The problem is a bit earlier actually:

This isn't true, because  doesn't just ignore  here (since ). I think the route is to say "Let  . Then  must treat  and  identically, meaning that either they are equal, or the frame is biextensionally equivalent to one where they are equal."

Comment by rohinmshah on Draft report on AI timelines · 2021-02-09T03:30:04.906Z · LW · GW

One reason I put a bit more weight on short / medium horizons was that even if transformative tasks are long-horizon, you could use self-supervised pretraining to do most learning, thus reducing the long-horizon data requirements. Now that Scaling Laws for Transfer is out, we can use it to estimate how much this might help. So let's do some bogus back-of-the-envelope calculations:

We'll make the very questionable assumption that the law relating our short-horizon pretraining task and our long-horizon transformative task will still be .

Let's assume that the long-horizon transformative task has a horizon that is 7 orders of magnitude larger than the short-horizon pretraining task. (The full range is 9 orders of magnitude.) Let the from-scratch compute of a short-horizon transformative task be . Then the from-scratch compute for our long horizon task would be , if we had to train on all 1e13 data points.

Our key trick is going to be to make the model larger, and pretrain on a short-horizon task, to reduce the amount of long-horizon data we need. Suppose we multiply the model size by a factor of . We'll estimate total compute as a function of c, and then find the value that minimizes it.

Making the model bigger increases the necessary (short-horizon) pretraining data by a factor of , so pretraining compute goes up by a factor of . For transfer,  goes up by a factor of 

We still want to have  data points for the long-horizon transformative task. To actually get computational savings, we need to get this primarily from transfer, i.e. we have , which we can solve to get .

Then the total compute is given by .

Minimizing this gives us , in which case total compute is .

Thus, we've taken a base time of , and reduced it down to , a little over an order of magnitude speedup. This is solidly within my previous expectations (looking at my notes, I said "a couple of orders of magnitude"), so my timelines don't change much.

Some major caveats:

  1. The calculation is super bogus; there's no reason to expect  to be the same for TAI as for finetuning code completion on text; different values could wildly change the conclusion.
  2. It's not clear to me that this particular scaling law should be expected to hold for such large models.
  3. There are lots of other reasons to prefer the short / medium horizon hypotheses (see my opinion above).
Comment by rohinmshah on rohinmshah's Shortform · 2021-02-09T02:03:05.980Z · LW · GW

Sometimes people say "look at these past accidents; in these cases there were giant bureaucracies that didn't care about safety at all, therefore we should be pessimistic about about AI safety". I think this is backwards, and that you should actually conclude the reverse: this is evidence that problems tend to be easy, and therefore we should be optimistic about AI safety.

This is not just one man's modus ponens -- the key issue is the selection effect.

It's easiest to see with a Bayesian treatment. Let's say we start completely uncertain about what fraction of people will care about problems, i.e. uniform distribution over [0, 100]%. In what worlds do I expect to see accidents where giant bureaucracies don't care about safety? Almost all of them -- even if 90% of people care about safety, there will still be some cases where people didn't care and accidents happened; and of course we'd hear about them if so (and not hear about the cases where accidents didn't happen). You can get a strong update against 99.9999% and higher, but by the time you're at 90% the update seems pretty weak. Given how much selection there is, I think even the update against 99% is relatively weak. So really you just don't learn much about how careful people will be by looking at our accident track record (unless you can also quantify the denominator of how many "potential accidents" there could have been).

However, it feels pretty notable to me that the vast majority of accidents I hear about in detail are ones where it seems like there were a bunch of obvious mistakes and the accidents would have been prevented had there been a decision-maker who cared (enough) about safety. And unlike the previous paragraph, I do expect to hear about accidents that we couldn't have prevented, so I don't have to worry about selection bias. So it seems like I should conclude that usually problems are pretty easy, and "all we have to do" is make sure people care. (One counterargument is that problems look obvious only in hindsight; at the time the obvious mistakes may not have been obvious.)

Examples of accidents that fit this pattern: the Challenger crash, the Boeing 737-MAX issues, everything in Engineering a Safer World, though admittedly the latter category suffers from some selection bias.

Comment by rohinmshah on rohinmshah's Shortform · 2021-02-09T01:49:58.417Z · LW · GW

You've heard of crucial considerations, but have you heard of red herring considerations?

These are considerations that intuitively sound like they could matter a whole lot, but actually no matter how the consideration turns out it doesn't affect anything decision-relevant.

To solve a problem quickly, it's important to identify red herring considerations before wasting a bunch of time on them. Sometimes you can even start outlining solutions that turn a bunch of seemingly-crucial considerations into red herring considerations.

For example, it might seem like "what is the right system of ethics" is a crucial consideration for AI alignment (after all, you need to know ethics to write down a utility function), but once you decide to instead aim to design algorithms that allow you to build AI systems for any task you have in mind, that turns into a red herring consideration.

Here's an example where I argue that, for a specific question, anthropics is a red herring consideration (thus avoiding the question of whether to use SSA or SIA).

Alternate names: sham considerations? insignificant considerations?

Comment by rohinmshah on What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers · 2021-02-08T16:31:28.835Z · LW · GW

Andrew Gelman's take here.

Comment by rohinmshah on 2019 Review: Voting Results! · 2021-02-05T19:01:43.110Z · LW · GW

Yeah, I agree debate seems less obvious. I guess I'm more interested in the iterated amplification claim since it seems like you do see iterated amplification as opposed to "avoiding manipulation" or "making a clean distinction between good and bad reasoning", and that feels confusing to me. (Whereas with debate I can see the arguments for debate incentivizing manipulation, and I don't think they're obviously wrong, or obviously correct.)

I still feel like there is some weird counterfactual incentive to manipulate the process

Yeah, this argument makes sense to me, though I question how much such incentives matter in practice. If we include incentives like this, then I'm saying "I think the incentives a) arise for any situation and b) don't matter in practice, since they never get invoked during training". (Not just for the automated decomposition example; I think similar arguments apply less strongly to situations involving actual humans.)

there isn't even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.


1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn't count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.

I'm not claiming (1) in full generality. I'm claiming that there's a spectrum of how much incentive there is to predict humans in generality. On one end we have the automated examples I mentioned above, and on the other end we have sales and marketing. It seems like where we are on this spectrum is primarily dependent on the task and the way you structure your reasoning. If you're just training your AI system on making better transistors, then it seems like even if there's a human in the loop your AI system is primarily going to be learning about transistors (or possibly about how to think about transistors in the way that humans think about transistors). Fwiw, I think you can make a similar claim about debate.

If we use iterated amplification to aim for corrigibility, that will probably require the system to learn about agency, though I don't think it obviously has to learn about humans.

I might also be claiming (2), except I don't know what you mean by "sufficiently far". I can understand how prediction behavior is "close" to manipulation behavior (in that many of the skills needed for the first are relevant to the second and vice versa); if that's what you mean then I'm not claiming (2).

If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.

I'm definitely not claiming that we can do this. But I don't think any approach could possibly meet the standard of "thinking about humans does not aid in the goal"; at the very least there is probably some useful information to be gained in updating on "humans decided to build me", which requires thinking about humans. Which is part of why I prefer thinking about the spectrum.

Comment by rohinmshah on 2019 Review: Voting Results! · 2021-02-05T05:15:50.553Z · LW · GW

Yeah, that sounds interesting, I'd participate.

Comment by rohinmshah on 2019 Review: Voting Results! · 2021-02-05T05:15:07.995Z · LW · GW

You are focusing on the task/output of the system, and internal incentives to learn how to manipulate don't count?

This seems closest, though I'm not saying that internal incentives don't count -- I don't see what these incentives even are (or, I maybe see them in the superintelligent utility maximizer model, but not in other models).

Do you agree that the agents in Supervising strong learners by amplifying weak experts don't have an incentive to manipulate the automated decomposition strategies?

If yes, then if we now change to a human giving literally identical feedback, do you agree that then nothing would change (i.e. the resulting agent would not have an incentive to manipulate the human)?

If yes, then what's the difference between that scenario and one where there are internal incentives to manipulate the human?

Possibly you say no to the first question because of wireheading-style concerns; if so my followup question would probably be something like "why doesn't this apply to any system trained from a feedback signal, whether human-generated or automated?" (Though that's from a curious-about-your-beliefs perspective. On my beliefs I mostly reject wireheading as a specific thing to be worried about, and think of it as a non-special instance of a broader class of failures.)

Comment by rohinmshah on 2019 Review: Voting Results! · 2021-02-04T20:59:59.367Z · LW · GW

I'm on board with "let's try to avoid giving models strong incentives to learn how to manipulate humans" and stuff like that, e.g. via coordination to use AI primarily for (say) science and manufacturing, and not for (say) marketing and recommendation engines.

I don't see this as particularly opposed to methods like iterated amplification or debate, which seem like they can be applied to all sorts of different tasks, whether or not they incentivize manipulation of humans.

It feels like the crux is going to be in our picture of how AGI works, though I don't know how.

Comment by rohinmshah on [AN #136]: How well will GPT-N perform on downstream tasks? · 2021-02-04T20:51:52.191Z · LW · GW

Yes, in general you want to account for hardware and software improvements. From the original post:

Finally, it’s important to note that algorithmic advances are real and important. GPT-3 still uses a somewhat novel and unoptimised architecture, and I’d be unsurprised if we got architectures or training methods that were one or two orders of magnitude more compute-efficient in the next 5 years.

From the summary:

$100B -$1T at current prices, $1B - $10B given estimated hardware and software improvements over the next 5 - 10 years

The $1B - $10B number is meant to include things like the Performer.

Comment by rohinmshah on Distinguishing claims about training vs deployment · 2021-02-04T02:11:13.519Z · LW · GW

Planned summary for the Alignment Newsletter:

One story for AGI is that we train an AI system on some objective function, such as an objective that rewards the agent for following commands given to it by humans using natural language. We then deploy the system without any function that produces reward values; we instead give the trained agent commands in natural language. Many key claims in AI alignment benefit from more precisely stating whether they apply during training or during deployment.

For example, consider the instrumental convergence argument. The author proposes that we instead think of the training convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behavior aimed towards certain convergent goals (such as self-preservation). This could happen either via the AGI internalizing them directly as final goals, or by the AGI learning final goals for which these goals are instrumental.

The author similarly clarifies goal specification, the orthogonality thesis, fragility of value, and Goodhart’s Law.

Comment by rohinmshah on 2019 Review: Voting Results! · 2021-02-03T22:17:46.058Z · LW · GW

I certainly would be more excited if there was an alternative in mind -- it seems pretty unlikely to me that this is at all tractable.

However, I am also pretty unconvinced by the object-level arguments that there are risks from using human models that are comparable to the risks from AI alignment overall (under a total view, longtermist perspective). Taking the arguments from the post:

  • Less Independent Audits: I agree that all else equal, having fewer independent audits increases x-risk from AI alignment failure. This seems like a far smaller effect than from getting the scheme right in the first place so that your AI system is more likely to be aligned.

    I also think the AI system must have at least some information about humans for this to be a reasonable audit: if your AI system does something in accordance with our preferences (which is what the audit checks for), then it has information about human preferences, which is at least somewhat of a human model; in particular many of the other risks mentioned in the post would apply.  This makes the "independent audits" point weaker.
  • Incorrectly Encoded Values: I don't find the model of "we build a general-purpose optimizer and then slot in a mathematical description of the utility function that we can never correct" at all compelling; and that seems like the only model anyone's mentioned where this sort of mistake seems like it could be very bad.
  • Manipulation: I agree with this point.
  • Threats: This seems to be in direct conflict with alignment -- roughly speaking, either your AI system is aligned with you and can be threatened, or it is not aligned with you and then threats against it don't hurt you. Given that choice, I definitely prefer alignment.
  • Mind Crime: This makes sense to worry about if you think that most likely we will fail at alignment, but if you think that we will probably succeed at alignment it doesn't seem that important -- even if mind crime exists for a while, we would presumably eliminate it fairly quickly. (Leaning heavily on the longtermist total view here.)
  • Unexpected Agents: I agree with the post that the likelihood of this mattering is small. (I also don't agree that a system that predicts human preferences seems strictly more likely to run into problems with misaligned subagents.)

More broadly, I wish that posts like this would make a full case for expected impact, rather than gesturing vaguely at ways things could go poorly.

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-29T21:43:45.491Z · LW · GW

Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say "debate will in practice lead to honest policies".        

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-29T18:52:20.239Z · LW · GW

If bad arguments are always describably bad, then it's plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.

I think you also need that at least some of the time good arguments are not describably bad (i.e. they don't have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)

Do you agree with my conclusion that your argument would, then, have little to do with factored cognition?

I think I'm still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-27T04:46:54.556Z · LW · GW

Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I'm usually hoping for with debate, but I don't think that was the definition I was using here.

I think in this comment thread I've been defining an honest answer as one that can be justified via arguments that eventually don't have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters -- while this doesn't strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn't consciously realize I was making that assumption.)

I still think that working with this "definition" is an interesting theoretical exercise, though I agree it doesn't correspond to reality. Looking back I can see that you were talking about how this "definition" doesn't actually correspond to the realistic situation, but I didn't realize that's what you were saying, sorry about that.

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-26T22:46:52.984Z · LW · GW

Your argument also has some leaf nodes which use the terminology "fully defeat", in contrast to "defeat".

I don't think I ever use "fully defeat" in a leaf? It's always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).

I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.

Yes, that's what I mean by "fully defeat".

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-26T22:44:20.778Z · LW · GW

So, I propose as a concrete counterexample to your argument:

Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)

Ah, I see what you mean now. Yeah, I agree that debate is not going to answer fish in the scenario above. Sorry for using "correct" in a confusing way.

When I say that you get the correct answer, or the honest answer, I mean something like "you get the one that we would want our AI systems to give, if we knew everything that the AI systems know". An alternative definition is that the answer should be "accurately reporting what humans would justifiably believe given lots of time to reflect" rather than "accurately corresponding to reality".

(The two definitions above come apart when you talk about questions that the AI system knows about but can't justify to humans, e.g. "how do you experience the color red", but I'm ignoring those questions for now.)

(I'd prefer to talk about "accurately reporting the AI's beliefs", but there's no easy way to define what beliefs an AI system has, and also in any case debate .)

In the example you give, the AI systems also couldn't reasonably believe that the answer is "fish", and so the "correct" / "honest" answer in this case is "the question can't be answered given our current information", or "the best we can do is guess the typical food for an ancient Greek diet", or something along those lines. If the opponent tried to dispute this, then you simply challenge them to do better; they will then fail to do so. Given the assumption of optimal play, this absence of evidence is evidence of absence, and you can conclude that the answer is correct.

So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong.

In this case they're acknowledging that the other player's argument is "correct" (i.e. more likely than not to win if we continued recursively debating). While this doesn't guarantee their loss, it sure seems like a bad sign.

Or, even if they do object, they can simply choose to recurse on the honest player's objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).

Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player's arguments in parallel (at only 2x the cost).

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-26T21:08:22.983Z · LW · GW

Am I still misunderstanding something big about the kind of argument you are trying to make?

I don't think so, but to formalize the argument a bit more, let's define this new version of the WFC:

Special-Tree WFC: For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that:

  1. Every internal node has exactly one child leaf of the form "What is the best defeater to X?" whose answer is auto-verified,
  2. For every other leaf node, a human can verify that the answer to the question at that node is correct,
  3. For every internal node, a human can verify that the answer to the question is correct, assuming that the subanswers are correct.

(As before, we assume that the human never verifies something incorrect, unless the subanswers they were given were incorrect.)

Claim 1: (What I thought was) your assumption => Special-Tree WFC, using the construction I gave.

Claim 2: Special-Tree WFC + assumption of optimal play => honesty is an equilibrium, using the same argument that applies to regular WFC + assumption of optimal play.

Idk whether this is still true under the assumptions you're using; I think claim 1 in particular is probably not true under your model.

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-26T19:04:53.745Z · LW · GW

For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say "player 1 offers no argument for its position" and receive points for that, as far as I am concerned.

I think at this point I want a clearer theoretical model of what assumptions you are and aren't making. Like, at this point, I'm feeling more like "why are we even talking about defeaters; there are much bigger issues in this setup".

I wouldn't be surprised at this point if most of the claims I've made are actually false under the assumptions you seem to be working under.

Another problem with your argument -- WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don't address).

Not sure what you want me to "address". The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.

Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.

This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other's arguments. I didn't realize you were considering a version where you don't have to specifically rebut the other player's arguments.

The "in equilibrium" there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).

I just meant to include the fact that the honest player is able to find the defeaters to dishonest arguments. If you include that in "the honest policy", then I agree that "in equilibrium" is unnecessary. (I definitely could have phrased that better.)

Comment by rohinmshah on Literature Review on Goal-Directedness · 2021-01-26T18:48:06.798Z · LW · GW

It's implicit, but I think it should be made explicit that the properties/tests are what we extract from the literature, not what we say is fundamental.

Changed to "This post extracts five different concepts that have been identified in the literature as properties of goal-directed systems:".

I don't think the post says that.

Deleted that sentence.

Comment by rohinmshah on Preface to the Sequence on Factored Cognition · 2021-01-26T18:41:24.099Z · LW · GW

Yeah, that seems right to me -- I don't really expect to see us solving hard exercises in a textbook with a small number of humans without any additional tricks. I don't think Ought did either; from pretty early on they were talking about strategies for having larger trees, e.g. via automated decomposition strategies, or caching / memoization of strategies, possibly using ML.

In addition, I think Ought historically has pursued the strategy "try the thing that, if successful, would allow us to build a safety story", rather than "try the thing that, if it fails, implies that factored cognition would not work out", which is why they talk about particularly challenging tasks like solving the hardest exercise in a textbook.

Comment by rohinmshah on Preface to the Sequence on Factored Cognition · 2021-01-26T18:36:30.746Z · LW · GW

Changed to "The judge decides the winner based on whether they can confidently verify the final statement or not."

One thing I would like to be added is just that I come out moderately optimistic about Debate. It's not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn't as non-zero evidence.

Added a line to the end of the summary:

On the other hand, the author is cautiously optimistic about debate.

Comment by rohinmshah on Preface to the Sequence on Factored Cognition · 2021-01-26T06:23:13.308Z · LW · GW

Planned summary for the Alignment Newsletter:

The <@Factored Cognition Hypothesis@>(@Factored Cognition@) informally states that any task can be performed by recursively decomposing the task into smaller and smaller subtasks until eventually the smallest tasks can be done by a human. This sequence aims to formalize the hypothesis to the point that it can be used to argue for the outer alignment of (idealized versions of) <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@) and <@debate@>(@AI safety via debate@).

The key concept is that of an _explanation_ or _decomposition_. An explanation for some statement **s** is a list of other statements **s1, s2, … sn** along with the statement “(**s1** and **s2** and … and **sn**) implies **s**”. A _debate tree_ is a tree in which for a given node **n** with statement **s**, the children of **n** form an explanation (decomposition) of **s**. The leaves of the tree should be statements that the human can verify. (Note that the full formalism has significantly more detail, e.g. a concept of the “difficulty” for the human to verify any given statement.)

We can then define an idealized version of debate, in which the first debater must produce an answer with associated explanation, and the second debater can choose any particular statement to expand further. The judge decides the winner by evaluating whether the final statement is true or not. Assuming optimal play, the correct (honest) answer is an equilibrium as long as:

**Ideal Debate Factored Cognition Hypothesis:** For every question, there exists a debate tree for the correct answer where every leaf can be verified by the judge.

The idealized form of iterated amplification is <@HCH@>(@Humans Consulting HCH@); the corresponding Factored Cognition Hypothesis is simply “For every question, HCH correctly returns the correct answer”. Note that the _existence_ of a debate tree is not enough to guarantee this, as HCH must also _find_ the decompositions in this debate tree. If we imagine that HCH gets access to a decomposition oracle that tells it the right decomposition to make at each node, then HCH would be similar to idealized debate. (HCH could of course simply try all possible decompositions, but we are ignoring that possibility: the decompositions that we rely on should reduce or hide complexity.)

Is the HCH version of the Factored Cognition Hypothesis true? The author tends to lean against (more specifically, that HCH would not be superintelligent), because it seems hard for HCH to find good decompositions. In particular, humans seem to improve their decompositions over time as they learn more, and also seem to improve the concepts by which they think over time, all of which are challenging for HCH to do.

Planned opinion:

I enjoyed this sequence: I’m glad to see more analysis of what is and isn’t necessary for iterated amplification and debate to work, as well as more theoretical models of debate. I broadly agreed with the conceptual points made, with one exception: I’m not convinced that we should not allow brute force for HCH, and for similar reasons I don’t find the arguments that HCH won’t be superintelligent convincing. In particular, the hope with iterated amplification is to approximate a truly massive tree of humans, perhaps a tree containing around 2^100 (about 1e30) base agents / humans. At that scale (or even at just a measly billion (1e9) humans), I don’t expect the reasoning to look anything like what an individual human does, and approaches that are more like “brute force” seem a lot more feasible.

One might wonder why I think it is possible to approximate a tree with more base agents than there are grains of sand in the Sahara desert. Well, a perfect binary tree of depth 99 would have 1e30 nodes; thus we can roughly say that we’re approximating 99-depth-limited HCH. If we had perfect distillation, this would take 99 rounds of iterated amplification and distillation, which seems quite reasonable. Of course, we don’t have perfect distillation, but I expect that to be a relatively small constant factor on top (say 100x), which still seems pretty reasonable. (There’s more detail about how we get this implicit exponential-time computation in <@this post@>(@Factored Cognition@).)

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-26T00:25:00.439Z · LW · GW

Whoops, I seem to have missed this comment, sorry about that. I think at this point we're nearly at agreement.

Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium -- there would be no reason to be honest, but no specific reason to be dishonest, either.

Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you'll see honest arguments to which there is no dishonest defeater.)

Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium -- the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.

Similar comment here -- the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it's clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.)

It seems plausible to me that there's an incremental zero-sum scoring rule; EG, every convincing counterargument takes 1 point from the other player, so any dishonest statement is sure to lose you a point (in equilibrium). The hope would be that you always prefer to concede rather than argue, even if you're already losing, in order to avoid losing more points.

However, this doesn't work, because a dishonest (but convincing) argument gives you +1, and then -1 if it is refuted; so at worst it's a wash. So again it's a weak equilibrium, and if there's any imperfection in the equilibrium at all, it actively incentivises lying when you would otherwise concede (because you want to take the chance that the opponent will not manage to refute your argument).

This was the line of reasoning which led me to the scoring rule in the post, since making it a -2 (but still only +1 for the other player) solves that issue.

On the specific -2/+1 proposal, the issue is that then the first player just makes some dishonest argument, and the second player concedes because even if they give an honest defeater, the second player could then re-defeat that with a dishonest defeater. (I realize I'm just repeating myself here; there's more discussion in the next section.)

But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater).

In this situation, there is no possible way to distinguish between honesty and dishonesty -- under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don't have defeaters. From the perspective of the players, the salient feature of the game is that they can make statements; all such statements will have defeaters; there's no information available to them in the structure of the game that distinguishes honesty from dishonesty. Therefore honesty can't be the unique equilibrium; whatever the policy is, there should be an equivalent one that is at least sometimes dishonest.

In this worst case, I suspect that for any judge-based scoring rule, the equilibrium behavior is either "the first player says something and the second concedes", or "every player always provides some arbitrary defeater of the previous statement, and the debate never ends / the debate goes to whoever got the last word".

The probability of the opponent finding (and using) a dishonest defeater HAS TO be below 50%, in all cases, which is a pretty high bar. Although of course we can make an argument about how that probability should be below 50% if we're already in an honest-enough regime. (IE we hope that the dishonest player prefers to concede at that point rather than refute the refutation, for the same reason as your argument gives -- it's too afraid of the triple refutation. This is precisely the argument we can't make in the zero sum case.)

Sorry, I don't get this. How could we make the argument that the probability is below 50%?

Depending on the answer, I expect I'd follow up with either

  1. Why can't the same argument apply in the zero sum case? or
  2. Why can't the same argument be used to say that the first player is happy to make a dishonest claim? or
  3. Why is it okay for us to assume that we're in an honest-enough regime?

Separately, I'd also want to understand how exactly we're evading the argument I gave above about how the players can't even distinguish between honesty and dishonesty in the worst case.


Things I explicitly agree with:

I assume (correct me if I'm wrong) that the scoring rules to "the zero sum setting" are something like: the judge assesses things at the end, giving +1 to the winner and -1 from the loser, or 0 in case of a tie.


Ahhh, this is actually a pretty interesting point, because it almost suggests that honesty is an Evolutionarily Stable Equilibrium, even though it's only a Weak Nash Equilibrium. But I think that's not quite true, since the strategy "lie when you would otherwise have to concede, but otherwise be honest" can invade the honest equilibrium. (IE that mutation would not be selected against, and could be actively selected for if we're not quite in equilibrium, since players might not be quite perfect at finding the honest refutations for all lies.)

Comment by rohinmshah on Debate Minus Factored Cognition · 2021-01-26T00:00:36.566Z · LW · GW

I'll just flag that I still don't know this argument, either, and I'm curious where you're getting it from / what it is.

I just read the Factored Cognition sequence since it has now finished, and this post derives WFC as the condition necessary for honesty to be an equilibrium in (a slightly unusual form of) debate, under the assumption of optimal play.

Comment by rohinmshah on Defining capability and alignment in gradient descent · 2021-01-25T06:54:58.131Z · LW · GW

Planned summary for the Alignment Newsletter:

Consider a neural network like GPT-3 trained by gradient descent on (say) the cross-entropy loss function. This loss function forms the _base objective_ that the process is optimizing for. Gradient descent typically ends up at some local minimum, global minimum, or saddle point of this base objective.

However, if we look at the gradient descent equation, θ = θ - αG, where G is the gradient, we can see that this is effectively minimizing the size of the gradients. We can think of this as the mesa objective: the gradient descent process (with an appropriate learning rate decay schedule) will eventually get G down to zero, its minimum possible value (even though it may not be at the global minimum for the base objective).

The author then proposes defining capability of an optimizer based on how well it decreases its loss function in the limit of infinite training. Meanwhile, given a base optimizer and mesa optimizer, alignment is given by the capability of the base optimizer divided by the capability of the mesa optimizer. (Since the mesa optimizer is the one that actually acts, this is effectively measuring how much progress on the mesa objective also causes progress on the true base objective.)

This has all so far assumed a fixed training setup (such as a fixed dataset and network architecture). Ideally, we would also want to talk about robustness and generalization. For this, the author introduces the notion of a “perturbation” to the training setup, and then defines [capability / alignment] [robustness / generalization] based on whether the optimization stays approximately the same when the training setup is perturbed.

It should be noted that these are all definitions about the behavior of optimizers in the infinite limit. We may also want stronger guarantees that also talk about the behavior on the way to the infinite limit.

Comment by rohinmshah on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-24T17:55:38.344Z · LW · GW

I didn't quite follow this part. Do you think I'm not reasoning from the thing I believe is the bottleneck?

I actually don't remember what I meant to convey with that :/

Would you be willing to take a 10:1 bet that there won't be something as good as GPT-3 trained on 2 OOMs less compute by 2030?

No, I'd also take the other side of the bet. A few reasons:

  • Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for "efficiency on a transformative task", whereas researchers probably are optimizing for "efficiency of GPT-3 style systems", suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
  • 90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.

(Note that 2 OOMs in 10 years seems significantly different from "we can get several OOMs more data-efficient training than the GPT's had using various already-developed tricks and techniques". I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)

I don't think evolution was going for compute-optimal performance in the relevant sense.

I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?

I think you'd need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.

But it seems to me to be... like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it's because they've developed general intelligence in the relevant sense.

This seems like it is realist about rationality, which I mostly don't buy. Still, 25% doesn't seem crazy, I'd probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.

Or maybe they haven't but it's a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture.

Why aren't we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that's already what the model does.

GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.

  1. I don't particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren't optimized for that.
  2. I expect that Google search beats GPT-3 on that dataset.

I don't really know what you mean when you say that this task is "hard". Sure, humans don't do it very well. We also don't do arithmetic very well, while calculators do.

But I'm happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.

Er, note that I've talked to Ajeya for like an hour or two on the entire report. I'm not that confident that Ajeya also believes the things I'm saying (maybe I'm 80% confident).

To me this really sounds like it's saying the horizon length = the number of subjective seconds per sample during training. [...] 

I agree that the definition used in the report does seem consistent with that. I think that's mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn't really talk about the unsupervised pretraining approach so its definitions didn't have to handle that case.

But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for "when a neural net can do human-level summarization" and "when a neural net can be a human-level personal assistant", even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don't use the horizon length for that purpose, I think you should have some other way of incorporating "difficulty of the task" into your timelines.

Exactly. I think this is what humans do too, to a large extent. I'd be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.

I mean, I'm at 30 / 40 / 10, so that isn't that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let's say) 15% on it.

Comment by rohinmshah on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-21T23:43:34.266Z · LW · GW

Children also learn right from wrong - I'd be interested in where you draw the line between "An AI that learns common sense" and "An AI that learns right from wrong."

I'm happy to assume that AI will learn right from wrong to about the level that children do. This is not a sufficiently good definition of "the good" that we can then optimize it.

My suspicion, which is interesting to me so I'll explain it even if you're going to tell me that I'm off base, is that you're thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn't carry over to "knowing right from wrong." An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.

That sounds basically right, with the caveat that you want to be a bit more specific and precise with what the AI system should do than just saying "common sense"; I'm using the phrase as a placeholder for something more precise that we need to figure out.

Also, I'd change the last sentence to "an AI that has learned right from wrong to the same extent that humans learn it, and then optimizes for right things as hard as possible, will probably make dangerous moral mistakes". The point is that when you're trying to define "the good" and then optimize it, you need to be very very correct in your definition, whereas when you're trying not to optimize too hard in the first place (which is part of what I mean by "common sense") then that's no longer the case.

After all, one of the most universally acknowledged things about common sense is that it's uncommon among humans!

I think at this point I don't think we're talking about the same "common sense".

Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time - this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).

But why?

they're saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.

Again it depends on how accurate the "right/wrong classifier" needs to be, and how accurate the "common sense" needs to be. My main claim is that the path to safety that goes via "common sense" is much more tolerant of inaccuracies than the path that goes through optimizing the output of the right/wrong classifier.

Comment by rohinmshah on Literature Review on Goal-Directedness · 2021-01-21T21:58:13.628Z · LW · GW

Planned summary for the Alignment Newsletter:

This literature review on goal-directedness identifies five different properties that should be true for a system to be described as goal-directed:

1. **Restricted space of goals:** The space of goals should not be too expansive, since otherwise goal-directedness can <@become vacuous@>(@Coherence arguments do not imply goal-directed behavior@) (e.g. if we allow arbitrary functions over world-histories with no additional assumptions).

2. **Explainability:** A system should be described as goal-directed when doing so improves our ability to _explain_ the system’s behavior and _predict_ what it will do.

3. **Generalization:** A goal-directed system should adapt its behavior in the face of changes to its environment, such that it continues to pursue its goal.

4. **Far-sighted:** A goal-directed system should consider the long-term consequences of its actions.

5. **Efficient:** The more goal-directed a system is, the more efficiently it should achieve its goal.

The concepts of goal-directedness, optimization, and agency seem to have significant overlap, but there are differences in the ways the terms are used. One common difference is that goal-directedness is often understood as a _behavioral_ property of agents, whereas optimization is thought of as a _mechanistic_ property about the agent’s internal cognition.

The authors then compare multiple proposals on these criteria:

1. The _intentional stance_ says that we should model a system as goal-directed when it helps us better explain the system’s behavior, performing well on explainability and generalization. It could easily be extended to include far-sightedness as well. A more efficient system for some goal will be easier to explain via the intentional stance, so it does well on that criterion too. And not every possible function can be a goal, since many are very complicated and thus would not be better explanations of behavior. However, the biggest issue is that the intentional stance cannot be easily formalized.

2. One possible formalization of the intentional stance is to say that a system is goal-directed when we can better explain the system’s behavior as maximizing a specific utility function, relative to explaining it using an input-output mapping (see <@Agents and Devices: A Relative Definition of Agency@>). This also does well on all five criteria.

3. <@AGI safety from first principles@> proposes another set of criteria that have a lot of overlap with the five criteria above.

4. A [definition based off of Kolmogorov complexity]( works well, though it doesn’t require far-sightedness.

Planned opinion:

The five criteria seem pretty good to me as a description of what people mean when they say that a system is goal-directed. It is less clear to me that all five criteria are important for making the case for AI risk (which is why I care about a definition of goal-directedness); in particular it doesn’t seem to me like the explainability property is important for such an argument (see also [this comment](

Note that it can still be the case that as a research strategy it is useful to search for definitions that satisfy these five criteria; it is just that in evaluating which definition to use I would choose the one that makes the AI risk argument work best. (See also [Against the Backward Approach to Goal-Directedness](