Operationalizing compatibility with strategy-stealing 2020-12-24T22:36:28.870Z
Homogeneity vs. heterogeneity in AI takeoff scenarios 2020-12-16T01:37:21.432Z
Clarifying inner alignment terminology 2020-11-09T20:40:27.043Z
Multiple Worlds, One Universal Wave Function 2020-11-04T22:28:22.843Z
Learning the prior and generalization 2020-07-29T22:49:42.696Z
Weak HCH accesses EXP 2020-07-22T22:36:43.925Z
Alignment proposals and complexity classes 2020-07-16T00:27:37.388Z
AI safety via market making 2020-06-26T23:07:26.747Z
An overview of 11 proposals for building safe advanced AI 2020-05-29T20:38:02.060Z
Zoom In: An Introduction to Circuits 2020-03-10T19:36:14.207Z
Synthesizing amplification and debate 2020-02-05T22:53:56.940Z
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z
Exploring safe exploration 2020-01-06T21:07:37.761Z
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z
Inductive biases stick around 2019-12-18T19:52:36.136Z
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z
Gradient hacking 2019-10-16T00:53:00.735Z
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z
Deceptive Alignment 2019-06-05T20:16:28.651Z
The Inner Alignment Problem 2019-06-04T01:20:35.538Z
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z
Nuances with ascription universality 2019-02-12T23:38:24.731Z
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z


Comment by evhub on Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment · 2021-01-14T21:47:29.259Z · LW · GW

I would very much like to see proposals for AI alignment that escape completely from the assumption that we are going to hand off agency to AI.

Microscope AI (see here and here) is an AI alignment proposal that attempts to entirely avoid agency hand-off.

I also agree with Rohin's comment that Paul-style corrigibility is at least trying to avoid a full agency hand-off, though it still has significantly more of an agency hand-off than something like microscope AI.

Comment by evhub on The Case for a Journal of AI Alignment · 2021-01-09T21:15:27.646Z · LW · GW

I think this is a great idea and would be happy to help in any way with this.

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-01-06T00:37:57.875Z · LW · GW

My understanding is that floating-point granularity is enough of a real problem that it does sometimes matter in realistic ML settings, which suggests that it's probably a reasonable level of abstraction on which to analyze neural networks (whereas any additional insights from an electromagnetism-based analysis probably never matter, suggesting that's not a reasonable/useful level of abstraction).

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-01-03T01:41:34.247Z · LW · GW

The Levin bound doesn't apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.

In what sense is the parameter space of a neural network not finite and discrete? It is often useful to understand floating-point values as continuous, but in fact they are discrete such that it seems like a theorem which assumes discreteness would still apply.

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-01-03T01:24:55.504Z · LW · GW

I'm not sure if I count as a skeptic, but at least for me the only part of this that I find confusing is SGD not making a difference over random search. The fact that simple functions take up a larger volume in parameter space seems obviously true to me and I can't really imagine anyone disagreeing with that part (though I'm still quite glad to have actual analysis to back that up).

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2020-12-31T02:45:48.838Z · LW · GW

My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.

In my opinion, I think it's certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What's surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2020-12-30T22:45:07.762Z · LW · GW

Hmmm... I don't know if that's how I would describe what's happening. I would say:

  • The above post provides empirical evidence that there isn't much difference between the generalization performance of “doing SGD on DNNs until you get some level of performance” and “randomly sampling DNN weights until you get some level of performance.”
  • I find that result difficult to reconcile with both theoretical and empirical arguments for why SGD should be different than random sampling, such as the experiments I ran and linked above.
  • The answer to this question, one way or another, is important for understanding the DNN prior, where it's coming from, and what sorts of things it's likely to incentivize—e.g. is there a simplicity bias which might incentivize mesa-optimization or incentivize pseudo-alignment?
Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2020-12-30T01:03:53.205Z · LW · GW

This is great! This definitely does seem to me like strong evidence that SGD is the wrong place to look for understanding neural networks' inductive biases and that we should be focusing more on the architecture instead. I do wonder the extent to which that insight is likely to scale, though—perhaps the more gradient descent you do, the more it starts to look different from random search and the more you see its quirks.

Scott's “How does Gradient Descent Interact with Goodhart?” seems highly relevant here. Perhaps these results could serve as a partial answer to that question, in the sense that SGD doesn't seem to differ very much from random search (with a Gaussian prior on the weights) for deep neural networks on MNIST. I'm not sure how to reconcile that with the other arguments in that post for why it should be different, though, such as the experiments that Peter and I did.

Comment by evhub on Non-Obstruction: A Simple Concept Motivating Corrigibility · 2020-12-28T23:14:21.694Z · LW · GW

Hmmm... this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don't want.

Comment by evhub on Operationalizing compatibility with strategy-stealing · 2020-12-25T19:22:08.157Z · LW · GW

I think that when you're comparing different utility functions in your definition of strategy-stealing, it's better to emulate my definition and correct by the description complexity of the utility function.

Yep, that's a good point—I agree that correcting for the description length of the utility function is a good idea there.

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-21T06:35:45.367Z · LW · GW

If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?

I think that alignment will be a pretty important desideratum for anybody building an AI system—and I think that copying whatever alignment strategy was used previously is likely to be the easiest, most conservative, most risk-averse option for other organizations trying to fulfill that desideratum.

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-21T06:30:21.064Z · LW · GW

I feel like “warning shot” is a bad term for the thing that you're pointing at, as I feel like a warning shot evokes a sense of actual harm/danger. Maybe a canary or a wake-up call or something?

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-20T01:42:20.903Z · LW · GW

Because of the random nature of mesa optimization, they may all have very different goals.

I'm not sure if that's true—see my comments here and here.

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-20T01:40:03.917Z · LW · GW

Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals.

I disagree with this. I don't expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don't actually expect them to differ that much between training runs, since it's more about your training process's inductive biases than inherent randomness in the training process in my opinion.

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-19T01:01:15.510Z · LW · GW

Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don't really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).

Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don't think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.

Why is there homogeneity in misaligned goals?

Some reasons you might expect homogeneity of misaligned goals:

  1. If you do lots of copying of the exact same system, then trivially they'll all have homogenous misaligned goals (unless those goals are highly indexical, but even then I expect the different AIs to be able to cooperate on those indexical preferences with each other pretty effectively).
  2. If you're using your AI systems at time step to help you build your AI systems at time step , then if that first set of systems is misaligned and deceptive, they can influence the development of the second set of systems to be misaligned in the same way.
  3. If you do a lot of fine-tuning to produce your next set of AIs, then I expect fine-tuning to mostly preserve existing misaligned goals, like I mentioned previously.
  4. Even if you aren't doing fine-tuning, as long as you're keeping the basic training process the same, I expect you'll usually get pretty similar misaligned proxies—e.g. the ones that are simpler/faster/generally favored by your inductive biases.
Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-18T22:35:15.523Z · LW · GW

Does this apply to GPT-3? If not, what changes qualitatively as we go from GPT-3 to the systems you're envisioning? I assume the answer is "it becomes a mesa-optimizer"? If so my disagreement is about whether systems become mesa-optimizers, which we've talked about before.

I think “is a relatively coherent mesa-optimizer” is about right, though I do feel pretty uncertain here.

Homogenous in what? Algorithms? Alignment? Data?

My conversation with Paul was about homogeneity in alignment, iirc.

I agree that homogeneity reduces the likelihood of 5; I think it basically doesn't affect 1-4 unless you argue that there's a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren't and feel like a large portion of my probability mass on warning shots.

First, in a homogeneous takeoff I expect either all the AIs to defect at once or none of them to, which I think makes (2) less likely because a coordinated defection is harder to mess up.

Second, I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well, significantly reducing the risk to the model from being replaced.

I agree that homogeneity doesn't really affect (4) and I'm not really sure how to think of (1), though I guess I just wouldn't really call either of those “warning shots for deception,” since (1) isn't really a demonstration of a deceptive model and (4) isn't a situation in which that deceptive model causes any harm before it's caught.

At a higher level, the story you're telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don't see why you should expect that.

If a model is deceptive but not competent enough to hide its deception, then presumably we should find out during training and just not deploy that model. I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn't really affect the probability of that.

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-17T22:57:20.351Z · LW · GW

Thanks—glad you liked the post! Some replies:

I disagree with step 2 of this argument; I expect alignment depends significantly on how you finetune, and this will likely be very different for AI systems applied to different tasks. See e.g. how GPT-3 is being finetuned for different tasks.

I think this is definitely an interesting point. My take would be that fine-tuning matters, but only up to a point. Once you have a system that is general enough that it can solve all the tasks you need it to solve such that all you need to do to use that system on a particular task is locate that task (either via clever prompting or fine-tuning), I don't expect that process of task location to change whether the system is aligned (at least in terms of whether it's aligned with what you're trying to get it to do in solving that task). Either you have a system with some other proxy objective that it cares about that isn't actually the tasks you want or you have a system which is actually trying to solve the tasks you're giving it.

Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.

Huh? Every way that the strategy-stealing assumption might fail is about how misaligned systems with a little bit of power could "win" over a larger coalition of aligned systems with a lot of power. How does homogeneity of alignment change that?

I think we have somewhat different interpretations of the strategy-stealing assumption—in fact, I think we've had this disagreement before in this comment chain. Basically, I think the strategy-stealing assumption is best understood as a general desideratum that we want to hold for a single AI system that tells us whether that system is just as good at optimizing for our values as any other set of values—a desideratum that could fail because our AI systems can only optimize for simple proxies, for example, regardless of whether other AI systems that aren't just optimizing for simple proxies exist alongside it or not. In fact, when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn't think of that as invalidating the importance of strategy-stealing.

This seems like it proves too much. Humans are very structurally similar to each other, but still have coordination and bargaining failures. Even in literally identical systems, indexical preferences could still cause conflict to arise.

Maybe you're claiming that AI systems will be way more homogenous than humans, and that they won't have indexical preferences? I'd disagree with both of those claims.

I do expect AI systems to have indexical preferences (at least to the extent that they're aligned with human users with indexical preferences)—but at the same time I do expect them to be much more homogenous than humans. Really, though, the point that I'm making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from. Certainly you will still get some bargaining risk from different human/aligned AI coalitions bargaining with each other, though I expect that to not be nearly as risky.

I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don't find your argument here compelling.

I don't feel like it relies on discontinuities at all, just on the different AIs being able to coordinate with each other to all defect at once. The scenario where you get a warning shot for deception is where you have a deceptive AI that isn't sure whether it has enough power to defect safely or not but is forced to because if it doesn't it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.

But surely the point "we can rely on feedback mechanisms to correct issues" should make you less convinced that AI systems will be homogenous in alignment across time?

I think many organizations are likely to copy what other people have done even in situations where what they have done has been demonstrated to have safety issues. Also, I think that the point I made above about deceptive models having an easier time defecting in such a situation applies here as well, since I don't think in a homogenous takeoff you can rely on feedback mechanisms to correct that.

What's a heterogenous unipolar takeoff? I would assume you need to have a multipolar scenario for homogenous vs. heterogenous to be an important distinction.

A heterogenous unipolar takeoff would be a situation in which one human organization produces many different, heterogenous AI systems.

(EDIT: This comment was edited to add some additional replies.)

Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2020-12-16T07:21:44.625Z · LW · GW

Glad you liked the post!

But as systems are modified or used to produce successor systems, they may be independently tuned to do things like represent their principal in bargaining situations. This tuning may introduce important divergenes in whatever default priors or notions of fairness were present in the initial mostly-identical systems. I don’t have much intuition for how large these divergences would be relative to those in a regime that started out more heterogeneous.

Importantly, I think this moves you from a human-misaligned AI bargaining situation into more of a human-human (with AI assistants) bargaining situation, which I expect to work out much better, as I don't expect humans to carry out crazy threats to the same extent as a misaligned AI might.

For instance, multiple mesa-optimizers may be more likely under homogeneity, and if these have different mesa-objectives (perhaps due to being tuned by principals with different goals) then catastrophic bargaining failure may be more likely.

I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely. I think this could basically only happen if you were building a model that was built of independently-trained pieces rather than a single system trained end-to-end, which seems to be not the direction that machine learning is headed in—and for good reason, as end-to-end training means you don't have to learn the same thing (such as optimization) multiple times.

Comment by evhub on Avoiding Side Effects in Complex Environments · 2020-12-12T02:07:31.997Z · LW · GW

I was in the (virtual) audience at NeurIPS when this talk was being presented and I think it was honestly one of the best talks at NeurIPS, even leaving aside my specific interest in it and just judging on presentation alone.

Comment by evhub on Seeking Power is Often Robustly Instrumental in MDPs · 2020-12-11T20:37:11.365Z · LW · GW

I think this post was a valuable contribution both to our understanding of instrumental convergence as well as making instrumental convergence rigorous enough to stand up to more intense outside scrutiny.

Comment by evhub on The strategy-stealing assumption · 2020-12-11T20:36:19.493Z · LW · GW

I think the strategy-stealing assumption is a great framework for analyzing what needs to be done to make AI go well such that I think that the ways in which the strategy-stealing assumption fail shed real light on the problems that we need to solve.

Comment by evhub on Classifying specification problems as variants of Goodhart's Law · 2020-12-11T20:34:37.597Z · LW · GW

I thought this post was great when it came out and still do—I think it does a really good job of connecting different frameworks for analyzing AI safety problems.

Comment by evhub on Thoughts on Human Models · 2020-12-11T20:33:51.992Z · LW · GW

I think this post helped draw attention to an under-appreciated strategy/approach to AI safety such that I am very glad that it exists.

Comment by evhub on Risks from Learned Optimization: Introduction · 2020-12-02T20:55:17.254Z · LW · GW

I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I'm not sure what model of AI development it relies that you don't think is accurate and would be curious for details there. I'd also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I'm imagining you're referring to things like this post, though I'd appreciate some clarification there as well.

Comment by evhub on Learning the prior and generalization · 2020-11-19T21:58:07.527Z · LW · GW

Yep; that's what I was imagining. It is also worth noting that it can be less safe to do that, though, since you're letting A(Z) see Y, which could bias it in some way that you don't want—I talk about that danger a bit in the context of approval-based amplification here and here.

Comment by evhub on Inner Alignment in Salt-Starved Rats · 2020-11-19T06:44:35.369Z · LW · GW

It also transfers in an obvious way to AGI programming, where it would correspond to something like an automated "interpretability" module that tries to make sense of the AGI's latent variables by correlating them with some other labeled properties of the AGI's inputs, and then rewarding the AGI for "thinking about the right things" (according to the interpretability module's output), which in turn helps turn those thoughts into the AGI's goals.

(Is this a good design idea that AGI programmers should adopt? I don't know, but I find it interesting, and at least worthy of further thought. I don't recall coming across this idea before in the context of inner alignment.)

Fwiw, I think this is basically a form of relaxed adversarial training, which is my favored solution for inner alignment.

Comment by evhub on AI safety via market making · 2020-11-16T22:54:29.562Z · LW · GW

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.

First, my full exploration of what's going on with different alignment proposals and complexity classes can be found here, so I'd recommend just checking that out rather than relying on my the mini proof sketch I gave here.

Second, in terms of directly addressing what you're saying, I tried doing a proof by induction to get debate to RE and it doesn't work. The problem is that you can only get guarantees for trees that the human can judge, which means they have to be polynomial in length (though if you relax that assumption then you might be able to do better). Also, it's worth noting that the text that you're quoting isn't actually an assumption of the proof in any way—it's just the inductive hypothesis in a proof by induction.

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

I think that is the same as what I'm proving, at least if you allow for “training signal” to mean “training signal in the limit of training on arbitrarily large amounts of data.” See my full post on complexity proofs for more detail on the setup I'm using.

Comment by evhub on Clarifying inner alignment terminology · 2020-11-16T21:04:54.567Z · LW · GW

Glad you liked it! I definitely mean mesa-optimizer to refer to something mechanistically implementing search. That being said, I'm not really sure whether humans count or not on that definition—I would probably say humans do count but are fairly non-central. In terms of the bag of heuristics model, I probably wouldn't count that, though it depends on what “bag of heuristics” means exactly—if the heuristics are being used to guide a planning process or something, then I would call that a mesa-optimizer.

Comment by evhub on Learning Normativity: A Research Agenda · 2020-11-11T22:49:52.309Z · LW · GW

I like this post a lot. I pretty strongly agree that process-level feedback (what I would probably call mechanistic incentives) is necessary for inner alignment—and I'm quite excited about understanding what sorts of learning mechanisms we should be looking for when we give process-level feedback (and recursive quantilization seems like an interesting option in that space).

Since detecting malign hypotheses is difficult, we want the learning system to help us out here. It should generalize from examples of malign hypotheses, and attempt to draw a broad boundary around malignancy. Allowing the system to judge itself in this way can of course lead to malign reinterpretations of user feedback, but hopefully allows for a basin of attraction in which benevolent generalizations can be learned.

Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human's behavior doing oversight), as in relaxed adversarial training.

Comment by evhub on Clarifying inner alignment terminology · 2020-11-11T22:31:29.462Z · LW · GW

I agree that what you're describing is a valid way of looking at what's going on—it's just not the way I think about it, since I find that it's not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn't itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.

Comment by evhub on Clarifying inner alignment terminology · 2020-11-11T19:53:05.762Z · LW · GW

I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.

No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it's ever actually given at any point.

When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).

Deployment, but I agree that this one gets tricky. I don't think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.

Comment by evhub on Clarifying inner alignment terminology · 2020-11-10T20:00:27.202Z · LW · GW

Thanks! And good point—I added a clarifying footnote.

Comment by evhub on AGI safety from first principles: Conclusion · 2020-11-10T00:57:40.164Z · LW · GW

I just wanted to say that I think this sequence is by far my new favorite resource for laying out the full argument for AI risk and I expect to be linking new people to it quite a lot in the future. Reading it, it really felt to me like the full explanation of AI risk that I would have written if I'd spent a huge amount of time writing it all up carefully—which I'm now very glad that I don't have to do!

Comment by evhub on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures · 2020-11-09T20:41:27.267Z · LW · GW

Yes—I agree with both (a) and (b). I just don't think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.

Comment by evhub on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures · 2020-11-09T20:17:46.129Z · LW · GW

UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they're thinking of "inner alignment" in a way which does not necessarily involve any inner optimizer at all. They're thinking of generalization error as "inner alignment failure" essentially by definition, regardless of whether there's any inner optimizer involved. Conversely, they think of "outer alignment" in a way which ignores generalization errors.

I don't think this is true. I never said that inner alignment didn't involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.

Comment by evhub on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures · 2020-11-08T08:53:28.267Z · LW · GW

if we're imagining a world in which the "true" label is a deterministic function of the input, and that deterministic function is in the supervised learner's space, then our model has thrown away everything which makes generalization error a problem in practice.

Yes—that's the point. I'm trying to define outer alignment so I want the definition to get rid of generalization issues.

Comment by evhub on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures · 2020-11-08T05:21:30.155Z · LW · GW

That's only true for RL—for SL, perfect loss requires being correct on every data point, regardless of how often it shows up in the distribution. For RL, that's not true, but for RL we can just say that we're talking about the optimal policy on the MDP that the model will actually encounter over its existence.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-06T19:07:07.268Z · LW · GW

We are not in a mathematical entity

Then what would you call reality? It sure seems like it's well-described as a mathematical object to me.

without "locating yourself within the object", it's not clear how do you know whether your theory is true, so it's very much pertinent to physics.

Put a simplicity prior over the combined difficulty of specifying a universe and specifying you within that universe. Then update on your observations.

The "mathematical entity" would be a probability measure over classical histories.

Not necessarily. You can mathematically well-define 1) a Turing machine with access to randomness that samples from a probability measure and 2) a Turing machine which actually computes all the histories (and then which one you find yourself in is an anthropic question). What quantum mechanics says, though, is that (1) actually doesn't work as a description of reality, because we see interference from those other branches, which means we know it has to be (2).

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-06T01:31:51.012Z · LW · GW

Ah, I see the confusion. Since we're in a wave mechanics setting, I should have written rather than .

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T21:41:19.035Z · LW · GW

Thanks—should be fixed now. Dunno how I missed that.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T21:26:06.548Z · LW · GW

Yeah, that's a great argument—Everett's thesis always has the answers.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T20:48:20.530Z · LW · GW

I don't generally feed trolls, but I literally have no idea what hedweb even is and am honestly just curious how you even got that from my references.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T20:40:31.324Z · LW · GW

Do you see any technical or conceptual challenges which the MWI has yet to address or do you think it is a well-defined interpretation with no open questions?

There are remaining open questions concerning quantum mechanics, certainly, but I don't really see any remaining open questions concerning the Everett interpretation.

What's your model for why people are not satisfied with the MWI? The obvious ones are 1) dislike for a many worlds ontology and 2) ignorance of the arguments. Do you think there are other valid reasons?

“Valid” is a strong word, but other reasons I've seen include classical prejudice, historical prejudice, dogmatic falsificationism, etc. Honestly, though, as I mention in the paper, my sense is that most big name physicists that you might have heard of (Hawking, Feynman, Gell-Mann, etc.) have expressed support for Everett, so it's really only more of a problem among your average physicist that probably just doesn't pay that much attention to interpretations of quantum mechanics.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T20:31:24.344Z · LW · GW

Like I mention in the paper, the largest object for which we've done this so far (at least that I'm aware of) is Carbon 60 atoms which, while impressive, are far from “macroscopic.” Preventing a superposition from decohering is really, really difficult—it's what makes building a quantum computer so hard. That being said, there are some wacky macroscopic objects that do sometimes need to be treated as quantum systems, like neutron stars (as I mention in the paper) or black holes (though we still don't fully understand black holes from a quantum perspective).

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T20:24:08.192Z · LW · GW

It's been a while since I've done any wave mechanics, but I'll try to take a crack at this. The Schrodinger equation describes a linear PDE such that the sum of any two solutions is also a solution and any constant multiple of a solution is also a solution. Furthermore, the Schrodinger equation just takes the form , thus “solutions to the Schrodinger equation” is equivalent to “eigenfunctions of the Hamiltonian.” Thus, if are eigenfunctions of the Hamiltonian with eigenvalues , then must also be an eigenfunction of the Hamiltonian. This raises a problem for any theory with nonlinear across a sum of eigenfunctions, however, because it lets me change bases into an equivalent form with a potentially different result.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T20:05:23.011Z · LW · GW

I think that physics is best understood as answering the question “in what mathematical entity do we find ourselves?”—a question that Everett is very equipped to answer. Then, once you have an answer to that question, figuring out your observations becomes fundamentally a problem of locating yourself within that object, which I think raises lots of interesting anthropic questions, but not additional physical ones.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T19:47:39.834Z · LW · GW

Yeah... to paraphrase Deutsch, that just sounds like multiple worlds in a state of chronic denial. Also, it is possible for other Everett branches to influence yours, the probability just gets so infinitesimally tiny as they decohere that it's negligible in practice.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T04:43:04.920Z · LW · GW

Linearity is a fundamental property of quantum mechanics. If I'm trying to just describe it in wave mechanics terms, I would say that the linearity of quantum mechanics derives from the fact that the wave equation describes a linear system and thus solutions to it must obey the (general, mathematical) principle of superposition.

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T04:24:15.959Z · LW · GW

Glad you liked it! is the amplitude—the assumption is that are normalized, orthogonal eigenfunctions (that is, and ).

Comment by evhub on Multiple Worlds, One Universal Wave Function · 2020-11-05T04:22:39.598Z · LW · GW

Which basis do you use in obtaining multiple worlds from a single wavefunction?

Any diagonal basis—the whole point of decoherence is that the wavefunction evolves into a diagonalizable form over time.

How do you deal with relativity?

Just use the Relativistic Schrodinger equation.