Knowledge Neurons in Pretrained Transformers 2021-05-17T22:54:50.494Z
Agents Over Cartesian World Models 2021-04-27T02:06:57.386Z
Open Problems with Myopia 2021-03-10T18:38:09.459Z
Operationalizing compatibility with strategy-stealing 2020-12-24T22:36:28.870Z
Homogeneity vs. heterogeneity in AI takeoff scenarios 2020-12-16T01:37:21.432Z
Clarifying inner alignment terminology 2020-11-09T20:40:27.043Z
Multiple Worlds, One Universal Wave Function 2020-11-04T22:28:22.843Z
Learning the prior and generalization 2020-07-29T22:49:42.696Z
Weak HCH accesses EXP 2020-07-22T22:36:43.925Z
Alignment proposals and complexity classes 2020-07-16T00:27:37.388Z
AI safety via market making 2020-06-26T23:07:26.747Z
An overview of 11 proposals for building safe advanced AI 2020-05-29T20:38:02.060Z
Zoom In: An Introduction to Circuits 2020-03-10T19:36:14.207Z
Synthesizing amplification and debate 2020-02-05T22:53:56.940Z
Outer alignment and imitative amplification 2020-01-10T00:26:40.480Z
Exploring safe exploration 2020-01-06T21:07:37.761Z
Safe exploration and corrigibility 2019-12-28T23:12:16.585Z
Inductive biases stick around 2019-12-18T19:52:36.136Z
Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z
Gradient hacking 2019-10-16T00:53:00.735Z
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z
Deceptive Alignment 2019-06-05T20:16:28.651Z
The Inner Alignment Problem 2019-06-04T01:20:35.538Z
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z
Nuances with ascription universality 2019-02-12T23:38:24.731Z
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z


Comment by evhub on "Decision Transformer" (Tool AIs are secret Agent AIs) · 2021-06-09T02:03:35.965Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Survey on AI existential risk scenarios · 2021-06-09T02:03:16.268Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Game-theoretic Alignment in terms of Attainable Utility · 2021-06-08T19:41:33.367Z · LW · GW

(Moderation note: added to the Alignment Forum from LessWrong.)

Comment by evhub on Thoughts on the Alignment Implications of Scaling Language Models · 2021-06-04T19:55:42.411Z · LW · GW

(Moderation note: added to Alignment Forum from LessWrong.)

Comment by evhub on List of good AI safety project ideas? · 2021-05-27T19:02:17.427Z · LW · GW

Though they're both somewhat outdated at this point, there are certainly still some interesting concrete experiment ideas to be found in my “Towards an empirical investigation of inner alignment” and “Concrete experiments in inner alignment.”

Comment by evhub on Agency in Conway’s Game of Life · 2021-05-19T21:37:11.045Z · LW · GW

Have you seen “Growing Neural Cellular Automata?” It seems like the authors there are trying to do something pretty similar to what you have in mind here.

Comment by evhub on Knowledge Neurons in Pretrained Transformers · 2021-05-19T20:32:00.892Z · LW · GW

Knowledge neurons don't seem to include all of the model's knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%.

Yeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I'm not saying they have necessarily done that).

I don't think there's evidence that these knowledge neurons don't do a bunch of other stuff. After removing about 0.02% of neurons they found that the mean probability on other correct answers decreased by 0.4%. They describe this as "almost unchanged" but it seems like it's larger than I'd expect for a model trained with dropout for knocking out random neurons (if you extrapolate that to knocking out 10% of the mlp neurons, as done during training, you'd have reduced the correct probability by 50x, whereas the model should still operate OK with 10% dropout).

That's a good point—I didn't look super carefully at their number there, but I agree that looking more carefully it does seem rather large.

Looking at that again, it seems potentially relevant that instead of zeroing those neurons they added the embedding of the [UNK] token.

I also thought this was somewhat strange and am not sure what to make of it.

A priori if a network did work this way, it's unclear why individual neurons would correspond to individual lookups rather than using a distributed representation (and they probably wouldn't, given sparsity---that's a crazy inefficient thing to do and if anything seems harder for SGD to learn) so I'm not sure that this perspective even helps explain the observation that a small number of neurons can have a big effect on particular prompts.

I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less.

But they don't give any evidence that the transformation had a reliable effect, or that it didn't mess up other stuff, or that they couldn't have a similar effect by targeting other neurons.

Actually looking at the replacement stuff in detail it seems even weaker than that. Unless I'm missing something it looks like they only present 3 cherry-picked examples with no quantitative evaluation at all? It's possible that they just didn't care about exploring this effect experimentally, but I'd guess that they tried some simple stuff and found the effect to be super brittle and so didn't report it. And in the cases they report, they are changing the model from remembering an incorrect fact to a correct one---that seems important because probably the model put significant probability on the correct thing already.

Perhaps I'm too trusting—I agree that everything you're describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.

Comment by evhub on Formal Inner Alignment, Prospectus · 2021-05-12T21:41:35.508Z · LW · GW

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).

Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we'd like to be as confident that there's as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.

Comment by evhub on Mundane solutions to exotic problems · 2021-05-04T22:44:09.774Z · LW · GW

Your link is broken.

For reference, the first post in Paul's ascription universality sequence can be found here (also Adam has a summary here).

Comment by evhub on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-05-04T22:16:14.251Z · LW · GW

I guess I would say something like: random search is clearly a pretty good first-order approximation, but there are also clearly second-order effects. I think that exactly how strong/important/relevant those second-order effects are is unclear, however, and I remain pretty uncertain there.

Comment by evhub on [AN #148]: Analyzing generalization across more axes than just accuracy or loss · 2021-04-30T23:12:35.272Z · LW · GW

Read more: Section 1.3 of this version of the paper

This is in the wrong spot.

Comment by evhub on Covid 4/22: Crisis in India · 2021-04-24T21:40:09.114Z · LW · GW

Is there ways to share this with EAs?

You could write a post about it on the EA Forum.

Comment by evhub on NTK/GP Models of Neural Nets Can't Learn Features · 2021-04-22T19:25:21.831Z · LW · GW

(moved from LW to AF)

Meta: I'm going to start commenting this on posts I move from LW to AF just so there's a better record of what moderation actions I'm taking.

Comment by evhub on Does the lottery ticket hypothesis suggest the scaling hypothesis? · 2021-04-22T18:48:52.460Z · LW · GW


Comment by evhub on Homogeneity vs. heterogeneity in AI takeoff scenarios · 2021-04-15T19:29:17.981Z · LW · GW

I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.

I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn't a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn't involve taking over the world. By default, in the case of deception, my expectation is that we won't get a warning shot at all—though I'd more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.

Comment by evhub on Open Problems with Myopia · 2021-03-12T05:56:50.404Z · LW · GW

Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.

Comment by evhub on Open Problems with Myopia · 2021-03-11T08:03:56.813Z · LW · GW

Yeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.

Comment by evhub on Open Problems with Myopia · 2021-03-11T08:00:59.805Z · LW · GW

I think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training—but that's going to require us to get a better understanding of what exactly it looks for an agent to be myopic or not so we know what the overseer in a setup like that should be looking for.

Comment by evhub on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-10T21:45:22.678Z · LW · GW

Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you're advocating, that architectural transparency isn't relying on any sort of path-continuity arguments in terms of how your training process is going to search through the space, since you're just trying to guarantee that the whole space is transparent, which I do think is pretty good desideratum if it's achievable. Imo, I mostly bite the bullet on path-continuity arguments being necessary, but it definitely would be nice if they weren't.

Comment by evhub on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-09T21:48:06.328Z · LW · GW

Fwiw, I also agree with Adele and Eliezer here and just didn't see Eliezer's comment when I was giving my comments.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-09T21:42:02.882Z · LW · GW

Sure—by that definition of realizability, I agree that's where the difficulty is. Though I would seriously question the practical applicability of such an assumption.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-09T03:12:09.898Z · LW · GW

Perhaps I just totally don't understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn't matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there's some deceptive model with greater prior probability that's indistinguishable on the training distribution, as in my simple toy model from earlier.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-08T23:50:12.969Z · LW · GW

Yeah, that's a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-08T21:46:55.307Z · LW · GW

Here's a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):

Then, comparing the K-complexity of the two models, we get

and the problem becomes that both and will produce behavior that looks aligned on the training distribution, but has to be much more complex. To see this, note that essentially any will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then has to actually encode the full objective, which is likely to make it quite complicated.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-06T22:40:06.459Z · LW · GW

It's not that simulating is difficult, but that encoding for some complex goal is difficult, whereas encoding for a random, simple goal is easy.

Comment by evhub on I'm still mystified by the Born rule · 2021-03-05T02:12:40.301Z · LW · GW

I don't think I would take that bet—I think the specific question of what UTM to use does feel more likely to be off-base than other insights I associate with UDASSA. For example, some things that I feel UDASSA gets right: a smooth continuum of happeningness that scales with number of clones/amount of simulation compute/etc., and simpler things being more highly weighted.

Comment by evhub on I'm still mystified by the Born rule · 2021-03-04T06:35:53.187Z · LW · GW

Yeah—I think I agree with what you're saying here. I certainly think that UDASSA still leaves a lot of things unanswered and seems confused about a lot of important questions (embeddedness, uncomputable universes, what UTM to use, how to specify an input stream, etc.). But it also feels like it gets a lot of things right in a way that I don't expect a future, better theory to get rid of—that is, UDASSA feels akin to something like Newtonian gravity here, where I expect it to be wrong, but still right enough that the actual solution doesn't look too different.

Comment by evhub on I'm still mystified by the Born rule · 2021-03-04T04:21:34.520Z · LW · GW

FWIW I have decent odds on "a thicker computer (and, indeed, any number of additional copies of exactly the same em) has no effect", and that's more obviously in contradiction with UDASSA.

Absolutely no effect does seem pretty counterintuitive to me, especially given that we know from QM that different levels of happeningness are at least possible.

Like, I continue to have the dualing intuitions "obviously more copies = more happening" and "obviously, setting aside how it's nice for friends to have backup copies in case of catastrophe, adding an identical em of my bud doesn't make the world better, nor make their experiences different (never mind stronger)". And, while UDASSA is a simple idea that picks a horse in that race, it doesn't... reveal to each intuition why they were confused, and bring them into unison, or something?

I think my answer here would be something like: the reason that UDASSA doesn't fully resolve the confusion here is that UDASSA doesn't exactly pick a horse in the race as much as it enumerates the space of possible horses, since it doesn't specify what UTM you're supposed to be using. For any (computable) tradeoff between “more copies = more happening” and “more copies = no impact” that you want, you should be able to find a UTM which implements that tradeoff. Thus, neither intuition really leaves satisfied, since UDASSA doesn't actually take a stance on how much each is right, instead just deferring that problem to figuring out what UTM is “correct.”

Comment by evhub on I'm still mystified by the Born rule · 2021-03-04T03:51:23.399Z · LW · GW

For another instance, I weakly suspect that "running an emulation on a computer with 2x-as-thick wires does not make them twice-as-happening" is closer to the truth than the opposite, in apparent contradiction with UDASSA.

I feel like I would be shocked if running a simulation on twice-as-thick wires made it twice as easy to specify you, according to whatever the “correct” UTM is. It seems to me like the effect there shouldn't be nearly that large.

Comment by evhub on I'm still mystified by the Born rule · 2021-03-04T03:34:28.939Z · LW · GW

The only reason that sort of discarding works is because of decoherence (which is a probabilistic, thermodynamic phenomenon), and in fact, as a result, if you want to be super precise, discarding actually doesn't work, since the impact of those other eigenfunctions never literally goes to zero.

Comment by evhub on I'm still mystified by the Born rule · 2021-03-04T03:30:30.950Z · LW · GW

I think I basically agree with all of this, though I definitely think that the problem that you're pointing to is mostly not about the Born rule (as I think you mostly state already), and instead mostly about anthropics. I do personally feel pretty convinced that at least something like UDASSA will serve the test of time on that front—it seems to me like you mostly agree with that, but just think that there are problems with embededness + figuring out how to properly extract a sensory stream that still need to be resolved, which I definitely agree with, but still expect the end result of resolving those issues to look UDASSA-ish.

I also definitely agree that there are really important hints as to how we're supposed to do things like anthropics that we can get from looking at physics. I think that if you buy that we're going to want something UDASSA-ish, then one way in which we can interpret the hint that QM is giving us is as a hint as to what our Universal Turing Machine should be. Obviously, the problem with that is that it's a bit circular, since you don't want to choose a UTM just by taking a maximum likelihood estimate, otherwise you just get a UTM with physics as a fundamental operation. I definitely still feel confused about the right way to use physics as evidence for what our UTM should be, but I do feel like it should be some form of evidence—perhaps we're supposed to have a pre-prior or something here to handle combining our prior beliefs about simple UTMs with our observations about what sorts of UTM properties would make the physics that we find ourselves in look simple.

It's also worth noting that UDASSA pretty straightforwardly predicts the existence of happeningness dials, which makes me find their existence in the real world not all that surprising.

Also, it's not really either here not there, but as an aside I feel like this sort of discussion is where the meat of not just anthropics, but also population ethics is supposed to be—if we can figure out what the “correct” anthropic distribution is supposed to be (whatever that means), then that measure should also clearly be the measure that you use to weight person-moments in your utilitarian calculations.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-03T20:02:42.121Z · LW · GW

I think that “think up good strategies for achieving [reward of the type I defined earlier]” is likely to be much, much more complex (making it much more difficult to achieve with a local search process) than an arbitrary goal X for most sorts of rewards that we would actually be happy with AIs achieving.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-03T02:22:59.278Z · LW · GW

A's structure can just be “think up good strategies for achieving X, then do those,” with no explicit subroutine that you can find anywhere in A's weights that you can copy over to B.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-02T23:10:17.177Z · LW · GW

A is sufficiently powerful to select M which contains the complex part of B. It seems rather implausible that an algorithm of the same power cannot select B.

A's weights do not contain the complex part of B—deception is an inference-time phenomenon. It's very possible for complex instrumental goals to be derived from a simple structure such that a search process is capable of finding that simple structure that yields those complex instrumental goals without being able to find a model with those complex instrumental goals hard-coded as terminal goals.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-02T20:52:06.523Z · LW · GW

B can be whatever algorithm M uses to form its beliefs + confidence threshold (it's strategy stealing.)

Sure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A.

Why? If you give evolution enough time, and the fitness criterion is good (as you apparently agreed earlier), then eventually it will find B.

I definitely don't think this, unless you have a very strong (and likely unachievable imo) definition of “good.”

Second, why? Suppose that your starting algorithm is good at finding results when a lot of data is available, but is also very data inefficient (like deep learning seems to be). Then, by providing it with a lot of synthetic data, you leverage its strength to find a new algorithm which is data efficient. Unless you believe deep learning is already maximally data efficient (which seems very dubious to me)?

I guess I'm skeptical that you can do all that much in the fully generic setting of just trying to predict a simplicity prior. For example, if we look at actually successful current architectures, like CNNs or transformers, they're designed to work well on specific types of data and relationships that are common in our world—but not necessarily at all just in a general simplicity prior.

Comment by evhub on Are the Born probabilities really that mysterious? · 2021-03-02T20:43:33.511Z · LW · GW

Imo, I think b) is a correct objection, in the sense that I think it's literally true, but also a bit unfair, since I think you could level a similar objection against essentially any physical law.

Comment by evhub on Are the Born probabilities really that mysterious? · 2021-03-02T04:40:51.458Z · LW · GW

I had a very similar reaction when I first read Everett's thesis as well (after having previously read Eliezer's quantum physics sequence). I reproduced Everett's proof in “Multiple Worlds, One Universal Wave Function,” precisely because I felt like it did dissolve a lot of the problem—and I also reference a couple of other similar proofs from more recent authors in that paper which you might find interesting; see references 11 and 18.

I also once showed Everett's proof to Nate Soares (who can perhaps serve as a stand-in for Eliezer here), who had a couple of objections. IIRC, he thought that it didn't fully dissolve the problem because it a) assumed that the Born rule had to only be a function of the amplitude and b) didn't really answer the anthropic question of why we see anything like the Born rule in the first place, only the mathematical question of why, if we see any amplitude-based rule, it has to be the squared amplitude.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-02T01:42:46.794Z · LW · GW

This seems very sketchy to me. If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect. The last claim also seems pretty likely to be wrong if you let C = SGD or C = evolution.

Moreover, it definitely seems like training on data sampled from a simplicity prior (if that's even possible—it should be uncomputable in general) is unlikely to help at all. I think there's essentially no realistic way that training on synthetic data like that will be sufficient to produce a model which is capable of accomplishing things in the world. At best, that sort of approach might give you better inductive biases in terms of incentivizing the right sort of behavior, but in practice I expect any impact there to basically just be swamped by the impact of fine-tuning on real-world data.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-03-01T21:37:03.431Z · LW · GW

Because that's never what machine learning actually looks like in practice—essentially any access to text from the internet (let alone actual ability to interface with the world, both of which seem necessary for competitive systems) will let it determine things like the year, whether RSA-2048 has been factored, or other various pieces of information that are relevant to what stage in training/testing/deployment it's in, how powerful the feedback mechanisms keeping it in check are, whether other models are defecting, etc. that can let it figure out when to defect.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-28T22:05:05.147Z · LW · GW

I agree that random defection can potentially be worked around—but the RSA-2048 problem is about conditional defection, which can't be dealt with in the same way. More generally, I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-28T20:21:41.052Z · LW · GW

Right, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds.

Note that adversarial training doesn't work on deceptive models due to the RSA-2048 problem; also see more detail here.

If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizability assumption, it is very unlikely to make mispredictions in deployment.

I think realizability basically doesn't help here—as long as there's a deceptive model which is easier to find according to your inductive biases, the fact that somewhere in the model space there exists a correct model, but not one that your local search process finds by default, is cold comfort.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-28T00:16:51.873Z · LW · GW

that's a problem with deep learning in general and not just inner alignment

I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization, where we introduced the term.

The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal.

This is not true for deceptively aligned models, which is the situation I'm most concerned about, and—as we argue extensively in Risks from Learned Optimization—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-27T22:56:17.680Z · LW · GW

Sure, but you have no guarantee that the model you learn is actually going to be optimizing anything like that reward function—that's the whole point of the inner alignment problem. What's nice about the approach in the original paper is that it keeps a bunch of different models around, keeps track of their posterior, and only acts on consensus, ensuring that the true model always has to approve. But if you just train a single model on some reward function like that with deep learning, you get no such guarantees.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-27T20:12:16.204Z · LW · GW

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-26T20:43:59.762Z · LW · GW

Hmmm... I don't think I was ever even meaning to talk specifically about RL, but regardless I don't expect nearly as large of a difference between Q-learning and policy gradient algorithms. If we imagine both types of algorithms making use of the same size massive neural network, the only real difference is how the output of that neural network is interpreted, either directly as a policy, or as Q values that are turned into a policy via something like softmax. In both cases, the neural network is capable of implementing any arbitrary policy and should be getting a similar sort of feedback signal from the training process—especially if you're using a policy gradient algorithm that involves something like advantage estimation rather than actual rollouts, since the update rule in that situation is going to look very similar to the Q learning update rule. I do expect some minor differences in the sorts of models you end up with, such as Q learning being more prone to non-myopic behavior across episodes, and I think there are some minor reasons that policy gradient algorithms are favored in real-world settings, since they get to learn their exploration policy rather than having it hard-coded and can handle continuous action domains—but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-26T20:29:53.077Z · LW · GW

I agree that this is progress (now that I understand it better), though:

if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models

I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-25T21:27:14.996Z · LW · GW

I agree that at some level SGD has to be doing something approximately Bayesian. But that doesn't necessarily imply that you'll be able to get any nice, Bayesian-like properties from it such as error bounds. For example, if you think of SGD as effectively just taking the MAP model starting from sort of simplicity prior, it seems very difficult to turn that into something like the top posterior models, as would be required for an algorithm like this.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-25T21:19:35.177Z · LW · GW

It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science.

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

if local search is this bad, I don't think it is a viable path to AGI

We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of problems from image classification, to language modeling, to complex video games, all given just current compute budgets. So while I could certainly imagine SGD being insufficient, I definitely wouldn't want to bet on it.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-25T00:02:20.958Z · LW · GW

Yeah; I think I would say I disagree with that. Notably, evolution is not a generally intelligent predictor, but is still capable of producing generally intelligent predictors. I expect the same to be true of processes like SGD.

Comment by evhub on Formal Solution to the Inner Alignment Problem · 2021-02-24T23:08:59.859Z · LW · GW

Here's the setup I'm imagining, but perhaps I'm still misunderstanding something. Suppose you have a bunch of deceptive models that choose to cooperate and have larger weight in the prior than the true model (I think you believe this is very unlikely, though I'm more ambivalent). Specifically, they cooperate in that they perfectly mimic the true model up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted, at which point they all simultaneously defect. This allows for arbitrarily bad worst-case behavior.