## Posts

Understanding “Deep Double Descent” 2019-12-06T00:00:10.180Z · score: 99 (41 votes)
What are some non-purely-sampling ways to do deep RL? 2019-12-05T00:09:54.665Z · score: 15 (5 votes)
What I’ll be doing at MIRI 2019-11-12T23:19:15.796Z · score: 116 (35 votes)
More variations on pseudo-alignment 2019-11-04T23:24:20.335Z · score: 20 (6 votes)
Chris Olah’s views on AGI safety 2019-11-01T20:13:35.210Z · score: 115 (34 votes)
Impact measurement and value-neutrality verification 2019-10-15T00:06:51.879Z · score: 35 (10 votes)
Towards an empirical investigation of inner alignment 2019-09-23T20:43:59.070Z · score: 43 (11 votes)
Relaxed adversarial training for inner alignment 2019-09-10T23:03:07.746Z · score: 43 (11 votes)
Are minimal circuits deceptive? 2019-09-07T18:11:30.058Z · score: 51 (12 votes)
Concrete experiments in inner alignment 2019-09-06T22:16:16.250Z · score: 63 (20 votes)
Towards a mechanistic understanding of corrigibility 2019-08-22T23:20:57.134Z · score: 36 (10 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 63 (18 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 61 (16 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 66 (16 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 57 (19 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 117 (35 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 16 (5 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 24 (7 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-07T21:05:52.434Z · score: 10 (3 votes) · LW · GW

Note that double descent also happens with polynomial regression—see here for an example.

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T23:34:32.033Z · score: 4 (2 votes) · LW · GW

Yep; good catch!

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T19:37:44.059Z · score: 4 (2 votes) · LW · GW

I wonder if this is a neural network thing, an SGD thing, or a both thing?

Neither, actually—it's more general than that. Belkin et al. show that it happens even for simple models like decision trees. Also see here for an example with polynomial regression.

Are you aware of this work and the papers they cite?

Yeah, I am. I definitely think that stuff is good, though ideally I want something more than just “approximately K-complexity.”

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T06:00:35.921Z · score: 4 (2 votes) · LW · GW

Ah—thanks for the summary. I hadn't fully read that paper yet, though I knew it existed and so I figured I would link it, but that makes sense. Seems like in that case the flat vs. sharp minima hypothesis still has a lot going for it—not sure how that interacts with the lottery tickets hypothesis, though.

Comment by evhub on Understanding “Deep Double Descent” · 2019-12-06T02:23:03.652Z · score: 2 (1 votes) · LW · GW

Thanks! And good catch—should be fixed now.

Comment by evhub on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T19:28:09.040Z · score: 2 (1 votes) · LW · GW

Yep—that's the adversarial training approach to this problem. The problem is that you might not be able to sample all the relevant highly uncertain points (e.g. because you don't know exactly what the deployment distribution will be), which means you have to do some sort of relaxed adversarial training instead, which introduces its own issues.

Comment by evhub on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T19:25:03.452Z · score: 4 (2 votes) · LW · GW

This is really neat; thanks for the pointer!

Comment by evhub on What are some non-purely-sampling ways to do deep RL? · 2019-12-05T01:19:24.322Z · score: 2 (1 votes) · LW · GW

Hmmm... not sure if this is exactly what I want. I'd prefer not to assume too much about the environment dynamics. Not sure if this is related to what you're talking about, but one possibility, maybe, for a way in which you could do model-based planning with an explicit reward function but without assuming much about the environment dynamics could be to learn all the dynamics necessary to do model-based planning in a model-free way (like MuZero) except for the reward function and then include the reward function explicitly.

Comment by evhub on Thoughts on implementing corrigible robust alignment · 2019-11-26T23:07:04.731Z · score: 5 (3 votes) · LW · GW

I really enjoyed this post; thanks for writing this! Some comments:

the AGI uses its understanding of humans to try to figure out what a human would do in a hypothetical scenario.

I think that supervised amplification can also sort of be thought as falling into this category, in that you often want your model to be internally modeling what an HCH would do in a hypothetical scenario. Of course, if you're training a model using supervised amplification, you might not actually get a model which is in fact just trying to guess what an HCH would do, but is instead doing something more strategic and/or deceptive, though in many cases the goal at least is to try and get something that's just trying to approximate HCH.

So that suggests an approach of pre-loading this template database with a hardcoded model of a human, complete with moods, beliefs, and so on.

This is actually quite similar to an approach that Nevan Witchers at Google is working on, which is to hardcode a differentiable model of the reward function as a component in your network when doing RL. The idea there being very similar, which is to prevent the model from learning a proxy by giving it direct access to the actual structure of the reward function rather than just learning based on rewards that were observed during training. The two major difficulties I see with this style of approach, however, are that 1) it requires you to have an explicit differentiable model of the reward function and 2) it still requires the model to learn the policy and value (that is, how much future discounted reward the model expects to get using its current policy starting from some state) functions which could still allow for the introduction of misaligned proxies.

Comment by evhub on Bottle Caps Aren't Optimisers · 2019-11-22T08:23:55.360Z · score: 22 (9 votes) · LW · GW

Daniel Filan's bottle cap example was featured prominently in "Risks from Learned Optimization" for good reason. I think it is a really clear and useful example of why you might want to care about the internals of an optimization algorithm and not just its behavior, and helped motivate that framing in the "Risks from Learned Optimization" paper.

Comment by evhub on Paul's research agenda FAQ · 2019-11-22T08:14:19.162Z · score: 18 (7 votes) · LW · GW

Reading Alex Zhu's Paul agenda FAQ was the first time I felt like I understood Paul's agenda in its entirety as opposed to only understanding individual bits and pieces. I think this FAQ was a major contributing factor in me eventually coming to work on Paul's agenda.

Comment by evhub on Towards a New Impact Measure · 2019-11-22T08:00:35.105Z · score: 13 (5 votes) · LW · GW

I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.

Comment by evhub on [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety · 2019-11-06T18:54:41.716Z · score: 6 (3 votes) · LW · GW

Asya's opinion on "Norms, Rewards, and the Intentional Stance" appears to have accidentally been replaced by Rohin's opinion on the "Ought Progress Update."

Comment by evhub on But exactly how complex and fragile? · 2019-11-05T07:40:16.168Z · score: 10 (3 votes) · LW · GW

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI"

Surely "the whole point of AI safety research" is just to save the world, no? If the world ends up being saved, does it matter whether we were able to "verify" that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it's worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn't go around saying that verification is "the whole point of AI safety research."

Comment by evhub on More variations on pseudo-alignment · 2019-11-05T01:44:14.160Z · score: 8 (3 votes) · LW · GW

Perfectly reasonable for you to not reply like you said, though I think it's worthwhile for me to at least clarify one point:

I don't think a competent-at-human-level system doesn't know about deception, and I don't think a competent-at-below-human-level system can cause extinction-level catastrophe

A model which simply "doesn't know about deception" isn't the only (or even the primary) situation I'm imagining. The example I gave in the post was a situation in which the model hadn't yet "figured out that deception is a good strategy," which could be:

• because it didn't know about deception,
• because it thought that deception wouldn't work,
• because it thought it was fully aligned,
• because the training process constrained its thoughts such that it wasn't able to even think about deception,

or some other reason. I don't necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.

Comment by evhub on Will transparency help catch deception? Perhaps not · 2019-11-04T23:48:41.662Z · score: 22 (9 votes) · LW · GW

I agree that once you have a fully deceptively aligned model that's plotting against you and trying to figure out how to fool your transparency tools, there's not much hope left for you to be able to detect and remedy that issue. Importantly, however, that's not my model for how transparency could help you catch deception. Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where the model is trying to trick the transparency tools. This is especially important in the context of gradient hacking (as I mention in that post), since you have to inspect the entire training process if you want to know whether any gradient hacking occurred.

Put another way: once you're playing the game where I can hand you any model and then you have to figure out whether it's deceptive or not, you've already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place. I see enforcing something like corrigibility or myopia as a way of doing this: before your model can become deceptive, it has to become non-myopic, which means if you can detect your model starting to become non-myopic, then you can prevent it from ever becoming deceptive.

[For the recursion problem,] the recursion doesn't actually have to work for all —rather, we only need our acceptability guarantee to hold along any path through model space that could actually be traversed by our training procedure.

Thus, it's fine if our procedure fails if given a sufficiently unacceptable model, as long as our procedure can prevent such models from ever arising. For example, if we always used an old version of that we were confident in for implementing , then we would just need it to be the case that couldn't get to the point where it would be able to deceive in-between oversight.

Comment by evhub on But exactly how complex and fragile? · 2019-11-03T20:16:02.665Z · score: 18 (4 votes) · LW · GW

So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

I think this argument mostly holds in the case of proxy alignment, but fails in the case of deceptive alignment. If a model is deceptively aligned, then I don't think there is any reason we should expect it to be only "somewhat misaligned"—once a mesa-optimizer becomes deceptive, there's no longer optimization pressure acting to keep its mesa-objective in line with the base, which means it could be totally off, not just slightly wrong. Additionally, a deceptively aligned mesa-optimizer might be able to do things like gradient hacking to significantly hinder our correction processes.

Also, I think it's worth pointing out that deception doesn't just happen during training: it's also possible for a non-deceptive proxy aligned mesa-optimizer to become deceptive during deployment, which could throw a huge wrench in your correction processes story. In particular, non-myopic proxy aligned mesa-optimizers "want to be deceptive" in the sense that, if presented with the strategy of deceptive alignment, they will choose to take it (this is a form of suboptimality alignment). This could be especially concerning in the presence of an adversary in the environment (a competitor AI, for example) that is choosing its output to cause other AIs to behave deceptively.

Comment by evhub on Chris Olah’s views on AGI safety · 2019-11-03T19:45:06.791Z · score: 11 (3 votes) · LW · GW

To me, the important safety feature of "microscope AI" is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit).

As I mentioned in this comment, not modeling the consequences of its output is actually exactly what I want to get out of myopia.

For the latter question, what is the user interface, "Use interpretability tools & visualizations on the world-model" seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision.

Yep; me too!

I hope that they don't stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a "search through causal pathways to get desired consequences" interface.

Chris (and the rest of Clarity) are definitely working on stuff like this!

unsupervised (a.k.a. "self-supervised") learning as ofer suggests seems awfully safe but is it really?

I generally agree that unsupervised learning seems much safer than other approaches (e.g. RL), though I also agree that there are still concerns. See for example Abram's recent "The Parable of Predict-O-Matic" and the rest of his Partial Agency sequence.

Comment by evhub on Chris Olah’s views on AGI safety · 2019-11-02T20:00:32.859Z · score: 20 (7 votes) · LW · GW

Yep, I think that's a correct summary of the final point.

The main counterpoint that comes to mind is a possible world where "opaque AIs" just can't ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to "general intelligence/agency" without being reliable enough to be aligned.

Well, we know it's possible to achieve general intelligence via dumb black box search—evolution did it—and we've got lots of evidence for current black box approaches being quite powerful. So it seems unlikely to me that we "just can't ever achieve general intelligence" with black box approaches, though it could be that doing so is much more difficult than if you have more of an understanding.

Also, ease of aligning a particular AI design is a relative property, not an absolute one. When you say transparent approaches might not be "reliable enough to be aligned" you could mean that they'll be just as likely likely as black box approaches to be aligned, less likely, or that they won't be able to meet some benchmark threshold probability of safety. I would guess that transparency will increase the probability of alignment relative to not having it, though I would say that it's unclear currently by how much.

The way I generally like to think about this is that there are many possible roads we can take to get to AGI, with some being more alignable and some being less alignable and some being shorter and some being longer. Then, the argument here is that transparency research opens up additional avenues which are more alignable, but which may be shorter or longer. Even if they're shorter, however, since they're more alignable the idea is that even if we end up taking the fastest path without regards to safety, if you can make the fastest path available to us a safer one, then that's a win.

Comment by evhub on Technical AGI safety research outside AI · 2019-10-31T06:15:40.906Z · score: 2 (1 votes) · LW · GW

Also relevant is Geoffrey Irving and Amanda Askell's "AI Safety Needs Social Scientists."

Comment by evhub on [Site Update] Subscriptions, Bookmarks, & Pingbacks · 2019-10-30T21:09:01.230Z · score: 4 (2 votes) · LW · GW

Looks like it's fixed now; thanks!

Comment by evhub on [Site Update] Subscriptions, Bookmarks, & Pingbacks · 2019-10-30T19:07:12.454Z · score: 5 (3 votes) · LW · GW

Something shows up now, but they all just say "HopefullyAnonymous" instead of the actual users I subscribed to.

Comment by evhub on [Site Update] Subscriptions, Bookmarks, & Pingbacks · 2019-10-30T05:36:24.927Z · score: 4 (2 votes) · LW · GW

When I go to manage my subscriptions, it claims that I no longer have any--what happened to all of my old subscriptions?

Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-29T22:17:24.827Z · score: 2 (1 votes) · LW · GW

I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don't start out with too many resources, but this doesn't seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.

That's not the scenario I'm thinking about when I think about strategy-stealing. I mentioned this a bit in this comment, but when I think about strategy-stealing I'm generally thinking about it as an alignment property that may or may not hold for a single AI: namely, the property that the AI is equally good at optimizing all of the different things we might want it to optimize. If this property doesn't hold, then you get something like Paul's going out with a whimper where our easy-to-specify values win out over our other values.

Furthermore, I agree with you that I generally expect basically all early AGIs to have similar alignment properties, though I think you push a lot under the rug when you say they'll all either be "aligned" or "unaligned." In particular, I generally imagine producing an AGI that is corrigible in that it's trying to do what you want, but isn't necessarily fully aligned in the sense of figuring out what you want for you. In such a case, it's very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.

Also, value-neutrality verification isn't just about strategy-stealing: it's also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.

Comment by evhub on Gradient hacking · 2019-10-29T21:37:48.671Z · score: 5 (3 votes) · LW · GW

Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?

No, at least not in the way that I'm imagining this working. In fact, I wouldn't really call that gradient-hacking anymore (maybe it's just normal hacking at that point?).

For your other points, I agree that they seem like interesting directions to poke on for figuring out whether something like this works or not.

Comment by evhub on Are minimal circuits deceptive? · 2019-10-29T21:31:23.701Z · score: 4 (2 votes) · LW · GW

A couple of things. First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out. Second, even if you're operating on pure speed, I think there are sets of tasks that are large enough that a GLUT won't work. For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren't the fastest way to do that. Even if you restrict to only finitely many convex functions, if your set is large enough it still seems hard to do better than gradient descent, especially since looking up the solution in a huge GLUT could be quite expensive (how do you do the lookup? a hash table? a binary tree? I'm not sure if those would be good enough here).

Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-29T21:19:32.184Z · score: 2 (1 votes) · LW · GW

I think this comment by Wei Dai does a good job of clarifying what's going on with the strategy-stealing assumption. I know Wei Dai was also confused about the purpose of the strategy-stealing assumption for a while until he wrote that comment.

Comment by evhub on What are some unpopular (non-normative) opinions that you hold? · 2019-10-23T23:50:23.494Z · score: 10 (5 votes) · LW · GW

Suppose you were asked to briefly describe what such a place (which would, by construction, not be Less Wrong) would be like—what would you say?

I'm a big fan of in-person conversation. I think it's entirely possible to save the world without needing to be able to talk about anything you want online in a public forum.

Comment by evhub on What are some unpopular (non-normative) opinions that you hold? · 2019-10-23T21:01:14.704Z · score: 19 (15 votes) · LW · GW

I just want it to be clear to anybody who might be reading this and wondering what sorts of beliefs people on LessWrong hold that I do not hold any of the beliefs in the above post and that I find the beliefs expressed there to be both factually incorrect and morally repugnant.

Comment by evhub on Defining Myopia · 2019-10-22T00:42:20.743Z · score: 1 (1 votes) · LW · GW

Yeah, I agree. I almost said "simple function of the output," but I don't actually think simplicity is the right metric here. It's more like "a function of the output that doesn't go through the consequences of said output."

Comment by evhub on Defining Myopia · 2019-10-21T23:10:21.699Z · score: 20 (4 votes) · LW · GW

Really enjoying this sequence! For the purposes of relaxed adversarial training, I definitely want something closer to absolute myopia than just dynamic consistency. However, I think even absolute myopia is insufficient for the purposes of preventing deceptive alignment.[1] For example, as is mentioned in the post, an agent that manipulates the world via self-fulfilling prophecies could still be absolutely myopic. However, that's not the only way in which I think an absolutely myopic agent could still be a problem. In particular, I think there's an important distinction which I think about a lot when I think about myopia which I think is missing here regarding the nature of the agent's objective function.

In particular, I think there's a distinction between agents with objective functions over states of the world vs. their own subjective experience vs. their output.[2] For the sort of myopia that I want, I think you basically need the last thing; that is, you need an objective function which is just a function of the agent's output that completely disregards the consequences of that output. Having an objective function over the state of the world bleeds into full agency too easily and having an objective function over your own subjective experience leads to the possibility of wanting to gather resources to run simulations of yourself to capture your own subjective experience. If your agent simply isn't considering/thinking about how its output will affect the world at all, however, then I think you might be safe.[3]

1. I mentioned this to Abram earlier and he agrees with this, though I think it's worth putting here as well. ↩︎

2. Note that I don't think that these possibilities are actually comprehensive; I just think that they're some of the most salient ones. ↩︎

3. Instead, I want it to just be thinking about making the best prediction it can. Note that this still leaves open the door for self-fulfilling prophecies, though if the selection of which self-fulfilling prophecy to go with is non-adversarial (i.e. there's no deceptive alignment), then I don't think I'm very concerned about that. ↩︎

Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-20T20:35:34.434Z · score: 1 (1 votes) · LW · GW I don't think we do agree, in that I think pressure towards simple models implies that they won't be dualist in the way that you're claiming. Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-20T01:49:30.774Z · score: 1 (1 votes) · LW · GW

I think maybe what you're getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn't imply it's aware of "itself" as an entity.

No, but it does imply that it has the information about its own prediction process encoded in its weights such that there's no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.

Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it's not the task that it was trained for. It doesn't magically have the "self-awareness" necessary to see what's going on.

Sure, but that's not actually the relevant task here. It may not understand its own weights, but it does understand its own predictive process, and thus its own output, such that there's no reason it would encode that information again in its world model.

Comment by evhub on Relaxed adversarial training for inner alignment · 2019-10-19T21:16:37.487Z · score: 1 (1 votes) · LW · GW

The point about decompositions is a pretty minor portion of this post; is there a reason you think that part is more worthwhile to focus on for the newsletter?

Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-19T21:13:12.319Z · score: 1 (1 votes) · LW · GW that suggested to me that there were 2 instances of this info about Predict-O-Matic's decision-making process in the dataset whose description length we're trying to minimize. "De-duplication" only makes sense if there's more than one. Why is there more than one? ML doesn't minimize the description length of the dataset—I'm not even sure what that might mean—rather, it minimizes the description length of the model. And the model does contain two copies of information about Predict-O-Matic's decision-making process—one in its prediction process and one in its world model. The prediction machinery is in code, but this code isn't part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That's the point I was trying to make previously. Modern predictive models don't have some separate hard-coded piece that does prediction—instead you just train everything. If you consider GPT-2, for example, it's just a bunch of transformers hooked together. The only information that isn't included in the description length of the model is what transformers are, but "what's a transformer" is quite different than "how do I make predictions." All of the information about how the model actually makes its predictions in that sort of a setup is going to be trained. Comment by evhub on The Dualist Predict-O-Matic ($100 prize) · 2019-10-18T07:52:38.920Z · score: 3 (2 votes) · LW · GW

Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself.

If the data isn't at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not?

A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains.

We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego").

Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.

Also, if you think that, then I'm confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine.

--

I think I didn't fully convey my picture here, so let me try to explain how I think this could happen. Suppose you're training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that's happened--or while it's in the process of happening--there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared--which is precisely the thing you were calling an "ego."

Comment by evhub on The Dualist Predict-O-Matic (\$100 prize) · 2019-10-17T21:37:50.302Z · score: 7 (2 votes) · LW · GW

If dualism holds for Abram's prediction AI, the "Predict-O-Matic", its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions -- but it's not special in any way and isn't being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic's default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an "ego").

I don't think this is right. In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it. In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-16T20:30:01.719Z · score: 1 (1 votes) · LW · GW
1. That summary might be useful as a TL:DR on that post, unless the description was only referencing what aspects of it are important for (the ideas you are advancing in) this post.

The idea of splitting up a model into a value-neutral piece and a value-laden piece was only one of a large number of things I talked about in "Relaxed adversarial training."

1. It seems like those would be hard to disentangle because it seems like a value piece only cares about the things that it values, and thus, its "value neutral piece" might be incomplete for other values - though this might depend on what you mean by "optimization procedure".

It's definitely the case that some optimization procedures work better for some values than for others, though I don't think it's that bad. What I mean by optimization process here is something like a concrete implementation of some decision procedure. Something like "predict what action will produce the largest value given some world model and take that action," for example. The trick is just to as much as you can avoid cases where your optimization procedure systematically favors some values over others. For example, you don't want it to be the case that your optimization procedure only works for very easy-to-specify values, but not for other things that we might care about. Or, alternatively, you don't want your optimization process to be something like "train an RL agent and then use that" that might produce actions that privilege simple proxies rather than what you really want (the "forwarding the guarantee" problem).

Comment by evhub on Gradient hacking · 2019-10-16T18:16:16.841Z · score: 10 (5 votes) · LW · GW

Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it's undergoing won't be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.

Comment by evhub on Gradient hacking · 2019-10-16T18:08:16.406Z · score: 3 (2 votes) · LW · GW

Unless this problem is resolved, I don't see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.

I think that's a bit extreme, or at least misplaced. Gradient-hacking is just something that makes catching deceptive alignment more difficult. Deceptive alignment is the real problem: if you can prevent deceptive alignment, then you can prevent gradient hacking. And I don't think it's impossible to catch deceptive alignment in something that looks similar to contemporary ML—or at least if it is impossible then I don't think that's clear yet. I mentioned some of the ML transparency approaches I'm excited about in this post, though really for a full treatment of that problem see "Relaxed adversarial training for inner alignment."

Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-15T20:50:58.696Z · score: 5 (3 votes) · LW · GW

You're right, I think the absolute value might actually be a problem—you want the policy to help/hurt all values relative to no-op equally, not hurt some and help others. I just edited the post to reflect that.

As for the connection between neutrality and objective impact, I think this is related to a confusion that Wei Dai pointed out, which is that I was sort of waffling between two different notions of strategy-stealing, those being:

1. strategy-stealing relative to all the agents present in the world (i.e. is it possible for your AI to steal the strategies of other agents in the world) and
2. strategy-stealing relative to a single AI (i.e. if that AI were copied many times and put in service of many different values, would it advantage some over others).

If you believe that most early AGIs will be quite similar in their alignment properties (as I generally do, since I believe that copy-and-paste is quite powerful and will generally be preferred over designing something new), then these two notions of strategy-stealing match up, which was why I was waffling between them. However, conceptually they are quite distinct.

In terms of the connection between neutrality and objective impact, I think there I was thinking about strategy-stealing in terms of notion 1, whereas for most of the rest of the post I was thinking about it in terms of notion 2. In terms of notion 1, objective impact is about changing the distribution of resources among all the agents in the world.

Comment by evhub on Impact measurement and value-neutrality verification · 2019-10-15T20:43:25.786Z · score: 3 (2 votes) · LW · GW

Note that the model's output isn't what's relevant for the neutrality measure; it's the algorithm it's internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic, which is why it's important to have some sort of myopia guarantee.

Comment by evhub on AI Alignment Writing Day Roundup #2 · 2019-10-08T22:26:11.119Z · score: 3 (2 votes) · LW · GW

Paul's post offers two conditions about the ease of training an acceptable model (in particular, that it should not stop the agent achieving a high average reward and that is shouldn't make hard problems much harder), but Evan's conditions are about the ease of choosing an acceptable action.

This is reversed. Paul's conditions were about the ease of choosing an acceptable action; my conditions are about the ease of training an acceptable model.

Comment by evhub on Towards a mechanistic understanding of corrigibility · 2019-09-29T23:31:28.048Z · score: 3 (2 votes) · LW · GW

Part of the point that I was trying to make in this post is that I'm somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don't give enough of a basis for actually verifying them. So I can't really give you a definition of act-based corrigibility that I'd be happy with, as I don't think there currently exists such a definition.

That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.

Comment by evhub on Partial Agency · 2019-09-28T17:55:32.419Z · score: 13 (3 votes) · LW · GW

I agree that this is possible, but I would be very surprised if a mesa-optimizer actually did something like this. By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify in terms of their input data (e.g. pain) not those that require extremely complex world models to even be able to specify (e.g. spread of DNA). In the context of supervised learning, having an objective that explicitly cares about the value of RAM that stores its loss seems very similar to explicitly caring about the spread of DNA in that it requires a complex model of the computer the mesa-optimizer is running and is quite complex and difficult to reason about. This is why I'm not very worried about reward-tampering: I think proxy-aligned mesa-optimizers basically never tamper with their rewards (though deceptively aligned mesa-optimizers might, but that's a separate problem).

Comment by evhub on Partial Agency · 2019-09-28T05:38:20.679Z · score: 16 (4 votes) · LW · GW

I really like this post. I'm very excited about understanding more about this as I said in my mechanistic corrigibility post (which as you mention is very related to the full/partial agency distinction).

we can kind of expect any type of learning to be myopic to some extent

I'm pretty uncertain about this. Certainly to the extent that full agency is impossible (due to computational/informational constraints, for example), I agree with this. But I think a critical point which is missing here is that full agency can still exhibit pseudo-myopic behavior (and thus get selected for) if using an objective that is discounted over time or if deceptive. Thus, I don't think that having some sort of soft episode boundary is enough to rule out full-ish agency.

Furthermore, it seems to me like it's quite plausible that for many learning setups models implementing algorithms closer to full agency will be simpler than models implementing algorithms closer to partial agency. As you note, partial agency is a pretty weird thing to do from a mathematical standpoint, so it seems like many learning processes might penalize it pretty heavily for that. At the very least, if you count Solomonoff Induction as a learning process, it seems like you should probably expect something a lot closer to full agency there.

That being said, I definitely agree that the fact that epistemic learning seems to just do this by default seems pretty promising for figuring out how to get myopia, so I'm definitely pretty excited about that.

RL tends to require temporal discounting -- this also creates a soft episode boundary, because things far enough in the future matter so little that they can be thought of as "a different episode".

This is just a side note, but RL also tends to have hard episode boundaries if you are regularly resetting the state of the environment as is common in many RL setups.

Comment by evhub on Concrete experiments in inner alignment · 2019-09-12T16:32:46.229Z · score: 1 (1 votes) · LW · GW

When I use the term "RL agent," I always mean an agent trained via RL. The other usage just seems confused to me in that it seems to be assuming that if you use RL you'll get an agent which is "trying" to maximize its reward, which is not necessarily the case. "Reward-maximizer" seems like a much better term to describe that situation.

Comment by evhub on Relaxed adversarial training for inner alignment · 2019-09-11T18:28:52.140Z · score: 1 (1 votes) · LW · GW

Good catch! Also, I generally think of pseudo-inputs as predicates, not particular inputs or sets of inputs (though of course a predicate defines a set of inputs). And as for the reason for the split, see the first section in "Other approaches" (the basic idea is that the split lets us have an adversary, which could be useful for a bunch of reasons).

Comment by evhub on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T22:51:28.463Z · score: 4 (2 votes) · LW · GW

Thinking about this more, this doesn't actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?

I don't think that's quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn't supposed to optimize at all. The discount factor is only per-step within an episode; there isn't any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they're given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it's in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.

Comment by evhub on Counterfactual Oracles = online supervised learning with random selection of training episodes · 2019-09-10T17:05:41.525Z · score: 5 (3 votes) · LW · GW

I call this problem "non-myopia," which I think interestingly has both an outer alignment component and an inner alignment component:

1. If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure.
2. Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that's an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn't explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.