LCDT, A Myopic Decision Theory 2021-08-03T22:41:44.545Z
Alex Turner's Research, Comprehensive Information Gathering 2021-06-23T09:44:34.496Z
Looking Deeper at Deconfusion 2021-06-13T21:29:07.811Z
Review of "Learning Normativity: A Research Agenda" 2021-06-06T13:33:28.371Z
[Event] Weekly Alignment Research Coffee Time (08/02) 2021-05-29T13:26:28.471Z
[Event] Weekly Alignment Research Coffee Time (05/24) 2021-05-21T17:45:53.618Z
[Event] Weekly Alignment Research Coffee Time (05/17) 2021-05-15T22:07:02.339Z
[Event] Weekly Alignment Research Coffee Time (05/10) 2021-05-09T11:05:30.875Z
[Weekly Event] Alignment Researcher Coffee Time (in Walled Garden) 2021-05-02T12:59:20.514Z
[Linkpost] Teaching Paradox, Europa Univeralis IV, Part I: State of Play 2021-05-02T09:02:19.191Z
April 2021 Deep Dive: Transformers and GPT-3 2021-05-01T11:18:08.584Z
Review of "Fun with +12 OOMs of Compute" 2021-03-28T14:55:36.984Z
Behavioral Sufficient Statistics for Goal-Directedness 2021-03-11T15:01:21.647Z
Epistemological Framing for AI Alignment Research 2021-03-08T22:05:29.210Z
Suggestions of posts on the AF to review 2021-02-16T12:40:52.520Z
Tournesol, YouTube and AI Risk 2021-02-12T18:56:18.446Z
Epistemology of HCH 2021-02-09T11:46:28.598Z
Infra-Bayesianism Unwrapped 2021-01-20T13:35:03.656Z
Against the Backward Approach to Goal-Directedness 2021-01-19T18:46:19.881Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z
The Case for a Journal of AI Alignment 2021-01-09T18:13:27.653Z
Postmortem on my Comment Challenge 2020-12-04T14:15:41.679Z
[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology 2020-11-30T17:33:43.691Z
Small Habits Shape Identity: How I became someone who exercises 2020-11-26T14:55:57.622Z
What are Examples of Great Distillers? 2020-11-12T14:09:59.128Z
The (Unofficial) Less Wrong Comment Challenge 2020-11-11T14:18:48.340Z
Why You Should Care About Goal-Directedness 2020-11-09T12:48:34.601Z
The "Backchaining to Local Search" Technique in AI Alignment 2020-09-18T15:05:02.944Z
Universality Unwrapped 2020-08-21T18:53:25.876Z
Goal-Directedness: What Success Looks Like 2020-08-16T18:33:28.714Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
Will OpenAI's work unintentionally increase existential risks related to AI? 2020-08-11T18:16:56.414Z
Analyzing the Problem GPT-3 is Trying to Solve 2020-08-06T21:58:56.163Z
What are the most important papers/post/resources to read to understand more of GPT-3? 2020-08-02T20:53:30.913Z
What are you looking for in a Less Wrong post? 2020-08-01T18:00:04.738Z
Dealing with Curiosity-Stoppers 2020-07-30T22:05:02.668Z
adamShimi's Shortform 2020-07-22T19:19:27.622Z
The 8 Techniques to Tolerify the Dark World 2020-07-20T00:58:04.621Z
Locality of goals 2020-06-22T21:56:01.428Z
Goal-directedness is behavioral, not structural 2020-06-08T23:05:30.422Z
Focus: you are allowed to be bad at accomplishing your goals 2020-06-03T21:04:29.151Z
Lessons from Isaac: Pitfalls of Reason 2020-05-08T20:44:35.902Z
My Functor is Rich! 2020-03-18T18:58:39.002Z
Welcome to the Haskell Jungle 2020-03-18T18:58:18.083Z
Lessons from Isaac: Poor Little Robbie 2020-03-14T17:14:56.438Z
Where's the Turing Machine? A step towards Ontology Identification 2020-02-26T17:10:53.054Z
Goal-directed = Model-based RL? 2020-02-20T19:13:51.342Z


Comment by adamShimi on LCDT, A Myopic Decision Theory · 2021-08-04T18:30:20.229Z · LW · GW

Thanks for the comment!

  1. What seems to be necessary is that the LCDT thinks its decisions have no influence on the impact of other agents' decisions, not simply on the decisions themselves (this relates to Steve's second point). For example, let's say you're deciding whether to press button A or button B, and I rewire them so that B now has A's consequences, and A B's. I now assume that my action hasn't influenced your decision, but it has influenced the consequences of your decision.
    1. The causal graph here has both of us influencing a [buttons] node: I rewire them and you choose which to press. I've cut my link to you, but not to [buttons]. More generally, I can deceive you arbitrarily simply by anticipating your action and applying a post-action-adaptor to it (like re-wiring the buttons).
      1. Perhaps the idea here is that I'd have no incentive to hide my interference with the buttons (since I assume it won't change which you press). That seems to work for many cases, and so will be detectable/fixable in training - but after you apply a feedback loop of this sort you'll be left with the action-adaptor-based deceptions which you don't notice.

Thats... a situation I hadn't thought about. My first reaction is that it fall in the category of "myopic defection" instead of deception. In the concrete example, as you say, you would reveal it to any overseer/observer because you don't think anything you do would impact them (similarly to what happens in the (N,M)-deception problem).

Your point about what happens if you detect defection in training is actually a more general issue even in all examples: what do you do when you catch defection/bad incentives? LCDT isn't not trying to solve this problem, just to find a decision theory where you would always reveal such incentives in training. I personally don't know how to solve this incentive question, and I agree it's a concrete and crucial detail in aligning an AI.

It seems to me that to get the "...and extract many valuable insights about its behavior", there needs to be an assumption that Argmax-HCH is being simulated in a helpful/clear/transparent way. It's not clear to me why this is expected: wouldn't the same pressures that lead to a "trained model that we searched for acting like HCH" tending to be opaque also lead the simulation of Argmax-HCH to be opaque? Specifically, the LCDT agent only needs to run it, not understand it.

There's two way to think about it.

  • If we're talking about a literal LCDT agent (which is what I have in mind), then it would have a learned causal model of HCH good enough to predict what the final output is. That sounds more interpretable to me than just having an opaque implementation of HCH (but it's not already interpreted for us).
  • If we're talking about systems which act like an LCDT agent but are not literally programmed to do so, I'm not so sure. I expect that they need a somewhat flexible representation of what they're trying to represent, but maybe I'm missing a clever trick.
Comment by adamShimi on LCDT, A Myopic Decision Theory · 2021-08-04T17:39:39.087Z · LW · GW

Thanks for the comment!

Suppose we design the LCDT agent with the "prior" that "After this decision right now, I'm just going to do nothing at all ever again, instead I'm just going to NOOP until the end of time." And we design it to never update away from that prior. In that case, then the LCDT agent will not try to execute multi-step plans.

Whereas if the LCDT agent has the "prior" that it's going to make future decisions using a similar algorithm as what it's using now, then it would do the first step of a multi-step plan, secure in the knowledge that it will later proceed to the next step.

Your explanation of the paperclip factory is spot on. That being said, it is important to precise that the link to building the factory must have no agent in it, or the LCDT agent would think its actions doesn't change anything.

The weird part (that I don't personally know how to address) is deciding where the prior comes from. Most of the post argues that it doesn't matter for our problems, but in this example (and other weird multi-step plans, it does.

If so, I'm concerned about capabilities here because I normally think that, for capabilities reasons, we'll need reasoning to be a multi-step sequential process, involving thinking about different aspects in different ways. So if we do the first "prior", where LCDT assumes that it's going to NOOP forever starting 0.1 seconds from now, it won't try to "think things through", gather background knowledge etc. But if we do the more human-like "prior" where LCDT assumes that it's going to make future decisions in a similar way as present decisions, then we're back to long-term planning.

That's a fair concern. Our point in the post is that LCDT can think things through when simulating other systems (like HCH) for imitating them. And so it should have strong capabilities there. But you're right that its an issue for long term planning if we expect an LCDT agent to directly solve problems. 

Different topic: If the human's "space of possible actions" at t=1 depends on the LCDT agent's action at t=0, then I'm confused about how the LCDT agent is supposed to pretend that the human's decision is independent of its current choice.

The technical answer is that the LCDT agent computes its distribution over actions spaces for the human by marginalizing the human's current distribution with the LCDT agent distribution over its own action. The intuition is something like: "I believe that the human has already some model of which action I will take, and nothing I can do will change that".

Comment by adamShimi on LCDT, A Myopic Decision Theory · 2021-08-04T17:29:45.918Z · LW · GW

Could you give a more explicit example of what you think might go wrong? I feel like your argument that agency is natural to learn actually goes in LCDT's favor, because it requires an accurate (or at least an overapproximation) of tagging things in its causal model as agentic.

Comment by adamShimi on Thoughts on safety in predictive learning · 2021-08-02T09:48:19.097Z · LW · GW

5 years later, I'm finally reading this post. Thanks for the extended discussions of postdictive learning; it's really relevant to my current thinking about alignment for potential simulators-like Language Models.

Note that others disagree, e.g. advocates of Microscope AI.

I don't think advocates of Microscope AI think you can reach AGI that way. More that through Microscope AI, we might end up solving the problems we have without relying on an agent.

Why? Because in predictive training, the system can (under some circumstances) learn to make self-fulfilling prophecies—in other words, it can learn to manipulate the world, not just understand it. For example see Abram Demski’s Parable of the Predict-O-Matic. In postdictive training, the answer is already locked in when the system is guessing it, so there’s no training incentive to manipulate the world. (Unless it learns to hack into the answer by row-hammer or whatever. I’ll get back to that in a later section.)

Agreed, but I think you could be even clearer that the real point is that postdiction can never causally influence the output. As you write there are cases and version where prediction also has this property, but it's not a guarantee by default.

As for the actual argument, that's definitely part of my reasoning why I don't expect GPT-N to have deceptive incentives (although maybe what it simulates would have).

In backprop, but not trial-and-error, and not numerical differentiation, we get some protection against things like row-hammering the supervisory signal.

Even after reading the wikipedia page, it's not clear to me what "row-hammering the supervisory signa"l would look like. Notably, I don't see the analogy to the electrical interaction here. Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?

The differentiation engine is essentially symbolic, so it won’t (and indeed can’t) “differentiate through” the effects of row-hammer or whatever.

No idea what this means. If row-hammering (or whatever) improves the loss, then the gradient will push in that direction. I feel like the crux is in the specific way you imagine row-hammering happening here, so I'd like to know more about it.

Easy win #3: Don’t access the world-model and then act on that information, at least not without telling it

Slight nitpicking, but this last one doesn't sound like an easy win to me -- just an argument for not using a naive safety strategy. I mean, it's not like we really get anything in terms of safety, we just don't mess up the capabilities of the model completely.

(Human example of this error: Imagine someone saying "If fast-takeoff AGI happens, then it would have bizarre consequence X, and there’s no way you really expect that to happen, right?!? So c’mon, there’s not really gonna be fast-takeoff AGI.". This is an error because if there’s a reason to expect fast-takeoff AGI, and fast-takeoff AGI leads to X, we should make the causal update (“X is more likely than I thought”), not the retrocausal update (“fast-takeoff AGI is less likely than I thought”). Well, probably. I guess on second thought it’s not always a reasoning error.)

I see what you did there. (Joke apart, that's a telling example)

And, like other reasoning errors and imperfect heuristics, I expect that it’s self-correcting—i.e., it would manifest more early in training, but gradually go away as the AGI learns meta-cognitive self-monitoring strategies. It doesn’t seem to have unusually dangerous consequences, compared to other things in that category, AFAICT.

One way to make this argument more concrete relies on saying that solving this problem helps capabilities as well as safety. So as long as what we worry is a very capable AGI, this should be mitigated.

  • There are within-universe consequences of a processing step, where the step causes things to happen entirely within the intended algorithm. (By "intended", I just mean that the algorithm is running without hardware errors). These same consequences would happen for the same reasons if we run the algorithm under homomorphic encryption in a sealed bunker at the bottom of the ocean.
  • Then there are 4th-wall-breaking consequences of a processing step, where the step has a downstream chain of causation that passes through things in the real world that are not within-universe. (I mean, yes, the chip’s transistors have real-world-impacts on each other, in a manner that implements the algorithm, but that doesn’t count as 4th-wall-breaking.)

This distinction makes some sense to me, but I'm confused by your phrasing (and thus by what you actually mean). I guess my issue is that stating it like that made me think that you expected processing steps to be one or the other, whereas I can't imagine any processing step without 4th-wall-breaking consequences. What you do with these, about whether the 4th-wall-breaking consequences are reasons for specific actions, makes it clearer IMO.

Out-of-distribution, maybe the criterion in question diverges from a good postdiction-generation strategy. Oh well, it will make bad postdictions for a while, until gradient descent fixes it. That’s a capability problem, not a safety problem.

Agreed. Though, as Evan already pointed, the real worry with mesa-optimizers isn't proxy alignment but deceptive alignment. And deceptive alignment isn't just a capability problem.

Another way I've been thinking about the issue of mesa-optimizers in GPT-N is the risk of something like malign agents in the models (a bit like this) that GPT-N might be using to simulate different texts. (Oh, I see you already have a section about that)

It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.

Just because I share this intuition, I want to try pushing back against it.

First, I don't see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn't learn to model 4th-wall-breaking consequences: that's just the sort of thing you need to predict security exploits or AI alignment posts like this one.

Next comes the questions of whether it will take advantage of this. Well, a deceptive mesa-optimizer would have an incentive to use this. So I guess the question boils down to the previous discussion, of whether we should expect postdictive learners to spin deceptive mesa-optimizers.

So a self-aware, aligned AGI could, and presumably would, figure out the idea “Don’t do a step-by-step emulation in your head of a possibly-adversarial algorithm that you don’t understand; or do it in a super-secure sandbox environment if you must”, as concepts encoded in its value function and planner. (Especially if we warn it / steer it away from that.)

I see a thread of turning potential safety issues into capability issues, and then saying that the AGI being competent, it will not have them. I think this makes sense for a really competent AGI, which would not be taken over by budding agents inside its simulation. But there's still the risk of spinning agents early in training, and if those agents get good enough to take over the model from the inside and become deceptive, competence at the training task become decorrelated with what happens in deployment.

Comment by adamShimi on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-28T17:49:36.006Z · LW · GW

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

Comment by adamShimi on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-28T13:13:41.302Z · LW · GW

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do all the things we don't want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt.

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯ 

I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Agreed that there's not much difference when predicting GPT-3. But it's because we're at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its "goal" literally encode all of its behavior.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Fair enough.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Yeah, you're probably right.

Comment by adamShimi on DeepMind: Generally capable agents emerge from open-ended play · 2021-07-27T20:30:22.591Z · LW · GW

Actually, I think you're right. I always thought that MuZero was one and the same system for every game, but the Nature paper describes it as an architecture that can be applied to learn different games. I'd like a confirmation from someone who actually studied it more, but it looks like MuZero indeed isn't the same system for each game.

Comment by adamShimi on DeepMind: Generally capable agents emerge from open-ended play · 2021-07-27T18:41:45.669Z · LW · GW

Could you use this technique to e.g. train the same agent to do well on chess and go?

If I don't misunderstand your question, this is something they already did with MuZero.

Comment by adamShimi on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-24T09:06:17.805Z · LW · GW

Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyone, but I'm sort of implying that it's the right framing to talk about it. In this case, the "motivated" part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don't think is right (and apparently you agree).

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

Agreed with you there.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

True, but I don't feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

First, I think I interpreted "misalignment" here to mean "inner misalignment", hence my answer. I also agree that all examples in Victoria's doc are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff. 

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They're things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that Codex looks for the most probable next token based on all the code.

But there doesn't seem to be a real disagreement between us: I agree that GPT-3/Codex seem fundamentally unable to get really good at the "Chatbot task" I described above, which is what I gather you mean by "solving my problem".

(By the way, I have an old post about formulating this task that we want GPT-3 to solve. It was written before I actually studied GPT-3 but that holds decently well I think. I also did some experiments on GPT-3 with EleutherAI people on whether bigger models get better at answering more variations of the prompt for the same task.)

Comment by adamShimi on [AN #157]: Measuring misalignment in the technology underlying Copilot · 2021-07-23T20:49:58.528Z · LW · GW

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with EleutherAI, I'm pushing them to post on the AF) that GPT-3 seems to work more like a simulator of language-producing processes (for lack of a better word), than as an agent trying to predict the next token.

Like what you write here:

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem. In that sense, GPT-3 fails the "chatbot task": for a lot of the great things it's great at doing, you have to handcraft (or constrain) the prompts to make -- it won't find out precisely what you mean.

Or put it differently: people which are good at making GPT-3 do what they want have learned to not use it like a smart agent figuring out what you really mean, but more like a "prompt continuation engine". You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture, and I have the gut feeling that being agenty makes it harder to do that task (as you need a very un-goal-like goal).

(I think this points out to what you mention in that comment, about approval-directedness being significantly less goal-directed: if GPT-3 is agenty, it looks quite a lot like a sort of approval-directed agent.)

Comment by adamShimi on Looking Deeper at Deconfusion · 2021-07-21T15:06:57.369Z · LW · GW

To be fair, that was the original title. But after talking with Nate, I agreed that this perspective, although quite useful IMO, falls short of deconfusion because it hasn't paid its due in making the application (doing deconfusion) better/easier yet. Doesn't mean I don't expect it to eventually. :)

Comment by adamShimi on [Link] Musk's non-missing mood · 2021-07-13T13:05:51.853Z · LW · GW

I feel like the linked post is extolling the virtue of something that is highly unproductive and self-destructive: using your internal grim-o-meter to measure the state of the world/future. As Nate points out in his post, this is a terrible idea. Maybe Musk can be constantly grim while being productive on AI Alignment, but from my experience, people constantly weighted down by the shit that happens don't do creative research -- they get depressed and angsty. Even if they do some work, they burnout way more often.

That being said, I agree that it makes sense for people really involved in this topic to freak out from time to time (happens to me). But I don't want to make freaking out the thing that every Alignment researcher feels like they have to signal. 

Comment by adamShimi on Cliffnotes to Craft of Research parts I, II, and III · 2021-07-12T09:54:09.052Z · LW · GW

I finally took the time to read this post, and it's really interesting! Thanks for writing it!

One general comment is that I feel this book (as you summarize it) shows confusion over deconfusion. Notably all the talk of pure vs applied and conceptual vs applied isn't cutting reality at the joint, and is just the old stereotype between theorists and experimenters.

Additionally, it occurs to me that maybe "I have information for you" mode just a cheaper version of the question/problem modes. Sometimes I think of something that might lead to cool new information (either a theory or an experiment), and I'm engaged moreso by the potential for novelty than I am by the potential for applications.

I think I'd like to become more problem-driven. To derive possibilities for research from problems, and make sure I'm not just seeking novelty. At the end of the day, I don't think these roles are "equal" I think the problem-driven role is the best one, the one we should aspire to.

I don't necessarily agree with the "cheaper" judgment, except that if you want your research to contribute to a specific problem instead of maybe being relevant, the problem-driven role is probably better. Or at least the application-driven roles, which include the problem-driven and the deconfusion-driven.

Also, the trick I've found from reading EA materials but which isn't really in the air in academia is that if you want to work on a specific subject/problem but still follow your excitement, it's as simple as looking at many different approaches to the problem, and find ones that excite you. I feel like researchers are so used to protecting their ability to work on whatever they want that they end up believing that whatever looked cool first is necessarily their only interest. A bit like how the idea of passion can fuck up some career thinking.

Isn't it the case that deconfusion/writer role three research can be disseminated to practical (as opposed to theoretical) -minded people, and then those people turn question-answer into problem-solution?

In my experience, practical-minded people without much nerd excitement for theory tend to be bad at deconfusion, because it doesn't interest them. They tend to be problem-solvers first, and you have to convince them that your deconfusion is useful for their problem for them to care, which is fair.

There might be some truth to your point if we tweak it though, because I feel like deconfusion requires a mix of conceptual and practical mindsets: you need to care enough about theory and concept to want to clarify things, but you also need to care enough about applications to clarify and deconfuse with a goal in mind.

The conceptual problem case is where intangibles play in. The condition in that case is always the simple lack of knowledge or understanding of something. The cost in that case is simple ignorance.

Disagree with the cost part, because often the cost of deconfusion problems is... confusion. That is to say, it's being unable to solve the problem, or present it, or get money for it because nobody understands. There's a big chunk of that which sounds more like problem-related costs than pure ignorance to me.

A helpful exercise is if you find yourself saying "we want to understand x so that we can y", try flipping to "we can't y if we don't understand x". This sort of shifts the burden on the reader to provide ways in which we can y without understanding x. You can do this iteratively: come up with _z_s which you can't do without y, and so on.

That's a good intuition pump, but it is often too strong of a condition. Deconfusion of a given idea or concept might be the clearest, most promising or most obvious way of solving the problem, but it's almost never a necessary condition. There is almost no such things as necessary condition in the real world, and if you wait for one, you'll never do anything.

I want to reason about what these distinctions look like in the alignment community, and whether or not they're important.

Would guess no, because that's a distinction nobody cares about in the STEM world. The only point is maybe "people should read the original papers instead of just citing them", but that doesn't apply to many things here.

Moreover, what is a primary source in the alignment community? Surely if one is writing about inner alignment, a primary source is the Risks from Learned Optimization paper. But what are Risks' primary, secondary, tertiary sources? Does it matter?

On inner alignment, RIsks is a primary source which doesn't really have primary sources. I don't think it necessarily makes sense to talk about primary sources in STEM settings except as the first paper to present an idea/concept/theory. It's not about "the source written during the time it happened" as in history. So to even answer the question of the primary sources of Risks, you need to first know the primary source about what.

But once again, I don't think there is any value here, except in making people read Risks instead of reverse engineering it from subsequent posts.

Comment by adamShimi on Research Facilitation Invitation · 2021-07-10T16:38:40.782Z · LW · GW

I feel like you're proposing deconfusion as a service, at least in the way I decompose it here. Since my research is basically freelance deconfusion for Alignment researchers, I would be very interested of talking to you about how you do it. :)

Comment by adamShimi on paulfchristiano's Shortform · 2021-07-02T14:57:24.762Z · LW · GW

Ok, so you optimize the circuit both for speed and for small loss on human answers/comparisons, hoping that it generalizes to more questions while not being complex enough to be deceptive. Is that what you mean?

Comment by adamShimi on paulfchristiano's Shortform · 2021-07-02T14:55:45.166Z · LW · GW

This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn't have to be so complex.)

So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?

If that's about right, then I agree that having this would make your proposal work, but I still don't know how to get it. I need to read your previous posts on reading questions honestly. 

Comment by adamShimi on paulfchristiano's Shortform · 2021-07-01T16:19:23.874Z · LW · GW

Here's my starting proposal:

  • We quantify the human's local preferences by asking "Look at the person you actually became. How happy are you with that person? Quantitatively, how much of your value was lost by replacing yourself with that person?" This gives us a loss on a scale from 0% (perfect idealization, losing nothing) to 100% (where all of the value is gone). Most of the values will be exceptionally small, especially if we look at a short period like an hour.
  • Eventually once the human becomes wise enough to totally epistemically dominate the original AI, they can assign a score to the AI's actions. To make life simple for now let's ignore negative outcomes and just describe value as a scalar from 0% (barren universe) to 100% (all of the universe is used in an optimal way). Or we might use this "final scale" in a different way (e.g. to evaluate the AI's actions rather than the actually assessing outcomes, assigning high scores to corrigible and efficient behavior and somehow quantifying deviations from that ideal).
  • The utility is the product of all of these numbers.

If I follow correctly, the first step requires the humans to evaluate the output of narrow value learning, until this output becomes good enough to become universal with regard to the original AI and supervise it? I'm not sure I get why the AI wouldn't be incentivized to temper with the narrow value learning, à la Predict-o-matic? Depending on certain details, (like maybe the indescribable hellworld hypothesis), maybe the AI can introduce changes to the partial imitations/deliberations that end up hidden and compounding until the imitations epistemically dominates the AI, and then it ask it to do simple stuff.

Comment by adamShimi on paulfchristiano's Shortform · 2021-07-01T16:07:55.807Z · LW · GW

One aspect of this proposal which I don't know how to do is evaluation the answers of the question-answerer. That looks too me very related to the deconfusion of universality that we discussed a few months ago, and without an answer to this, I feel like I don't even know how to run this silly approach.

Comment by adamShimi on Brute force searching for alignment · 2021-06-29T11:04:17.269Z · LW · GW

Well, if you worry that these properties don't have a simple conceptual core, maybe you can do the trick where you try to formalize a subset of them with a small conceptual core. That's basically Evan move on Myopia as a more easy to study subset of non-deceptiveness.

Comment by adamShimi on Brute force searching for alignment · 2021-06-29T11:02:44.265Z · LW · GW

If I try to rephrase it in my words, your proposal looks like a way to go from partial deconfusion (in the form of an extensive definition, a list of examples of what you want) to full deconfusion (an actual program with the property that you want) through brute force search.

Stated like that, it looks really cool. I wonder if you need an AGI already to do the search with a reasonable amount of compute. In this case, the worry might be that you have to deconfuse what you want to deconfuse before being able to apply this technique, which would make it useless.

Still, I will add this sort of thought experiment to my bag of tools. It's a pretty good argument for extensive definition in a way.

Comment by adamShimi on Richard Ngo's Shortform · 2021-06-29T10:53:03.580Z · LW · GW

It seems to me that the resolution to the apparent paradox is that nerds are interested in all the details of their domain, but the outcome that they tend to look for are high-level abstractions. Even in settings like fandoms, there is a big push towards massive theories that entails every little detail about the story.

Though defining rationalist community as a sort of community of meta-nerds who apply this nerd approach to almost anything doesn't seem too off the mark. 

Comment by adamShimi on Richard Ngo's Shortform · 2021-06-29T10:49:16.411Z · LW · GW

Do you think that these are mutually exclusive, or something like that? I've always been confused by what I take to be the position in this shortform, that defining the outcomes makes it somehow harder to define the process. Sure, you can define a process without defining an outcome (i.e. writing a program or training an NN), but since what we are confused about is what we even want at the end, for me that's the priority. And doing so would help searching for processes leading to this outcome.

That being said, if you point is that defining outcomes isn't enough, in that we also need to define/deconfuse/study the processes leading to these outcomes, then I agree with that.

Comment by adamShimi on Frequent arguments about alignment · 2021-06-28T09:41:05.090Z · LW · GW

Thanks for this post! I have to admit that I took some time to read it because I believed that it would be basic, but I really like the focus on more current techniques (which makes sense since you cofounded and work at OpenAI).

Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if it knows that they're mistaken about something.

That doesn't feel as bad as you describe it for me. Sure, if you literally call a "wise old man" from the literature (or god forbid, reddit), that might end up pretty badly. But we might go for a tighter control around the sort of "language producer" were trying to instantiate. Or go microscope AI.

All these do require more alignment focused work though. I'm particularly excited of perspectives of language models as simulators of many small models of things producting/influencing language, and of techniques related to that view, like meta-prompts or counterfactual parsing.

I also feel like this answer from the Advocate disparage a potentially very big deal for language models: the fact that they might pick up the human abstractions because they learn to model language, and our use of language is littered with these abstractions. This is a potentially strong version of the natural abstraction hypothesis, which seems like it makes the problem easier in some ways. For example, we have more chance of understanding what the model might do because it's trying to predict a system (language) that we use constantly at that level of granularity, as opposed to images that we never think of pixels by pixels.

Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We'll need to use reinforcement learning.)

I want to point out that from an alignment standpoint, this looks like a very dangerous step. One thing Language Models have for them is that what they optimize for isn't what we use them for exactly, and so they avoid potential issues like goodharting. This would be completely destroyed by adding an explicit optimization step at the end. 

Returning to the original question, there was the claim that alignment gets easier as the models get smarter. It does get easier in some ways, but it also gets harder in others. Smarter models will be better at gaming our reward functions in unexpected and clever ways -- for example, producing the convincing illusion of being insightful or helpful, while actually being the opposite. And eventually they'll be capable of intentionally deceiving us.

I think this is definitely an important point that goes beyond the special case of language models that you mostly discuss before.

While alignment and capabilities aren't distinct, they correspond to different directions that we can push the frontier of AI. Alignment advances make it easier to optimize hard-to-measure objectives like being helpful or truthful. Capabilities advances also sometimes make our models more helpful and more accurate, but they also make the models more potentially dangerous.

On thing I would want to point out is that another crucial difference lies in the sort of conceptual research that is done in alignment. Deconfusion of ideas like power-seeking, enlightened judgment and goal-directedness are rarely that useful for capabilities, but I'm pretty convinced they are crucial for understanding better the alignment risks and how to deal with them.

Comment by adamShimi on Announcing the Replacing Guilt audiobook · 2021-06-25T15:34:13.413Z · LW · GW

I'm not really into audiobook, but Replacing Guilt is awesome, and it's great to have new ways for people to discover and experience it!

Comment by adamShimi on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T09:28:26.289Z · LW · GW

Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.

Despite disagreeing with you, I'm glad that you published this comment and I agree that airing up disagreements is really important for the research community.

In particular, I don't think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to ("We prove that for most prior beliefs one might have about the agent's reward function […], one should expect optimal policies to seek power in these environments."). Nor do I think that the paper justifies the relevance of that set of MDPs. (Why is it useful to prove things about it?)

There's a sense in which I agree with you: AFAIK, there is no formal statement of the set of MDPs with the structural properties that Alex studies here. That doesn't mean it isn't relatively easy to state:

  • Proposition 6.9 requires that there is a state with two actions  and  such that (let's say)   leads to a subMDP that can be injected/strictly injected into the subMDP that  leads to.
  • Theorems 6.12 and 6.13 require that there is a state with two actions  and such that (let's say)   leads to a set of RSDs (final cycles that are strictly optimal for some reward function) that can be injected/strictly injected into the set of RSDs from .

The first set of MDPs is quite restrictive (because you need an exact injection), which is why IIRC Alex extends the results to the sets of RSDs, which captures a far larger class of MDPs. Intuitively, this is the class of MDPs such that some action leads to more infinite horizon behaviors than another for the same state. I personally find this class quite intuitive, and also I feel like it captures many real world situations where we worry about power and instrumental convergence.

Also, there may be a misconception that this paper formalizes the instrumental convergence thesis. That seems wrong, i.e. the paper does not seem to claim that several convergent instrumental values can be identified. The only convergent instrumental value that the paper attempts to address AFAICT is self-preservation (avoiding terminal states).

Once again, I agree in part with the statement that the paper doesn't IIRC explicitly discuss different convergent instrumental goals. On the other hand, the paper explicitly says that it focus on a special case of the instrumental convergence thesis.

An action is instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The claim that power-seeking is robustly instrumental is a specific instance of the instrumental convergence thesis:

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents [Bostrom, 2014].

That being said, you just made me want to look more into how well power-seeking captures different convergent instrumental goals from Omohundro's paper, so thanks for that. :)

Comment by adamShimi on Alex Turner's Research, Comprehensive Information Gathering · 2021-06-23T19:18:32.784Z · LW · GW

Sure, I hadn't thought about that.

Comment by adamShimi on Visualizing in 5 dimensions · 2021-06-19T20:28:56.227Z · LW · GW

Trying to picture the warmup is already hard enough for me, so I'll start with asking questions about that and revisit the rest later:

(Exercise: what are the lines of latitude? What about longitude? Can you picture the north and south pole in different placetimes, and the corresponding equators?)

I expect that the longitude are the lines taken by choosing one in the middle sphere, and tracing the line following it on both side? As for latitude, if I use the analogy of the 2-sphere, each circle in the film is one line of latitude; so maybe each 2-sphere in this film is a latitude?

Also, I don't understand what you mean by your last question. In the 2-sphere version, the poles are only visible at time 0 and 2 and the equator is only visible at time 1.

Comment by adamShimi on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T13:23:02.979Z · LW · GW

I want to point that this is a great example of a deconfusion open problem. There is a bunch of intuitions, some constraints, and then we want to clarify the confusion underlying it all. Not planning to work on it myself, but it sounds very interesting.

(Only caveat I have with the post itself is that it could be more explicit in the title that it is an open problem).

Comment by adamShimi on Knowledge is not just digital abstraction layers · 2021-06-16T13:16:15.445Z · LW · GW

Nice post, as always.

What I take from the sequence up to this point is that the way we formalize information is unfit to capture knowledge. This is quite intuitive, but you also give concrete counterexamples that are really helpful.

It is reasonable to say that a data recorder is accumulating nonzero knowledge, but it is strange to say that exchanging the sensor data for a model derived from that sensor data is always a net decrease in knowledge.

Definitely agreed. This sounds like your proposal doesn't capture the transformation of information into more valuable precomputation (making valuable abstraction requires throwing away some information).

Comment by adamShimi on Looking Deeper at Deconfusion · 2021-06-16T08:50:27.787Z · LW · GW

Glad that you liked it. :)

Comment by adamShimi on Looking Deeper at Deconfusion · 2021-06-16T08:49:54.765Z · LW · GW

That's one option. I actually wrote my thesis to be the readable version of this deconfusion process, so this is where I would redirect people by default (the first few pages are in french, but the actual thesis is in english)

Comment by adamShimi on Vignettes Workshop (AI Impacts) · 2021-06-15T13:51:44.859Z · LW · GW

Already told you yesterday, but great idea! I'll definitely be a part of it, and will try to bring some people with me.

Comment by adamShimi on Looking Deeper at Deconfusion · 2021-06-14T21:12:32.767Z · LW · GW

Glad you found this helpful!

Concerning your deconfusion issue, I would say that maybe some things you could try are:

  • Be very clear about your application. Why do you want to deconfuse these ideas? That might give you some constraint on what the result must look like.
  • Maybe try the simplest form of handles? I'm quite fond of extensive definitions myself, as they're easier to create but still quite insightful.
  • One thing I didn't touch on in this post is how handle-building is often an iterative process, where you build a crude one that serves you to pinpoint some of the confusion, and then you build a better or different one, over and over again.

Hope this might help.

Comment by adamShimi on Looking Deeper at Deconfusion · 2021-06-14T21:08:52.475Z · LW · GW

I guess it depends on the application you have in mind. In principle, the most deconfused handle I can think of is a full mathematical formalization with just the right number of degrees of freedom. Maybe the best analogy is with the theories in physics.

Regarding the textbook, I would say that you probably need pretty good level of deconfusion to write a good textbook, but the textbook writing also involves a lot of bridging the inferential distance with newcomers that doesn't count as deconfusion for me.

Does that answer your question?

Comment by adamShimi on Looking Deeper at Deconfusion · 2021-06-14T07:00:27.415Z · LW · GW


All in all, I think there are many more examples. It's just that deconfusion almost always plays a part, because we don't have one unified paradigm or approach which does the deconfusion for us. But actual problem solving and most part of normal science, are not deconfusion by my perspective.

Comment by adamShimi on [Event] Weekly Alignment Research Coffee Time (08/02) · 2021-06-13T21:56:05.685Z · LW · GW

Hey, it seems like other could use the link, so I'm not sure what went wrong. If you have the same problem tomorrow, just send me a PM.

Comment by adamShimi on Knowledge is not just mutual information · 2021-06-12T12:42:40.212Z · LW · GW

Thanks again for a nice post in this sequence!

The previous post looked at measuring the resemblance between some region and its environment as a possible definition of knowledge and found that it was not able to account for the range of possible representations of knowledge.

I found myself going back to the previous post to clarify what you mean here. I feel like you could do a better job of summarizing the issue of the previous post (maybe by mentioning the computer example explicitly?).

Formally, the mutual information between two objects is the gap between the entropy of the two objects considered as a whole, and the sum of the entropy of the two objects considered separately. If knowing the configuration of one object tells us nothing about the configuration of the other object, then the entropy of the whole will be exactly equal to the sum of the entropy of the parts, meaning there is no gap, in which case the mutual information between the two objects is zero. To the extent that knowing the configuration of one object tells us something about the configuration of the other, the mutual information between them is greater than zero.

I need to get deeper into information theory, but that is probably the most intuitive explanation of mutual information I've seen. I delayed reading this post because I worried that my half-remembered information theory wasn't up to it, but you deal with that nicely.

At the microscopic level, each photon that strikes the surface of an object might change the physical configuration of that object by exciting an electron or knocking out a covalent bond. Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought. So then does the physical case in which the computer is housed have as much "knowledge" about the position of the object being sought as the computer itself?

Interestingly, I expect this effect to disappear when the measurement defining our two variable get less precise. In a sense, the mutual information between the case and ship container depend on measuring very subtle differences, whereas the mutual information between the computer and the ship container is far more robust to loss of precision.

For example, a computer that is using an electron microscope to build up a circuit diagram of its own CPU ought to be considered an example of the accumulation of knowledge. However, the mutual information between the computer and itself is always equal to the entropy of the computer and is therefore constant over time, since any variable always has perfect mutual information with itself.

But wouldn't there be a part of the computer that accumulates knowledge about the whole computer?

This is also true of the mutual information between the region of interest and the whole system: since the whole system includes the region of interest, the mutual information between the two is always equal to the entropy of the region of interest, since every bit of information we learn about the region of interest gives us exactly one bit of information about the whole system also.

Maybe it's my lack of understanding of information theory speaking, but that sounds wrong. Surely there's a difference between cases where the region of interest determines the full environment, and when it is completely independent of the rest of the environment?

The accumulation of information within a region of interest seems to be a necessary but not sufficient condition for the accumulation of knowledge within that region. Measuring mutual information fails to account for the usefulness and accessibility that makes information into knowledge.

Despite my comments above, that sounds broadly correct. I'm not sure that the mutual information would capture your example of the textbook for example, even when it contains a lot of knowledge.

Comment by adamShimi on Exercises in Comprehensive Information Gathering · 2021-06-12T12:19:13.495Z · LW · GW

I must have read this post when you first published it, but only now does it strike me as answering perfectly one of my need for deconfusion: building a reasonable map of vast territories of knowledge, to have more tools in mind when deconfusing. Especially with maths, I've been having the problem of always changing my focus, and never finishing textbooks.

But this is simply a Comprehensive Information Gathering exercice! The right way to go about it is to go through the wikipedia page on areas of mathematics; look at each sub area in turn; and get a grip on the history, the objects studied, and the fundamental theorems.

Honestly, this plan is the first one I imagined for this issue that sounds both fun and likely to work as I intended. Thanks so much!

Comment by adamShimi on The Apprentice Experiment · 2021-06-12T11:16:32.837Z · LW · GW

Excited about this!

From a personal standpoint, I'm curious whether Aysajan learns some interesting deconfusion skills through this apprenticeship, as this is what I'm most interested in, and because I expect deconfusion to be a fundamental subskill in solving problems we don't understand.

On a community level, I really want as many people as possible tackling these problems, so I hope this result in better ways of training for this task.

Comment by adamShimi on Search-in-Territory vs Search-in-Map · 2021-06-10T12:53:33.182Z · LW · GW

This is a very interesting distinction. Notably, I feel that you point better at a distinction between "search inside" and "search outside" which I waved at in my review of Abram's post. Compared with selection vs control, this split also has the advantage that there is no recursive calls of one to the other: a controller can do selection inside, but you can't do search-in-territory by doing search-in-map (if I understand you correctly).

That being said, I feel you haven't yet deconfused optimization completely because you don't give a less confused explanation of what "search" means. You point out that typically search-in-map looks more like "search/optimization algorithms" and search-in-territory looks more like "controllers", which is basically redirecting to selection vs control. Yet I think this is where a big part of the confusion lies, because both look like search while being notoriously hard to reconcile. And I don't think you can rely on let's say Alex Flint's definition of optimization, because you focus more on the internal algorithm than he does.

Key point: if we can use information to build a map before we have full information about the optimization/search task, that means we can build one map and use it for many different tasks. We can weigh all the rocks, put that info in a spreadsheet, then use the spreadsheet for many different problems: finding the rock closest in weight to a reference, finding the heaviest/lightest rock, picking out rocks which together weigh some specified amount, etc. The map is a capital investment.

One part you don't address here is the choice of what to put in the map. In your rock example, maybe the actual task will be about finding the most beautiful rock (for some formalized notion of beautiful) which is completely uncorrelated with weight. Or one of the many different questions that you can't answer if your map only contains the weights. So in a sense, search-in-map requires you to know the sort of info you'll need, and what you can safely throw away.


On the thermostat example, I actually have an interesting aside from Dennett. He writes that the thermostat is an intentional system, but that the difference with humans, or even with a super advanced thermostat, is that the standard thermostat has a very abstract goal. It basically have two states and try to be in one instead of the other, by doing its only action. One consequence is that you can plug the thermostat into another room, or to control the level of water in a tub or the speed of a car, and it will do so.

From this perspective, the thermostat is not so much doing search-in-territory than search-in-map with a very abstracted map that throw basically everything.

Comment by adamShimi on Suggestions of posts on the AF to review · 2021-06-08T08:22:48.683Z · LW · GW

Putting aside how people feel for the moment (I'll come back to it), I don't think peer-review should be private, and I think anyone publishing work in an openly readable forum where other researchers are expected to interact would value a thoughtful review of their work.

That being said, you're probably right that at least notifying the authors before publication is a good policy. We sort of did that for the first two reviews, in the sense of literally asking people what they wanted to get reviews for, but we should make it a habit.

Thanks for the suggestion.

Comment by adamShimi on The Alignment Forum should have more transparent membership standards · 2021-06-05T21:25:39.281Z · LW · GW

I want to push back on that. I agree that most people don't read the manual, but I think that if you're confused about something and then don't read the manual, it's on you. I also don't think they could make it much more obvious than being always on the front page.

Maybe the main criticism is that this FAQ/intro post has a bunch of info about the first AF sequences that is probably irrelevant to most newcomers.

Comment by adamShimi on The Alignment Forum should have more transparent membership standards · 2021-06-05T20:48:03.813Z · LW · GW

I'm still confused by half the comments on this post. How can people be confused by a setting explained in detail in the only post always pinned in the AF, which is a FAQ?

Comment by adamShimi on What is the most effective way to donate to AGI XRisk mitigation? · 2021-05-31T11:20:25.010Z · LW · GW


For a bit more funding information:

Comment by adamShimi on What is the most effective way to donate to AGI XRisk mitigation? · 2021-05-30T17:34:14.195Z · LW · GW

Quick thought: I expect that the most effective donation would be to organizations funding independent researchers, notably the LTFF.

Note that I'm an independent researcher funded by the LTFF (and Beth Barnes), but even if you told me that the money would never go to me, I would still think that.

  • Grants by organizations like that have a good track record for producing valuable research, as at least two people I think are among the most interesting thinkers on the topic (John S. Wentworth and Steve Byrnes) have gotten grants from sources like that (Steve is technically funded by Beth Barnes with money from the donor lottery), and others I'm really excited about (like Alex Turner) were helped by LTFF grants.
  • Such grants allow researchers to both bootstrap their careers, and also explore less incentivized subjects related to alignment at the start of their career.
  • They are cheaper than funding a hire for somewhere like MIRI, ARC or CHAI.
Comment by adamShimi on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T13:56:26.294Z · LW · GW

I'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?

My current understanding is something like:

  • There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
  • Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the "5-googleplex" one).

There's no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.

I would say the choice of agent architecture is the subjective decision. That's the point at which we decide what states and actions are possible, which completely determines the MDP. Granted, this argument is probably stronger for POMDP (for which you have more degrees of freedom in observations), but I still see it for MDP.

If you don't think there is subjectivity involved, do you think that for whatever (non-formal) problem we might want to solve, there is only one way to encode it as a state space and action space? Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree with the latter, but then it's a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn't seem unique to me.

You don't have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions. 

But to falsify the "5 googolplex", you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don't power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.

Comment by adamShimi on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T10:18:53.709Z · LW · GW

Despite agreeing with your conclusion, I'm unconvinced by the reasons you propose. Sure, once the interface is chosen, then the MDP is pretty much constrained by the real-world (for a reasonable modeling process). But that just means the subjectivity comes from the choice of the interface!

To be more concrete, maybe the state space of Pacman could be red-ghost, starting-state and live-happily-ever-after (replacing the right part of the MDP). Then taking the right action wouldn't be power-seeking either.

What I think is happening here is that in reality, there is a tradeoff in modeling between simplicity/legibility/usability of the model (pushing for fewer states and fewer actions) and performance/competence/optimality (pushing for more states and actions to be able to capture more subtle cases). The fact that we want performance rules out my Pacman variant, and the fact that we want simplicity rules out ofer's example.

It's not clear to me that there is one true encoding that strikes a perfect balance, but I'm okay with the idea that there is an acceptable tradeoff and models around that point are mostly similar, in ways that probably doesn't change the power-seeking.

That's also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over. If that were true, then yes - optimal policies really would tend to "die" immediately, since they'd have so many choices. 

The "5 googolplex" claim is both falsifiable and false. Given an agent architecture (specifically, the two encodings), optimal policy tendencies are not subjective. We may be uncertain about the agent's state- and action-encodings, but that doesn't mean we can imagine whatever we want.

Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn't help.

Comment by adamShimi on Attainable Utility Preservation: Concepts · 2021-05-21T13:11:34.909Z · LW · GW

(Definitely a possibility that this is answered later in the sequence)

Rereading the post and thinking about this, I wonder if AUP-based AIs can still do anything (which is what I think Steve was pointing at). Or maybe phrased differently, whether it can still be competitive.

Sure, reading a textbook doesn't decrease the AU of most other goals, but applying the learned knowledge might. On your paperclip example, I expect that the AUP-based AI will make very few paper clips, or it could have a big impact (after all, we make paperclips in factories, but they change the AUP landscape)

More generally, AUP seems to forbid any kind of competence in a zero-sum-like situation. To go back to Steve's example, if the AI invents a great new solar cell, then it will make its owner richer and more powerful at the expense of other people, which is forbidden by AUP as I understand it.

Another way to phrase my objection is that at first glance, AUP seems to not only forbid gaining power for the AI, but also gaining power for the AI's user. Which sounds like a good thing, but might also create incentives to create and use non AUP-based AIs. Does that make any sense, or did I fail to understand some part of the sequence that explains this?

(An interesting consequence of this if I'm right is that AUP-based AIs might be quite competitive for making open-source things, which is pretty cool).

Comment by adamShimi on SGD's Bias · 2021-05-19T10:50:24.945Z · LW · GW

In SGD, our “intended” drift is  - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias” - a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is

I have not followed all your reasoning, but focusing on this last formula, does it represent a bias towards less variance over the different gradients one can sample at a given point?

If so, then I do find this quite interesting. A random connection: I actually thought of one strong form of gradient hacking as forcing the gradient to be (approximately) the same for all samples. Your result seems to imply that if such forms of gradient hacking are indeed possible, then they might be incentivized by SGD.

Comment by adamShimi on Knowledge Neurons in Pretrained Transformers · 2021-05-19T10:42:07.132Z · LW · GW

I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it's the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn't modeling humans or ensure that an agent is myopic in the sense that it isn't modeling the future.

Despite agreeing that the results are impressive, I'm less optimistic that you are for this path to microscope and/or myopia. Doing so would require an exhaustive listing of what we don't want the model to know (like human modeling or human manipulation) and a way of deleting that knowledge that doesn't break the whole network. The first requirement seems a deal-breaker to me, and I'm not convinced this work actual provide much evidence that more advanced knowledge can be removed that way.

Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally.

Here too, I agree with the sentiment, but I'm not convinced that this is the whole story. This looks like how structured facts are learned, but I see no way as of now to generate the range of stuff GPT-3 and other LMs can do from just key-value knowledge pairs.