Review of "Fun with +12 OOMs of Compute" 2021-03-28T14:55:36.984Z
Behavioral Sufficient Statistics for Goal-Directedness 2021-03-11T15:01:21.647Z
Epistemological Framing for AI Alignment Research 2021-03-08T22:05:29.210Z
Suggestions of posts on the AF to review 2021-02-16T12:40:52.520Z
Tournesol, YouTube and AI Risk 2021-02-12T18:56:18.446Z
Epistemology of HCH 2021-02-09T11:46:28.598Z
Infra-Bayesianism Unwrapped 2021-01-20T13:35:03.656Z
Against the Backward Approach to Goal-Directedness 2021-01-19T18:46:19.881Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z
The Case for a Journal of AI Alignment 2021-01-09T18:13:27.653Z
Postmortem on my Comment Challenge 2020-12-04T14:15:41.679Z
[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology 2020-11-30T17:33:43.691Z
Small Habits Shape Identity: How I became someone who exercises 2020-11-26T14:55:57.622Z
What are Examples of Great Distillers? 2020-11-12T14:09:59.128Z
The (Unofficial) Less Wrong Comment Challenge 2020-11-11T14:18:48.340Z
Why You Should Care About Goal-Directedness 2020-11-09T12:48:34.601Z
The "Backchaining to Local Search" Technique in AI Alignment 2020-09-18T15:05:02.944Z
Universality Unwrapped 2020-08-21T18:53:25.876Z
Goal-Directedness: What Success Looks Like 2020-08-16T18:33:28.714Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
Will OpenAI's work unintentionally increase existential risks related to AI? 2020-08-11T18:16:56.414Z
Analyzing the Problem GPT-3 is Trying to Solve 2020-08-06T21:58:56.163Z
What are the most important papers/post/resources to read to understand more of GPT-3? 2020-08-02T20:53:30.913Z
What are you looking for in a Less Wrong post? 2020-08-01T18:00:04.738Z
Dealing with Curiosity-Stoppers 2020-07-30T22:05:02.668Z
adamShimi's Shortform 2020-07-22T19:19:27.622Z
The 8 Techniques to Tolerify the Dark World 2020-07-20T00:58:04.621Z
Locality of goals 2020-06-22T21:56:01.428Z
Goal-directedness is behavioral, not structural 2020-06-08T23:05:30.422Z
Focus: you are allowed to be bad at accomplishing your goals 2020-06-03T21:04:29.151Z
Lessons from Isaac: Pitfalls of Reason 2020-05-08T20:44:35.902Z
My Functor is Rich! 2020-03-18T18:58:39.002Z
Welcome to the Haskell Jungle 2020-03-18T18:58:18.083Z
Lessons from Isaac: Poor Little Robbie 2020-03-14T17:14:56.438Z
Where's the Turing Machine? A step towards Ontology Identification 2020-02-26T17:10:53.054Z
Goal-directed = Model-based RL? 2020-02-20T19:13:51.342Z


Comment by adamShimi on Where are intentions to be found? · 2021-04-22T17:01:01.657Z · LW · GW

I have two reactions while reading this post:

  • First, even if we say that a given human (for example) at a fixed point in time doesn't necessarily contain everything that we would want the AI to learn, if it only learns what's in there, there might already be a lot of alignment failures that disappear. For example paperclip maximizers are probably ruled out by taking one human's values at a point in time and extrapolating. But that clearly doesn't help with scenarios where the AI does the sort of bad things humans can do, for example.
  • Second, I would argue that in the you of the past, there might actually be enough information to encode, if not the you of now, at least better and better versions of you through interactions with the environment. Or said another way, I feel like what we're pointing at when we're pointing at a human is the normativity of human values, including how they evolve, and how we think about how they evolve, and... recursively. So I think you might actually have all the information you want from this part of space if AI captures the process behind rethinking our values and ideas.
Comment by adamShimi on Gradations of Inner Alignment Obstacles · 2021-04-21T17:06:18.068Z · LW · GW

Cool post! It's clearly not super polished, but I think you're pointing at a lot of important ideas, and so it's a good thing to publish it relatively quickly.

The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.

As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can speak of the objective/goal without a lot of the issues around behavioral objectives; and because the authors of the Risk from Learned Optimization paper felt that they needed assumptions about the internals of the system to say things like "training and generalization incentivize mesa-optimization".

But personally, I really think of inner alignment in terms of goal-directed agents with misaligned goals. That's by the way one reason why I'm excited to work on deconfusing goal-directedness: I hope this will allow us to consider broader inner misalignment.

With that perspective, I see the Risks paper as arguing that when pushed at the limit of competence, optimized goal-directed systems will have a simple internal model built around a goal, instead of being a mess of heuristics as you could expect at intermediary levels of competence. But I don't necessarily think this has to be search.

I don't think these arguments are enough to supersede (misaligned) mesa-control as the general thing we're trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around / systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it's important to keep this in mind as a potential definition.

The argument I find the most convincing for the internal representation (or at least awareness/comprehension) is that it is required for very high-level of competence towards the goal (for complex enough goals, of course). I guess that's probably similar (though not strictly the same) to your point about the "systematically misaligned".

But I worry that people could interpret the experiment incorrectly, thinking that "good" results from this experiment (ie creating much more helpful versions of GPT) are actually "good signs" for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.

Your analysis of making GPT-3 made me think a lot of this great blog post (and great blog) that I just read today. The gist of this and other posts there is to think of GPT-3 as a "multiverse-generator", simulating some natural language realities. And with the prompt, the logit-bias and other aspects, you can push it to priviledge certain simulations. I feel like the link with what you're saying is that making GPT-3 useful in that sense seems to push it towards simulating realities consistent/produced by agents, and so to almost optimize for an inner alignment problem.

Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.

I haven't thought enough/studied enough the lottery ticket hypotheses and related idea to judge if your proposal makes sense, but even accepting it, I'm not sure it forbids basins of attraction. It just says that when the deceptive lottery ticket is found enough, then there is no way back. But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can't expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.

I guess the difference is that instead of the deceptive region of the model space, it's the "your innate deceptiveness has won" region of the model space?

Comment by adamShimi on Updating the Lottery Ticket Hypothesis · 2021-04-20T17:24:08.636Z · LW · GW

By Newton step, do you mean one step of Newton's method?

Comment by adamShimi on Updating the Lottery Ticket Hypothesis · 2021-04-20T14:49:21.577Z · LW · GW

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Trying to check if I'm understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these "even out" such that they end up equivalent to a single update from the initial parameters?

Comment by adamShimi on Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers · 2021-04-15T12:52:10.873Z · LW · GW

I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.

Comment by adamShimi on Identifiability Problem for Superrational Decision Theories · 2021-04-12T12:58:16.749Z · LW · GW

As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory.

Hum, then I'm not sure I understand in what way classical game theory is neater here?

I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and then you know you would have just gotten that message from the copy as well.

As long as the probabilistic coin flips are independent on both sides (you also mention the case where they're symmetric, but let's put that aside for the example), then you can apply the basic probabilistic algorithm for leader election: both copies flip a coin n times to get a n-bit number, which they exchange. If the numbers are different, then the copy with the smallest one says 0 and the other says 1; otherwise they flip a coin and return the answer. With this algorithm, you have probability  of deciding different values, and so you can get as close as you want to 1 (by paying the price in more random bits).

I'd be interested. I think even just more solved examples of the reasoning we want are useful currently.

Do you have examples of problems with copies that I could look at and that you think would be useful to study?

Comment by adamShimi on Identifiability Problem for Superrational Decision Theories · 2021-04-12T09:15:38.254Z · LW · GW

Well, if I understand the post correctly, you're saying that these two problems are fundamentally the same problem, and so rationality should be able to solve them both if it can solve one. I disagree with that, because from the perspective of distributed computing (which I'm used to), these two problems are exactly the two kinds of problems that are fundamentally distinct in a distributed setting: agreement and symmetry-breaking.

Communication won't make a difference if you're playing with a copy.

Actually it could. Basically all of distributed computing assumes that every process is running the same algorithm, and you can solve symmetry-breaking in this case with communication and additional constraint on the scheduling of processes (the difficulty here is that the underlying graph is symmetric, whereas if you had some form of asymmetry (like three processes in a line, such that the one in the middle has two neighbors but the others only have one), they you can use directly that asymmetry to solve symmetry-breaking.

(By the way, you just gave me the idea that maybe I can use my knowledge of distributed computing to look at the sort of decision problems where you play with copies? Don't know if it would be useful, but that's interesting at least)

Comment by adamShimi on "Taking your environment as object" vs "Being subject to your environment" · 2021-04-12T09:08:11.960Z · LW · GW

Not sure I'm not right person to ask for that, because I tend to often doubt basically almost anything I say or think (not at the same time), and sometimes I forget why something makes sense, and spend quite some time trying to find a good explanation. So I guess I'm naturally the type that gets something out of the imagined version.

Comment by adamShimi on Specializing in Problems We Don't Understand · 2021-04-12T09:04:16.572Z · LW · GW

The impression I always had with general systems (from afar) was that it looked cool, but it never seemed to be useful for doing anything other than "think in systems", (so not useful for doing research in another field or making any concrete applications). So that's why I never felt interested. Note that I'm clearly not knowledgeable at all on the subject, this is just my outside impression.

I assume from your comment you think that's wrong. Is the Weinberg book a good resource for educating myself and seeing how wrong I am?

Comment by adamShimi on Specializing in Problems We Don't Understand · 2021-04-12T08:59:01.198Z · LW · GW

Fair enough. But identifying good subproblems of well-posed problems is a different skill from identifying good well-posed subproblems of a weird and not formalized problem. An example of the first would be to simplify the problem as much as possible without making it trivial (classic technique in algorithm analysis and design), whereas an example of the second would be defining the logical induction criterion, which creates the problems of finding a logical inductor (not sure that happened in this order, this is a part of what's weird with problem formulation)

And I have the intuition that there are way more useful and generalizable techniques for the first case than the second case. Do you feel differently? If so, I'm really interested with techniques you have in mind for starting from a complex mess/intuitions and getting to a formal problem/setting.

Comment by adamShimi on [LINK] Luck, Skill, and Improving at Games · 2021-04-12T00:22:27.852Z · LW · GW

Pretty cool. The part about not blaming luck reminded me a lot of the advice to not adopt a victim mindset. I also like the corresponding advice to not take credit for luck.

Comment by adamShimi on "Taking your environment as object" vs "Being subject to your environment" · 2021-04-11T23:16:07.908Z · LW · GW

So far I have said there are three ways of getting perspective on your environment: leaving it, imagining yourself into someone outside of it, and assuming that it's hostile.


What are some other ways of getting a better perspective on your environment?

Imagining myself explaining the environment to someone else, or literally doing that. That's also a very useful technique for checking understanding, and I think it uses the same mechanism: when you read a paper, you feel a sense of familiarity and obviousness that makes you think you understand. But if you have to actually explain it, completely, then you can't really do that anymore.

I tend to do that a lot, be it for "getting outside of my head/environnment" or for learning.

Comment by adamShimi on Specializing in Problems We Don't Understand · 2021-04-11T12:30:19.368Z · LW · GW

This looks like expanding on the folklore separation between engineering and research in technical fields like computer science: engineering is solving a problem we know how to solve or know the various pieces needed for solving, whereas research is solving a problem no one ever solved, and such that we don't expect/don't know if the standard techniques apply. Of course this is not exactly accurate, and generalizes to field that we wouldn't think of as engineering.

I quite like your approach; it looks like the training for an applied mathematician (in the sense of Shannon, that is, a mathematician which uses the mental tools of maths to think about a variety of problems). I don't intend right now to use the exercises that you recommend, Yet this approach seems similar to what I'm trying to do, which is having a broad map of the territory in say Maths or other fields, such that I can recognize that the problem I'm working might gain from insight in this specific subfield.

One aspect I feel is not emphasized enough here is the skill of finding/formulating good problems. I don't think you're disregarding this skill, but your concrete example of finding an algorithm is a well-defined problem. We might oppose to it the kind of problems the computer science pioneers where trying to solve, like "what is even a computation/an algorithm?". And your approach of recognizing when a very specific technique applies looks less useful in the second case. Other approaches you highlight probably translate well to this setting, but I think it's valuable to dig deeper into which one specifically, and why.

Comment by adamShimi on Identifiability Problem for Superrational Decision Theories · 2021-04-10T21:22:08.226Z · LW · GW

I don't see how the two problems are the same. They are basically the agreement and symmetry breaking problems of distributed computing, and those two are not equivalent in all models. What you're saying is simply that in the no-communication model (where the same algorithm is used on two processes that can't communicate), these two problems are not equivalent. But they are asking for fundamentally different properties, and are not equivalent in many models that actually allow communication. 

Comment by adamShimi on Phylactery Decision Theory · 2021-04-10T18:54:35.363Z · LW · GW

I feel like doing a better job of motivating why we should care about this specific problem might help get you more feedback.

If we want to alter a decision theory to learn its set of inputs and outputs, your proposal makes sense to me at first glance. But I'm not sure why I should particularly care, or why there is even a problem to begin with solution. The link you provide doesn't help me much after skimming it, and I (and I assume many people) almost never read something that requires me to read other posts without even a summary of the references. I made an exception today because I'm trying to give more feedback, and I feel that this specific piece of feedback might be useful for you.

Basically, I'm not sure of what problem you're trying to solve with having this ability to learn your cartesian boundary, and so I'm unable to judge how well you are solving it.

Comment by adamShimi on Testing The Natural Abstraction Hypothesis: Project Intro · 2021-04-07T21:39:02.966Z · LW · GW

This project looks great! I especially like the focus on a more experimental kind of research, while still focused and informed on the specific concepts you want to investigate.

If you need some feedback on this work, don't hesitate to send me a message. ;)

Comment by adamShimi on Open & Welcome Thread – March 2021 · 2021-04-06T23:29:36.251Z · LW · GW

To be clear, I was just answering the comment, not complaining again about the editor. I find it's great, and the footnote is basically a nitpick (but a useful nitpick). I also totally get if it takes quite some time and work to implement. ;)

Comment by adamShimi on Open & Welcome Thread – March 2021 · 2021-04-06T14:06:39.037Z · LW · GW

Thanks for the link!

But yeah, I like using the WYSIWIG, at least if I have to edit on LW directly (otherwise vim is still my favorite probably)

Comment by adamShimi on TAI? · 2021-03-30T14:46:11.294Z · LW · GW

Quick answer without any reference, so probably biased towards my internal model: I don't think we reached TAI yet because I believe that if you remove every application of AI in the world (to simplify the definitions, every product of ML), the vast majority of people wouldn't see any difference, and probably some positive difference (less attention manipulation on social media for example).

Compare with removing every computing device, or removing electricity.

And taking as examples the AI we're making now, I expect that your first two points are wrong: people are already trying to build AI into everything, and it's basically always useless/not that much useful.

(An example of the disconnect between AI as thought about here or in research lab, and practical application, is that AFAIK, nobody knows how to make money with RL)

The question of whether we have enoigh resources to scale to TAI right now is one I haven't thought about enough for a decent answer, but you can find discussions of it on LW.

Comment by adamShimi on Vanessa Kosoy's Shortform · 2021-03-29T18:42:30.460Z · LW · GW

Oh, right, that makes a lot of sense.

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Comment by adamShimi on Review of "Fun with +12 OOMs of Compute" · 2021-03-29T17:45:54.728Z · LW · GW

About the update

You're right, that's what would happen with an update.

I think that the model I have in mind (although I hadn't explicitly thought about it until know), is something like a distribution over ways to reach TAI (capturing how probable it is that they're the first way to reach AGI), and each option comes with its own distribution (let's say over years). Obviously you can compress that into a single distribution over years, but then you lose the ability to do fine grained updating.

For example, I imagine that someone with relatively low probability that prosaic AGI will be the first to reach AGI, upon reading your post, would have reasons to update the distribution for prosaic AGI in the way you discuss, but not to update the probability that prosaic AGI will be the first to reach TAI. On the other hand, if there was a argument centered more around an amount of compute we could plausibly get in a short timeframe (the kind of thing we discuss as potential follow-up work), then I'd expect that this same person, if convinced, would put more probability that prosaic AGI will be the first to reach TAI.

Graph-based argument

I must admit that I have trouble reading your graph because there's no scale (although I expect the spiky part is centered at +12 OOMs? As for the textual argument, I actually think it makes sense to put quite low probability to +13 OOMs if one agrees with your scenario.

Maybe my argument is a bit weird, but it goes something like this: based on your scenarios, it should be almost sure that we can reach TAI with +12 OOMs of magnitude. If it's not the case, then there's something fundamentally difficult about reaching TAI with prosaic AGI (because you're basically throwing all the compute we want at it), and so I expect very little probability of a gain from 1 OOMs.

The part about this reasoning that feels weird is that I reason about 13 OOMs based on what happens at 12 OOMs, and the idea that we care about 13 OOMs iff 12 OOMs is not enough. It might be completely wrong.

Reasons for 12 OOMs

To the first suspicion I'll say: I had good reasons for writing about 12 rather than 6 which I am happy to tell you about if you like.

I'm both interested, and (without knowing them), I expect that I will want you to have put them in the post, to deal with the implicit conclusion that you couldn't argue 6 OOMs.

Also interested by your arguments for 6 OOMs or pointers.

Comment by adamShimi on Review of "Fun with +12 OOMs of Compute" · 2021-03-29T06:50:51.900Z · LW · GW

Let me try to make an analogy with your argument.

Say we want to make X. What you're saying is "with 10^12 dollars, we could do it that way". Why on earth would I update at all whether it can be done with 10^6 dollars? If your scenario works with that amount, then you should have described it using only that much money. If it doesn't, then you're not providing evidence for the cheaper case.

Similarly here, if someone starts with a low credence on prosaic AGI, I can see how your arguments would make them put a bunch of probability mass close to +10^12 compute. But they have no reason to put probability mass anywhere far from that point, since the scenarios you give are tailored to that. And lacking an argument for why you can get that much compute in a short timeline, then they probably end up thinking that if prosaic AGI ever happens, it's probably after every other option. Which seems like the opposite of the point you're trying to make.

Comment by adamShimi on Review of "Fun with +12 OOMs of Compute" · 2021-03-28T21:20:46.371Z · LW · GW

You're welcome!

To put it another way: I don't actually believe we will get to +12 OOMs of compute, or anywhere close, anytime soon. Instead, I think that if we had +12 OOMs, we would very likely get TAI very quickly, and then I infer from that fact that the probability of getting TAI in the next 6 OOMs is higher than it would otherwise be (if I thought that +12 OOMs probably wasn't enough, then my credence in the next 6 OOMs would be correspondingly lower).

To some extent this reply also partly addresses the concerns you raised about memory and bandwidth--I'm not actually saying that we actually will scale that much; I'm using what would happen if we magically did as an argument for what we should expect if we (non-magically) scale a smaller amount.

(Talking only for myself here)

Rereading your post after seeing this comment:

What I’ve done in this post is present an intuition pump, a thought experiment that might elicit in the reader (as it does in me) the sense that the probability distribution should have the bulk of its mass by the 10^35 mark.

I personally misread this, and understood "the bulk of its mass at the 10^35 mark". The correct reading is more in line with what you're saying here. That's probably a reason why I personnally focused on the +12 OOMs mark (I mean, that's also in the title).

So I agree we misunderstood some parts of your post, but I still think our issue remains. Except that instead of being about justifying +12 OOMs of magnitude in the short term, it becomes about justifying why the +12 OOMs examples should have any impact on, let's say, +6 OOMs.

I personally don't feel like your examples give me an argument  for anywhere but the +12 OOMs mark. That's where they live, and those examples seem to require that much compute, or still a pretty big amount of it. So reading your post makes me feel like I should have more probability mass at this mark or very close to it, but I don't see any reason to update the probability at the +6OOMs mark say.

And if the +12 OOMs looks really far, as it does in my point of view, then that definitely doesn't make me update towards shorter timelines.

Comment by adamShimi on Vanessa Kosoy's Shortform · 2021-03-27T20:30:25.490Z · LW · GW

However, it can do much better than that, by short-term quantilizing w.r.t. the user's reported success probability (with the user's policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user's reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.

I don't understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don't see a distribution here, or even a clear ordering over actions (given that we don't have access to the utility function).

I'm probably missing something obvious, but more details would really help.

Comment by adamShimi on Generalizing Power to multi-agent games · 2021-03-27T17:28:39.319Z · LW · GW

Glad to be helpful!

I go into more detail in my answer to Alex, but what I want to say here is that I don't feel like you use the power-scarcity idea enough in the post itself. As you said, it's one of three final notes, and without any emphasis on it.

So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.

Comment by adamShimi on Generalizing Power to multi-agent games · 2021-03-27T17:22:28.577Z · LW · GW

Thanks for the detailed reply!

I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.

The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state of the research, as well as the argument for its value, fit more the "here's-a-cool-formal-idea blogpost or workshop paper". And that's independently of the importance of the result. To say it again differently, I'm ready to accept the importance of a formalism without much explanations of why I should care if it shows a lot of cool results, but when the results are few, I need a more detailed story of why I should care.

About your specific story now:

Coming off of Optimal Policies Tend to Seek Power last summer, I felt like I understood single-agent Power reasonably well (at that point in time, I had already dropped the assumption of optimality). Last summer, "understand multi-agent power" was actually the project I intended to work on under Andrew Critch. I ended up understanding defection instead (and how it wasn't necessarily related to Power-seeking), and corrigibility-like properties, and further expanding the single-agent results. But I was still pretty confused about the multi-agent situation.

Nothing to say here, except that you have the frustrating (for me) ability to make me want to read 5 of your posts in detail when explaining something completely different. I am also supposed to make my own research, you know? (Related: I'll be excited with reviewing one of your post with the review project we're doing with a bunch of other researchers. Not sure what post of you would be most appropriate though. If you have some idea, you can post it here. ;) )

The crux was, in an MDP, you've got a state, and it's pretty clear what an agent can do. But in the multi-agent case, now you've got other reasoners, and now you have to account for their influence. So at first I thought, 

maybe Power is about being able to enforce your will even against the best efforts of the other players

which would correspond to everyone else minmax-ing you on any goal you chose. But this wasn't quite right. I thought about this for a while, and I didn't make much progress, and somehow I didn't come up with the formalism in this post until this winter when I started working with Jacob. In hindsight, maybe it's obvious: 

  • in an MDP, the relevant "situation" is the current state; measure the agent's average optimal value at that state.
  • in a non-iterated multi-agent game, the relevant "situation" is just the other players' strategy profile; measure your average maximum reward, assuming everyone else follows the strategy profile.
    • This should extend naturally into Bayesian stochastic games, to account for sequential decision-making and truly generalize the MDP results.

When phrased that way, I think my "issue" is that the subtlety you add is mostly hidden within the additional parameter of the strategy profile. That is, with the original intuition, you don't have to find out what the other players will actually do; here you kind of have to. It's a good thing as I agree with you that it makes the intuition subtler, but it also creates a whole new complex problem of inferring strategies.

At this point, I went to reread the last sections, and realized that you're partially dealing with my problem by linking power with well-known strategy profiles (the nash-equilibriums).

But for me, I was excited about the Power formalism when (IIRC) I proposed to Jacob that we prove results about that formalism. Jacob was the one who formulated the theorem, and I actually didn't buy it at first; my naive intuition was that Power should always be constant when summed over players who have their types drawn from the constant-sum distribution. This was wrong, so I was pretty surprised.

This part pushed me to reread the statements in detail. If I get it correctly, you had the intuition that the power behaved like "will this player win", whereas it actually work as "keeping everything else fixed, how well can this player end up". The trick that makes the theorem true and the power bigger than the sum is that for a strategy profile that isn't a nash equilibrium, multiple players might get a lot if they change their action in turn while keeping everything else fixed.

I'm a bit ashamed, because that's actually explained in the intuition of the proof, but I didn't get it on the first reading. I also see now that it was the point of the discussion before the theorem, but that part flew over my head. So my advice for this would be to explain even more in detail the initial intuition and why it is wrong, including where in the maths this happens (the fixing of ).

My updated take after getting this point is that I'm a bit more excited about your formalism.

But the thing I'm most excited about is how I had this intuitive picture of "if your goals are unaligned, then in worlds like ours, one person gaining power means other people must lose power, after 'some point'."

Intuitively this seems obvious, just like the community knew about instrumental convergence before my formal results. But I'm excited now that we can prove the intuitively correct conclusion, using a notion of Power that mirrors the one used in the single-agent case for the existing power-seeking results. And this wasn't obvious to me, at least.

I agree that this is exciting, but this is only mentioned in the last line of the post, as one perspective among others. Notably, it wasn't clear at all that this was the main application of this work.

Comment by adamShimi on Generalizing Power to multi-agent games · 2021-03-25T01:22:15.992Z · LW · GW

Ok, that's fair. It's hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven't studied game theory formally.

Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.

Comment by adamShimi on Generalizing Power to multi-agent games · 2021-03-24T22:55:25.954Z · LW · GW

Exciting to see new people tackling AI Alignment research questions! (and I'm already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just not seeing how important it is, but then it is not obvious from the post alone.

On the positive side, it was quite agreeable to read, and I followed all the formal parts. My only criticism of the form is that I would have liked a statement of what will be proved/done in the post upfront, instead of having to wait the last section.

This might be harsh criticism, but I really encourage you to keep working in the field, and hopefully prove me wrong by expanding on this work in more advanced and exciting ways.

Alternatively, imagine that your team spends the meeting breaking your knees and your laptop.

This is an example of wit done well in a "serious" post. I approve.

Strategies (technically, mixed strategies) in a Bayesian game are given by functions . Thus, even given a fixed strategy profile , any notion of "expected reward of an action" will have to account for uncertainty in other players' types. We do so be defining interim expected utility for player  as follows:

You haven't defined  at that point, and you don't introduce the indexing  for other strategies before the next line. So is this a typo (where you wanted to write ) or am I just misunderstanding the formula? I'm even more confused because you use  to compute , and so if it's not a typo this means that your interim utility considers that every other agent uses the same strategy?

Coming back after reading more, do you use  to mean "the strategy profile for every process except "? That would make more sense of the formulas (since you fix , there's no reason to have a ) but if it's the case, then this notation is horrible (no offense).

By the way, indexing the other strategies by  instead of, let's say  or  is quite unconventional and confusing.

It initially seems unintuitive that as players' strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like "as your strategy improves, other players lose the power to capitalize off of your mistakes".

I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals.

Comment by adamShimi on Against evolution as an analogy for how humans will create AGI · 2021-03-24T22:18:49.394Z · LW · GW

Just wanted to say that this comment made me add a lot of things on my reading list, so thanks for that (but I'm clearly not well-read enough to go into the discussion).

Comment by adamShimi on My research methodology · 2021-03-23T00:50:32.208Z · LW · GW

Thanks for writing this! I'm quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

So you don't think that we could have a result of the sort "with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs that way"? Or is it more than even with such arguments, you see incentives for people to use it, and so we might as well consider that we have to solve the problem even in such problematic cases?

This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:

  • We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
  • We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
  • We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.

Of these, only the last one looks to me like it's making things simpler. The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior. Or said differently, if you have to solve every plausible scenario, then simple testing doesn't cut it. And for the second, my personal worry with work on toy models is that the solutions work on test cases but not on practical one, not the other way around.

I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.

Reading that paragraph, I feel like you addressed some of my questions from above. One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.

What this looks like (3 examples)

Your rundown of examples from your research was really helpful, not only to get a grip of the process, but also because it clarified the path of refinement of your different proposals. I think it might be worth to make it its own post, with maybe more examples, for a view of how your "stable" evolved over the years.

My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”

This made me think of this famous paper in the theory of distributed computing, and especially what Nancy Lynch, the author, says about the process of working on impossibility results:

How does one go about working on an impossibility proof? [...]

Then it's time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternatively on one direction and on the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.


I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:

I expect this description of the process to be really helpful to many starting researchers who don't know where to push when one direction or approach fails.

  • I think there’s a reasonable chance of empirical work turning up unknown unknowns that change how we think about alignment, or to find empirical facts that make alignment easier. We want to get those sooner rather than later.

This is the main reason I'm excited by empirical work.


For the objections and your response, I don't have any specific comment, except that I pretty much agree with most of what you say. On the differences with traditional theoretical computer science, I feel like the biggest one right now is that most of the work here lies in the "grasping towards the precise problem" instead of "solving a well-defined precise problem". I would expect that this is because the problem is harder, because the field is younger and has less theoretical work on, and because we are not satisfied by simply working on a tractable and/or exciting precise problem -- it has to be relevant to alignment.

Comment by adamShimi on Demand offsetting · 2021-03-22T23:51:40.833Z · LW · GW

Yeah, I didn't write an answer earlier but my first thought was that it's a classical case of confusing easier/simpler with "I can find a one sentence handle". Not that far from "The gods did it" in terms of hiding the simplicity in language.

Comment by adamShimi on Open & Welcome Thread – March 2021 · 2021-03-22T13:01:47.129Z · LW · GW

I mean, what I really want is a modern footnote system where I click on the footnote and it appears where I'm at, instead of having to jump around. But I would already be quite happy with being able to have footnotes at all.

Comment by adamShimi on Open & Welcome Thread – March 2021 · 2021-03-22T12:03:32.464Z · LW · GW

This is a common complaint, so maybe I shouldn't voice it again, but I would really like to be able to use footnotes in the draft js editor. I can't be bothered using the markdown editor (because now I usually share my draft before hand as a gdoc instead of just writing it for myself in markdown using Vim), but I feel regularly stifled by the impossibility to add footnotes to my post.

Comment by adamShimi on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-15T18:11:11.144Z · LW · GW

To people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with you.

The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it's misleading to call that the solution.

So I've been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.

Comment by adamShimi on Epistemological Framing for AI Alignment Research · 2021-03-15T18:08:09.391Z · LW · GW

Thanks for the feedback!

Who? It would be helpful to have some links so I can go read what they said.

That was one of my big frustrations when writing this post: I only saw this topic pop up in personal conversation, not really in published posts. And so I didn't want to give names of people who just discussed that with me on a zoom call or in a chat. But I totally feel you -- I'm always annoyed by posts that pretend to answer a criticism without pointing to it.

On this more complicated (but IMO more accurate) model, your post is itself an attempt to make AI alignment paradigmatic! After all, you are saying we should have multiple paradigms (i.e. you push to make parts of AI alignment more paradigmatic) and that they should fit together into this overall epistemic structure you propose. Insofar as your proposed epistemic structure is more substantial than the default epistemic structure that always exists between paradigms (e.g. the one that exists now) then it's an attempt make the whole of AI alignment more paradigmatic too, even if not maximally paradigmatic.

Of course, that's not necessarily a bad thing -- your search for a paradigm is not naive, and the paradigm you propose is flexible and noncommital (i.e. not-maximally-paradigmatic?) enough that it should be able to avoid the problems you highlight. (I like the paradigm you propose! It seems like a fairly solid, safe first step.)

That's a really impressive comment, because my last rewrite of the post was exactly to hint that this was the "right way" (in my opinion) to make the field paradigmatic, instead of arguing that AI Alignment should be made paradigmatic (what my previous draft attempted). So I basically agree with what you say.

I think you could instead have structured your post like this:

1. Against Premature Paradigmitization: [Argues that when a body of ongoing research is sufficiently young/confused, pushing to paradigmitize it results in bad assumptions being snuck in, bad constraints on the problem, too little attention on what actually matters, etc. Gives some examples.]

2. Paradigmiticization of Alignment is Premature: [Argues that it would be premature to push for paratigmization now. Maybe lists some major paradigms or proto-paradigms proposed by various people and explains why it would be bad to make any one of them The King. Maybe argues that in general it's best to let these things happen naturally than to push for them.]

If I agreed with what you wrote before, this part strikes me as quite different from what I'm saying. Or more like you're only focusing on one aspect. Because I actually argue for two things:

  • That we should have a paradigm of the "AIs" part, a paradigm of the "well-behaved" part, and from that we get a paradigm of the solving part. This has nothing to do with the field being young and/or confused, and all about the field being focused on solving a problem. (That's the part I feel your version is missing)
  • That in the current state of our knowledge, fixing those paradigms is too early; we should instead do more work on comparing and extending multiple paradigms for each of the "slots" from the previous point, and similarly have a go at solving different variants of the problem. That's what you mostly get right.

It's partly my fault, because I'm not stating it that way.

I think overall my reaction is: This is too meta; can you point to any specific, concrete things people are doing that they should do differently? For example, I think of Richard Ngo's "AI Safety from First Principles," Bostrom's Superintelligence, maybe Christiano's stuff, MIRI's stuff, and CAIS as attempts to build paradigms that (if things go as well as their authors hope) could become The Big Paradigm we All Follow. Are you saying people should stop trying to write things like this? Probably not... so then what are you recommending? That people not get too invested into any one particular paradigm, and start thinking it is The One, until we've had more time to process everything? Well, I feel like people are pretty good about that already.

My point about this is that thinking of your examples as "big paradigms of AI" is the source of the confusion, and a massive problem within the field. If we use my framing instead, then you can split these big proposals into their paradigm for "AIs", their paradigm for "well-behaved", and so the paradigm for the solving part. This actually show you where they agree and where they disagree. If you're trying to build a new perspective on AI Alignment, then I also think my framing is a good lens to crystallize your insatisfactions with the current proposals.

Ultimately, this framing is a tool of philosophy of science, and so it probably won't be useful to anyone not doing philosophy of science. The catch is that we all do a bit of philosophy of science regularly: when trying to decide what to work on, when interpreting work, when building these research agendas and proposals. I hope that this tool will help on these occasions.

I very much like your idea of testing this out. It'll be hard to test, since it's up to your subjective judgment of how useful this way of thinking is, but it's worth a shot! I'll be looking forward to the results.

That's why I asked to people who are not as invested in this framing (and can be quite critical) to help me do these reviews -- hopefully that will help make them less biased! (We also choose some posts specifically because they didn't fit neatly into my framing).

Comment by adamShimi on Suggestions of posts on the AF to review · 2021-03-12T15:35:32.373Z · LW · GW

If we do only one, which one do you think matters the most?

Comment by adamShimi on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-12T15:15:44.311Z · LW · GW

Thanks for commenting on your reaction to this post!

That being said, I'm a bit confused by your comment. You seem to write off approaches which attempt to provide a computational model of mind, but my approach is literally the opposite: looking only at the behavior (but all the behavior), extract relevant statistics to study questions related to goal-directedness.

Can you maybe give more details?

Comment by adamShimi on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-12T15:09:46.458Z · LW · GW

Thanks for the spot-on pushback!

I do understand what a sufficient statistics is -- which probably means I'm even more guilty of what you're accusing me of. And I agree completely that I don't defend correctly that the statistics I provide are really sufficient.

If I try to explain myself, what I want to say in this post is probably something like

  • Knowing these intuitive properties about  and the goals seems sufficient to express and address basically any question we have related to goals and goal-directedness. (in a very vague intuitive way that I can't really justify).
  • To think about that in a grounded way, here are formulas for each property that look like they capture these properties.
  • Now what's left to do is to attack the aforementioned questions about goals and goal-directedness with these statistics, and see if they're enough. (Which is the topic of the next few posts)

Honestly, I don't think there's an argument to show these are literally sufficient statistics. Yet I still think staking the claim that they are is quite productive for further research. It gives concreteness to an exploration of goal-directedness, carving more grounded questions:

  • Given a question about goals and goal-directedness, are these properties enough to frame and study this question? If yes, then study it. If not, then study what's missing.
  • Are my formula adequate formalization of the intuitive properties?

This post mostly focuses on the second aspect, and to be honest, not even in as much detail as one could go.

Maybe that means this post shouldn't exist, and I should have waited to see if I could literally formalize every question about goals and goal-directedness. But posting it to gather feedback on whether these statistics makes sense to people, and if they feel like something's missing, seemed valuable.

That being said, my mistake (and what caused your knee-jerk reaction) was to just say these are literally sufficient statistics instead of presenting it the way I did in this comment. I'll try to rewrite a couple of sentences to make that clear (and add another note at the beginning so your comment doesn't look obsolete.

Comment by adamShimi on Daniel Kokotajlo's Shortform · 2021-03-10T10:49:27.072Z · LW · GW

Cool list! I'll look into the ones I don't know or haven't read yet.

Comment by adamShimi on Towards a Mechanistic Understanding of Goal-Directedness · 2021-03-09T22:45:04.968Z · LW · GW

Nice post! Surprisingly, I'm interested in the topic. ^^

Funny too that you focus on an idea I am writing a post about (albeit from a different angle). I think I broadly agree with your conjectures, for sufficient competence and generalization at least.

Most discussion about goal-directed behavior has focused on a behavioral understanding, which can roughly be described as using the intentional stance to predict behavior.

I'm not sure I agree with that. Our lit review shows that there are both behavioral and mechanistic approaches (Richard's goal-directed agency is an example of the latter)

A machine “is an NFA (mechanistically)” if the internal mechanism has non-deterministic transitions.

The analogy is great, but if I nitpick a little, I'm not sure a non-determinism mechanism makes sense. You have either deterministm or probabilities, but I don't see how to implement determinism. That's by the way a reason why the non-deterministic Turing Machines aren't really used anymore when talking about complexity classes like NP.

Adam Shimi’s Literature Review on Goal-Directedness identifies five properties behaviorally goal-directed systems have

Two corrections here: the post was written with Michele Campolo and Joe Collman, so they should also be given credit; and we identify five properties that the literature on the subject focuses and agrees on. We don't necessarily say that both are necessary or as important.

We restructure these properties hierarchically:

I would like more explanations here, because I'm not sure that I follow. Specifically, I can't make sense of "what is the distribution over goals?". Are you talking about the prior over goals in some sort of bayesian goal-inference?

Roughly speaking, an agent is mechanistically goal-directed if we can separate it into a goal that is being pursued and an optimization process doing that pursuit.

I like this. My current position (that will be written down in my next post on the subject) is that these mechanical goal directed systems are actually behavioral goal-directed systems at a certain level of competence. They also represent a point where "simple models" become more predictive than the intentional stance, because the optimization itself can be explained a simple model.

Efficient: The more mechanistically goal-directed a system is, the more efficiently it pursues its goal.

Shouldn't that be the other way around?

We omit “far-sighted” because this is not a property intrinsically related to goal-directedness. We view far-sighted goal-directed agents as more dangerous than near-sighted ones, but not less goal-directed. While there might be a large difference between far-sighted and near-sighted agents, the mechanistic difference is as small as a single discount parameter.

It's funny, because I actually see far-sightedness as a property of the internal structure more than the behavior. So I would assume that a mechanically goal-directed system shows some far-sightedness.

However, many possible internal mechanisms can result in the same behavior, so this connection is lossy. For example, a maze-solver can either be employing a good set of heuristics or implementing depth-first search.

But those two maze-solver won't actually have the same behavior. I think the lossy connection doesn't come from the fact that multiple internal mechanisms can result in the same behavior "over all situations" (because in that case the internal differences are irrelevant) but in the fact that they can result in the same behavior for the training/testing environments considered.

Algorithms can be behaviorally linear time-complexity if they tend to take time that scales linearly with the input length and mechanistically linear time-complexity if they’re provably in O(n).

I disagree with that example. What you call behavioral time complexity is more something like averaged time complexity (or smooth analysis maybe). And in complexity theory, the only thing that exists is behavioral.

Comment by adamShimi on Book review: "A Thousand Brains" by Jeff Hawkins · 2021-03-09T13:24:55.701Z · LW · GW

Thanks for the nice review! It's great to have the reading of someone who understand enough the current state of neuroscience to point to aspects of the book at odds with neuroscience consensus. My big takeaway is that I should look a bit more into neuroscience based approaches to AGI, because they might be important, and require different alignment approaches.

On a more rhetorical level, I'm impressed by how you manage to make me ask a question (okay, but what evidence is there for this uniformity of the neocortex) and then points to some previous work you did on the topic. That changes my perspective on them completely, because it makes it easier to see a point within AI Alignment research (instead of just intellectual curiosity.

So if indeed we can get AGI by reverse-engineering just the neocortex (and its “helper” organs like the thalamus and hippocampus), and if the neocortex is a relatively simple, human-legible, learning algorithm, then all of the sudden it doesn’t sound so crazy for Hawkins to say that brain-like AGI is feasible, and not centuries away, but rather already starting to crystallize into view on the horizon.

I might have missed it, but what is the argument for the neocortex learning algorithm being human-legible? That seems pretty relevant to this approach.

This is a big and important section of the book. I’m going to skip it. My excuse is: I wrote a summary of an interview he did a while back, and that post covered more-or-less similar ground. That said, this book describes it better, including a new and helpful (albeit still a bit sketchy) discussion of learning abstract concepts.

I'm fine with you redirecting to a previous post, but I would have appreciated at least a one sentence-summary and opinion.

Some people (cf. Stuart Russell's book) are concerned that the development of AGI poses a substantial risk of catastrophic accidents, up to and including human extinction. They therefore urge research into how to ensure that AIs robustly do what humans want them to do—just as Enrico Fermi invented nuclear reactor control rods before he built the first nuclear reactor.

Jeff Hawkins is having none of it. “When I read about these concerns,” he says, “I feel that the arguments are being made without any understanding of what intelligence is.”

Writing this part before going on to read the rest, but intuitively an AGI along those lines seems less dangerous than purely artificial approach. Indeed, I expect that such an AGI will have some underlying aspects of basic human cognition (or something similar), and thus things like common sense and human morals might be easier to push for. Does that make any sense?

Going back after reading the rest of the post, it seems that these sort of aspects of human cognition would come more from what you call the Judge, with all the difficulties in implementing it.


No specific comment on your explanation of the risks, just want to say that you make a very good job of it!

Comment by adamShimi on The case for aligning narrowly superhuman models · 2021-03-08T14:28:31.309Z · LW · GW

Thanks for the very in-depth case you're making! I especially liked the parts about the objections, and your take on some AI Alignment researcher's opinions of this proposal.

Personally, I'm enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you're pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be summarized by saying what I say in my post: that as long as we're not really sure that we got the terms of the problem well-defined, we cannot make the whole field into this Solving part.

As a quick summary of what I get into in my detailed feedback, I think more work on this kind of problems will be net positive and very useful if:

  • we are able to get reasonably good guarantees that doing a specific experiment doesn't present too big of a risk;
  • this kind of work stays in conversation with what you call conceptual work;
  • this kind of work doesn't replace other kinds of AI Alignment research completely.

Also, I think a good example of a running research project doing something similar is Tournesol. I have a post explaining what it is, but the idea boils down to building a database of expert feedback on Youtube videos on multiple axes, and leverage it to train a more aligned recommendation algorithm for Youtube. One difference is that their idea does probably make the model more competent (it's not already using a trained model like GPT-3); yet the similarities are numerous enough that you might find it interesting.


In general, it seems to me like building and iterating on prototypes is a huge part of how R&D progress is made in engineering fields, and it would be exciting if AI alignment could move in that direction.

I agree with the general idea that getting more experimental work will be immensely valuable, but I’m worried about the comparison with traditional engineering. AI Alignment cannot just follow engineering paradigms and wisdom of just prototyping stuff willy-nilly because every experiment could explode in our face. It seems closer to nuclear engineering, which required AFAIK a preliminary work and understanding of nuclear physics.

To summarize, I’m for finding constrained and safe ways to gather more experimental understanding, but pushing for more experiments without heeding the risks seems like one of the worst things we could do.

Holistically, this seems like a much safer situation to be in than one where the world has essentially procrastinated on figuring out how to align systems to fuzzy goals, doing only the minimum necessary to produce commercial products.

Is it the correct counterfactual, though? You seem to compare your proposed approach with a situation where no AI Alignment research is done. That hardly seems fair or representative of a plausible counterfactual.

Aligning narrowly superhuman models today could help build up tools, infrastructure, best practices, and tricks of the trade. I expect most of this will eventually be developed anyway, but speeding it up and improving its quality could still be quite valuable, especially in short timelines worlds  where there's a lot less time for things to take their natural course.

Well, it depends whether it’s easier to get from the conceptual details to the implementation details, or the other way around. My guess would be the former, which means that working on implementation details before knowing what we want to implement is at best a really unproductive use of research time (even more in short timelines), at worse a waste of time. I'm curious if you have argument for the opposite take.

Note that I’m specifically responding to this specific argument. I still think that experimental work can be tremendously valuable for solving the conceptual issues.

All this seems like it would make the world safer on the eve of transformative AI or AGI, and give humans more powerful and reliable tools for dealing with the TAI / AGI transition.

Agreed. That being said, pushing in this direction might also place us in a worse situation, for example by putting a lot of pressure on AIs to build human models which then make deception/manipulation significantly more accessible and worthwhile. I don’t really know how to think about this risk, but I certainly would want follow-up discussions on it.

More broadly, “doing empirical science on the alignment problem” -- i.e. systematically studying what the main problem(s) are, how hard they are, what approaches are viable and how they scale, etc -- could help us discover a number of different avenues for reducing long-run AI x-risk that we aren’t currently thinking of, one-shot technical solutions or otherwise.

Yes, yes and yes. Subject to preliminary thinking about the risks involved in such experimental research, that’s definitely a reason to push more for this kind of work.

Compared to conceptual research, I’d guess aligning narrowly superhuman models will feel meatier and more tractable to a number of people. It also seems like it would be easier for funders and peers to evaluate whether particular papers constitute progress, which would probably help create a healthier and more focused field where people are broadly more on the same page and junior researchers can get stronger mentorship. Related to both of these, I think it provides an easier opportunity for people who care about long-run x-risk to produce results that are persuasive and impressive to the broader ML community, as I mentioned above.

You present this as a positive, but I instead see a pretty big issue here. Because of everything you point out, most incentives will push towards doing only this kind of research. You’ll have more prestige, a better chance at a job, recognition by a bigger community. All of which is definitely good from a personal standpoint. Which means both that all newcomers will go on to the experimental type of work, and that such experiments will bear less and less relationship with the actual aligning of AI (and more and more with the specific kind of problems for which we find experimental solutions without the weird conceptual work).

In particular, you say that the field will be healthier because “people are more broadly on the same page”. That for me falls into the trap of believing that a paradigm is necessary the right way to structure a field of research trying to solve a problem. As I argue here, a paradigm in this case basically means that you think you have circumscribed the problem well enough to not question it any more, and work single-mindedly on it. We’re amazingly far from that point in AI Alignment, and so that looks really dangerous, especially because shorter timelines won’t allow more than one or two such paradigms to unfold. 

When it’s possible to demonstrate an issue at scale, I think that’s usually a pretty clear win.

Agreed, with the caveat I’ve been repeating about the check for risks due to the scale.

I think we have a shot at eventually supplying a lot of people to work on it too. In the long run, I think more EAs could be in a position to contribute to this type of work than to either conceptual research or mainstream ML safety.

This looks about right. Although I wonder if it wouldn’t be dangerous to have a lot of people working on the topic that don’t get the conceptual risks and/or the underlying ML technology. So I’m wondering if having people without the conceptual or ML skills work on that kind of project is safe.

Comment by adamShimi on The case for aligning narrowly superhuman models · 2021-03-07T21:53:09.708Z · LW · GW

Well, Paul's original post presents HCH as the specification of a human enlightened judgement.

For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.

And if we follow the links to Paul's previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.

To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.

So it looks to me like "HCH captures the judgment of the human after thinking from a long time" is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don't know the answer.

A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH. 

Comment by adamShimi on Full-time AGI Safety! · 2021-03-03T12:17:12.303Z · LW · GW

Welcome in the (for now) small family of people funded by Beth! Your research looks pretty cool, and I'm quite excited when seeing how different it is from mine. So Beth is funding quite a wide range of researchers, which is what makes most sense to me. :)

Comment by adamShimi on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-01T18:00:26.182Z · LW · GW

Thanks for telling me! I've changed that.

It might be because I copied and pasted the first sentence to each subsection.

Comment by adamShimi on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-01T14:48:19.727Z · LW · GW

Thanks for taking the time to give feedback!

Technical comment on the above post

So if I understand this correctly. then  is a metric of goal-directedness. However, I am somewhat puzzled because  only measures directedness to the single goal .

But to get close to the concept of goal-directedness introduced by Rohin, don't you need then do an operation over all possible values of ?

That's not what I had in mind, but it's probably on me for not explaining it clearly enough.

  • First, for a fixed goal , the whole focus matters. That is, we also care about  and . I plan on writing a post defending why we need all of them, but basically there are situations when using only one of them would makes us order things weirdly.
  • You're right that we need to consider all goals. That's why the goal-directedness of the system  is defined as a function that send each goal (satisfying the nice conditions) on a focus, the vector of three numbers. So the goal-directedness of  contains the focus for every goal, and the focus captures the coherence of  with the goal.

Rohin then speculates that if we remove the 'goal' from the above argument, we can make the AI safer. He then comes up with a metric of 'goal-directedness' where an agent can have zero goal-directedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin's terminology, an agent gets safer it if is less goal-directed.

This doesn't feel like a good summary of what Rohin says in his sequence.

  • He says that many scenarios used to argue for AI risks implicitly use systems following goals, and thus that building AIs not having goal might make these scenarios go away. But he doesn't say that new problems can't emerge.
  • He doesn't propose a metric of goal-directedness. He just argues that every system is maximizing a utility function, and so this isn't the way to differenciate goal-directed with non-goal-directed systems. The point of this argument is also to say that reasons to believe that AGIs should maximize expected utility are not enough to say that such AGI must necessarily be goal-directed.

Rohin then proposes that intuitively, a table-driven agent is not goal-directed. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.

Where things completely move off the main sequence is in Rohin's next step in developing his intuitive notion of goal-directedness:

"This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal."

So what I am reading here is that if an agent behaves more unpredictably off-distribution, it is becomes less goal-directed in Rohin's intuition. But I can't really make sense of this anymore, as Rohin also associates less goal-directedness with more safety.

This all starts to look like a linguistic form of Goodharting: the meaning of the term 'goal-directed' collapses completely because too much pressure is placed on it for control purposes.

My previous answer mostly addresses this issue, but let's spell it out: Rohin doesn't say that non-goal-directed system. What he defends is that

  1. Non-goal-directed (or low-goal-directed) systems wouldn't be unsafe in many of the ways we study, because these depend on having a goal (convergent instrumental subgoals for example)
  2. Non-goal-directed competent agents are not a mathematical impossibility, even if every competent agent must maximize expected utility.
  3. Since removing goal-directedness apparently gets rid of many big problem with aligning AI, and we don't have an argument for why making a competent non-goal-directed system is impossible, then we should try to look into non-goal-directed approaches.

Basically, the intuition of "less goal-directed means safer" makes sense when safer means "less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown", not when it means "less probability that the AI takes an unexpected and counterproductive action".

Another way to put it is that Rohin argues that removing goal-directedness (if possible) seems to remove many of the specific issues we worry about in AI Alignment -- and leaves mostly the near-term "my automated car is running over people because it thinks they are parts of the road" kind of problems.

To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.

  • Is your idea that a lower number on a metric implies more safety? This seems to be Rohin's original idea.
  • Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of 'will become adversarial and work against us' at all? If so I am not seeing the correlation.

That's a very good and fair question. My reason for not using a single metric is that I think the whole structure of focuses for many goals can tell us many important things (for safety) when looked at from different perspective. That's definitely something I'm working on, and I think I have nice links for explainability (and others probably coming). But to take an example from the post, it seems that a system with one goal with far more generalization than any other is more at risk of the kind of safety problems Rohin related to goal-directedness.

Comment by adamShimi on adamShimi's Shortform · 2021-02-28T20:07:56.606Z · LW · GW

Thanks for the idea! I agree that it probably helps, and it solves my issue with the state of knowledge of the other.

That being said, I don't feel like this solves my main problem: it still feel to me as pushing too hard. Here the reason is that I post on a small venue (rarely more than a few posts per day) that I know the people I'm asking feedback too read regularly. So if I send them such a message at the moment I publish, it feels a bit like I'm saying that they wouldn't read and comment it without that, which is a bit of a problem.

(I'm interested to know if researchers on the AF agree with that feeling, or if it's just a weird thing that only exists in my head. When I try to think about being at the other end of such a message, I see myself as annoyed, at the very least).

Comment by adamShimi on adamShimi's Shortform · 2021-02-28T14:56:47.411Z · LW · GW

curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?

My main reason is steve's first point:

  1. Maybe there's a sense in which everyone has already implicitly declared that they don't want to give feedback, because they could have if they wanted to, so it feels like more of an imposition.

Asking someone for feedback on work posted somewhere I know they read feels like I'm whining about not having feedback (and maybe whining about them not giving me feedback). On the other hand, sending a link to a gdoc feels like "I thought that could interest you", which seems better to me.

There's also the issue that when the work is public, you don't know if someone has read it and not found it interesting enough to comment, not read it but planned to do it later, read it and planned to comment later. Depending on which case they are in, me asking for feedback can trigger even more problems (like them being annoyed because they don't feel I let them the time to do it by themselves). Whereas when I share a doc, there's only one state of knowledged for the other (not having read the doc and not knowing it exists).

Concerning steve's second point:

2. Maybe it feels like "I want feedback for my own personal benefit" when it's already posted, as opposed to "I want feedback to improve this document which I will share with the community" when it's not yet posted. So it feels more selfish, instead of part of a community project. For that problem, maybe you'd want to frame it as "I'm planning to rewrite this post / write a follow-up to this post / give a talk based on this post / etc., can you please offer feedback on this post to help me with that?" (Assuming that's in fact the case, of course, but most posts have follow-up posts...)

I don't feel that personally. I basically take a stance of trying to do things I feel are important for the community, so if I publish something, I don't feel like feedback is for my own benefit. Indeed, I would gladly have only constructive negative feedback for my posts instead of no feedback at all; this is pretty bad personnally (in terms of ego for example) but great for the community because it put my ideas to the test and forces me to improve them.

Now I want to go back to Raemon.

I think there are a number for features LW could build to improve this situation

Agreed. My diagnostic of the situation is that to ensure consistent feedback, it probably need to be at least slightly an obligation. The two examples of process producing valuable feedback that I have in mind are gdocs comments and peer-review for conferences/journals. In both cases, the reviewer has an obligation to do the review (social obligation for the gdoc, because it was shared explicity to you, and community obligation for the peer-review, because that's a part of your job and the conference/journal editor asked you to review the paper). Without this element of obligation, it's far to easy to not give feedback, even when you might have something valuable to say!

Note that I'm part of the problem: this week, I spent a good couple of hours commenting in details a 25 pages technical gdoc for a fellow researcher who asked me, but I haven't published a decent feedback on the AF for quite some time. And when I look at my own internal process, this sense of commitment and obligation is a big reason why. (I ended up liking the work, but even that wouldn't have ensured that I comment it to the extent that I did).

This makes me think that a "simple" solution could be a review process on the AF. Now, I've been talking about a proper review process with Habryka among others; getting a clearer idea of how we should judge research for such a review is a big incentive for the trial run of a review that I'm pushing for (and I'm currently rewriting a post about a framing of AI Alignment research that I hope will help a lot for that).

Yet after thinking about it further yesterday and today, it might be possible to split the establishment of such a review process for the AF in two step.

  • Step 1: Everyone with a post on the AF can ask for feedback. This is not considered neat per review, just the sort of thing that a fellow researchers would say if you shared the post as a gdoc to them. On the other hand, a group of people (working researchers let's say) propose themselves to give such feedback at a given frequency (once a week for example).
    After that, we probably only need to find a decent enough way to order requests for feedback (prioritizing posts with no feedback, prioritizing people without the network to ask personally for feedback...), and it could be up and running.
  • Step 2: Establish a proper peer-review system, where you can ask for peer-review on a post, and if the review is good enough, it gets a "peer-review" tag that is managed by admins only. Doing this correctly will probably require standards for such a review, a stronger commitment by reviewers (and so finding more incentives for them to participate), and additional infrastructure (code, managing the review process, maybe sending a newsletter?

In my mind, step 1 is here for getting some feedback on your work, and step 2 is for getting prestige. I believe that both are important, but I'm more starving for feedback. And I also think that doing the step 1 could be really fast, and even if fails, there's not big downside to the AF (whereas fucking up step 2 seems more fraught with bad consequences).


Also another point for the difference in ease to give feedback in Gdoc vs posts: implicitly, almost all shared gdocs come with a "Come at me bro" request. But when I read a post, it's not always clear whether the poster want me to come at them or not. You also tend to know a bit more the people that share gdocs with you than posters on the AF. So being able to signal "I really want you to come at me" might help, although I doubt it's the complete solution.

Comment by adamShimi on adamShimi's Shortform · 2021-02-27T10:54:53.016Z · LW · GW

Right now, the incentives to get useful feedback on my research push me to go into the opposite policy that I would like: publish on the AF as late as I can allow.

Ideally, I would want to use the AF as my main source of feedback, as it's public, is read by more researchers that I know personally, and I feel that publishing there helps the field grow.

But I'm forced to admit that publishing anything on the AF means I can't really send it to people anymore (because the ones I ask for feedback read the AF, so that's feels wrong socially), and yet I don't get any valuable feedback 99% of the time. More specifically, I don't get any feedback 99% of the time. Whereas when I ask for feedback directly on a gdoc, I always end up with some useful remarks.

I also feel bad that I'm basically using a privileged policy, in the sense that a newcomer cannot use it.

Nonetheless, because I believe in the importance of my research, and I want to know if I'm doing stupid things or not, I'll keep to this policy for the moment: never ever post something on the AF for which I haven't already got all the useful feedback I could ask for.

Comment by adamShimi on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2021-02-26T19:07:29.881Z · LW · GW

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of "fusion power generators" has a more concrete form and include safety guidelines.

In general, I don't see why we should expect the abstraction most relevant for the AGI to be the one we're using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations).

(That makes me think that it might be interesting to see how Kuhn's arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).