Posts

AE Studio @ SXSW: We need more AI consciousness research (and further resources) 2024-03-26T20:59:09.129Z
Survey for alignment researchers! 2024-02-02T20:41:44.323Z
The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda 2023-12-18T20:35:01.569Z
Computational signatures of psychopathy 2022-12-19T17:01:49.254Z
AI researchers announce NeuroAI agenda 2022-10-24T00:14:46.574Z
Alignment via prosocial brain algorithms 2022-09-12T13:48:05.839Z
Paradigm-building: Conclusion and practical takeaways 2022-02-15T16:11:08.985Z
Question 5: The timeline hyperparameter 2022-02-14T16:38:17.006Z
Question 4: Implementing the control proposals 2022-02-13T17:12:41.702Z
Question 3: Control proposals for minimizing bad outcomes 2022-02-12T19:13:48.075Z
Question 2: Predicted bad outcomes of AGI learning architecture 2022-02-11T22:23:49.937Z
Question 1: Predicted architecture of AGI learning algorithm(s) 2022-02-10T17:22:24.087Z
Paradigm-building: The hierarchical question framework 2022-02-09T16:47:57.119Z
Paradigm-building from first principles: Effective altruism, AGI, and alignment 2022-02-08T16:12:26.423Z
Paradigm-building: Introduction 2022-02-08T00:06:28.771Z
Theoretical Neuroscience For Alignment Theory 2021-12-07T21:50:10.142Z
The Dark Side of Cognition Hypothesis 2021-10-03T20:10:57.204Z

Comments

Comment by Cameron Berg (cameron-berg) on AE Studio @ SXSW: We need more AI consciousness research (and further resources) · 2024-03-27T00:13:13.698Z · LW · GW

Thanks for the comment!

Consciousness does not have a commonly agreed upon definition. The question of whether an AI is conscious cannot be answered until you choose a precise definition of consciousness, at which point the question falls out of the realm of philosophy into standard science.

Agree. Also happen to think that there are basic conflations/confusions that tend to go on in these conversations (eg, self-consciousness vs. consciousness) that make the task of defining what we mean by consciousness more arduous and confusing than it likely needs to be (which isn't to say that defining consciousness is easy). I would analogize consciousness to intelligence in terms of its difficulty to nail down precisely, but I don't think there is anything philosophically special about consciousness that inherently eludes modeling. 

is there some secret sauce that makes the algorithm [that underpins consciousness] special and different from all currently known algorithms, such that if we understood it we would suddenly feel enlightened? I doubt it. I expect we will just find a big pile of heuristics and optimization procedures that are fundamentally familiar to computer science.

Largely agree with this too—it very well may be the case (as seems now to be obviously true of intelligence) that there is no one 'master' algorithm that underlies the whole phenomenon, but rather as you say, a big pile of smaller procedures, heuristics, etc. So be it—we definitely want to better understand (for reasons explained in the post) what set of potentially-individually-unimpressive algorithms, when run in concert, give you system that is conscious. 

So, to your point, there is not necessarily any one 'deep secret' to uncover that will crack the mystery (though we think, eg, Graziano's AST might be a strong candidate solution for at least part of this mystery), but I would still think that (1) it is worthwhile to attempt to model the functional role of consciousness, and that (2) whether we actually have better or worse models of consciousness matters tremendously. 

Comment by Cameron Berg (cameron-berg) on Survey for alignment researchers! · 2024-02-23T18:10:21.871Z · LW · GW

There will be places on the form to indicate exactly this sort of information :) we'd encourage anyone who is associated with alignment to take the survey.

Comment by Cameron Berg (cameron-berg) on Survey for alignment researchers! · 2024-02-09T14:08:25.208Z · LW · GW

Thanks for taking the survey! When we estimated how long it would take, we didn't count how long it would take to answer the optional open-ended questions, because we figured that those who are sufficiently time constrained that they would actually care a lot about the time estimate would not spend the additional time writing in responses.

In general, the survey does seem to take respondents approximately 10-20 minutes to complete. As noted in another comment below,

this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one's individual results to that of community), which hopefully is worth the time investment :)

Comment by Cameron Berg (cameron-berg) on Survey for alignment researchers! · 2024-02-08T15:25:50.243Z · LW · GW

Ideally within the next month or so. There are a few other control populations still left to sample, as well as actually doing all of the analysis.

Comment by Cameron Berg (cameron-berg) on Survey for alignment researchers! · 2024-02-07T14:10:08.062Z · LW · GW

Thanks for sharing this! Will definitely take a look at this in the context of what we find and see if we are capturing any similar sentiment.

Comment by Cameron Berg (cameron-berg) on The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda · 2023-12-19T18:10:50.576Z · LW · GW

Thanks for calling this out—we're definitely open to discussing potential opportunities for collaboration/engaging with the platform!

Comment by Cameron Berg (cameron-berg) on The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda · 2023-12-19T18:07:35.921Z · LW · GW

It's a great point that the broader social and economic implications of BCI extend beyond the control of any single company, AE no doubt included. Still, while bandwidth and noisiness of the tech are potentially orthogonal to one's intentions, companies with unambiguous humanity-forward missions (like AE) are far more likely to actually care about the societal implications, and therefore, to build BCI that attempts to address these concerns at the ground level.

In general, we expect the by-default path to powerful BCI (i.e., one where we are completely uninvolved) to be negative/rife with s-risks/significant invasions of privacy and autonomy, etc, which is why we are actively working to nudge the developmental trajectory of BCI in a more positive direction—i.e., one where the only major incentive is build the most human-flourishing-conducive BCI tech we possibly can.

Comment by Cameron Berg (cameron-berg) on The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda · 2023-12-19T16:31:46.114Z · LW · GW

With respect to the RLNF idea, we are definitely very sympathetic to wireheading concerns. We think that approach is promising if we are able to obtain better reward signals given all of the sub-symbolic information that neural signals can offer in order to better understand human intent, but as you correctly pointed out that can be used to better trick the human evaluator as well. We think this already happens to a lesser extent and we expect that both current methods and future ones have to account for this particular risk.

More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.

We do call some of these concerns out in the post, eg:

We also recognize that many of these proposals have a double-edged sword quality that requires extremely careful consideration—e.g., building BCI that makes humans more competent could also make bad actors more competent, give AI systems manipulation-conducive information about the processes of our cognition that we don’t even know, and so on. We take these risks very seriously and think that any well-defined alignment agenda must also put forward a convincing plan for avoiding them (with full knowledge of the fact that if they can’t be avoided, they are not viable directions.)

Overall—in spite of the double-edged nature of alignment work potentially facilitating capabilities breakthroughs—we think it is critical to avoid base rate neglect in acknowledging how unbelievably aggressively people (who are generally alignment-ambivalent) are now pushing forward capabilities work. Against this base rate, we suspect our contributions to inadvertently pushing forward capabilities will be relatively negligible. This does not imply that we shouldn't be extremely cautious, have rigorous info/exfohazard standards, think carefully about unintended consequences, etc—it just means that we want to be pragmatic about the fact that we can help solve alignment while being reasonably confident that the overall expected value of this work will outweigh the overall expected harm (again, especially given the incredibly high, already-happening background rate of alignment-ambivalent capabilities progress).

Comment by Cameron Berg (cameron-berg) on The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda · 2023-12-19T16:11:23.924Z · LW · GW

Thanks for your comment! I think we can simultaneously (1) strongly agree with the premise that in order for AGI to go well (or at the very least, not catastrophically poorly), society needs to adopt a multidisciplinary, multipolar approach that takes into account broader civilizational risks and pitfalls, and (2) have fairly high confidence that within the space of all possible useful things to do to within this broader scope, the list of neglected approaches we present above does a reasonable job of documenting some of the places where we specifically think AE has comparative advantage/the potential to strongly contribute over relatively short time horizons. So, to directly answer:

Is this a deliberate choice of narrowing your direct, object-level technical work to alignment (because you think this where the predispositions of your team are?), or a disagreement with more systemic views on "what we should work on to reduce the AI risks?"

It is something far more like a deliberate choice than a systemic disagreement. We are also very interested and open to broader models of how control theory, game theory, information security, etc have consequences for alignment (e.g., see ideas 6 and 10 for examples of nontechnical things we think we could likely help with). To the degree that these sorts of things can be thought of further neglected approaches, we may indeed agree that they are worthwhile for us to consider pursuing or at least help facilitate others' pursuits—with the comparative advantage caveat stated previously.

Comment by Cameron Berg (cameron-berg) on Consider working more hours and taking more stimulants · 2022-12-15T23:09:00.504Z · LW · GW

I'm definitely sympathetic to the general argument here as I understand it: something like, it is better to be more productive when what you're working towards has high EV, and stimulants are one underutilized strategy for being more productive. But I have concerns about the generality of your conclusion: (1) blanket-endorsing or otherwise equating the advantages and disadvantages of all of the things on the y-axis of that plot is painting with too broad a brush. They vary, eg, in addictive potential, demonstrated medical benefit, cost of maintenance, etc. (2) Relatedly, some of these drugs (e.g., Adderall) alter the dopaminergic calibration in the brain, which can lead to significant personality/epistemology changes, typically as a result of modulating people's risk-taking/reward-seeking trade-offs. Similar dopamine agonist drugs used to treat Parkinson's led to pathological gambling behaviors in patients who took it. There is an argument to be made for at least some subset of these substances that the trouble induced by these kinds of personality changes may plausibly outweigh the productivity gains of taking the drugs in the first place.

Comment by Cameron Berg (cameron-berg) on AI researchers announce NeuroAI agenda · 2022-10-24T22:17:02.396Z · LW · GW

27 people holding the view is not a counterexample to the claim that it is becoming less popular.

Still feels worthwhile to emphasize that some of these 27 people are, eg, Chief AI Scientist at Meta, co-director of CIFAR, DeepMind staff researchers, etc. 

These people are major decision-makers in some of the world's leading and most well-resourced AI labs, so we should probably pay attention to where they think AI research should go in the short-term—they are among the people who could actually take it there.

 

See also this survey of NLP

I assume this is the chart you're referring to. I take your point that you see these numbers as increasing or decreasing (despite that where they actually are in an absolute sense seems harmonious with believing that brain-based AGI is entirely possible), but it's likely that these increases or decreases are themselves risky statistics to extrapolate. These sorts of trends could easily asymptote or reverse given volatile field dynamics. For instance, if we linearly extrapolate from the two stats you provided (5% believe scaling could solve everything in 2018; 17% believe it in 2022), this would predict, eg, 56% of NLP researchers in 2035 would believe scaling could solve everything. Do you actually think something in this ballpark is likely?

 

Did the paper say that NeuroAI is looking increasingly likely?

I was considering the paper itself as evidence that NeuroAI is looking increasingly likely. 

When people who run many of the world's leading AI labs say they want to devote resources to building NeuroAI in the hopes of getting AGI, I am considering that as a pretty good reason to believe that brain-like AGI is more probable than I thought it was before reading the paper. Do you think this is a mistake?

Certainly, to your point, signaling an intention to try X is not the same as successfully doing X, especially in the world of AI research. But again, if anyone were to be able to push AI research in the direction of being brain-based, would it not be these sorts of labs? 

To be clear, I do not personally think that prosaic AGI and brain-based AGI are necessarily mutually exclusive—eg, brains may be performing computations that we ultimately realize are some emergent product of prosaic AI methods that already basically exist. I do think that the publication of this paper gives us good reason to believe that brain-like AGI is more probable than we might have thought it was, eg, two weeks ago.

Comment by Cameron Berg (cameron-berg) on AI researchers announce NeuroAI agenda · 2022-10-24T21:30:28.505Z · LW · GW

However, technological development is not a zero-sum game. Opportunities or enthusiasm in neuroscience doesn't in itself make prosaic AGI less likely and I don't feel like any of the provided arguments are knockdown arguments against ANN's leading to prosaic AGI.

Completely agreed! 

I believe there are two distinct arguments at play in the paper and that they are not mutually exclusive. I think the first is "in contrast to the optimism of those outside the field, many front-line AI researchers believe that major new breakthroughs are needed before we can build artificial systems capable of doing all that a human, or even a much simpler animal like a mouse, can do" and the second is "a better understanding of neural computation will reveal basic ingredients of intelligence and catalyze the next revolution in AI, eventually leading to artificial agents with capabilities that match and perhaps even surpass those of humans." 

The first argument can be read as a reason to negatively update on prosaic AGI (unless you see these 'major new breakthroughs' as also being prosaic) and the second argument can be read as a reason to positively update on brain-like AGI. To be clear, I agree that the second argument is not a good reason to negatively update on prosaic AGI.

Comment by Cameron Berg (cameron-berg) on AI researchers announce NeuroAI agenda · 2022-10-24T14:57:39.888Z · LW · GW

Thanks for your comment! 

As far as I can tell the distribution of views in the field of AI is shifting fairly rapidly towards "extrapolation from current systems" (from a low baseline).

I suppose part of the purpose of this post is to point to numerous researchers who serve as counterexamples to this claim—i.e., Yann LeCun, Terry Sejnowski, Yoshua Bengio, Timothy Lillicrap et al seem to disagree with the perspective you're articulating in this comment insofar as they actually endorse the perspective of the paper they've coauthored.

You are obviously a highly credible source on trends in AI research—but so are they, no? 

And if they are explicitly arguing that NeuroAI is the route they think the field should go in order to get AGI, it seems to me unwise to ignore or otherwise dismiss this shift.

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-14T18:19:01.597Z · LW · GW

Agreed that there are important subtleties here. In this post, I am really just using the safety-via-debate set-up as a sort of intuitive case for getting us thinking about why we generally seem to trust certain algorithms running in the human brain to adjudicate hard evaluative tasks related to AI safety. I don't mean to be making any especially specific claims about safety-via-debate as a strategy (in part for precisely the reasons you specify in this comment).

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-14T18:14:00.604Z · LW · GW

Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards 'right' in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?  

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-14T18:00:46.787Z · LW · GW

Thanks Lukas! I just gave your linked comment a read and I broadly agree with what you've written both there and here, especially w.r.t. focusing on the necessary training/evolutionary conditions out of which we might expect to see generally intelligent prosocial agents (like most humans) emerge. This seems like a wonderful topic to explore further IMO. Any other sources you recommend for doing so?

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-14T17:53:40.221Z · LW · GW

Hi Joe—likewise! This relationship between prosociality and distribution of power in social groups is super interesting to me and not something I've given a lot of thought to yet. My understanding of this critique is that it would predict something like: in a world where there are huge power imbalances, typical prosocial behavior would look less stable/adaptive. This brings to mind for me things like 'generous tit for tat' solutions to prisoner's dilemma scenarios—i.e., where being prosocial/trusting is a bad idea when you're in situations where the social conditions are unforgiving to 'suckers.' I guess I'm not really sure what exactly you have in mind w.r.t. power specifically—maybe you could elaborate on (if I've got the 'prediction' right in the bit above) why one would think that typical prosocial behavior would look less stable/adaptive in a world with huge power imbalances?

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-14T17:39:10.260Z · LW · GW

I broadly agree with Viliam's comment above. Regarding Dagon's comment (to which yours is a reply), I think that characterizing my position here as 'people who aren't neurotypical shouldn't be trusted' is basically strawmanning, as I explained in this comment. I explicitly don't think this is correct, nor do I think I imply it is anywhere in this post.  

As for your comment, I definitely agree that there is a distinction to be made between prosocial instincts and the learned behavior that these instincts give rise to over the lifespan, but I would think that the sort of 'integrity' that you point at here as well as the self-aware psychopath counterexample are both still drawing on particular classes of prosocial motivations that could be captured algorithmically. See my response to 'plausible critique #1,' where I also discuss self-awareness as an important criterion for prosociality.  

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-12T22:25:11.759Z · LW · GW

Interesting! Definitely agree that if people's specific social histories are largely what qualify them to be 'in the loop,' this would be hard to replicate for the reasons you bring up. However, consider that, for example,

Young neurotypical children (and even chimpanzees!) instinctively help others accomplish their goals when they believe they are having trouble doing so alone...

which almost certainly has nothing to do with their social history. I think there's a solid argument to be made, then, that a lot of these social histories are essentially a lifelong finetuning of core prosocial algorithms that have in some sense been there all along.  And I am mainly excited about enumerating these. (Note also that figuring out these algorithms and running them in an RL training procedure might get us the relevant social histories training that you reference—but we'd need the core algorithms first.)

"human in the loop" to some extent translates to "we don't actually know why we trust (some) other humans, but there exist humans we trust, so let's delegate the hard part to them".

I totally agree with this statement taken by itself, and my central point is that we should actually attempt to figure out 'why we trust (some) other humans' rather than treating this as a kind of black box. However, if this statement is being put forward as an argument against doing so,, then it seems circular to me.

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-12T22:09:24.976Z · LW · GW

Agreed that the correlation between the modeling result and the self-report is impressive, with the caveat that the sample size is small enough not to take the specific r-value too seriously. In a quick search, I couldn't find a replication of the same task with a larger sample, but I did find a meta-analysis that includes this task which may be interesting to you! I'll let you know if I find something better as I continue to read through the literature :)

Comment by Cameron Berg (cameron-berg) on Alignment via prosocial brain algorithms · 2022-09-12T21:58:32.655Z · LW · GW

Definitely agree with the thrust of your comment, though I should note that I neither believe nor think I really imply anywhere that 'only neurotypical people are worth societal trust.' I only use the word in this post to gesture at the fact that the vast majority of (but not all) humans share a common set of prosocial instincts—and that these instincts are a product of stuff going on in their brains. In fact, my next post will almost certainly be about one such neuroatypical group: psychopaths!

Comment by Cameron Berg (cameron-berg) on Look For Principles Which Will Carry Over To The Next Paradigm · 2022-02-19T20:33:20.082Z · LW · GW

I liked this post a lot, and I think its title claim is true and important. 

One thing I wanted to understand a bit better is how you're invoking 'paradigms' in this post wrt AI research vs. alignment research. I think we can be certain that AI research and alignment research are not identical programs but that they will conceptually overlap and constrain each other. So when you're talking about 'principles that carry over,' are you talking about principles in alignment research that will remain useful across various breakthroughs in AI research, or are you thinking about principles within one of these two research programs that will remain useful across various breakthroughs within that research program? 

Another thing I wanted to understand better was the following:

This leaves a question: how do we know when it’s time to make the jump to the next paradigm? As a rough model, we’re trying to figure out the constraints which govern the world.  

Unlike many of the natural sciences (physics, chemistry, biology, etc.) whose explicit goals ostensibly are, as you've said, 'to figure out the constraints which govern the world,' I think that one thing that makes alignment research unique is that its explicit goal is not simply to gain knowledge about reality, but also to prevent a particular future outcome from occurring—namely, AGI-induced X-risks. Surely a necessary component for achieving this goal is 'to figure out the [relevant] constraints which govern the world,' but it seems pretty important to note (if we agree on this field-level goal) that this can't be the only thing that goes into a paradigm for alignment research. That is, alignment research can't only be about modeling reality; it must also include some sort of plan for how to bring about a particular sort of future. And I agree entirely that the best plans of this sort would be those that transcend content-level paradigm shifts. (I daresay that articulating this kind of plan is exactly the sort of thing I try to get at in my Paradigm-building for AGI safety sequence!) 

Comment by Cameron Berg (cameron-berg) on Question 4: Implementing the control proposals · 2022-02-17T16:04:05.490Z · LW · GW

Thanks for your comment! I agree with both of your hesitations and I think I will make the relevant changes to the post: instead of 'totally unenforceable,' I'll say 'seems quite challenging to enforce.' I believe that this is true (and I hope that the broad takeaway from this post is basically the opposite of 'researchers need to stay out of the policy game,' so I'm not too concerned that I'd be incentivizing the wrong behavior). 

To your point, 'logistically and politically inconceivable' is probably similarly overblown.  I will change it to 'highly logistically and politically fraught.' You're right that the general failure of these policies shouldn't be equated with their inconceivability. (I am fairly confident that, if we were so inclined, we could go download a free copy of any movie or song we could dream of—I wouldn't consider this a case study of policy success—only of policy conceivability!). 

Comment by Cameron Berg (cameron-berg) on Question 5: The timeline hyperparameter · 2022-02-15T16:09:59.310Z · LW · GW

Very interesting counterexample! I would suspect it gets increasingly sketchy to characterize 1/8th, 1/16th, etc. 'units of knowledge towards AI' as 'breakthroughs' in the way I define the term in the post. 

I take your point that we might get our wires crossed when a given field looks like it's accelerating, but when we zoom in to only look at that field's breakthroughs, we find that they are decelerating. It seems important to watch out for this. Thanks for your comment!

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-14T16:37:51.852Z · LW · GW

The question is not "How can John be so sure that zooming into something narrower would only add noise?", the question is "How can Cameron be so sure that zooming into something narrower would yield crucial information without which we have no realistic hope of solving the problem?".

I am not 'so sure'—as I said in the previous comment, I have only claim(ed) it is probably necessary to, for instance, know more about AGI than just whether it is a 'generic strong optimizer.' I would only be comfortable making non-probabilistic claims about the necessity of particular questions in hindsight.

I don't think I'm making some silly logical error. If your question is, "Why does Cameron think it is probably necessary to understand X if we want to have any realistic hope of solving the problem?", well, I do not think this is rhetorical! I spend an entire post defending and elucidating each of these questions, and I hope by the end of the sequence, readers would have a very clear understanding of why I think each is probably necessary to think about (or I have failed as a communicator!). 

It was never my goal to defend the (probable) necessity of each of the questions in this one post—this is the point of the whole sequence! This post is a glorified introductory paragraph. 

I do not think, therefore, that this post serves as anything close to an adequate defense of this framework, and I understand your skepticism if you think this is all I will say about why these questions are important. 

However, I don't think your original comment—or any of this thread, for that matter—really addresses any of the important claims put forward in this sequence (which makes sense, given that I haven't even published the whole thing yet!). It also seems like some of your skepticism is being fueled by assumptions about what you predict I will argue as opposed to what I will actually argue (correct me if I'm wrong!).

I hope you can find the time to actually read through the whole thing once it's published before passing your final judgment. Taken as a whole, I think the sequence speaks for itself. If you still think it's fundamentally bullshit after having read it, fair enough :)

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-13T19:34:23.507Z · LW · GW

Definitely agree that if we silo ourselves into any rigid plan now, it almost certainly won't work. However, I don't think 'end-to-end agenda' = 'rigid plan.' I certainly don't think this sequence advocates anything like a rigid plan. These are the most general questions I could imagine guiding the field, and I've already noted that I think this should be a dynamic draft. 

...we do not currently possess a strong enough understanding to create an end-to-end agenda which has any hope at all of working; anything which currently claims to be an end-to-end agenda is probably just ignoring the hard parts of the problem.

What hard parts of the problem do you think this sequence ignores?

(I explicitly claim throughout the sequence that what I propose is not sufficient, so I don't think I can be accused of ignoring this.)

Hate to just copy and paste, but I still really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, then we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). This is basically tautological as far as I can tell. Do you agree or disagree with this if-then statement? 

I do think that finding necessary subquestions, or noticing that a given subquestion may not be necessary, is much easier than figuring out an end-to-end agenda.   

Agreed. My goal was to enumerate these questions. When I noticed that they followed a fairly natural progression, I decided to frame them hierarchically.  And, I suppose to your point, it wasn't necessarily easy to write this all up. I thought it would nonetheless be valuable to do so, so I did!

Thanks for linking the Rocket Alignment Problem—looking forward to giving it a closer read. 

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-12T22:47:46.699Z · LW · GW

If it's possible that we could get to a point where AGI is no longer a serious threat without needing to answer the question, then the question is not necessary.

Agreed, this seems like a good definition for rendering anything as 'necessary.' 

Our goal: minimize AGI-induced existential threats (right?). 

My claim is that answering these questions is probably necessary for achieving this goal—i.e., P(achieving goal | failing to think about one or more of these questions) ≈ 0. (I say, "I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work.")

That is, we would be exceedingly lucky if we achieve AGI safety's goal without thinking about 

  • what we mean when we say AGI (Q1),
  • what existential risks are likely to emerge from AGI (Q2),
  • how to address these risks (Q3),
  • how to implement these mitigation strategies (Q4), and
  • how quickly we actually need to answer these questions (Q5).

I really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). I propose a way to do this hierarchically. Do you see wiggle room here where I do not? 

FWIW, I also don't really think this is the core claim of the sequence. I would want that to be something more like here is a useful framework for moving from point A (where the field is now) to point B (where the field ultimately wants to end up). I have not seen a highly compelling presentation of this sort of thing before, and I think it is very valuable in solving any hard problem to have a general end-to-end plan (which we probably will want to update as we go along; see Robert's comment).   

I think most of the strategies in MIRI's general cluster do not depend on most of these questions.

Would you mind giving a specific example of an end-to-end AGI safety research agenda that you think does not depend on or attempt to address these questions? (I'm also happy to just continue this discussion off of LW, if you'd like.)

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-12T19:06:47.476Z · LW · GW

Thanks for taking the time to write up your thoughts! I appreciate your skepticism. Needless to say, I don't agree with most of what you've written—I'd be very curious to hear if you think I'm missing something:

[We] don't expect that the alignment problem itself is highly-architecture dependent; it's a fairly generic property of strong optimization. So, "generic strong optimization" looks like roughly the right level of generality at which to understand alignment...Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively "noise", for purposes of understanding alignment.

Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn't seem at all obvious to me. I write in Q2: "It is also worth noting immediately that even if particular [alignment problems] are architecture-independent [your point!], it does not necessarily follow that the optimal control proposals for minimizing those risks would also be architecture-independent! For example, just because an SL-based AGI and an RL-based AGI might both hypothetically display tendencies towards instrumental convergence does not mean that the way to best prevent this outcome in the SL AGI would be the same as in the RL AGI."

By analogy, consider the more familiar 'alignment problem' of training dogs (i.e., getting the goals of dogs to align with the goals of their owners). Surely there are 'breed-independent' strategies for doing this, but it is not obvious that these strategies will be sufficient for every breed—e.g., Afghan Hounds are apparently way harder to train, than, say, Golden Retrievers. So in addition to the generic-dog-alignment-regime, Afghan hounds require some additional special training to ensure they're aligned. I don't yet understand why you are confident that different possible AGIs could not follow this same pattern.

On top of that, there's the obvious problem that if we try to solve alignment for a particular architecture, it's quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)

I think that you think that I mean something far more specific than I actually do when I say "particular architecture," so I don't think this accurately characterizes what I believe. I describe my view in the next post

[It's] the unknown unknowns that kill us. The move we want is not "brainstorm failure modes and then avoid the things we brainstormed", it's "figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)".

I think this is a very interesting point (and I have not read Eliezer's post yet, so I am relying on your summary), but I don't see what the point of AGI safety research is if we take this seriously. If the unknown unknowns will kill us, how are we to avoid them even in theory? If we can articulate some strategy for addressing them, they are not unknown unknowns; they are "increasingly-known unknowns!" 

I spent the entire first post of this sequence devoted to "figuring out what we want" (we = AGI safety researchers). It seems like what we want is to avoid AGI-induced existential risks. (I am curious if you think this is wrong?) If so, I claim, here is a "strategy that might systematically achieve this:" we need to understand what we mean when we say AGI (Q1), figure out what risks are likely to emerge from AGI (Q2), mitigate these risks (Q3), and implement these mitigation strategies (Q4).  

If by "figure out what we want," you mean "figure out what we want out of an AGI," I definitely agree with this (see Robert's great comment below!). If by "figure out what we want," you mean "figure out what we want out of AGI safety research," well, that is the entire point of this sequence!

I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it's technically necessary to answer at some point, this question might not be very useful to think about ahead of time.

I completely disagree with this. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn't even been published yet—I hope you'll read it!). 

in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary.  

When you frame it this way, I completely agree. However, there is definitely a continuous space of plausible timelines between "all-the-time-in-the-world" and "hail-Mary," and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum. Again, I hope you will withhold your final judgment of my claim until you see how I defend it in Q5! (I suppose my biggest regret in posting this sequence is that I didn't just do it all at once.)

Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.

I think this is a bit uncharitable. I have worked with and/or talked to lots of different AGI safety researchers over the past few months, and this framework is the product of my having "consider[ed] a wide variety of approaches, and look for subquestions which are clearly crucial to all of them." Take, for instance, this chart in Q1—I am proposing a single framework for talking about AGI that potentially unifies brain-based vs. prosaic approaches. That seems like a useful and productive thing to be doing at the paradigm-level.

I definitely agree that things like how we define 'control' and 'bad outcomes' might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!

Thanks again for your comment, and I definitely want to flag that, in spite of disagreeing with it in the ways I've tried to describe above, I really do appreciate your skepticism and engagement with this sequence (I cite your preparadigmatic claim a number of times in it). 

As I said to Robert, I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-12T18:08:28.379Z · LW · GW

Hey Robert—thanks for your comment!

it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches. 

Definitely agree—I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.

"Aiming at good outcomes while/and avoiding bad outcomes" captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. 

I definitely agree with the value of framing AGI outcomes both positively and negatively, as I discuss in the previous post. I am less sure that AGI safety as a field necessarily requires deeply considering the positive potential of AGI (i.e., as long as AGI-induced existential risks are avoided, I think AGI safety researchers can consider their venture successful), but, much to your point, if the best way of actually achieving this outcome is by thinking about AGI more holistically—e.g., instead of explicitly avoiding existential risks, we might ask how to build an AGI that we would want to have around—then I think I would agree. I just think this sort of thing would radically redefine the relevant approaches undertaken in AGI safety research. I by no means want to reject radical redefinitions out of hand (I think this very well could be correct); I just want to say that it is probably not the path of least resistance given where the field currently stands.

(And agreed on the self-control point, as you know. See directionality of control in Q3.)

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-09T21:47:50.495Z · LW · GW

Thanks for your comment—I entirely agree with this. In fact, most of the content of this sequence represents an effort to spell out these generalizations. (I note later that, e.g., the combinatorics of specifying every control proposal to deal with every conceivable bad outcome from every learning architecture is obviously intractable for a single report; this is a "field-sized" undertaking.) 

I don't think this is a violation of the hierarchy, however. It seems coherent to both claim (a) given the field's goal, AGI safety research should follow a general progression toward this goal (e.g., the one this sequence proposes), and (b) there is plenty of productive work that can and should be done outside of this progression (for the reason you specify).

I look forward to hearing if you think the sequence walks this line properly!

Comment by Cameron Berg (cameron-berg) on Paradigm-building: The hierarchical question framework · 2022-02-09T17:11:35.979Z · LW · GW

Hi Tekhne—this post introduces each of the five questions I will put forward and analyze in this sequence. I will be posting one a day for the next week or so. I think I will answer all of your questions in the coming posts.

I doubt that carving up the space in this—or any—way would be totally uncontroversial (there are lots of value judgments necessary to do such a thing), but I think this concern only serves to demonstrate that this framework is not self-justifying (i.e., there is still lots of clarifying work to be done for each of these questions). I agree with this—that's why there I am devoting a post to each of them!

In order to minimize AGI-induced existential threats, I claim that we need to understand (i.e., anticipate; predict) AGI well enough (Q1) to determine what these threats are (Q2). We then need to figure out ways to mitigate these threats (Q3) and ways to make sure these proposals are actually implemented (Q4). How quickly we need to answer Q1-Q4 will be determined by how soon we expect AGI to be developed (Q5). I appreciate your skepticism, but I would counter that this seems actually like a fairly natural and parsimonious way to get from point A (where we are now) to point B (minimizing AGI-induced existential threats). That's why I claim that an AGI safety research agenda would need to answer these questions correctly in order to be successful.  

Ultimately, I can only encourage you to wait for the rest of the sequence to be published before passing a conclusive judgment!

Comment by Cameron Berg (cameron-berg) on Paradigm-building from first principles: Effective altruism, AGI, and alignment · 2022-02-09T16:46:10.341Z · LW · GW

I agree with this. By 'special class,' I didn't mean that AI safety has some sort of privileged position as an existential risk (though this may also happen to be true)—I only meant that it is unique. I think I will edit the post to use the word "particular" instead of "special" to make this come across more clearly.

Comment by Cameron Berg (cameron-berg) on Theoretical Neuroscience For Alignment Theory · 2021-12-14T21:27:27.423Z · LW · GW

I think this is an incredibly interesting point. 

I would just note, for instance, in the (crazy cool) fungus-and-ants case, this is a transient state of control that ends shortly thereafter in the death of the smarter, controlled agent. For AGI alignment, we're presumably looking for a much more stable and long-term form of control, which might mean that these cases are not exactly the right proofs of concept. They demonstrate, to your point, that "[agents] can be aligned with the goals of someone much stupider than themselves," but not necessarily that agents can be comprehensively and permanently aligned with the goals of someone much stupider than themselves.

Your comment makes me want to look more closely into how cases of "mind control" work in these more ecological settings and whether there are interesting takeaways for AGI alignment.

Comment by Cameron Berg (cameron-berg) on Theoretical Neuroscience For Alignment Theory · 2021-12-14T21:20:02.210Z · LW · GW

If we expect to gain something from studying how humans implement these processes, it'd have to be something like ensuring that our AIs understand them “in the same way that humans do,” which e.g. might help our AIs generalize in a similar way to humans.

I take your point that there is probably nothing special about the specific way(s) that humans get good at predicting other humans. I do think that "help[ing] our AIs generalize in a similar way to humans" might be important for safety (e.g., we probably don't want an AGI that figures out its programmers way faster/more deeply than they can figure it out). I also think it's the case that we don't currently have a learning algorithm that can predict humans as well as humans can predict humans. (Some attempts, but not there yet.) So to the degree that current approaches are lacking, it makes sense to me to draw some inspiration from the brain-based algorithms that already implement these processes extremely well—i.e., to first understand these algorithms, and to later develop training goals in accordance with the heuristics/architecture these algorithms seem to instantiate. 

 This is notably in contrast to affective empathy, though, which is not something that's inherently necessary for predictive accuracy—so figuring out how/why humans do that has a more concrete story for how that could be helpful.

Agreed! I think it's worth noting that if you take seriously the 'hierarchical IRL' model I proposed in the ToM section, understanding the algorithm(s) underlying affective empathy might actually require understanding cognitive and affective ToM (i.e., if these are the substrate of affective empathy, we'll probably need a good model of them before we can have a good model of affective empathy).

And wrt learning vs. online learning, I think I'm largely in agreement with Steve's reply. I would also add that this might end up just being a terminological dispute depending on how flexible we are with calling particular phases "training" vs. "deployment." E.g., is a brain "deployed" when the person's genetic make-up as a zygote is determined? Or is it when they're born? When their brain stops developing? When they learn the last thing they'll ever learn? To the degree we think these questions are awkward/their answers are arbitrary, I would think this counts as evidence that the notion of "online learning" is useful to invoke here/gives us more parsimonious answers.   

Comment by Cameron Berg (cameron-berg) on Theoretical Neuroscience For Alignment Theory · 2021-12-14T20:51:30.429Z · LW · GW

Thank you! 

I don't think I claimed that the brain is a totally aligned general intelligence, and if I did, I take it back! For now, I'll stand by what I said here: "if we comprehensively understood how the human brain works at the algorithmic level, then necessarily embedded in this understanding should be some recipe for a generally intelligent system at least as aligned to our values as the typical human brain." This seems harmonious with what I take your point to be: that the human brain is not a totally aligned general intelligence. I second Steve's deferral to Eliezer's thoughts on the matter, and I mean to endorse something similar here.

what's the prevalence of empathy in social but non-general animals?

Here's a good summary. I also found a really nice non-academic article in Vox on the topic.

And I'm looking forward to seeing your post on second-order alignment! I think the more people who take the concern seriously (and put forward compelling arguments to that end), the better. 

Comment by Cameron Berg (cameron-berg) on Theoretical Neuroscience For Alignment Theory · 2021-12-14T20:36:12.954Z · LW · GW

Thank you! I think these are all good/important points. 

In regards to functional specialization between the hemispheres, I think whether this difference is at the same level as mid-insular cortex vs posterior insular cortex would depend on whether the hemispheric differences can account for certain lower-order distinctions of this sort or not. For example, let's say that there are relevant functional differences between left ACC and right ACC, left vmPFC and right vmPFC, and left insular cortex and right insular cortex—and that these differences all have something in common (i.e., there is something characteristic about the kinds of computations that differentiate left-hemispheric ACC, vmPFC, insula from right-hemispheric ACC, vmPFC, insula). Then, you might have a case for the hemispheric difference being more fundamental or important than, say, the distinction between mid-insular cortex vs posterior insular cortex. But that's only if these conditions hold (i.e., that there are functional differences and these differences have intra-hemispheric commonalities). I think there's a good chance something like this might be true, but I obviously haven't put forward an argument for this yet, so I don't blame anyone for not taking my word for it!

I'm not fully grasping the autism/ToM/IRL point yet. My understanding of people on the autism spectrum is that they typically lack ordinary ToM, though I'm certainly not saying that I don't believe the people you've spoken with; maybe only that they might be the exception rather than the rule (there are accounts that emphasize things others than ToM, though, to your point). If it is true that (1) autistic people use mechanisms other than ToM/IRL to understand people (i.e., modeling people like car engines), and (2) autistic people have social deficits, then I'm not yet seeing how this demonstrates that IRL is 'at most' just a piece of the puzzle. (FWIW, I would be surprised if IRL were the only piece of the puzzle; I'm just not yet grasping how this argument shows this.) I can tell I'm missing something. 

And I agree with the sad vs. schadenfreude point. I think in an earlier exchange you made the point that this sort of thing could be conceivably modulated by in-group style dynamics. More specifically, I think that the extent to which I can look at a person, their situation, the outcome, etc., and notice (probably implicitly) that I could end up in a similar situation, it's adaptive for me to "simulate" what it is probably like for them to be in this position so I can learn from their experience without having to go through the experience myself. As you note, there are exceptions to this—I think this is particularly when we are looking at people more as "objects" (i.e., complex external variables in our environments) than "subjects" (other agents with internal states, goals, etc. just like me). I think this is well-demonstrated by the following examples.

1, lion-as-subject: I go to the zoo and see a lion. "Ooh, aah! Super majestic." Suddenly, a huge branch falls onto the lion, trapping it. It yelps loudly. I audibly wince, and I really hope the lion is okay. (Bonus subjects: other people around the enclosure also demonstrate they're upset/disturbed by what just happened, which makes me even more upset/disturbed!)

2: lion-as-object: I go on a safari alone and my car breaks down, so I need to walk to the nearest station to get help. As I'm doing this, a lion starts stalking and chasing me. Oh crap. Suddenly, a huge branch falls onto the lion, trapping it. It yelps loudly. "Thank goodness. That was almost really bad."

Very different reactions to the same narrow event. So I guess this kind of thing demonstrates to me that I'm inclined to make stronger claims about affective empathy in those situations where we're looking at other agents in our environment as subjects, not objects. I think in eusocial creatures like humans, subject-perspective is probably far more common than object-perspective, though one could certainly come up with lots of examples of both. So definitely more to think about here, but I really like this kind of challenge to an overly-simplistic picture of affective empathy wherein someone else feeling way X automatically and context-independently makes me feel way X. This, to your point, just seems wrong.