Posts
Comments
We've spoken to numerous policymakers and thinkers in DC. The goal is to optimize for explaining to these folks why alignment is important, rather than the median conservative person per se (ie, DC policymakers are not "median conservatives").
Fixed, thanks!
Note this is not equivalent to saying 'we're almost certainly going to get AGI during Trump's presidency,' but rather that there will be substantial developments that occur during this period that prove critical to AGI development (which, at least to me, does seem almost certainly true).
One thing that seems strangely missing from this discussion is that alignment is in fact, a VERY important CAPABILITY that makes it very much better. But the current discussion of alignment in the general sphere acts like 'alignment' is aligning the AI with the obviously very leftist companies that make it rather than with the user!
Agree with this—we do discuss this very idea at length here and also reference it throughout the piece.
That alignment is to the left is one of just two things you have to overcome in making conservatives willing to listen. (The other is obviously the level of danger.)
I think this is a good distillation of the key bottlenecks and seems helpful for anyone interacting with lawmakers to keep in mind.
Whether one is an accelerationist, Pauser, or an advocate of some nuanced middle path, the prospects/goals of everyone are harmed if the discourse-landscape becomes politicized/polarized.
...
I just continue to think that any mention, literally at all, of ideology or party is courting discourse-disaster for all, again no matter what specific policy one is advocating for.
...
Like a bug stuck in a glue trap, it places yet another limb into the glue in a vain attempt to push itself free.
I would agree in a world where the proverbial bug hasn't already made any contact with the glue trap, but this very thing has clearly already been happening for almost a year in a troubling direction. The political left has been fairly casually 'Everything-Bagel-izing' AI safety, largely in smuggling in social progressivism that has little to do with the core existential risks, and the right, as a result, is increasingly coming to view AI safety as something approximating 'woke BS stifling rapid innovation.' The fly is already a bit stuck.
The point we are trying to drive home here is precisely what you're also pointing at: avoiding an AI-induced catastrophe is obviously not a partisan goal. We are watching people in DC slowly lose sight of this critical fact. This is why we're attempting to explain here why basic AI x-risk concerns are genuinely important regardless of one's ideological leanings. ie, genuinely important to left-leaning and right-leaning people alike. Seems like very few people have explicitly spelled out the latter case, though, which is why we thought it would be worthwhile to do so here.
Interesting—this definitely suggests that Planck's statement probably shouldn't be taken literally/at face value if it is indeed true that some paradigm shifts have historically happened faster than generational turnover. It may still be possible that this may be measuring something slightly different than the initial 'resistance phase' that Planck was probably pointing at.
Two hesitations with the paper's analysis:
(1) by only looking at successful paradigm shifts, there might be a bit of a survival bias at play here (we're not hearing about the cases where a paradigm shift was successfully resisted and never came to fruition).
(2) even if senior scientists in a field may individually accept new theories, institutional barriers can still prevent that theory from getting adequate funding, attention, exploration. I do think Anthony's comment below nicely captures how the institutional/sociological dynamics in science seemingly differ substantially from other domains (in the direction of disincentivizing 'revolutionary' exploration).
Thanks for this! Completely agree that there are Type I and II errors here and that we should be genuinely wary of both. Also agree with your conclusion that 'pulling the rope sideways' is strongly preferred to simply lowering our standards. The unconventional researcher-identification approach undertaken by the HHMI might be a good proof of concept for this kind of thing.
I think you might be taking the quotation a bit too literally—we are of course not literally advocating for the death of scientists, but rather highlighting that many of the largest historical scientific innovations have been systematically rejected by one's contemporaries in their field.
Agree that scientists change their minds and can be convinced by sufficient evidence, especially within specific paradigms. I think the thornier problem that Kuhn and others have pointed out is that the introduction of new paradigms into a field are very challenging to evaluate for those who are already steeped in an existing paradigm, which tends to cause these people to reject, ridicule, etc those with strong intuitions for new paradigms, even when they demonstrate themselves in hindsight to be more powerful or explanatory than existing ones.
Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task's influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don't point directly toward the identity solution. The gradient depends on both the deviation from identity () and the activation covariance (), with the network balancing this against the primary task loss. Since the self-modeling prediction isn't just a separate output block—it's predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero difference during training.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss in terms of the self-modeling layer, we get .
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.
I am not suggesting either of those things. You enumerated a bunch of ways we might use cutting-edge technologies to facilitate intelligence amplification, and I am simply noting that frontier AI seems like it will inevitably become one such technology in the near future.
On a psychologizing note, your comment seems like part of a pattern of trying to wriggle out of doing things the way that is hard that will work.
Completely unsure what you are referring to or the other datapoints in this supposed pattern. Strikes me as somewhat ad-hominem-y unless I am misunderstanding what you are saying.
AI helping to do good science wouldn't make the work any less hard—it just would cause the same hard work to happen faster.
hard+works is better than easy+not-works
seems trivially true. I think the full picture is something like:
efficient+effective > inefficient+effective > efficient+ineffective > inefficient+ineffective
Of course agree that if AI-assisted science is not effective, it would be worse to do than something that is slower but effective. Seems like whether or not this sort of system could be effective is an empirical question that will be largely settled in the next few years.
Somewhat surprised that this list doesn't include something along the lines of "punt this problem to a sufficiently advanced AI of the near future." This could potentially dramatically decrease the amount of time required to implement some of these proposals, or otherwise yield (and proceed to implement) new promising proposals.
It seems to me in general that human intelligence augmentation is often framed in a vaguely-zero-sum way with getting AGI ("we have to all get a lot smarter before AGI, or else..."), but it seems quite possible that AGI or near-AGI could itself help with the problem of human intelligence augmentation.
It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.
Thanks for this! Synthetic datasets of the kind you describe do seem like they could have a negative alignment tax, especially to the degree (as you point out) that self-motivated actors may be incentivized to use them anyway if they were successful.
Your point about alignment generalizing farther than capabilities is interesting and is definitely reminiscent of Beren’s thinking on this exact question.
Curious if you can say more about what evopsych assumptions assumptions about human capabilities/values you think are false.
- Thanks for writing up your thoughts on RLHF in this separate piece, particularly the idea that ‘RLHF hides whatever problems the people who try to solve alignment could try to address.’ We definitely agree with most of the thrust of what you wrote here and do not believe nor (attempt to) imply anywhere that RLHF indicates that we are globally ‘doing well’ on alignment or have solved alignment or that alignment is ‘quite tractable.’ We explicitly do not think this and say so in the piece.
- With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)
The broader point we are making in this post is that the entire world is moving full steam ahead towards more powerful AI whether we like it or not, and so discovering and deploying alignment techniques that move in the direction of actually satisfying this impossible-to-ignore attractor while also maximally decreasing the probability that the “giant wave will crash into human society and destroy it” seem worth pursuing—especially compared to the very plausible counterfactual world where everyone just pushes ahead with capabilities anyways without any corresponding safety guarantees.
While we do point to RLHF in the piece as one nascent example of what this sort of thing might look like, we think the space of possible approaches with a negative alignment tax is potentially vast. One such example we are particularly interested in (unlike RLHF) is related to implicit/explicit utility function overlap, mentioned in this comment.
I think the post makes clear that we agree RLHF is far from perfect and won't scale, and definitely appreciate the distinction between building aligned models from the ground-up vs. 'bolting on' safety techniques (like RLHF) after the fact.
However, if you are referring to current models, the claim that RLHF makes alignment worse (as compared to a world where we simply forego doing RLHF?) seems empirically false.
Thanks, it definitely seems right that improving capabilities through alignment research (negative Thing 3) doesn't necessarily make safe AI easier to build than unsafe AI overall (Things 1 and 2). This is precisely why if techniques for building safe AI were discovered that were simultaneously powerful enough to move us toward friendly TAI (and away from the more-likely-by-default dangerous TAI, to your point), this would probably be good from an x-risk perspective.
We aren't celebrating any capability improvement that emerges from alignment research—rather, we are emphasizing the expected value of techniques that inherently improve both alignment and capabilities such that the strategic move for those who want to build maximally-capable AI shifts towards the adoption of these techniques (and away from the adoption of approaches that might cause similar capabilities gains without the alignment benefits). This is a more specific take than just a general 'negative Thing 3.'
I also think there is a relevant distinction between 'mere' dual-use research and what we are describing here—note the difference between points 1 and 2 in this comment.
Just because the first approximation of a promising alignment agenda in a toy environment would not self-evidently scale to superintelligence does not mean it is not worth pursuing and continuing to refine that agenda so that the probability of it scaling increases.
We do not imply or claim anywhere that we already know how to ‘build a rocket that doesn’t explode.’ We do claim that no other alignment agenda has determined how to do this thus far, that the majority of other alignment researchers agree with this prognosis, and that we therefore should be pursuing promising approaches like SOO to increase the probability of the proverbial rocket not exploding.
In the RL experiment, we were only measuring SOO as a means of deception reduction in the agent seeing color (blue agent), and the fact that the colorblind agent is an agent at all is not consequential for our main result.
Please also see here and here, where we’ve described why the goal is not simply to maximize SOO in theory or in practice.
Thanks for all these additional datapoints! I'll try to respond all of your questions in turn:
Did you find that your AIS survey respondents with more AIS experience were significantly more male than newer entrants to the field?
Overall, there don't appear to be major differences when filtering for amount of alignment experience. When filtering for greater than vs. less than 6 months of experience, it does appear that the ratio looks more like ~5 M:F; at greater than vs. less than 1 year of experience, it looks like ~8 M:F; the others still look like ~9 M:F. Perhaps the changes you see over the past two years at MATS are too recent to be reflected fully in this data, but it does seem like a generally positive signal that you see this ratio changing (given what we discuss in the post).
Has AE Studio considered sponsoring significant bounties or impact markets for scoping promising new AIS research directions?
We definitely want to do everything we can to support increased exploration of neglected approaches—if you have specific ideas here, we'd love to hear them and discuss more! Maybe we can follow up offline on this.
Did survey respondents mention how they proposed making AIS more multidisciplinary? Which established research fields are more needed in the AIS community?
We don't appear to have gotten many practical proposals for how to make AIS more multidisciplinary, but there were a number of specific disciplines mentioned in the free responses, including cognitive psychology, neuroscience, game theory, behavioral science, ethics/law/sociology, and philosophy (epistemology was specifically brought up across multiple respondents). One respondent wrote, "AI alignment is dominated by computer scientists who don't know much about human nature, and could benefit from more behavioral science expertise and game theory," which I think captures the sentiment of many of the related responses most succinctly (however accurate this statement actually is!). Ultimately, encouraging and funding research at the intersection of these underexplored areas and alignment is likely the only thing that will actually lead to a more multidisciplinary research environment.
Did EAs consider AIS exclusively a longtermist cause area, or did they anticipate near-term catastrophic risk from AGI?
Unfortunately, I don't think we asked the EA sample about AIS in a way that would allow us to answer this question using the data we have. This would be a really interesting follow-up direction. I will paste in below the ground truth distribution of EAs' views on the relative promise of these approaches as additional context (eg, we see that the 'AI risk' and 'Existential risk (general)' distributions have very similar shapes), but I don't think we can confidently say much about whether these risks were being conceptualized as short- or long-term.
It's also important to highlight that in the alignment sample (from the other survey), researchers generally indicate that they do not think we're going to get AGI in the next five years. Again, this doesn't clarify if they think there are x-risks that could emerge in the nearer term from less-general-but-still-very-advanced AI, but it does provide an additional datapoint that if we are considering AI x-risks to be largely mediated by the advent of AGI, alignment researchers don't seem to expect this as a whole in the very short term:
I expect this to generally be a more junior group, often not fully employed in these roles, with eg the average age and funding level of the orgs that are being led particularly low (and some of the orgs being more informal).
Here is the full list of the alignment orgs who had at least one researcher complete the survey (and who also elected to share what org they are working for): OpenAI, Meta, Anthropic, FHI, CMU, Redwood Research, Dalhousie University, AI Safety Camp, Astera Institute, Atlas Computing Institute, Model Evaluation and Threat Research (METR, formerly ARC Evals), Apart Research, Astra Fellowship, AI Standards Lab, Confirm Solutions Inc., PAISRI, MATS, FOCAL, EffiSciences, FAR AI, aintelope, Constellation, Causal Incentives Working Group, Formalizing Boundaries, AISC.
~80% of the alignment sample is currently receiving funding of some form to pursue their work, and ~75% have been doing this work for >1 year. Seems to me like this is basically the population we were intending to sample.
One additional factor for my abandoning it was that I couldn't imagine it drawing a useful response population anyway; the sample mentioned above is a significant surprise to me (even with my skepticism around the makeup of that population). Beyond the reasons I already described, I felt that it being done by a for-profit org that is a newcomer and probably largely unknown would dissuade a lot of people from responding (and/or providing fully candid answers to some questions).
Your expectation while taking the survey about whether we were going to be able to get a good sample does not say much about whether we did end up getting a good sample. Things that better tell us whether or not we got a good sample are, eg, the quality/distribution of the represented orgs and the quantity of actively-funded technical alignment researchers (both described above).
All in all, I expect that the respondent population skews heavily toward those who place a lower value on their time and are less involved.
Note that the survey took people ~15 minutes to complete and resulted in a $40 donation being made to a high-impact organization, which puts our valuation of an hour of their time at ~$160 (roughly equivalent to the hourly rate of someone who makes ~$330k annually). Assuming this population would generally donate a portion of their income to high-impact charities/organizations by default, taking the survey actually seems to probably have been worth everyone's time in terms of EV.
There's a lot of overlap between alignment researchers and the EA community, so I'm wondering how that was handled.
Agree that there is inherent/unavoidable overlap. As noted in the post, we were generally cautious about excluding participants from either sample for reasons you mention and also found that the key results we present here are robust to these kinds of changes in the filtration of either dataset (you can see and explore this for yourself here).
With this being said, we did ask in both the EA and the alignment survey to indicate the extent to which they are involved in alignment—note the significance of the difference here:
From alignment survey:
From EA survey:
This question/result serves both as a good filtering criterion for cleanly separating out EAs from alignment researchers and also gives a pretty strong evidence that we are drawing on completely different samples across these surveys (likely because we sourced the data for each survey through completely distinct channels).
Regarding the support for various cause areas, I'm pretty sure that you'll find the support for AI Safety/Long-Termism/X-risk is higher among those most involved in EA than among those least involved. Part of this may be because of the number of jobs available in this cause area.
Interesting—I just tried to test this. It is a bit hard to find a variable in the EA dataset that would cleanly correspond to higher vs. lower overall involvement, but we can filter by number of years one has been involved involved in EA, and there is no level-of-experience threshold I could find where there are statistically significant differences in EAs' views on how promising AI x-risk is. (Note that years of experience in EA may not be the best proxy for what you are asking, but is likely the best we've got to tackle this specific question.)
Blue is >1 year experience, red is <1 year experience:
Blue is >2 years experience, red is <2 years experience:
Thanks for the comment!
Consciousness does not have a commonly agreed upon definition. The question of whether an AI is conscious cannot be answered until you choose a precise definition of consciousness, at which point the question falls out of the realm of philosophy into standard science.
Agree. Also happen to think that there are basic conflations/confusions that tend to go on in these conversations (eg, self-consciousness vs. consciousness) that make the task of defining what we mean by consciousness more arduous and confusing than it likely needs to be (which isn't to say that defining consciousness is easy). I would analogize consciousness to intelligence in terms of its difficulty to nail down precisely, but I don't think there is anything philosophically special about consciousness that inherently eludes modeling.
is there some secret sauce that makes the algorithm [that underpins consciousness] special and different from all currently known algorithms, such that if we understood it we would suddenly feel enlightened? I doubt it. I expect we will just find a big pile of heuristics and optimization procedures that are fundamentally familiar to computer science.
Largely agree with this too—it very well may be the case (as seems now to be obviously true of intelligence) that there is no one 'master' algorithm that underlies the whole phenomenon, but rather as you say, a big pile of smaller procedures, heuristics, etc. So be it—we definitely want to better understand (for reasons explained in the post) what set of potentially-individually-unimpressive algorithms, when run in concert, give you system that is conscious.
So, to your point, there is not necessarily any one 'deep secret' to uncover that will crack the mystery (though we think, eg, Graziano's AST might be a strong candidate solution for at least part of this mystery), but I would still think that (1) it is worthwhile to attempt to model the functional role of consciousness, and that (2) whether we actually have better or worse models of consciousness matters tremendously.
There will be places on the form to indicate exactly this sort of information :) we'd encourage anyone who is associated with alignment to take the survey.
Thanks for taking the survey! When we estimated how long it would take, we didn't count how long it would take to answer the optional open-ended questions, because we figured that those who are sufficiently time constrained that they would actually care a lot about the time estimate would not spend the additional time writing in responses.
In general, the survey does seem to take respondents approximately 10-20 minutes to complete. As noted in another comment below,
this still works out to donating $120-240/researcher-hour to high-impact alignment orgs (plus whatever the value is of the comparison of one's individual results to that of community), which hopefully is worth the time investment :)
Ideally within the next month or so. There are a few other control populations still left to sample, as well as actually doing all of the analysis.
Thanks for sharing this! Will definitely take a look at this in the context of what we find and see if we are capturing any similar sentiment.
Thanks for calling this out—we're definitely open to discussing potential opportunities for collaboration/engaging with the platform!
It's a great point that the broader social and economic implications of BCI extend beyond the control of any single company, AE no doubt included. Still, while bandwidth and noisiness of the tech are potentially orthogonal to one's intentions, companies with unambiguous humanity-forward missions (like AE) are far more likely to actually care about the societal implications, and therefore, to build BCI that attempts to address these concerns at the ground level.
In general, we expect the by-default path to powerful BCI (i.e., one where we are completely uninvolved) to be negative/rife with s-risks/significant invasions of privacy and autonomy, etc, which is why we are actively working to nudge the developmental trajectory of BCI in a more positive direction—i.e., one where the only major incentive is build the most human-flourishing-conducive BCI tech we possibly can.
With respect to the RLNF idea, we are definitely very sympathetic to wireheading concerns. We think that approach is promising if we are able to obtain better reward signals given all of the sub-symbolic information that neural signals can offer in order to better understand human intent, but as you correctly pointed out that can be used to better trick the human evaluator as well. We think this already happens to a lesser extent and we expect that both current methods and future ones have to account for this particular risk.
More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.
We do call some of these concerns out in the post, eg:
We also recognize that many of these proposals have a double-edged sword quality that requires extremely careful consideration—e.g., building BCI that makes humans more competent could also make bad actors more competent, give AI systems manipulation-conducive information about the processes of our cognition that we don’t even know, and so on. We take these risks very seriously and think that any well-defined alignment agenda must also put forward a convincing plan for avoiding them (with full knowledge of the fact that if they can’t be avoided, they are not viable directions.)
Overall—in spite of the double-edged nature of alignment work potentially facilitating capabilities breakthroughs—we think it is critical to avoid base rate neglect in acknowledging how unbelievably aggressively people (who are generally alignment-ambivalent) are now pushing forward capabilities work. Against this base rate, we suspect our contributions to inadvertently pushing forward capabilities will be relatively negligible. This does not imply that we shouldn't be extremely cautious, have rigorous info/exfohazard standards, think carefully about unintended consequences, etc—it just means that we want to be pragmatic about the fact that we can help solve alignment while being reasonably confident that the overall expected value of this work will outweigh the overall expected harm (again, especially given the incredibly high, already-happening background rate of alignment-ambivalent capabilities progress).
Thanks for your comment! I think we can simultaneously (1) strongly agree with the premise that in order for AGI to go well (or at the very least, not catastrophically poorly), society needs to adopt a multidisciplinary, multipolar approach that takes into account broader civilizational risks and pitfalls, and (2) have fairly high confidence that within the space of all possible useful things to do to within this broader scope, the list of neglected approaches we present above does a reasonable job of documenting some of the places where we specifically think AE has comparative advantage/the potential to strongly contribute over relatively short time horizons. So, to directly answer:
Is this a deliberate choice of narrowing your direct, object-level technical work to alignment (because you think this where the predispositions of your team are?), or a disagreement with more systemic views on "what we should work on to reduce the AI risks?"
It is something far more like a deliberate choice than a systemic disagreement. We are also very interested and open to broader models of how control theory, game theory, information security, etc have consequences for alignment (e.g., see ideas 6 and 10 for examples of nontechnical things we think we could likely help with). To the degree that these sorts of things can be thought of further neglected approaches, we may indeed agree that they are worthwhile for us to consider pursuing or at least help facilitate others' pursuits—with the comparative advantage caveat stated previously.
I'm definitely sympathetic to the general argument here as I understand it: something like, it is better to be more productive when what you're working towards has high EV, and stimulants are one underutilized strategy for being more productive. But I have concerns about the generality of your conclusion: (1) blanket-endorsing or otherwise equating the advantages and disadvantages of all of the things on the y-axis of that plot is painting with too broad a brush. They vary, eg, in addictive potential, demonstrated medical benefit, cost of maintenance, etc. (2) Relatedly, some of these drugs (e.g., Adderall) alter the dopaminergic calibration in the brain, which can lead to significant personality/epistemology changes, typically as a result of modulating people's risk-taking/reward-seeking trade-offs. Similar dopamine agonist drugs used to treat Parkinson's led to pathological gambling behaviors in patients who took it. There is an argument to be made for at least some subset of these substances that the trouble induced by these kinds of personality changes may plausibly outweigh the productivity gains of taking the drugs in the first place.
27 people holding the view is not a counterexample to the claim that it is becoming less popular.
Still feels worthwhile to emphasize that some of these 27 people are, eg, Chief AI Scientist at Meta, co-director of CIFAR, DeepMind staff researchers, etc.
These people are major decision-makers in some of the world's leading and most well-resourced AI labs, so we should probably pay attention to where they think AI research should go in the short-term—they are among the people who could actually take it there.
See also this survey of NLP
I assume this is the chart you're referring to. I take your point that you see these numbers as increasing or decreasing (despite that where they actually are in an absolute sense seems harmonious with believing that brain-based AGI is entirely possible), but it's likely that these increases or decreases are themselves risky statistics to extrapolate. These sorts of trends could easily asymptote or reverse given volatile field dynamics. For instance, if we linearly extrapolate from the two stats you provided (5% believe scaling could solve everything in 2018; 17% believe it in 2022), this would predict, eg, 56% of NLP researchers in 2035 would believe scaling could solve everything. Do you actually think something in this ballpark is likely?
Did the paper say that NeuroAI is looking increasingly likely?
I was considering the paper itself as evidence that NeuroAI is looking increasingly likely.
When people who run many of the world's leading AI labs say they want to devote resources to building NeuroAI in the hopes of getting AGI, I am considering that as a pretty good reason to believe that brain-like AGI is more probable than I thought it was before reading the paper. Do you think this is a mistake?
Certainly, to your point, signaling an intention to try X is not the same as successfully doing X, especially in the world of AI research. But again, if anyone were to be able to push AI research in the direction of being brain-based, would it not be these sorts of labs?
To be clear, I do not personally think that prosaic AGI and brain-based AGI are necessarily mutually exclusive—eg, brains may be performing computations that we ultimately realize are some emergent product of prosaic AI methods that already basically exist. I do think that the publication of this paper gives us good reason to believe that brain-like AGI is more probable than we might have thought it was, eg, two weeks ago.
However, technological development is not a zero-sum game. Opportunities or enthusiasm in neuroscience doesn't in itself make prosaic AGI less likely and I don't feel like any of the provided arguments are knockdown arguments against ANN's leading to prosaic AGI.
Completely agreed!
I believe there are two distinct arguments at play in the paper and that they are not mutually exclusive. I think the first is "in contrast to the optimism of those outside the field, many front-line AI researchers believe that major new breakthroughs are needed before we can build artificial systems capable of doing all that a human, or even a much simpler animal like a mouse, can do" and the second is "a better understanding of neural computation will reveal basic ingredients of intelligence and catalyze the next revolution in AI, eventually leading to artificial agents with capabilities that match and perhaps even surpass those of humans."
The first argument can be read as a reason to negatively update on prosaic AGI (unless you see these 'major new breakthroughs' as also being prosaic) and the second argument can be read as a reason to positively update on brain-like AGI. To be clear, I agree that the second argument is not a good reason to negatively update on prosaic AGI.
Thanks for your comment!
As far as I can tell the distribution of views in the field of AI is shifting fairly rapidly towards "extrapolation from current systems" (from a low baseline).
I suppose part of the purpose of this post is to point to numerous researchers who serve as counterexamples to this claim—i.e., Yann LeCun, Terry Sejnowski, Yoshua Bengio, Timothy Lillicrap et al seem to disagree with the perspective you're articulating in this comment insofar as they actually endorse the perspective of the paper they've coauthored.
You are obviously a highly credible source on trends in AI research—but so are they, no?
And if they are explicitly arguing that NeuroAI is the route they think the field should go in order to get AGI, it seems to me unwise to ignore or otherwise dismiss this shift.
Agreed that there are important subtleties here. In this post, I am really just using the safety-via-debate set-up as a sort of intuitive case for getting us thinking about why we generally seem to trust certain algorithms running in the human brain to adjudicate hard evaluative tasks related to AI safety. I don't mean to be making any especially specific claims about safety-via-debate as a strategy (in part for precisely the reasons you specify in this comment).
Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards 'right' in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?
Thanks Lukas! I just gave your linked comment a read and I broadly agree with what you've written both there and here, especially w.r.t. focusing on the necessary training/evolutionary conditions out of which we might expect to see generally intelligent prosocial agents (like most humans) emerge. This seems like a wonderful topic to explore further IMO. Any other sources you recommend for doing so?
Hi Joe—likewise! This relationship between prosociality and distribution of power in social groups is super interesting to me and not something I've given a lot of thought to yet. My understanding of this critique is that it would predict something like: in a world where there are huge power imbalances, typical prosocial behavior would look less stable/adaptive. This brings to mind for me things like 'generous tit for tat' solutions to prisoner's dilemma scenarios—i.e., where being prosocial/trusting is a bad idea when you're in situations where the social conditions are unforgiving to 'suckers.' I guess I'm not really sure what exactly you have in mind w.r.t. power specifically—maybe you could elaborate on (if I've got the 'prediction' right in the bit above) why one would think that typical prosocial behavior would look less stable/adaptive in a world with huge power imbalances?
I broadly agree with Viliam's comment above. Regarding Dagon's comment (to which yours is a reply), I think that characterizing my position here as 'people who aren't neurotypical shouldn't be trusted' is basically strawmanning, as I explained in this comment. I explicitly don't think this is correct, nor do I think I imply it is anywhere in this post.
As for your comment, I definitely agree that there is a distinction to be made between prosocial instincts and the learned behavior that these instincts give rise to over the lifespan, but I would think that the sort of 'integrity' that you point at here as well as the self-aware psychopath counterexample are both still drawing on particular classes of prosocial motivations that could be captured algorithmically. See my response to 'plausible critique #1,' where I also discuss self-awareness as an important criterion for prosociality.
Interesting! Definitely agree that if people's specific social histories are largely what qualify them to be 'in the loop,' this would be hard to replicate for the reasons you bring up. However, consider that, for example,
Young neurotypical children (and even chimpanzees!) instinctively help others accomplish their goals when they believe they are having trouble doing so alone...
which almost certainly has nothing to do with their social history. I think there's a solid argument to be made, then, that a lot of these social histories are essentially a lifelong finetuning of core prosocial algorithms that have in some sense been there all along. And I am mainly excited about enumerating these. (Note also that figuring out these algorithms and running them in an RL training procedure might get us the relevant social histories training that you reference—but we'd need the core algorithms first.)
"human in the loop" to some extent translates to "we don't actually know why we trust (some) other humans, but there exist humans we trust, so let's delegate the hard part to them".
I totally agree with this statement taken by itself, and my central point is that we should actually attempt to figure out 'why we trust (some) other humans' rather than treating this as a kind of black box. However, if this statement is being put forward as an argument against doing so,, then it seems circular to me.
Agreed that the correlation between the modeling result and the self-report is impressive, with the caveat that the sample size is small enough not to take the specific r-value too seriously. In a quick search, I couldn't find a replication of the same task with a larger sample, but I did find a meta-analysis that includes this task which may be interesting to you! I'll let you know if I find something better as I continue to read through the literature :)
Definitely agree with the thrust of your comment, though I should note that I neither believe nor think I really imply anywhere that 'only neurotypical people are worth societal trust.' I only use the word in this post to gesture at the fact that the vast majority of (but not all) humans share a common set of prosocial instincts—and that these instincts are a product of stuff going on in their brains. In fact, my next post will almost certainly be about one such neuroatypical group: psychopaths!
I liked this post a lot, and I think its title claim is true and important.
One thing I wanted to understand a bit better is how you're invoking 'paradigms' in this post wrt AI research vs. alignment research. I think we can be certain that AI research and alignment research are not identical programs but that they will conceptually overlap and constrain each other. So when you're talking about 'principles that carry over,' are you talking about principles in alignment research that will remain useful across various breakthroughs in AI research, or are you thinking about principles within one of these two research programs that will remain useful across various breakthroughs within that research program?
Another thing I wanted to understand better was the following:
This leaves a question: how do we know when it’s time to make the jump to the next paradigm? As a rough model, we’re trying to figure out the constraints which govern the world.
Unlike many of the natural sciences (physics, chemistry, biology, etc.) whose explicit goals ostensibly are, as you've said, 'to figure out the constraints which govern the world,' I think that one thing that makes alignment research unique is that its explicit goal is not simply to gain knowledge about reality, but also to prevent a particular future outcome from occurring—namely, AGI-induced X-risks. Surely a necessary component for achieving this goal is 'to figure out the [relevant] constraints which govern the world,' but it seems pretty important to note (if we agree on this field-level goal) that this can't be the only thing that goes into a paradigm for alignment research. That is, alignment research can't only be about modeling reality; it must also include some sort of plan for how to bring about a particular sort of future. And I agree entirely that the best plans of this sort would be those that transcend content-level paradigm shifts. (I daresay that articulating this kind of plan is exactly the sort of thing I try to get at in my Paradigm-building for AGI safety sequence!)
Thanks for your comment! I agree with both of your hesitations and I think I will make the relevant changes to the post: instead of 'totally unenforceable,' I'll say 'seems quite challenging to enforce.' I believe that this is true (and I hope that the broad takeaway from this post is basically the opposite of 'researchers need to stay out of the policy game,' so I'm not too concerned that I'd be incentivizing the wrong behavior).
To your point, 'logistically and politically inconceivable' is probably similarly overblown. I will change it to 'highly logistically and politically fraught.' You're right that the general failure of these policies shouldn't be equated with their inconceivability. (I am fairly confident that, if we were so inclined, we could go download a free copy of any movie or song we could dream of—I wouldn't consider this a case study of policy success—only of policy conceivability!).
Very interesting counterexample! I would suspect it gets increasingly sketchy to characterize 1/8th, 1/16th, etc. 'units of knowledge towards AI' as 'breakthroughs' in the way I define the term in the post.
I take your point that we might get our wires crossed when a given field looks like it's accelerating, but when we zoom in to only look at that field's breakthroughs, we find that they are decelerating. It seems important to watch out for this. Thanks for your comment!
The question is not "How can John be so sure that zooming into something narrower would only add noise?", the question is "How can Cameron be so sure that zooming into something narrower would yield crucial information without which we have no realistic hope of solving the problem?".
I am not 'so sure'—as I said in the previous comment, I have only claim(ed) it is probably necessary to, for instance, know more about AGI than just whether it is a 'generic strong optimizer.' I would only be comfortable making non-probabilistic claims about the necessity of particular questions in hindsight.
I don't think I'm making some silly logical error. If your question is, "Why does Cameron think it is probably necessary to understand X if we want to have any realistic hope of solving the problem?", well, I do not think this is rhetorical! I spend an entire post defending and elucidating each of these questions, and I hope by the end of the sequence, readers would have a very clear understanding of why I think each is probably necessary to think about (or I have failed as a communicator!).
It was never my goal to defend the (probable) necessity of each of the questions in this one post—this is the point of the whole sequence! This post is a glorified introductory paragraph.
I do not think, therefore, that this post serves as anything close to an adequate defense of this framework, and I understand your skepticism if you think this is all I will say about why these questions are important.
However, I don't think your original comment—or any of this thread, for that matter—really addresses any of the important claims put forward in this sequence (which makes sense, given that I haven't even published the whole thing yet!). It also seems like some of your skepticism is being fueled by assumptions about what you predict I will argue as opposed to what I will actually argue (correct me if I'm wrong!).
I hope you can find the time to actually read through the whole thing once it's published before passing your final judgment. Taken as a whole, I think the sequence speaks for itself. If you still think it's fundamentally bullshit after having read it, fair enough :)
Definitely agree that if we silo ourselves into any rigid plan now, it almost certainly won't work. However, I don't think 'end-to-end agenda' = 'rigid plan.' I certainly don't think this sequence advocates anything like a rigid plan. These are the most general questions I could imagine guiding the field, and I've already noted that I think this should be a dynamic draft.
...we do not currently possess a strong enough understanding to create an end-to-end agenda which has any hope at all of working; anything which currently claims to be an end-to-end agenda is probably just ignoring the hard parts of the problem.
What hard parts of the problem do you think this sequence ignores?
(I explicitly claim throughout the sequence that what I propose is not sufficient, so I don't think I can be accused of ignoring this.)
Hate to just copy and paste, but I still really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, then we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). This is basically tautological as far as I can tell. Do you agree or disagree with this if-then statement?
I do think that finding necessary subquestions, or noticing that a given subquestion may not be necessary, is much easier than figuring out an end-to-end agenda.
Agreed. My goal was to enumerate these questions. When I noticed that they followed a fairly natural progression, I decided to frame them hierarchically. And, I suppose to your point, it wasn't necessarily easy to write this all up. I thought it would nonetheless be valuable to do so, so I did!
Thanks for linking the Rocket Alignment Problem—looking forward to giving it a closer read.
If it's possible that we could get to a point where AGI is no longer a serious threat without needing to answer the question, then the question is not necessary.
Agreed, this seems like a good definition for rendering anything as 'necessary.'
Our goal: minimize AGI-induced existential threats (right?).
My claim is that answering these questions is probably necessary for achieving this goal—i.e., P(achieving goal | failing to think about one or more of these questions) ≈ 0. (I say, "I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work.")
That is, we would be exceedingly lucky if we achieve AGI safety's goal without thinking about
- what we mean when we say AGI (Q1),
- what existential risks are likely to emerge from AGI (Q2),
- how to address these risks (Q3),
- how to implement these mitigation strategies (Q4), and
- how quickly we actually need to answer these questions (Q5).
I really don't see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all "before the deadline;" Q5). I propose a way to do this hierarchically. Do you see wiggle room here where I do not?
FWIW, I also don't really think this is the core claim of the sequence. I would want that to be something more like here is a useful framework for moving from point A (where the field is now) to point B (where the field ultimately wants to end up). I have not seen a highly compelling presentation of this sort of thing before, and I think it is very valuable in solving any hard problem to have a general end-to-end plan (which we probably will want to update as we go along; see Robert's comment).
I think most of the strategies in MIRI's general cluster do not depend on most of these questions.
Would you mind giving a specific example of an end-to-end AGI safety research agenda that you think does not depend on or attempt to address these questions? (I'm also happy to just continue this discussion off of LW, if you'd like.)
Thanks for taking the time to write up your thoughts! I appreciate your skepticism. Needless to say, I don't agree with most of what you've written—I'd be very curious to hear if you think I'm missing something:
[We] don't expect that the alignment problem itself is highly-architecture dependent; it's a fairly generic property of strong optimization. So, "generic strong optimization" looks like roughly the right level of generality at which to understand alignment...Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively "noise", for purposes of understanding alignment.
Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn't seem at all obvious to me. I write in Q2: "It is also worth noting immediately that even if particular [alignment problems] are architecture-independent [your point!], it does not necessarily follow that the optimal control proposals for minimizing those risks would also be architecture-independent! For example, just because an SL-based AGI and an RL-based AGI might both hypothetically display tendencies towards instrumental convergence does not mean that the way to best prevent this outcome in the SL AGI would be the same as in the RL AGI."
By analogy, consider the more familiar 'alignment problem' of training dogs (i.e., getting the goals of dogs to align with the goals of their owners). Surely there are 'breed-independent' strategies for doing this, but it is not obvious that these strategies will be sufficient for every breed—e.g., Afghan Hounds are apparently way harder to train, than, say, Golden Retrievers. So in addition to the generic-dog-alignment-regime, Afghan hounds require some additional special training to ensure they're aligned. I don't yet understand why you are confident that different possible AGIs could not follow this same pattern.
On top of that, there's the obvious problem that if we try to solve alignment for a particular architecture, it's quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)
I think that you think that I mean something far more specific than I actually do when I say "particular architecture," so I don't think this accurately characterizes what I believe. I describe my view in the next post.
[It's] the unknown unknowns that kill us. The move we want is not "brainstorm failure modes and then avoid the things we brainstormed", it's "figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)".
I think this is a very interesting point (and I have not read Eliezer's post yet, so I am relying on your summary), but I don't see what the point of AGI safety research is if we take this seriously. If the unknown unknowns will kill us, how are we to avoid them even in theory? If we can articulate some strategy for addressing them, they are not unknown unknowns; they are "increasingly-known unknowns!"
I spent the entire first post of this sequence devoted to "figuring out what we want" (we = AGI safety researchers). It seems like what we want is to avoid AGI-induced existential risks. (I am curious if you think this is wrong?) If so, I claim, here is a "strategy that might systematically achieve this:" we need to understand what we mean when we say AGI (Q1), figure out what risks are likely to emerge from AGI (Q2), mitigate these risks (Q3), and implement these mitigation strategies (Q4).
If by "figure out what we want," you mean "figure out what we want out of an AGI," I definitely agree with this (see Robert's great comment below!). If by "figure out what we want," you mean "figure out what we want out of AGI safety research," well, that is the entire point of this sequence!
I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it's technically necessary to answer at some point, this question might not be very useful to think about ahead of time.
I completely disagree with this. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn't even been published yet—I hope you'll read it!).
in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary.
When you frame it this way, I completely agree. However, there is definitely a continuous space of plausible timelines between "all-the-time-in-the-world" and "hail-Mary," and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum. Again, I hope you will withhold your final judgment of my claim until you see how I defend it in Q5! (I suppose my biggest regret in posting this sequence is that I didn't just do it all at once.)
Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.
I think this is a bit uncharitable. I have worked with and/or talked to lots of different AGI safety researchers over the past few months, and this framework is the product of my having "consider[ed] a wide variety of approaches, and look for subquestions which are clearly crucial to all of them." Take, for instance, this chart in Q1—I am proposing a single framework for talking about AGI that potentially unifies brain-based vs. prosaic approaches. That seems like a useful and productive thing to be doing at the paradigm-level.
I definitely agree that things like how we define 'control' and 'bad outcomes' might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!
Thanks again for your comment, and I definitely want to flag that, in spite of disagreeing with it in the ways I've tried to describe above, I really do appreciate your skepticism and engagement with this sequence (I cite your preparadigmatic claim a number of times in it).
As I said to Robert, I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.