The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda
post by Cameron Berg (cameron-berg), Judd Rosenblatt (judd), AE Studio (AEStudio), Marc Carauleanu (Marc-Everin Carauleanu) · 2023-12-18T20:35:01.569Z · LW · GW · 21 commentsContents
TL;DR About us Why and how we think we can help solve alignment We can probably do with alignment what we already did with BCI Many shots on goal with neglected approaches …but what are these neglected approaches? Your neglected approach ideas Our neglected approach ideas Important caveats Ten examples of neglected approaches we think are probably worth pursuing 1. Reverse-engineering prosociality 2. Transformative AI → better BCI → better (human) alignment researchers 3. BCI for quantitatively mapping human values 4. ‘Reinforcement Learning from Neural Feedback’ (RLNF) 5. Provably safe architectures[5] 6. Intelligent field-building as an indirect alignment approach[6] 7. Facilitate the development of explicitly-safety-focused businesses 8. Scaling our consulting business to do object-level technical alignment work—and then scale this model to many other organizations. 9. Neuroscience x mechanistic interpretability 10. Neglected approaches to AI policy—e.g., lobby government to directly fund alignment research. We want to make these ideas stronger Concluding thoughts None 21 comments
Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Sumner Norman, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, and Eric Ho for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout.
TL;DR
- Our initial theory of change at AE Studio was a 'neglected approach' that involved rerouting profits from our consulting business towards the development of brain-computer interface (BCI) technology to dramatically enhance human agency, better enabling us to do things like solve alignment. Now, given shortening timelines, we're updating our theory of change to scale up our technical alignment efforts.
- With a solid technical foundation in BCI, neuroscience, and machine learning, we are optimistic that we’ll be able to contribute meaningfully to AI safety. We are particularly keen on pursuing neglected technical alignment agendas that seem most creative, promising, and plausible. We are currently onboarding promising researchers and kickstarting our internal alignment team.
- As we forge ahead, we're actively soliciting expert insights from the broader alignment community and are in search of data scientists and alignment researchers who resonate with our vision of enhancing human agency and helping to solve alignment.
About us
Hi! We are AE Studio, a bootstrapped software and data science consulting business. Our mission has always been to reroute our profits directly into building technologies that have the promise of dramatically enhancing human agency, like Brain-Computer Interfaces (BCI). We also donate 5% of our revenue directly to effective charities. Today, we are ~150 programmers, product designers, and ML engineers; we are profitable and growing. We also have a team of top neuroscientists and data scientists with significant experience in developing ML solutions for leading BCI companies, and we are now leveraging our technical experience and learnings in these domains to assemble an alignment team dedicated to exploring neglected alignment research directions that draw on our expertise in BCI, data science, and machine learning.
As we are becoming more public with our AI Alignment efforts, we thought it would be helpful to share our strategy and vision for how we at AE prioritize what problems to work on and how to make the best use of our comparative advantage.
Why and how we think we can help solve alignment
We can probably do with alignment what we already did with BCI
You might think that AE has no business getting involved in alignment—and we agree.
AE’s initial theory of change sought to realize a highly “neglected approach” to doing good in the world: bootstrap a profitable software consultancy, incubate our own startups on the side, sell them, and reinvest the profits in Brain Computer Interfaces (BCI) in order to do things like dramatically increase human agency, mitigate BCI-related s-risks, and make humans sufficiently intelligent, wise, and capable to do things like solve alignment. While the vision of BCI-mediated cognitive enhancement to do good in the world is increasingly common [LW · GW] today, it was viewed as highly idiosyncratic when we first set out in 2016.
Initially, many said that AE had no business getting involved in the BCI space (and we also agreed at the time)—but after hiring leading experts in the field and taking increasingly ambitious A/B-tested steps in the right direction, we emerged as a respected player in the space (see here, here, here, and here for some examples).
Now, given accelerating AI timelines and the clear existential risks that this technology poses, we’ve now decided to leverage our technical expertise and learnings in BCI, data science, and machine learning to help solve alignment. Using the same strategic insights, technical know-how, and operational skillset that have served us well in scaling our software consultancy and BCI work—including practicing epistemic humility by soliciting substantial feedback from current experts in the field (please share yours!)—we're eager to begin exploring a diverse set of neglected alignment approaches. Some of the specific object-level alignment ideas that have emerged directly from our BCI work are detailed in the next section.
We’ve learned firsthand that the most promising projects often have low probabilities of success but extremely high potential upside, and we're now applying this core lesson to our AI alignment efforts.
We think that we can apply a similar model to alignment as we did for BCI: begin humbly,[1] and update incrementally toward excellent, expert-guided outputs.
Many shots on goal with neglected approaches
We think that the space of plausible directions for research that contributes to solving alignment is vast and that the still-probably-preparadigmatic [LW · GW] state of alignment research means that only a small subset of this space has been satisfactorily explored. If there is a nonzero probability that currently-dominant [LW · GW] alignment research agendas have hit upon one or many local maxima [LW · GW] in the space of possible approaches, then we suspect that pursuing a diversified set (and/or a hodge-podge [LW · GW]) of promising neglected approaches would afford greater exploratory coverage of this space.[2] Therefore, we are planning to adopt an optimistic and exploratory approach in pursuit of creative, plausible, and neglected alignment directions—particularly in areas where we possess a comparative advantage, like BCI and human neuroscience. We suspect many in the EA community already agree that groundbreaking innovations are often found in some highly unexpected places, seeming to many as implausible, heretical, or otherwise far-fetched—until they work.
…but what are these neglected approaches?
Your neglected approach ideas
We think we have some potentially promising hypotheses. But because we know you do, too, we are actively soliciting input from the alignment community. We will be more formally pursuing this initiative in the near future, awarding some small prizes to the most promising expert-reviewed suggestions. Please submit any[3] agenda idea that you think is both plausible and neglected (even if you don’t have the bandwidth right now to pursue the idea! This is a contest for ideas, not for implementation).
Our neglected approach ideas
To be clear about our big-picture goal: we want to ensure that if/when we live in a world with superintelligent AI whose behavior is—likely by definition—outside our direct control, this AI (at the very least) does not destroy humanity and (ideally) dramatically increases the agency and flourishing of conscious entities.
Accordingly, the following list presents a set of ten ideas that we think (1) have some reasonable probability of contributing to the realization of this vision, (2) have not been explored satisfactorily, and (3) we could meaningfully contribute to actualizing.
Important caveats
- Please consider this set of ideas something far more like ‘AE’s evolving, first-pass best guesses at promising neglected alignment approaches’ rather than ‘AE’s official alignment agenda.’
- Please also note that these are our ideas, not concrete implementation plans. While we think we might have a comparative advantage in pursuing some of the following agendas, we do not think this is likely to be the case across the board; we see the following ideas as generally-interesting, definitely-neglected, alignment-related agendas—even if we aren’t the specific group that is best suited to implement all of them.
- One meta-approach we are exploring involves quantitatively identifying neglected approaches, such as analyzing a very large natural language dataset of alignment research. We suspect this and other related projects may be instrumental in identifying specific research areas that are currently underrepresented.
We began sharing this ‘Neglected Approaches’ approach publicly at the Foresight Institute’s Whole Brain Emulation Workshop in May, and we were excited to see this strategy gain steam, including Foresight Institute’s own emphasis on neglected approaches with their new Grant for Underexplored Approaches to AI Safety [EA · GW].
Ten examples of neglected approaches we think are probably worth pursuing
1. Reverse-engineering prosociality [? · GW]
We agree that humans provide an untapped wealth of evidence about alignment [LW · GW]. The neural networks of the human brain robustly instantiate prosocial algorithms [LW · GW] such as empathy, self-other overlap, theory of mind, attention schema, self-awareness, self-criticism, self-control, humility, altruism and more. We want to reverse-engineer [? · GW]—and contribute to further developing—our current best models of how prosociality happens in the brain, toward the construction of robustly prosocial AI. With AE's background in BCI, neuroscience, and machine learning, we feel well-equipped to make tangible progress in this research direction.
- We are currently actively working on operationalizing attention schema theory, self-other overlap, and theory of mind for RL- and LLM-based agents as mechanisms for facilitating prosocial cognition. Brain-based approaches to AI have proven to be generally successful for AI capabilities research [LW · GW], and we (along with many others [AF · GW]) think the same is likely to be true for AI safety. We are interested in testing the hypothesis that prosocial learning algorithms are more performant and scalable as compared to default approaches. We also think that creating and/or facilitating the development of relevant benchmarks and datasets might be a very high leverage subproject associated with this approach.
- Though we are aware that current models of human prosociality are far from perfect, we believe that the associated scientific literature is a largely untapped source of inspiration both for (1) what sort of incentives and mechanisms make agents prosocial, and (2) under what conditions prosociality robustly contributes to aligned behavior. We think this existing work is likely to inspire novel alignment approaches in spite of the certainly-still-imperfect nature of computational cognitive neuroscience.
- Best guesses for why this might be neglected:
- We speculate that there may be a tendency to conflate (1) the extraction of the best alignment-relevant insights from cognitive neuroscience (we support this), with (2) the assumption that AGI will mimic the human brain (we don’t think this is likely), or (3) the idea that we already have perfect models from current neuroscience of how prosociality works (this is empirically not true), or (4) that we should in all cases try to replicate the social behavior of human brains in AI (we think this is unwise [LW · GW] and unsafe)—all of which has needlessly limited the extent to which (1) has been pursued.
- Additionally, the alignment community's strong foundation in mathematics, computer science, and other key technical fields, while undeniably valuable, may inadvertently limit community-level exposure to the cutting edge of cognitive science research.
2. Transformative AI → better BCI → better (human) alignment researchers
Some alignment researchers want to employ advanced AI to automate and/or rapidly advance alignment research directly (most prominently, OpenAI’s Superalignment agenda). We think there is a highly neglected direction to pursue in this same vein: employ advanced AI to automate and/or rapidly advance BCI research. Then, use this BCI to dramatically augment the capabilities of human alignment researchers.
- While this may sound somewhat outlandish, we suspect that significant scientific automation is plausible in the near future, and we want to flag that there are other potentially-very-high-value alignment directions that emerge from this breakthrough besides directly jumping to automating alignment research—including automating things like connectomics/whole brain emulation [AF · GW]. (Incidentally, we also think it's worth considering various other benefits of transformative AI for a safer post-AGI future, such as effectively encrypting human DNA with unique DNA codons to combat biorisk.)
- It is also worth noting that augmenting the capabilities of human alignment researchers does not necessarily require transformative BCI; to this end, we are currently investigating relatively-lower-hanging psychological interventions and agency-enhancing tools that have the potential to significantly enhance the quality and quantity of individuals’ cognitive output. In an ideal world (i.e., one where we can begin implementing this agenda reasonably quickly), we speculate it might be safer to empower humans to do better alignment research than AI, as empowering AI carries alignment-relevant capabilities risks that empowering humans does not (which also is not to say that empowering humans via BCI does also not have many serious risks).
3. BCI for quantitatively mapping human values
We also think that near-future-BCI may enable us to map the latent space of human values in a far more data-driven way than, for instance, encoding our values in natural language, as is the case, for instance, in Anthropic’s constitutional AI. This research is already happening in a more constrained way—we suspect that BCI explicitly tailored to mapping cognition related to valuation would be very valuable for alignment (to individuals, groups, societies, etc.).
4. ‘Reinforcement Learning from Neural Feedback’ (RLNF)
Near-future BCI may also allow us to interface neural feedback directly with AI systems, enabling us to improve the alignment of the state-of-the-art reward prediction models (and/or develop novel reward models altogether) in the direction of yielding more efficient, individually-tailored, high-fidelity reward signals. We think that in order for this approach to be pragmatic, the increase in quality of the reward signals would have to outweigh or otherwise counterbalance the practical cost of extracting the associated brain signals.[4] And the general idea of using neural data as an ML training signal also need not be limited to RL—we just thought RLNF sounded pretty cool.
5. Provably safe architectures[5]
We see significant potential to help amplify, expedite, and scale the deployment of provably safe architectures, including potentially promising examples like open agency architectures [LW · GW], inductive program synthesis (e.g., DreamCoder), and other similar [LW · GW] frameworks that draw on insights from cognitive neuroscience. Though these architectures are not currently prominent in machine learning, we think it is possible that devoting effort and resources to scaling them up for mainstream adoption could potentially be highly beneficial in expectation. We are sensitive to the concern that the alignment tax might be high in adopting uncompetitive architectures—which is precisely why we think these architectures deserve more rather than less technical attention and funding.
6. Intelligent field-building as an indirect alignment approach[6]
Despite the increasing mainstream ubiquity of AI safety research, there is still only a tiny subset of smart and experienced people who could very likely add value to alignment who are in fact currently doing so. If we can carefully identify these extremely promising thinkers—especially those from disciplines and backgrounds that may be traditionally overlooked (e.g., neuroscience)—and get them into a state where they can contribute meaningfully to alignment, we think that this could enable us to develop, test, and iterate on unconventional approaches at scale.
7. Facilitate the development of explicitly-safety-focused businesses [LW · GW]
As alignment efforts become increasingly mainstream, we suspect that AI safety frameworks may yield innovations upon which various promising business models may be built. We also think it would be a far better outcome if, all else being equal, more emerging for-profit AI companies decide to build alignment-related products (rather than build products that just further advance capabilities, which seems like the current default behavior). We suspect many capable startup founders [LW(p) · GW(p)] could be nerd sniped into doing something more impactful with alignment.
- Some plausible examples of such businesses could include (1) consultancies offering red-teaming [LW · GW] as a service for adversarial testing of AI systems, (2) platforms providing robust testing/benchmarking/auditing software for advanced AI systems, (3) centralized services that deliver high-quality, expert-labeled, ethically-sourced datasets for unbiased ML training, and (4) AI monitoring services akin to Datadog for continuous safety and performance tracking. We know of several founders currently setting out to pursue similarly safety-focused business models.
- Accordingly, we are planning to do all the following:
- we’re currently growing a community of VCs and angels interested in funding such ideas,
- starting Q2 next year, we aim to fund, internally develop, and deploy safety-prioritizing AI skunkworks companies (ideally as an exportable model for others to follow), and
- we're planning to run a competition with $50K seed funding for already-existing safety-focused businesses and/or anyone who has promising business ideas that first and foremost advance alignment, to be evaluated by AI safety experts and concerned business leaders. We do encourage you to post any promising ideas, even if you're not likely to pursue them.
- We also suspect it may be worth creating some template best practices with company formation to increase the likelihood that the businesses retain agency long term in accomplishing AI safety goals, especially given recent events. Aligning business interests with public safety is not just beneficial for societal welfare, but also advantageous for long-term business sustainability—as well as potentially influencing public perception and policy efforts in a dramatically positive way. We also are acutely aware of safety-washing [EA · GW] concerns [LW(p) · GW(p)] and/or unintentionally creating race dynamics in this domain, and we think that ensuring for-profit safety work is technically rigorous and productive is critical to get right.
- If you are a potential funder for promising businesses that advance alignment, please reach out to us at alignmentangels@ae.studio to express interest in joining our Alignment Angels slack group.
8. Scaling our consulting business to do object-level technical alignment work—and then scale this model to many other organizations.
The potential to bring other highly promising people into the fold (see point 6, above) to contribute significantly to alignment—even without being alignment experts per se—is a hypothesis we're actively exploring and aiming to validate.
- Given that we expect most people to struggle with having actually-impactful alignment outputs as they are just starting, we see a model where senior AI engineers—even those without explicit alignment backgrounds—can eventually collaborate with a small number of extremely promising alignment researchers who have an abundance of excellent object-level technical project ideas but limited capacity to pursue them. By integrating these researchers into our client engagement framework, used highly successfully over the years for our other technical projects, we could potentially massively scale the efficacy of these researchers, leveraging our team's extensive technical expertise to advance these alignment projects and drive meaningful progress in the field.
- We hope that if this ‘outsource-specific-promising-technical-alignment-projects’ model works, many other teams (corporations, nonprofits, etc.) with technical talent would copy it—especially if grants are made in the future to further enable this approach.
9. Neuroscience x mechanistic interpretability
Both domains have yielded insights that are mutually elucidating for the shared project of attempting to model how neural data leads to complex cognitive properties. We think it makes a lot of sense to put leading neuroscientists in conversation with mechanistic interpretability researchers in an explicit and systematic way, such that the cutting-edge methods in each discipline can be further leveraged to enhance the other. Of course, we think that this synergy across research domains should be explicitly focused on enhancing safety and interpretability rather than using neuroscience insights to extend AI capabilities.
10. Neglected approaches to AI policy—e.g., lobby government to directly fund [LW · GW] alignment research.
Though not a technical research direction, we think that this perspective dovetails nicely with other thinking-outside-the-box alignment approaches that we’ve shared here. It appears as though congresspeople and staffers are taking the alignment problem more seriously [LW · GW] than many would have initially predicted and, in particular, are quite open to plausible safety proposals [LW · GW]—all of which means that there may be substantial opportunity to capitalize on the vast funding resources at their disposal to dramatically increase the scale and speed at which alignment work is being done. We think it is critical to make sure that this is done effectively and efficiently (e.g., avoiding pork) and for alignment organizations to be practically prepared to manage and utilize significant investment (e.g., 10-1000x) if such funding does in fact come to fruition in the near future. We are currently exploring the possibility of hiring someone with a strong policy background to help facilitate this: while we have received positive feedback on this general idea from those who know significantly more about the policy space than we do, we are very sensitive to the potential for a shortsighted or naive implementation of this to be highly harmful to AI safety policy. We are actively in the process of meeting with and learning more from policy experts: if you are doing work in this area and know way more than us about AI policy, please do reach out so we can learn from you!
We want to make these ideas stronger
It is critical to emphasize again that this list represents our current best guesses on some plausible neglected approaches that we think we are well-equipped to explore further. We fully acknowledge that many of these guesses may be ill-conceived for some reason we haven’t anticipated and are open to critical feedback in order to make our contributions as positively impactful as possible. We intend to keep the community updated with respect to our working models and plans for contributing maximally effectively to alignment. (Please see this feedback form if you’d prefer to share your thoughts on our work anonymously/privately instead of leaving a comment below this post.)
We also recognize that many of these proposals have a double-edged sword quality that requires extremely careful consideration—e.g., building BCI that makes humans more competent could also make bad actors more competent, give AI systems manipulation-conducive information about the processes of our cognition that we don’t even know, and so on. We take these risks very seriously and think that any well-defined alignment agenda must also put forward a convincing plan for avoiding them (with full knowledge of the fact that if they can’t be avoided, they are not viable directions.)
Concluding thoughts
AE Studio's burgeoning excitement about contributing to AI safety research is a calculated response to our updated timelines and optimism about having the skillset required for making impactful contributions. Our approach aims to combine our expertise in software, neuroscience, and data science with ambitious parallel exploration of what we consider to be neglected approaches in AI alignment.
We commit to exploring these directions in a pragmatic, informed, and data-driven manner, emphasizing collaboration and openness within the greater alignment community. We care deeply about contributing to alignment because we want to bring about a maximally agency-increasing future for humanity—and without the precondition of robustly aligned AGI, this future seems otherwise impossible to attain.
We are actively hiring for senior alignment researchers to join/advise our team, in addition to data scientists with a neuroscience background, those highly experienced in policymaker engagement [LW(p) · GW(p)], and a number of other positions. If you have any other questions, comments, or ideas, please do reach out.
- ^
Miscellaneous cool accomplishment: before we started getting involved in AI safety in any serious way, two AE engineers with no prior background in alignment developed a framework for studying prompt injection attacks that went on to win Best Paper at the 2022 NeurIPS ML Safety Workshop.
- ^
To illustrate this point more precisely, we can consider a highly simplified probabilistic model of the research space. (We recognize this sort of neglect math is likely highly familiar to many EAs, and we don't mean to be pedantic by including it; we've put it here because we think it is a succinct way of demonstrating—if only to ourselves—why taking on multiple neglected approaches is rational.)
Let’s say the total number of plausible alignment agendas is . Let’s stipulate that currently, alignment researchers have meaningfully explored approaches, meaning that approaches remain unexplored. (As stated previously, we suspect that current mainstream alignment research is likely exploiting only a small subset of the total space of plausible alignment approaches, rendering a large number of alignment strategies either completely or mostly unexplored—i.e., we think that is large.) Each neglected approach, , has a very small but nonzero probability of being crucial for making significant progress in alignment. Treating these probabilities as independent for the sake of simplicity, the chance that all neglected approaches are not key is . Conversely, the probability that at least one neglected approach is key is . This implies—at least in our simplified model—that even with low individual probabilities, a sufficiently large number of neglected approaches can collectively hold a high chance of including a crucial solution in expectation. For instance, in a world with 100 neglected approaches and a probability of 99% that each approach is not key (i.e., a 1% likelihood of pushing the needle on alignment), there’s still about a 63% chance that one of these approaches would be crucial; with 1000 approaches and a probability of 99% that each approach is not key, the probability rises to over 99% that one will be pivotal. This simple model motivates us to think it makes sense to take many shots on goal, pursuing as many plausible neglected alignment agendas as possible.
- ^
Please note: (1) we are primarily interested in aggregating the best ideas to begin, so don’t worry if you have an idea that you think fits the criteria above but is challenging to implement/you wouldn’t want to actually implement it. (2) There is space on the form to denote that your suggested approach is exfohazardous [LW · GW].
- ^
This is a core trade-off in our work and something that we have made substantial progress on since our founding.
- ^
We want to call out that this approach is likely the least neglected of the ten we enumerate here—which is not to say it isn’t neglected in an absolute sense.
- ^
While there are a good number of newer organizations working on fieldbuilding for alignment, we think it remains highly neglected given the potential impact, especially in likely-impactful fields that are only now starting to be considered within the Overton window.
21 comments
Comments sorted by top scores.
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-24T11:12:35.956Z · LW(p) · GW(p)
spicy take: LM agents for automated safety research in the rough shape of https://sakana.ai/ai-scientist/ will be the ultimate meta-approach to neglected safety approaches; see appendix 'C. Progression of Generated Ideas' from https://arxiv.org/abs/2408.06292 (and the full paper) for an early demo
comment by TsviBT · 2023-12-18T23:42:04.578Z · LW(p) · GW(p)
Make a hyperphone. A majority of my alignment research conversations would be enhanced by having a hyperphone, to a degree somewhere between a lot and extremely; and this is heavily weighted on the most hopeworthy conversations. (Also sometimes when I explain what a hyperphone is well enough for the other person to get it, and then we have a complex conversation, they agree that it would be good. But very small N, like 3 to 5.)
https://tsvibt.blogspot.com/2023/01/hyperphone.html
Replies from: judd↑ comment by Judd Rosenblatt (judd) · 2023-12-19T00:14:31.062Z · LW(p) · GW(p)
That would be great! And it's exactly the sort of thing we've dreamed about building at AE since the start.
Incidentally, I've practiced something (inferior) like this with my wife in the past and we've gotten good at speaking simultaneously and actually understanding multiple threads at the same time (though it seems to break down if one of the threads is particularly complex).
It seems like an MVP hyperphone could potentially just be a software project/not explicitly require BCI (though certainly would be enhanced with it). We would definitely consider building it, at least as a Same Day Skunkworks. Are you aware of any existing tool that's at all like this?
You might also enjoy this blog post, which talks about how easily good ideas can be lost and why a tool like this could be so high value.
My favorite quotes from the piece:
1. "While ideas ultimately can be so powerful, they begin as fragile, barely formed thoughts, so easily missed, so easily compromised, so easily just squished."
2. "You need to recognize those barely formed thoughts, thoughts which are usually wrong and poorly formed in many ways, but which have some kernel of originality and importance and truth. And if they seem important enough to be worth pursuing, you construct a creative cocoon around them, a set of stories you tell yourself to protect the idea not just from others, but from your own self doubts. The purpose of those stories isn't to be an air tight defence. It's to give you the confidence to nurture the idea, possibly for years, to find out if there's something really there.
3. "And so, even someone who has extremely high standards for the final details of their work, may have an important component to their thinking which relies on rather woolly arguments. And they may well need to cling to that cocoon. Perhaps other approaches are possible. But my own experience is that this is often the case."
comment by Roman Leventov · 2023-12-19T09:30:27.340Z · LW(p) · GW(p)
You choose phrases like "help to solve alignment", in general mostly mention "alignment" and not "safety" (except in the sections where you discuss indirect agendas, such as "7. Facilitate the development of explicitly-safety-focused businesses [LW · GW]"), and write "if/when we live in a world with superintelligent AI whose behavior is—likely by definition—outside our direct control" (implying that 'control' of AI would be desirable?).
Is this a deliberate choice of narrowing your direct, object-level technical work to alignment (because you think this where the predispositions of your team are?), or a disagreement with more systemic views on "what we should work on to reduce the AI risks", such as:
(1) Davidad's "AI Neorealism: a threat model & success criterion for existential safety [LW · GW]":
For me the core question of existential safety is this: “Under these conditions, what would be the best strategy for building an AI system that helps us ethically end the acute risk period without creating its own catastrophic risks that would be worse than the status quo?”
It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).
(2) Leventov's "Beyond alignment theories [LW · GW]":
Note that in this post, only a relatively narrow aspect of the multi-disciplinary view on AI safety [LW · GW] is considered, namely the aspect of poly-theoretical approach to the technical alignment of humans to AIs. This mainly speaks to theories of cognition (intelligence, alignment) and ethics. But on a larger view, there are more theories and approaches that should be deployed in order to engineer our civilisational intelligence [LW · GW] such that it “goes well”. These theories are not necessarily quite about “alignment”. Examples are control theory (we may be “aligned” with AIs but collectively “zombified” by powerful memetic viruses and walk towards a civilisational cliff), game theory (we may have good theories of alignment but our governance systems cannot deal with multi-polar traps so we cannot deploy these theories effectively), information security considerations [LW · GW], mechanistic anomaly detection [LW · GW] and deep deceptiveness [LW · GW], etc. All these perspectives further demonstrate that no single compact theory can “save” us.
(3) Drexler's "Open Agency Model [LW · GW]";
(4) Hendrycks' "Pragmatic AI Safety [? · GW]";
(5) Critch's "What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) [LW · GW]".
Replies from: cameron-berg↑ comment by Cameron Berg (cameron-berg) · 2023-12-19T16:11:23.924Z · LW(p) · GW(p)
Thanks for your comment! I think we can simultaneously (1) strongly agree with the premise that in order for AGI to go well (or at the very least, not catastrophically poorly), society needs to adopt a multidisciplinary, multipolar approach that takes into account broader civilizational risks and pitfalls, and (2) have fairly high confidence that within the space of all possible useful things to do to within this broader scope, the list of neglected approaches we present above does a reasonable job of documenting some of the places where we specifically think AE has comparative advantage/the potential to strongly contribute over relatively short time horizons. So, to directly answer:
Is this a deliberate choice of narrowing your direct, object-level technical work to alignment (because you think this where the predispositions of your team are?), or a disagreement with more systemic views on "what we should work on to reduce the AI risks?"
It is something far more like a deliberate choice than a systemic disagreement. We are also very interested and open to broader models of how control theory, game theory, information security, etc have consequences for alignment (e.g., see ideas 6 and 10 for examples of nontechnical things we think we could likely help with). To the degree that these sorts of things can be thought of further neglected approaches, we may indeed agree that they are worthwhile for us to consider pursuing or at least help facilitate others' pursuits—with the comparative advantage caveat stated previously.
comment by Roman Leventov · 2023-12-19T09:07:25.738Z · LW(p) · GW(p)
Post-response: Assessment of AI safety agendas: think about the downside risk [LW · GW]
You evidently follow a variant of 80000hours' framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability.
I think for assessing AI safety ideas, agendas, and problems to solve, we should augment the assessment with another factor: the potential for a Waluigi turn [? · GW], or more prosaically, the uncertainty about the sign of the impact (scale) and, therefore, the risks of solving the given problem or advancing far on the given agenda.
This reminds me of Taleb's mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the "ruin potential". See "The Logic of Risk Taking".
Of the approaches that you listed, some sound risky to me in this respect. Particularly, "4. ‘Reinforcement Learning from Neural Feedback’ (RLNF)" -- sounds like a direct invitation for wireheading [? · GW] to me. More generally, scaling BCI in any form and not falling into a dystopia at some stage is akin to walking a tightrope [LW(p) · GW(p)] (at least at the current stage of civilisational maturity, I would say) This speaks to agendas #2 and #3 on your list.
There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability:
- Why and When Interpretability Work is Dangerous [LW · GW]
- Against Almost Every Theory of Impact of Interpretability [LW · GW]
- AGI-Automated Interpretability is Suicide [LW · GW]
- AI interpretability could be harmful? [LW · GW]
This speaks to the agenda "9. Neuroscience x mechanistic interpretability" on your list.
Related earlier posts
- Some background for reasoning about dual-use alignment research [LW · GW]
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? [LW · GW]
↑ comment by Cameron Berg (cameron-berg) · 2023-12-19T16:31:46.114Z · LW(p) · GW(p)
With respect to the RLNF idea, we are definitely very sympathetic to wireheading concerns. We think that approach is promising if we are able to obtain better reward signals given all of the sub-symbolic information that neural signals can offer in order to better understand human intent, but as you correctly pointed out that can be used to better trick the human evaluator as well. We think this already [LW · GW] happens [LW · GW] to a lesser extent and we expect that both current methods and future ones have to account for this particular risk.
More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.
We do call some of these concerns out in the post, eg:
We also recognize that many of these proposals have a double-edged sword quality that requires extremely careful consideration—e.g., building BCI that makes humans more competent could also make bad actors more competent, give AI systems manipulation-conducive information about the processes of our cognition that we don’t even know, and so on. We take these risks very seriously and think that any well-defined alignment agenda must also put forward a convincing plan for avoiding them (with full knowledge of the fact that if they can’t be avoided, they are not viable directions.)
Overall—in spite of the double-edged nature of alignment work potentially facilitating capabilities breakthroughs—we think it is critical to avoid base rate neglect in acknowledging how unbelievably aggressively people (who are generally alignment-ambivalent) are now pushing forward capabilities work. Against this base rate, we suspect our contributions to inadvertently pushing forward capabilities will be relatively negligible. This does not imply that we shouldn't be extremely cautious, have rigorous info/exfohazard standards, think carefully about unintended consequences, etc—it just means that we want to be pragmatic about the fact that we can help solve alignment while being reasonably confident that the overall expected value of this work will outweigh the overall expected harm (again, especially given the incredibly high, already-happening background rate of alignment-ambivalent capabilities progress).
Replies from: Roman Leventov↑ comment by Roman Leventov · 2023-12-19T17:41:29.513Z · LW(p) · GW(p)
More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.
I can push back on this somewhat by noting that most risks from BCI may lay outside of the scope of control of any company that builds it and "plugs people in", but rather in the wider economy and social ecosystem. The only thing that may matter is the bandwidth and the noisiness of information channel between the brain and the digital sphere, and it seems agnostic to whether a profit-maximising, risk-ambivalent, or a risk-conscious company is building the BCI.
Replies from: cameron-berg↑ comment by Cameron Berg (cameron-berg) · 2023-12-19T18:07:35.921Z · LW(p) · GW(p)
It's a great point that the broader social and economic implications of BCI extend beyond the control of any single company, AE no doubt included. Still, while bandwidth and noisiness of the tech are potentially orthogonal to one's intentions, companies with unambiguous humanity-forward missions (like AE) are far more likely to actually care about the societal implications, and therefore, to build BCI that attempts to address these concerns at the ground level.
In general, we expect the by-default path to powerful BCI (i.e., one where we are completely uninvolved) to be negative/rife with s-risks/significant invasions of privacy and autonomy, etc, which is why we are actively working to nudge the developmental trajectory of BCI in a more positive direction—i.e., one where the only major incentive is build the most human-flourishing-conducive BCI tech we possibly can.
comment by Roman Leventov · 2023-12-19T17:33:46.141Z · LW(p) · GW(p)
We think we have some potentially promising hypotheses. But because we know you do, too, we are actively soliciting input from the alignment community. We will be more formally pursuing this initiative in the near future, awarding some small prizes to the most promising expert-reviewed suggestions. Please submit any[3] [LW(p) · GW(p)] agenda idea that you think is both plausible and neglected (even if you don’t have the bandwidth right now to pursue the idea! This is a contest for ideas, not for implementation).
This is related to what @Kabir Kumar [LW · GW] is doing with ai-plans.com, who just hosted a critique-a-thon a couple of days ago. So maybe you will find his platform useful or find other ways to collaborate.
Replies from: cameron-berg↑ comment by Cameron Berg (cameron-berg) · 2023-12-19T18:10:50.576Z · LW(p) · GW(p)
Thanks for calling this out—we're definitely open to discussing potential opportunities for collaboration/engaging with the platform!
comment by Roman Leventov · 2023-12-19T16:53:00.369Z · LW(p) · GW(p)
Here's my idea on this topic: "SociaLLM: a language model design for personalised apps, social science, and AI safety research [LW · GW]". Though it's more about engineering pro-sociality (including Self-Other Overlap) using architecture and inductive biases directly than reverse-engineering prosociality.
Replies from: Marc-Everin Carauleanu↑ comment by Marc Carauleanu (Marc-Everin Carauleanu) · 2023-12-19T21:45:27.762Z · LW(p) · GW(p)
Thanks for writing this up—excited to chat some time to further explore your ideas around this topic. We’ll follow up in a private channel.The main worry that I have with regards to your approach is how competitive SociaLLM would be with regards to SOTA foundation models given both (1) the different architecture you plan to use, and (2) practical constraints on collecting the requisite structured data. While it is certainly interesting that your architecture lends itself nicely to inducing self-other overlap, if it is not likely to be competitive at the frontier, then the methods uniquely designed to induce self-other overlap on SociaLLM are likely to not scale/transfer well to frontier models that do pose existential risks. (Proactively ensuring transferability is the reason we focus on an additional training objective and make minimal assumptions about the architecture in the self-other overlap agenda.) One additional worry is that many of the research benefits of SociaLLM may not be out of reach for current foundation models, and so it is unclear if investing in the unique data and architecture setup is worth it in comparison to the counterfactual of just scaling up current methods.
Replies from: Roman Leventov↑ comment by Roman Leventov · 2023-12-20T05:52:34.860Z · LW(p) · GW(p)
Thanks for feedback. I agree with worries (1) and (2). I think there is a way to de-risk this.
The block hierarchy that is responsible for tracking the local context consists of classic Transformer blocks. Only the user's own history tracking really needs to be an SSM hierarchy because it quickly surpasses the scalability limits of self-attention (also, interlocutor's tracking blocks on private 1-1 chats that can also be arbitrarily long, but there is probably no such data available for training). On the public data (such as forums, public chats room logs, Diplomacy and other text game logs) the interlocutor's history traces will 99% of the time would easily be less than 100k symbols, but for the symmetry with user's own state (same weights!) and for having the same representation structure it should mirror the user's own SSM blocks, of course.
With such an approach, the SSM hierarchies could start very small, with only a few blocks or even just a single SSM block (i.e., two blocks in total: one for user's own and one for interlocutor's state), and attach to the middle of the Transformer hierarchy to select from it. However, I think this approach couldn't be just slapped on the tre-trained LLama or another large Transformer LLM model. I suspect the transformer should be co-trained with the SSM blocks to induce the Transformer to make the corresponding representations useful for the SSM blocks. "Pretraining Language Models with Human Preferences [LW · GW]" is my intuition pump here.
Regarding the sufficiency and quality of training data, the Transformer hierarchy itself could still be trained on arbitrary texts, as well as the current LLMs. And we can adjust the size of the SSM hierarchies to the amounts of high-quality dialogue and forum data that we are able to obtain. I think this a no-brainer that this design would improve the frontier quality in LLM apps that value personalisation and attunement to the user's current state (psychological, emotional, levels of knowledge, etc.), relative to whatever "base" Transformer model we would take (such as Llama, or any other).
One additional worry is that many of the research benefits of SociaLLM may not be out of reach for current foundation models
With this I disagree, I think it's critical for the user state tracking to be energy-based. I don't think there are ways to recapitulate this with auto-regressive Transformer language models (cf. any LeCun's presentation from the last year). There are potential ways to recapitulate this with other language modelling architectures (non-Transformer and non-SSM), but they currently don't hold any stronger promise than SSM, so I don't see any reasons to pick them.
Replies from: Marc-Everin Carauleanu↑ comment by Marc Carauleanu (Marc-Everin Carauleanu) · 2023-12-20T21:56:27.525Z · LW(p) · GW(p)
Interesting, thanks for sharing your thoughts. If this could improve the social intelligence of models then it can raise the risk of pushing the frontier of dangerous capabilities. It is worth noting that we are generally more interested in methods that are (1) more likely to transfer to AGI (don't over-rely on specific details of a chosen architecture) and (2) that specially target alignment instead of furthering the social capabilities of the model.
Replies from: Roman Leventov↑ comment by Roman Leventov · 2023-12-20T22:26:58.832Z · LW(p) · GW(p)
On (1), cf. this report [LW · GW]: "The current portfolio of work on AI risk is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box (and this may involve plans targeted at more specific risks)."
On (2), I'm surprised to read this from you, since you suggested to engineer Self-Other Overlap into LLMs in your AI Safety Camp proposal, if I understood and remember correctly. Do you actually see a line (or a way) of increasing the overlap without furthering ToM and therefore "social capabilities"? (Which ties back to "almost all applied/empirical AI safety work is simultaneously capabilities work".)
Replies from: Marc-Everin Carauleanu↑ comment by Marc Carauleanu (Marc-Everin Carauleanu) · 2023-12-22T14:24:26.987Z · LW(p) · GW(p)
On (1), some approaches are neglected for a good reason. You can also target specific risks while treating TAI as a black-box (such as self-other overlap for deception). I think it can be reasonable to "peer inside the box" if your model is general enough and you have a good enough reason to think that your assumption about model internals has any chance at all of resembling the internals of transformative AI.
On (2), I expect that if the internals of LLMs and humans are different enough, self-other overlap would not provide any significant capability benefits. I also expect that in so far as using self-representations is useful to predict others, you probably don't need to specifically induce self-other overlap at the neural level for that strategy to be learned, but I am uncertain about this as this is highly contingent on the learning setup.
↑ comment by Roman Leventov · 2023-12-23T04:42:22.563Z · LW(p) · GW(p)
I absolutely agree that the future TAI may look nothing like the current architectures. Cf. this tweet by Kenneth Stanley, with whom I agree 100%. At the same time, I think it's a methodological mistake to therefore conclude that we should only work on approaches and techniques that are applicable to any AI, in a black-box manner. It's like tying our hands behind our backs. We can and should affect the designs of future TAIs through our research, by demonstrating promise (or inherent limitations) of this or that alignment technique, so that these techniques get or lose traction and are included or excluded from the TAI design. So, we are not just making "assumptions" about the internals of the future TAIs; we are shaping these internals.
We can and should think about the proliferation risks[1] (i.e., the risks that some TAI will be created by downright rogue actors), but IMO most of that thinking should be on the governance, not technical side. We agree with Davidad here that a good technical AI safety plan should be accompanied with a good governance (including compute monitoring) plan [LW · GW].
Replies from: Marc-Everin Carauleanu↑ comment by Marc Carauleanu (Marc-Everin Carauleanu) · 2023-12-26T18:08:48.959Z · LW(p) · GW(p)
I do not think we should only work on approaches that work on any AI, I agree that would constitute a methodological mistake. I found a framing that general to not be very conducive to progress.
You are right that we still have the chance to shape the internals of TAI, even though there are a lot of hoops to go through to make that happen. We think that this is still worthwhile, which is why we stated our interest in potentially helping with the development and deployment of provably safe architectures, even though they currently seem less competitive.
In my response, I was trying to highlight the point that whenever we can, we should keep our assumptions to a minimum given the uncertainty we are under. Having that said, it is reasonable to have some working assumptions that allow progress to be made in the first place as long as they are clearly stated.
I also agree with Davidad about the importance of governance for the successful implementation of a technical AI Safety plan as well as with your claim that proliferation risks are important, with the caveat that I am less worried about proliferation risks in a world with very short timelines.
↑ comment by Roman Leventov · 2023-12-27T14:56:08.812Z · LW(p) · GW(p)
This conversation has prompted me to write "AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them [LW · GW]".
comment by Review Bot · 2024-02-14T06:48:32.863Z · LW(p) · GW(p)
The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?