Alignment Gaps

post by kcyras · 2024-06-08T15:23:16.396Z · LW · GW · 3 comments


    Mechanistic interpretability vs Formal methods for verification and explainability
    AI safety via debate vs Argumentative dialogue games
    Value alignment vs Computational social choice and Reasoning with preferences
    Cooperative AI vs AI fairness
    AI Safety/Alignment vs Safe and Trustworthy AI

Misaligned agendas and terminology among academic, industrial and independent AI alignment research

This post aims to fill some gaps between technical AI Alignment topics and academic AI research. It summarises a quick but informed scouting of academic research papers that are closely connected to four Alignment topics: mechanistic interpretability, safety via debate, value alignment and cooperative AI. This overview should help somebody wanting to do research on any of those four topics to find the related research and expand the technical Alignment frontier.

TLDR: mechanistic interpretability can be seen as a form of formal explainability; AI safety via debate is vague and informal and could benefit from formal argumentative dialogue games (and has in fact been formalised using several formalisms); value alignment (including RLHF and similar) has deep connections to computational social choice theory and reasoning with preferences in AI; cooperative AI might benefit from recent fairness and multi-agent research.


My key takeaway is that technical Alignment research pre-prints often fail to refer to peer-reviewed, formal AI research articles that are highly relevant to the Alignment topic studied, but that do not use the typical Alignment terminology. This may be due to several reasons. For one, some academic research pre-dates Alignment topics -- for instance, formal verification and explainability via formal methods go way back before mechanistic interpretability (or even machine learning (ML) interpretability at large). In other cases, authors of non-peer-reviewed papers may simply be unaware of, or may intentionally omit related research -- for instance, the AI safety via debate pre-print does not even mention some two decades of research on argumentative debates and dialogue games. In yet other cases, the authors seem to be aware of some relevant research but fail to explore either deeper or broader connections -- for instance, the authors of What are human values, and how do we align AI to them? acknowledge and use some insights from social choice theory and preference aggregation, but dismiss the field as inadequate and do not engage with well established research on e.g. value-based argumentation or reasoning with preferences. I consider the failure to properly address established academic research while engaging technical Alignment topics to be bad, because it risks to reinvent the wheel, misses the opportunity to build or improve on existing science and fails to give credit where it is due. (There are great exceptions to this trend, e.g. the pre-print on open problems in Cooperative AI.) With this post I hope to fill some of the gaps with pointers to what I believe are relevant works for addressing the research questions within the selected four technical topics of Alignment (I have selected and loosely grouped several topics from those covered and omitted in BlueDot's 2024 March Alignment course). I will not detail how that academic research can directly help to solve Alignment problems, but simply try to argue why it is relevant. Instead, researchers should read more, preferably peer-reviewed articles in top-tier publication venues such as AIJ, IJCAI, AAAI, AAMAS.


In what follows, I assume familiarity with the technical Alignment topic, giving only a one sentence description. I then detail which academic research seems relevant but overlooked for the topic.

Mechanistic interpretability vs Formal methods for verification and explainability

Mechanistic interpretability concerns "reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding"[Mechanistic Interpretability for AI Safety: A Review (2024)] (see also A Mathematical Framework for Transformer Circuits (2021), Neel Nanda's blog and BlueDot's Alignment course).

Gaining an in-depth, causal understanding of trained ML model internals as well as behaviour is the purview of formal methods for verification and explainability. Formal verification encompasses methods for translating ML models (e.g. tree-based, neural networks (NNs), model-based/-free reinforcement learning (RL)) into logical formulas so that the latter capture the ML model exactly. Given a model in the language of some formal logic, reasoning techniques such as satisfiability are applied to analyse the model and infer its properties with formal guarantees (as opposed to typical heuristic or sample-based approximations as in most of ML interpretability).

Formal verification of deep NNs (DNNs) concerns formally verifying properties of DNNs over the full input space, where a property is (typically) a "statement that if the input belongs to some set X, then the output will belong to some set Y". Formal verification of NN-based agents and multi-agent systems concerns reasoning about agents trying to "bring about a state of affairs within a finite number of steps" or verifying "whether a system remains within safety bounds within a number of steps", which is known as reachability analysis. Various verification-based analyses can be done using off-the-shelf tools such as Marabou (consult the works of e.g. Clark Barrett, Guy Katz, Alessio Lomuscio). Formal verification also gives rise to formal explainable AI (XAI), which concerns explaining model internal reasoning and input/output behaviour with formal guarantees. For example, one can translate an ML model into propositional logic with linear arithmetic and obtain explanations in the form of rules where the premises consist of feature-value interval pairs that are minimally sufficient for the model to yield the conclusion as output when the premises cover the input. Such and similar explanations of input/output and internal model behaviour can come with deterministic (e.g. here or here) or provably approximate (e.g. here, here and here) guarantees. So in addition to conceptualising notions of features, motifs, superposition and the like, the work on mechanistic interpretability could benefit from e.g. mapping those concepts to formal languages and studying logical relationships among circuits in NNs.

In general, formal method applications in AI may be seen as complimentary to mechanistic interpretability. For instance, some view mechanistic interpretability as a new type of XAI approach to state-of-the-art generative AI models. Others claim that it is an instance of more general approaches, such as causal abstraction. In any case, both mechanistic interpretability and certain research in formal methods aim at actually understanding what ML models do and how we can trust them. A typical concern with formal methods is that of scaling the tools to large ML models. While this sentiment definitely has basis, modern verification tools can be optimised to verify properties of NNs with 100Ms of parameters in seconds, and can likely scale much further if adequate resources are invested [personal communication]. Given the compute needs of training models with millions and billions of parameters, it is absurd that researchers and engineers are often reluctant to spend a fraction of resources on formal verification and explainability of those models.

AI safety via debate vs Argumentative dialogue games

Debate is a method of scalable oversight for aligning agents via a game where given a question or proposal, two agents exchange statements to be eventually judged by a human as to which agent gives the best information (see AI safety via debate, Debating with More Persuasive LLMs Leads to More Truthful Answers).

Two or more agents exchanging arguments in a dialogical/dialectical/debate fashion has long been a purview of argumentative dialogues. In these, autonomous agents put forward claims using their knowledge bases, according to protocol rules, with the purposes of optimising for some outcome of the dialogue, defined by dialogue states and winning criteria. The notions of claims, (legal) moves, protocol rules, outcomes and others are formally defined. For instance, claims can be formalised in some form of logic or natural language, moves can be of assertion, challenge, inquiry or concession types, protocol rules manage which agents can move and how in response to previous claims, outcomes can be the claims accepted by all the agents according to some argumentation- or game-theoretic criteria at the end of the dialogue. Purposes of argumentative dialogue games, such as persuasion, negotiation, deliberation, as well as strategies for optimising for different outcomes, such as sound inferences or truthful accepted claims, have been thoroughly studied. (Consult the papers from years 1997, 2000, 2002, 2003, 2005 and subsequent research on argumentation protocols, persuasion dialogues, deliberation dialogues as well as works of e.g. Trevor Bench-Capon, Paul Dunne, Henry Prakken, Francesca Toni.)

This well-established research should undoubtedly inform any work on agents interactively exchanging statements to be judged by humans or other agents. This is primarily because argumentative dialogue games provide formal frameworks for what information agents can exchange and how, and what it means to establish goodness or usefulness of that information. In fact, the informal exposition in AI safety via debate has recently been formalised as argumentative debate: one variant in argumentation protocol-inspired mathematical framework, another in abstract argumentation-based moral debate. It was formalised using precisely the kind of formal framework, namely argumentative debates, that the authors of the original pre-print could have studied. Yet, argumentation is almost definitely not the only research area that could inform such scalable oversight work -- multi-agent interaction is a huge research domain concerned with some of the same problems of formalising protocols for multi-agent agreement (consult work of e.g. Nick Jennings).

Value alignment vs Computational social choice and Reasoning with preferences

Value alignment concerns aligning AI preferences with human preferences. Here I lump work on RL from human/AI feedback (RL(H/AI)F), including constitutional AI and direct preference optimisation, as well as work on aligning to human values.

Computational social choice theory concerns aggregation of individual preferences towards a collective choice. It is a field in itself, spanning subjects such as economics and psychology, not limited to AI. The typical topics are voting, fair allocation of goods, coalition formation and, pertinently to AI, preferences, including preference learning, processing and reasoning with preferences. Learning preferences (aka preference elicitation) concerns obtaining utilities or preferences from autonomous agents, typically humans, often indirectly about complex issues via simplified choices. For instance, crowdsourcing human opinions about the ethics of self-driving cars via questions on Trolley problems. The closely connected preference processing and modelling (aka preference aggregation) concern formally representing the learned preferences and combining them into a coherent whole. This is far from trivial even in the case of binary preferences. Relatedly, reasoning with preferences concerns how to make decisions given combined preferences of multiple agents, how to resolve conflicting preferences or revise preferences. These modelling and reasoning issues are of particular interest in argumentation, e.g. value-based argumentation.

This should sound relevant to the value alignment topic (as I outlined above). Indeed, RLHF has recently been combined with argumentation, purportedly for improving the reward model. There have been discussions about connections of RLHF to optimising preferences and issues thereof. Social choice theory has been suggested to complement RLHF (acknowledged also in a recent comprehensive survey on AI Alignment) and for knowledge extraction from Large Language Models. There have also been attempts at value alignment via preference aggregation and value-based reasoning. All these are attempts of applying more formal methods to the problem of value alignment.

Last but not least, a topic where established academic research is well appreciated, but may benefit from a broader scope.

Cooperative AI vs AI fairness

Cooperative AI concerns multi-agent cooperation for the good of humanity. The pre-print on open problems in Cooperative AI actually overviews the related work well in Section 2.2., mostly on game theory and multi-agent systems, including the following multi-agent topics: automated argumentation and negotiation, interaction protocols, knowledge representation, reasoning and planning, decision making. The authors also acknowledge the immense overlap with computational social choice and other areas of computing research. However, they point out that "open, heterogeneous, and scalable multi-agent systems require learning agents, including agents who learn to adapt and cooperate with others" so that in light of recent developments in learning, research in Cooperative AI needs to be reinvigorated.

Cooperative AI may benefit from modern research on AI (aka algorithmic) fairness. AI fairness concerns how algorithmic decisions pertain to formalised notions of fair decision making, such as individual fairness of giving similar individuals similar decisions and non-discrimination of giving similar groups on the whole similar decisions. One of the landmark results on the (im)possibility of fairness roughly states that (under standard algorithmic assumptions) individual fairness and non-discrimination cannot be simultaneously guaranteed. Researchers have however defined multiple nuanced metrics for assessing algorithm fairness and have pushed fairness limits in theory and the possibility of fairness in practice. Still, both inherently technical and conceptual limitations of AI fairness are studied and may inform work on cooperative AI.

Finally, a note on the more general theme of academic safe and trustworthy AI research.

AI Safety/Alignment vs Safe and Trustworthy AI

Academic research on safe autonomous systems often concerns safe RL agents, for instance modifying RL agent rewards to benefit others in the environment, provably safe RL or probably approximate safety verification. Research on cooperating agents tackles questions such as "What aspects of [agent] selection process can we adjust in practice to limit agents’ incentives to misreport their features?" or, more generally, "Given a set of different agent models, design mechanisms that induce all agents, or as many agents as possible, to invest effort in desirable actions" (consult other works from e.g. Ariel Procaccia). Responsibility research on trustworthy autonomous systems covers topics of logics for responsible reasoning, multi-agent planning, value-based design, ethics of AI, human-agent collectives, among others. All this is generally considered part of trustworthy AI, which concerns creating safe and trustable AI systems, which greatly overlaps with Safety/Alignment agendas. Trustworthy AI is a huge umbrella term (see NIST or European Commission) encompassing many AI research areas, most notably works on reliability, verification and validation, interpretability/explainability, security and privacy among others. There are adjacent research areas, not necessarily belonging to trustworthy AI, but still relevant to Alignment. For example, various impossibility results (on solvability of problems in AI and related fields) as well as the notable sub-field of neuro-symbolic AI, including approaches to Alignment. Many of these resources could also benefit those interested in technical Alignment research, not limited to the four topics discussed above.

(I have notably omitted the following Alignment topics: agent foundations as too broad and too vague to cover, thinking anything in autonomous agents research (eg. BDI agents) may be relevant to it; shard theory [LW · GW] as too recent and seemingly not yet studied in academia; weak-to-strong generalisation and Eliciting Latent Knowledge -- just didn't bother.)


Huge thanks to Raymond Joseph Gabriel Tana for very helpful comments on the initial draft of the post.

The views expressed here are my own and do not represent those of my employer Ericsson, Inc.


Comments sorted by top scores.

comment by Charlie Steiner · 2024-06-10T13:27:43.428Z · LW(p) · GW(p)

I enjoyed this post, especially because it pointed out argumentation games to me (even if I put no particular stock in debate as a good alignment mechanism, I wasn't aware of this prior literature).

However, sometimes this comes off like asking a paper about human psychology why they didn't cite a paper about mouse psychology. Certainly psychologists studying humans should at least be aware of the existence of research on mice and maybe some highlights, but not every psychology paper needs a paragraph about mice.

Similarly, not talking about interpretability of decision trees or formal languages is no particular failing for a paper on interpretability of deep neural nets. (Though if we do start mapping circuits to formal languages, of course I hope to see useful citations.)

The case of preferences, and citing early work on preferences, seems more nuanced. To a large extent it's more like engineering than science - you're trying to solve a problem - what previous attempts at solution should you compare to to help readers? I haven't read that Working With Preferences book you link (Is it any good, or did you just google it up?), but I imagine it's pretty rare I'll want a value learning paper to compare to prospect theory or nonmonotonic logics (though the fraction that would benefit from talking about nonmonotonic logic sound pretty interesting.)

Replies from: kcyras
comment by kcyras · 2024-06-10T18:14:21.808Z · LW(p) · GW(p)

Thanks, appreciated. My expectation is not that one should cite the generally relevant research. Rather, one is expected to quickly browse, at least by similar keywords, the existing literature for seemingly closely related works. If anything comes up, there are a few options.

  1. Just mention the work - this shows the reader that you've at least browsed around, but also gives the opportunity to somebody else to explore that relevant angle. Arguably, interpretability researchers that aim at actually figuring out model internals must be aware of formal verification of DNNs and should position their work accordingly.
  2. Mention and argue that the work is not applicable. This additionally justifies one's motivation. I've seen it done wrt mech interp in one of the main papers, where the typical heuristic explainability research was discarded as inapplicable, and appropriately so. A similar argument could be made towards formal verification.
  3. Consider using the existing approach or show what concrete problem it fails to address. This may even further incentivise any proponents of the existing approach to improve that work. In short, it's not so much about citing others' work (that's only the metrics game), but also about giving the opportunity for others to relate your work to that of others (that's science).

Re preferences. The book is fine (yes I've read it), but reading it is for somebody working deeply on preference-based reasoning. The point is about skimming through or at least picking up the conceptual claims as to what the research is about. A relatively quick browse through such relevant materials should help to realise that value learning and value-based reasoning have been extensively studied. Again, using one of the 3 options above, there may be opportunities to both identify new problems and to point one's readers to (not) pursue other lines of inquiry.

comment by Jonas Hallgren · 2024-06-08T19:06:53.725Z · LW(p) · GW(p)

I really like this type of post. Thank you for writing it!

I found some interesting papers that I didn't know off before so that is very nice.