Posts
Comments
Thanks so much for sharing that paper. I will give that a read.
I just posted another LW post that is related to this here: https://www.lesswrong.com/posts/Rn4wn3oqfinAsqBSf/intent-alignment-should-not-be-the-goal-for-agi-x-risk
Thanks.
There seems to be pretty wide disagreement about how intent-aligned AGI could lead to a good outcome.
For example, even in the first couple comments to this post:
- The comment above (https://www.lesswrong.com/posts/Rn4wn3oqfinAsqBSf/?commentId=zpmQnkyvFKKbF9au2) suggests "wide open decentralized distribution of AI" as the solution to making intent-aligned AGI deployment go well.
- And this comment I am replying to here says, "I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available."
My guess, and a motivation for writing this post, is that we see something in between (a.) wide and open distribution of intent-aligned AGI (that somehow leads to well-balanced highly multi-polar scenarios), and (b.) completely central ownership (by a beneficial group of very conscientious philosopher-AI-researchers) of intent-aligned AGI.
Thanks for those links and this reply.
1.
for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk
I don't see how this is a counterargument to this post's main claim:
P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).
That problem of the collapse of a human provided goal into AGI power-seeking seems to apply just as much to the problem of intent alignment as it does to societal alignment; it could apply even more because the goals provided would be (a) far less comprehensive, and (b) much less carefully crafted.
2.
Personally I think there's plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment.
I agree with this. My point is not that we should not think about the risks of intent alignment, but rather that (if the arguments in this post are valid): AGI-capabilities-advancing-technical-research that actively pushes us closer to developing intent-aligned AGI is a net negative because it could cause us to develop intent-aligned AGIs that would cause an increase in x-risk because AGIs aligned to multiple humans that have conflicting intentions can lead to out-of-control conflicts; and if we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound, worsening the above issues.
Most current technical AI alignment research is AGI-capabilities-advancing-research that actively pushes us closer to developing intent-aligned AGI, with the (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. But this would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); and that is highly unlikely.
It's definitely not the case that:
all of our intents have some implied "...and do so without disrupting social order."
There are many human intents that want to disrupt social order, and more generally cause things that are negative for other humans.
And that is one of the key issues with intent alignment.
Relatedly, Cullen O'Keefe has a very useful discussion of distinctions between intent alignment and law-following AI here: https://forum.effectivealtruism.org/s/3pyRzRQmcJNvHzf6J/p/9RZodyypnWEtErFRM
We can see that, on its face, intent alignment does not entail law-following. A key crux of this sequence, to be defended in subsequent posts, is that this gap between intent alignment and law-following is:
- Bad in expectation for the long-term future.
- Easier to bridge than the gap between intent alignment and deeper alignment with moral truth.
- Therefore worth addressing.
As a follow-up here, to expand on this a little more:
If we do not yet have sufficient AI safety solutions, advancing general AI capabilities may not be desirable because it leads to further deployment of AI and to bringing AI closer to transformative levels. If new model architectures or training techniques were not going to be developed by other research groups within a similar timeframe, then that increases AI capabilities. The specific capabilities developed for Law-Informed AGI purposes may be orthogonal to developments that contribute toward general AGI work. Technical developments achieved for the purposes of AI understanding law better that were not going to be developed by other research groups within a similar timeframe anyway are likely not material contributors to accelerating timelines for the global development of transformative AI.
However, this is an important consideration for any technical AI research – it's hard to rule out AI research contributing in at least some small way to advancing capabilities – so it is more a matter of degree and the tradeoffs of the positive safety benefits of the research with the negative of the timeline acceleration.
Teaching AI to better understand the preferences of an individual human (or small group of humans), e.g. RLHF, likely leads to additional capabilities advancements faster and to the type of capabilities that are associated with power-seeking of one entity (human, group of humans, or AI), relative to teaching AI to better understand public law and societal values as expressed through legal data. Much of the work on making AI understand law is data engineering work, e.g., generating labeled court opinion data that can be employed in evaluating the consistency of agent behavior with particular legal standards. This type of work does not cause AGI timeline acceleration as much as work on model architectures or compute scaling.
Is there no room for ethics outside of the law? It is not illegal to tell a lie or make a child cry, but AI should understand that those actions conflict with human preferences. Work on imbuing ethical understanding in AI systems therefore seems valuable.
There is definitely room for ethics outside of the law. When increasingly autonomous systems are navigating the world, it is important for AI to attempt to understand (or at least try to predict) moral judgements of humans encountered.
However, imbuing an understanding of an ethical framework for an AI to implement is more of a human-AI alignment solution, rather than a society-AI alignment solution.
The alignment problem is most often described (usually implicitly) with respect to the alignment of one AI system with one human, or a small subset of humans. It is more challenging to expand the scope of the AI’s analysis beyond a small set of humans and ascribe societal value to action-state pairs. Society-AI alignment requires us to move beyond "private contracts" between a human and her AI system and into the realm of public law to explicitly address inter-agent conflicts and policies designed to ameliorate externalities and solve massively multi-agent coordination and cooperation dilemmas.
We can use ethics to better align AI with its human principal by imbuing the ethical framework that the human principal chooses into the AI. But choosing one out of the infinite possible ethical theories (or an ensemble of theories) and "uploading" that into an AI does not work for a society-AI alignment solution because we have no means of deciding -- across all the humans that will be affected by the resolution of the inter-agent conflicts and the externality reduction actions taken -- which ethical framework to imbue in the AI. When attempting to align multiple humans with one or more AI system, we would need the equivalent of an elected "council on AI ethics" where every affected human is bought in and will respect the outcome.
In sum, imbuing an understanding of an ethical framework for an AI should definitely be pursued as part of human-AI alignment, but it is not an even remotely practical possibility for society-AI alignment.
law provides a relatively nuanced picture of the values we should give to AI. A simpler answer to the question of "what should the AI's values be?" would be "aligned with the person who's using it", known as intent alignment. Intent alignment is an important problem on its own, but does not entirely solve the problem. Law is particularly better than ideas like Coherent Extrapolated Volition, which attempt to reinvent morality in order to define the goals of an AI.
The law-informed AI framework sees intent alignment as (1.) something that private law methods can help with, and (2.) something that does not solve, and in some ways probably exacerbates (if we do not also tackle externalities concurrently), societal-AI alignment.
- One way of describing the deployment of an AI system is that some human principal, P, employs an AI to accomplish a goal, G, specified by P. If we view G as a “contract,” methods for creating and implementing legal contracts – which govern billions of relationships every day – can inform how we align AI with P. Contracts memorialize a shared understanding between parties regarding value-action-state tuples. It is not possible to create a complete contingent contract between AI and P because AI’s training process is never comprehensive of every action-state pair (that P may have a value judgment on) that AI will see in the wild once deployed. Although it is also practically impossible to create complete contracts between humans, contracts still serve as incredibly useful customizable commitment devices to clarify and advance shared goals. (Dylan Hadfield-Menell & Gillian Hadfield, Incomplete Contracting and AI Alignment).
- We believe this works mainly because the law has developed mechanisms to facilitate commitment and sustained alignment amongst ambiguity. Gaps within contracts – action-state pairs without a value – are often filled by the invocation of frequently employed standards (e.g., “material” and “reasonable”). These standards could be used as modular (pre-trained model) building blocks across AI systems. Rather than viewing contracts from the perspective of a traditional participant, e.g., a counterparty or judge, AI could view contracts (and their creation, implementation, evolution, and enforcement) as (model inductive biases and data) guides to navigating webs of inter-agent obligations.
- If (1.) works to increase the intent alignment of one AI system to one human (or a small group of humans), we will have a more useful and locally reliable system. But this likely decreases the expected global reliability and safety of the system as it interacts with the broader world, e.g., by increasing the risk of the system maximizing the welfare of a small group of powerful people. There are many more objectives (outside of individual or group goals) and many more humans that should be considered. As AI advances, we need to simultaneously address the human/intent-alignment and society AI alignment problems. Some humans would “contract” with an AI (e.g., by providing instructions to the AI or from the AI learning the humans’ preferences/intents) to harm others. Further, humans have (often, inconsistent and time-varying) preferences about the behavior of other humans (especially behaviors with negative externalities) and states of the world more broadly. Moving beyond the problem of intent alignment with a single human, aligning AI with society is considerably more difficult, but it is necessary as AI deployment has broad effects. Much of the technical AI alignment research is still focused on the solipsistic “single-single” problem of single human and a single AI. The pluralistic dilemmas stemming from “single-multi” (a single human and multiple AIs) and especially “multi-single” (multiple humans and a single AI ) and “multi-multi” situations are critical (Andrew Critch & David Krueger, AI Research Considerations for Human Existential Safety). When attempting to align multiple humans with one or more AI system, we need overlapping and sustained endorsements of AI behaviors, but there is no consensus social choice mechanism to aggregate preferences and values across humans or time. Eliciting and synthesizing human values systematically is an unsolved problem that philosophers and economists have labored on for millennia. Hence, the need for public law here.
Thank you for this detailed feedback. I'll go through the rest of your comments/questions in additional comment replies. To start:
What kinds of work do you want to see? Common legal tasks include contract review, legal judgment prediction, and passing questions on the bar exam, but those aren't necessarily the most important tasks. Could you propose a benchmark for the field of Legal AI that would help align AGI?
Given that progress in AI capabilities research is driven, in large part, by shared benchmarks that thousands of researchers globally use to guide their experiments, understand as a community whether certain model and data advancements are improving AI capabilities, and compare results across research groups, we should aim for the same phenomena in Legal AI understanding. Optimizing benchmarks are one of the primary “objective functions” of the overall global AI capabilities research apparatus.
But, as quantitative lodestars, benchmarks also create perverse incentives to build AI systems that optimize for benchmark performance at the expense of true generalization and intelligence (Goodhart’s Law). Many AI benchmark datasets have a significant number of errors, which suggests that, in some cases, machine learning models have, more than widely recognized, failed to actually learn generalizable skills and abstract concepts. There are spurious cues within benchmark data structures that, once removed, significantly drop model performance, demonstrating that models are often learning patterns that do not generalize outside of the closed world of the benchmark data. Many benchmarks, especially in natural language processing, have become saturated not because the models are super-human but because the benchmarks are not truly assessing their skills to operate in real-world scenarios. This is not to say that AI capabilities have made incredible advancements over the past 10 years (and especially since 2017). The point is just that benchmarking AI capabilities is difficult.
Benchmarking AI alignment likely has the same issues, but compounded by significantly vaguer problem definitions. There is also far less research on AI alignment benchmarks. Performing well on societal alignment is more difficult than performing well on task capabilities. Because alignment is so fundamentally hard, the sky should be the limit on the difficulty of alignment benchmarks. Legal-informatics-based benchmarks could serve as AI alignment benchmarks for the research community. Current machine learning models perform poorly on legal understanding tasks such as statutory reasoning (Nils Holzenberger, Andrew Blair-Stanek & Benjamin Van Durme, A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering (2020); Nils Holzenberger & Benjamin Van Durme, Factoring Statutory Reasoning as Language Understanding Challenges (2021)), professional law (Dan Hendrycks et al., Measuring Massive Multitask Language Understanding, arXiv:2009.03300 (2020)), and legal discovery (Eugene Yang et al., Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review, in Advances in Information Retrieval: 44th European Conference on IR Research, 502–517 (2022)). There is significant room for improvement on legal language processing tasks (Ilias Chalkidis et al., LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (2022); D. Jain, M.D. Borah & A. Biswas, Summarization of legal documents: where are we now and the way forward, Comput. Sci. Rev. 40, 100388 (2021)). An example benchmark that could be used as part of the alignment benchmarks is Law Search (Faraz Dadgostari et al., Modeling Law Search as Prediction, A.I. & L. 29.1, 3-34 (2021) at 3 (“In any given matter, before legal reasoning can take place, the reasoning agent must first engage in a task of “law search” to identify the legal knowledge—cases, statutes, or regulations—that bear on the questions being addressed.”); Michael A. Livermore & Daniel N. Rockmore, The Law Search Turing Test, in Law as Data: Computation, Text, and the Future of Legal Analysis (2019) at 443-452; Michael A. Livermore et al., Law Search in the Age of the Algorithm, Mich. St. L. Rev. 1183 (2020)).
We have just received a couple small grants specifically to begin to build additional legal understanding benchmarks for LLMs, starting with legal standards. I will share more on this shortly and would invite anyone interested in partnering on this to reach out!
This is a great point.
Legal tech startups working on improving legal understanding capabilities of AI has two effects.
- Positive: improves AI understanding of law and furthers the agenda laid out in this post.
- Negative: potentially involves AI in the law-making (broadly defined) process.
We should definitely invest efforts in understanding the boundaries where AI is a pure tool just making humans more efficient in their work on law-making and where AI is doing truly substantive work in making law. I will think more about how to start to define that and what research of this nature would look like. Would love suggestions as well!
Thanks for the reply.
- There does seem to be legal theory precise enough to be practically useful for AI understanding human preferences and values. To take just one example: the huge amount of legal theory on the how to craft directives. For instance, whether to make directives in contracts and legislation more of a rule nature or a standards nature. Rules (e.g., “do not drive more than 60 miles per hour”) are more targeted directives than standards. If comprehensive enough for the complexity of their application, rules allow the rule-maker to have more clarity than standards over the outcomes that will be realized conditional on the specified states (and agents’ actions in those states, which are a function of any behavioral impact the rules might have had). Standards (e.g., “drive reasonably” for California highways) allow parties to contracts, judges, regulators, and citizens to develop shared understandings and adapt them to novel situations (i.e., to generalize expectations regarding actions taken to unspecified states of the world). If rules are not written with enough potential states of the world in mind, they can lead to unanticipated undesirable outcomes (e.g., a driver following the rule above is too slow to bring their passenger to the hospital in time to save their life), but to enumerate all the potentially relevant state-action pairs is excessively costly outside of the simplest environments. In practice, most legal provisions land somewhere on a spectrum between pure rule and pure standard, and legal theory can help us estimate the right location and combination of “rule-ness” and “standard-ness” when specifying new AI objectives. There are other helpful legal theory dimensions to legal provision implementation related to the rule-ness versus standard-ness axis that could further elucidate AI design, e.g., “determinacy,” “privately adaptable” (“rules that allocate initial entitlements but do not specify end-states”), and “catalogs” (“a legal command comprising a specific enumeration of behaviors, prohibitions, or items that share a salient common denominator and a residual category—often denoted by the words “and the like” or “such as””).
- Laws are validated in a widely agreed-upon manner: court opinions.
- I agree that law lacks settled precedent across nations, but within a nation like the U.S.: at any given time, there is a settled precedent. New precedents are routinely set, but at any given time there is a body of law that represents the latest versioning.
- It seems that a crux of our overall disagreement about the usefulness of law is whether imposition by a democratic government makes a law legitimate. My arguments depend on that being true.
- In response to "I dispute that law accurately reflects the evolving will of citizens; or the proposition that so reflecting citizen's will is consistently good", I agree it does not represent the evolving will of citizens perfectly, but it does so better than any alternative. I think reflecting the latest version of citizens' views is important because I hope we continue on a positive trajectory to having better views over time.
The bottom line is that democratic law is far from perfect, but, as a process, I don't see any better alternative that would garner the buy-in needed to practically elicit human values in a scalable manner that could inform AGI about society-level choices.
Good idea. Will do!
Regarding, "Any AGI is highly likely to understand democratic laws."
There is likely much additional work to be done to imbue a comprehensive understanding of law in AGI systems -- in particular many of our legal standards (versus rules, which are easier and are already legible to AI) and many nuanced processes that are only in human legal expert minds right now. Making those things structured enough for a computational encoding is not easy.
If we solve that, though, there is still the work to be done on (1.) verifying AGI legal understandings (and AI systems along the way to AGI), and (2.) ensuring that law is still made by humans. Setting new legal precedent (which, broadly defined, includes proposing and enacting legislation, promulgating regulatory agency rules, publishing judicial opinions, enforcing law, and more) should be exclusively reserved for the democratic governmental systems expressing uniquely human values. The positive implications of that normative stance are that the resulting law then encapsulates human views and can be used for AGI understanding human values. We need to do significant technical and governance work to ensure law-making is human for that reason.
Therefore,
P(misalignment x-risk | AGI that understands democratic law) < P(misalignment x-risk | AGI)
when discussed in the larger context, should likely be expressed more like,
P(misalignment x-risk | AGI that understands democratic law, and the law is entirely human-driven, and humans have validation of the ways in which AGI understands the law) < P(misalignment x-risk | AGI)
I don't think anyone is claiming that law is "always humane" or "always just" or anything of that nature.
This post is claiming that law is imperfect, but that there is no better alternative of a synthesized source of human values than democratic law. You note that law is not distinguished from "other forms of nonfiction or for that matter novels, poetry, etc" in this context, but the most likely second best source of a synthesized source of human values would not be something like poetry -- it would be ethics. And, there are some critical distinguishing factors between law and ethics (and certainly between law and something like poetry):
- There is no unified ethical theory precise enough to be practically useful for AI understanding human preferences and values.
- Law, on the other hand, is actionable now in a real-world practically applicable way.
- Ethics does not have any rigorous tests of its theories. We cannot validate ethical theories in any widely agreed-upon manner.
- Law, on the other hand, although deeply theoretical and debated by academics, lawyers, and millions of citizens, is constantly formally tested through agreed-upon forums and processes.
- There is no database of empirical applications of ethical theories (especially not one with sufficient ecological validity) that can be leveraged by machine learning processes.
- Law, on the other hand, has reams of data on empirical application with sufficient ecological validity (real-world situations, not disembodied hypotheticals).
- Ethics, by its nature, lacks settled precedent across, and even within, theories. There are, justifiably, fundamental disagreements between reasonable people about which ethical theory would be best to implement.
- Law, on the other hand, has settled precedent, which can be updated to evolve with human values changing over time.
- Even if AGI designers (impossibly) agreed on one ethical theory (or ensemble of underlying theories) being “correct,” there is no mechanism to align the rest of the humans around that theory (or meta-theory).
- Law, on the other hand, has legitimate authority imposed by government institutions.
- Even if AI designers (impossibly) agreed on one ethical theory (or ensemble of underlying theories) being “correct,” it is unclear how any consensus update mechanism to that chosen ethical theory could be implemented to reflect evolving (usually, improving) ethical norms. Society is likely more ethical than it was in previous generations, and humans are certainly not at a theoretically achievable ethical peak now. Hopefully we continue on a positive trajectory. Therefore, we do not want to lock in today’s ethics without a clear, widely-agreed-upon, and trustworthy update mechanism.
- Law, on the other hand, is formally revised to reflect the evolving will of citizens.
Any thoughts on whether (and how) the generalized financing mechanism might apply to any AI Safety sub-problems?