Static vs Dynamic Alignment
post by Gracie Green · 2024-03-21T17:44:41.563Z · LW · GW · 0 commentsContents
Abstract Introduction Which is more likely? Is there a way to reward one kind of alignment without rewarding the other? Situational Awareness Can we know whether an aligned AI is/will be one or the other? Which would be better? Preference Change Knowing what you want The Value Change Problem Value Drift Corrigibility Deceptive Models Conclusions None No comments
In this paper, I outline the following claims:
- We can divide AI Alignment into two types: "static" or "dynamic". Static alignment (reflecting desires at training) and dynamic alignment (adapting to current desires) are often conflated.
- Static alignment is more likely. Static alignment requires a simpler training procedure and less situational awareness.
- But dynamic alignment is better. Dynamically aligned agents would be corrigible, stay value-aligned in real-time, and adapt to changes in human preferences.
- We should try harder to understand and achieve dynamic alignment. Paying more attention to the static/dynamic distinction can help labs develop and adapt their training methods, ultimately creating better AI agents.
This paper might be particularly interesting to anyone thinking about the philosophical foundations of alignment. My hope is that others generate technical strategies to solve the issues I identify.
I completed this paper as a research project with the Oxford Future Impact Group. I'm very grateful for the support from the team there and my mentor within it, Elliott Thornley; I would absolutely recommend them to anyone looking to get more research experience.
Abstract
In this paper, I identify an issue we may face when trying to produce AIs that are aligned in an optimal manner. I find that the most likely form of alignment produced will be what I call ‘static alignment’, where an AI model is aligned with what the agent wants at the time of training. However, the preferable form of alignment would be what I call ‘dynamic alignment’, where the AI model is aligned with what the agent wants in the present moment, changing its desires to match those presently held by the agent. I assess the relative likelihood of each by discussing the complexity of training processes, the situational awareness required by each model, and whether we would be able to identify whether a particular kind of alignment has been produced. Preferability was determined by looking at the implications of AI goals mirroring our own. This included a closer look at the value change problem, whether the two approaches would be equally affected by value drift, the necessity of corrigibility, and the likelihood of deceptive alignment. I also discuss the difficulties faced specifically by dynamic models when determining which human desires to fulfil. The conclusion made poses an issue for alignment efforts and we therefore require further research into how we may ensure that a model is dynamically aligned.
Introduction
‘AI alignment’ is often defined as aligning AI models with human values. But there’s an ambiguity here. When referring to AI alignment, it is often unclear whether an author is referring to de re or de dicto concepts of alignment with human values. This is an important distinction to make as it can alter the consequences of a model’s deployment, as I will highlight throughout this paper. Post-deployment, they each exhibit different behavioural patterns and thus respond to particular risks in differing ways.
De re and de dicto are usually used within philosophy to refer to a specific referent of a word and the general category the word refers to respectively. For example, if an individual requests ‘the plant with the most flowers on the table’, they may mean a specific plant that is best described as being the plant with the most flowers (this being de re) or they may mean any plant that fits this description (de dicto). Importantly, in the de re case, the plant they want would not change if another was added to the table with twice as many flowers, whereas in the de dicto case they would now want the newcomer.
This can also be applied to de re and de dicto alignment - de re alignment refers to a model being aligned with the de re sense of “what humans want”, this being the specific thing that humans want at the time of training. For example, if humans want ‘AI to promote welfare’ then what the AI wants is for ‘AI to promote welfare’. I should note that this is just an example, we may instead choose the ‘helpful, harmless, and honest’ framework, or ‘AI to follow instructions’, or suchlike. What de re aligned AIs want should not change as human wants change; if humans alter their concept of “welfare” to include an Aristotelian form of flourishing, the de re model would not update to accommodate this. De dicto alignment refers to alignment with the de dicto sense of “what humans want”; this does change as human desires change, and thus this model would alter its concept of “welfare” to include Aristotelian flourishing.
I suggest that we refer to de re and de dicto alignment as ‘static alignment’ and ‘dynamic alignment respectively,’ to relate more clearly to how the models reflect human desires.
The purpose of this paper is to highlight that static alignment is more likely to be produced whereas dynamic alignment would be preferable in most situations. Initially, I will show that static alignment is more likely by looking at possible training procedures and how complex these must be in order to facilitate one or other kind of alignment. This relates to the level of situational awareness required for each type of alignment. It is also necessary to look at whether we are more able to identify one form of alignment over another, as this may have an impact on the decisions made by those producing the models. Following this, I will evaluate whether one form of alignment would be preferable, determining that in most cases the preferable approach is a dynamic one. This is due to the necessarily corrigible nature of dynamic models, the importance of real-time value alignment, the lesser impact felt by value drift, and the less restrictive approach to preference change taken by dynamic models. Dynamic models do face unique challenges in that they must determine which human desires they need to fulfil, but this appears more an issue for our own indecision around this issue than an issue of the model itself.
This paper is designed as an overview and an introduction to ways static and dynamic alignment may differ. There is space for very wide debate within each topic. In the interests of time and due to the broad scope of this paper, I have chosen to generally abstain from particularly detailed nuance, thus leaving plenty of space for further research.
Which is more likely?
Within this section, I will cover whether we can use reinforcement learning to train a model to be either statically or dynamically aligned. I will also look at whether there is a speed advantage or a greater requirement for situational awareness either way. I then ask whether it would be possible to know which kind of alignment has been or will be produced.
Is there a way to reward one kind of alignment without rewarding the other?
My instinct is that a useful approach to favouring the production of one model over the other would be through differences in data generalisation and reward function within training. To train a model to be dynamically aligned, the best approach may be to offer a reward for any action that recognises a desire held by an individual and fulfils it, regardless of what this desire may be. Here I refer to “what humans want” as ‘desires’, in order to avoid ambiguity, but I am using it to refer to anything relating to goals, wants, preferences or some other similar word, retaining the sentiment. In this training scenario, the desires do not necessarily need to be desires actually held by individuals - it may be the case that individual human desires don’t vary strongly enough, so training a model entirely on natural human data may lead to static alignment. Simulating human data may be more useful in this case in that it offers greater control. In opposition to the dynamic training process, training a model to be statically aligned requires rewarding only those actions that fulfil a particular kind of desire, whilst still varying the desire data given as input. Both models face the same training data, with a wide array of possible desires, but the former are rewarded for correctly identifying and fulfilling any desire presented by the agent, and the latter are rewarded for correctly identifying desires and fulfilling those that match the de re description they have been given for “what humans want”. The difference in reward function therefore means the likelihood of the models developing the same goals is slim provided the two training methods differ sufficiently.
It may be the case that because generalisation is required in both (you have to generalise to all versions of "what humans want" or to all aspects of "maximise welfare", “HHH” etc.) they're both equally difficult to train. Both need to recognise expressions of desires and differentiate them from non-desire expressions, both need to recognise what exactly the individual is desiring, determine whether this is something they want to fulfil, and how. Essentially, they need to categorise between “yes - fulfil this desire” and “no - do not fulfil this desire/this is not a desire”. Statically aligned models will have a smaller set of “yes - fulfil this desire” that will fit within the dynamic set, but both need to make this distinction regardless.
It may be that dynamic alignment is harder to train as the model must account for all concepts in the static model as well as all other possible desire concepts, so there may be a greater need for informational understanding and a greater capacity for conceptual awareness. A dynamic model will also need to continually update to redetermine what the goals of the model should be, whereas this isn’t necessary for a static model. Dynamic alignment will therefore necessarily require a more complex system as it requires the ability to learn over time and update itself whilst maintaining a terminal goal. These models also require the ability to understand and detect human goal shifts. Though static alignment does need the ability to understand all these concepts in the sense of knowing what not to fulfil or understanding the environment, the same depth of understanding is not necessary.
This strategy does not account for the fact that desires tend to change over long periods of time; you need to teach a dynamic model to do "what humans want now" rather than to do anything that humans might ever possibly want. A temporal aspect of training is needed; we may benefit from a more gradual, realistic process. This can be done by changing the referent of "what humans want" over time by altering the data presented to the model gradually, with minor variation, through the learning process. You initially train that the referent is A, then train that it is now B-no-longer-A, then C-no-longer-A-or-B, and so on, so the model learns to want "what humans want [now]". This can be quite difficult depending on what data is used, particularly if we want models to be able to tell if they are being lied to. This is exacerbated by the existence of different levels of ‘what humans want’; the specific human wants may change but the more general wants may stay the same. I would suggest that the model should be able to treat these smaller changes in the same way they would treat major ones and simply adjust their own concept of what humans want to an extent equal to the relative weight the desire holds. Overall, static alignment is easier as the target does not change, it needs to want what humans wanted at time ‘t’ (“What humans want [upon training]”) and maintain this desire.
It is generally thought to be harder to generalise appropriately than it is to teach a model to perform a highly specific action, exacerbating the likelihood that it is easier to train static than dynamic alignment. It also may be more likely for dynamic training processes to produce misaligned models as they may align to something that fits the training data but does not fit all possible data available post-deployment. A static model needs to learn to reward X desire and nothing else, whereas a dynamic model needs to recognise and account for all possible desires. It may come across a desire that it cannot recognise using knowledge from the training data. To a lesser extent, static training processes may experience this same issue, as you need to generalise enough that it does all aspects of a specific version of what humans want but doesn't do anything beyond this, meaning they may struggle with grey areas. This does seem like a lesser problem though, with the former possibly being incapable of appropriately working with an entire set of desires and the latter struggling only with particularly difficult cases. This issue with grey area cases may be exacerbated with less easily definable goals. Static models likely have a clearer boundary and a greater understanding of grey areas as they have greater experience in optimising to fit the particular goal. However, the experience likely will not be a huge factor in influencing which model is more prone to confusion in grey cases as the model likely will not experience them particularly regularly.
Beyond this, it may be easier to train static models as they hold a speed advantage. This advantage comes from a decrease in the level of reasoning they are required to be able to carry out. A dynamic model must reason from wanting to fulfil human desires to wanting to fulfil the specific desires it identifies, to determining an action it must carry out in order to do this. Such a model also needs to monitor whether human desires are changing and whether its own instrumental desires align with the current human desires. This speed cost may not be particularly different to that in static models, however. Static models still need to reason from their own desires to appropriate actions. They should also identify human desires when determining which actions are most able to fulfil their desires and determine whether their actions are having the intended effect. The major difference is that a static model does not need to determine whether its instrumental values are correct and does not need to ensure that they align with human values. Though these two considerations are not necessary, they may form an aspect of reasoning in static models. It may be useful for a static model to identify whether its instrumental goals align with the people around it in order to determine whether these people will help it or hinder it from fulfilling said goal. Speed and simplicity just don’t seem to exert much influence on the likelihood of static vs dynamic alignment, as the reasoning often used in both cases is generally quite similar.
Situational Awareness
It is worth questioning whether dynamic alignment requires more situational awareness than static and whether this is a strong reason to expect aligned agents to be statically aligned rather than dynamically. Situational awareness involves perception of the elements in the environment, comprehension of what they mean and how they relate to one another, and projection of their future states. The model is therefore able to recognise that it is in training, understand what actions are being rewarded, understand that it is an AI model that has been built for a purpose etc. This is particularly interesting as people training powerful AI may restrict the situational awareness of an agent to reduce the risk of the model becoming deceptively aligned. This does not always occur when a model has situational awareness; a common thought is that models will learn to want certain things earlier on in the training process, so they may learn to want to follow human instructions or to want what humans want, and they will gain situational awareness later on. They will learn that they are AI models trained to want to follow human instructions after they already want to follow our instructions, so they will not use this information in order to change their desires. The issue arises if the model’s situational awareness can in any way interfere with it wanting what we want. If, for example, the model is not yet fully aligned with our desires when it gains situational awareness.
Both static and dynamically aligned systems need situational awareness to recognise what the desires they do and don't want to fulfil are; they need to recognise their own desires. Static models don't need to recognise what desire someone is expressing in the moment in order to determine how to fulfil it, they don't need to understand people's preferences and body language. They do need to understand the situation in order to know what to do and how to fulfil its goals from within it. I would argue that an understanding of a situation can encompass an understanding of the mental states of those around us, including an understanding of what their desires may be. It could be that statically aligned models benefit from the same level of understanding as dynamic models; the two simply use the information for different purposes. The more you know about a situation the more you're able to control what's happening, so a particularly good static model should also know what desire people are expressing in order to use this to further its own. This follows for other aspects of situational awareness; the more a model understands about its place in the world the more able it is to accurately determine which actions are necessary to carry out its goal.
Say we produce a model designed to babysit children, Pax. In the static sense, this model is asked to stop arguments. In the dynamic sense, it is initially asked the same, but we may go on to later decide that constructive arguments, or more debate-like arguments, are a good thing, so not all arguments should be stopped immediately. Both versions of Pax need to recognise the emotional state of the children, possible triggers for arguments, recognise what they are saying, understand how to stop the argument fully, etc. Dynamic Pax also needs to monitor the desires of the parents to ensure it is following them appropriately (the terminal goal being ‘what the parents want concerning childcare’), and monitor its own instrumental goals to ensure they align with the terminal goal. This knowledge of internal mechanisms isn’t necessary for Static Pax, it doesn’t need self-awareness of its own desires and does not need this introspective ability. It does, however, need external situational awareness, including awareness of the desires of the parents, as recognising that the parents no longer hold the same desires as itself means it knows how likely they would be to help/hinder its actions in a given situation. To determine whether the parents agree with it, it needs an understanding of its own goals. You can form Static Pax without these things, but it won’t work as well. Dynamic Pax needs them in order to be a dynamic model.
If you restrict situational awareness, a model will still be able to be statically aligned but it will not be able to be dynamic. Both models necessarily need to plan for the future and understand the environment enough to comprehend what is happening in relation to themselves. It is sufficient for a static model to have limited situational awareness, but a dynamic model necessarily needs to be able to analyse a situation and determine the desires held within it, whether they match its own etc. A static model does not seem to need as much meta-awareness of itself and its own internal functioning. Though these things would improve such a model, they are not needed in the same necessary way. This can therefore mean that more static models will be produced, as they are less likely to become deceptively aligned.
Can we know whether an aligned AI is/will be one or the other?
It appears unlikely that we will be able to show that an aligned AI is definitely static or dynamic, though we may be able to form a probability estimate. If we can’t be sure whether an AI is statically or deceptively aligned, then we cannot be certain how it will behave, and thus cannot form a solid strategy regarding safety. The unpredictability of model activity appears to gain a layer of complexity with the introduction of two forms of alignment.
It also seems particularly difficult to test which form of alignment you have produced. You cannot determine this on the basis of the model’s behaviour as a model may be deceptive or misaligned in a manner that is difficult to spot in a contained environment. Assuming it is aligned, a static and a dynamic model would present with the same behaviour until the human goals change. Testing for this requires the ability to lie to the model and be believed, producing data that shows your desires changing in a short time to identify the behavioural shift. It may be possible to tell using mechanistic interpretability practices, but as of yet it is not certain whether research into this field will go well enough to tell how and whether a model is aligned just by looking at its weights.
We may attempt to shift the relative likelihood of static vs dynamic alignment. This may be possible by altering the training process to lean towards one or the other being more viable. In this sense, it appears simpler to train a model to be static. It also appears that limiting situational awareness makes it significantly more likely that a model will be static, a likely outcome of efforts to limit the risk of deceptive alignment. There is also a slight speed advantage to this type of model, though the difference does not appear particularly great. It is therefore easier, through using less complicated training practices and decreasing situational awareness, to make it particularly likely that a model is static. However, we cannot be certain that a model is not dynamically aligned instead. This is particularly a problem if we determine that the better, more beneficial, models are those that are dynamically aligned. This is the argument I will make in the latter half of this work.
Which would be better?
In this section, I plan to take a more ethical, speculative look at this distinction to determine which alignment model would be better for society.
Preference Change
We need to ask the obvious question: is it better that dynamically aligned agent’s preferences change as ours do?
On the one hand, dynamic alignment is better because it changes as we do, allowing the model’s values to update as ours do. Even if our desires remain stable, for example, we may always want a model to be helpful, harmless, and honest, our concepts of these things may change. We may alter our concept of harm, for example, to include other species, or to include psychological harm rather than just physical, or harm to an individual’s reputation etc. We may also alter our preferences in terms of trade-offs between the goals of the model. These concepts should be included in ways that a model changes to remain dynamically aligned with our values.
In terms of long-term models, a benefit of dynamic alignment following our desire changes is that we, as a human species, seem to be becoming increasingly morally good as we age as a species. Our preferences seem to be becoming more informed as we gain evidence as to what we should be aiming for - our preferences are likely to have and continue to change for the better. We see inductive evidence for this as the human race seems to conduct itself more morally as time goes on. People think more of others, going beyond their social groups when making moral considerations. There appears to be evidence that people’s frame of moral relevance is expanding: slavery has been abolished, there are human rights institutions, there are animal rights movements etc. You could counteract this by noting that recent history has also had some of the worst ever genocides; we've invented the atomic bomb and used it, there is clear evidence for the cause of climate change and disaster with few major changes being made to amend this, and there are deplorable levels of inequality. However, this takes a narrow view of human history; we're in a time of rapid change and we’re trying to adapt to the moral responsibilities associated with this as a species. It's not that you're more moral than your grandparents, but that we act more morally than people did a thousand years ago and people many generations from now will act more morally than we do. It could be suggested that this is only the case relative to presently accepted moral standards; of course we think we’ve got a better moral compass than the people before us when comparing their moral judgements to our own. However, our moral beliefs are subject to more scrutiny now than ever before, so are more likely to be correct insofar as moral beliefs can be. In the future there's also the possibility of moral enhancement - people designed to be better at judging what to do in a morally grey area. We can’t assume we’ve got a better idea as to what this AI system should be doing than them.
I would also argue that enforcing our preferences on those in the future who use AI systems we build now seems unfair, we should allow them to change the moral practices of the model they interact with. Though future people are free to stop using a particular model, implementing a model with values based in morality may decrease the likelihood of future generations choosing to no longer use it. It may bias them towards our current moral thinking, limiting how much people question morality as they assume the system is more likely to be correct than their own reasoning. If we build an intelligent system that says honesty is always the correct choice, people will be less likely to wonder whether this is actually the case. Without this, future civilisations may converge on an idea that we do not currently commonly subscribe to, such as the idea that knowledge should be conserved rather than shared widely, making unfiltered honesty immoral in this future. If we build a widely implemented morality-based AI system, it could prevent, or decrease the likelihood of, the thinking and questioning that leads to this. There’s a demonstrable degree of overtrust in AI, the tendency for which is increased with improved explainability, so future people will be unlikely to question moral judgements exacted by an AI system.
These two arguments do focus on more long-term timeframes and future generations, but they can also fit the short-term. A long-term model will need to change to match future generations, a short-term model will need to change to match individual people or match the individual as they change. If an AI system is acting as a personal assistant, it is better off wanting to make its owner happy than it is just wanting to write their emails, make the shopping list, etc. It needs to be able to update to their preferences, otherwise it will prove obsolete very quickly. If we replace emails with another technology, an AI with the goal to “write professional emails” no longer serves a particularly strong purpose.
Alternatively, a model that does not update to match our preferences, a static model, may be useful for certain purposes when we do not want its purpose to be changed in any case. It could be useful if we feel that human preferences will change in a dangerous way, so we think that perhaps the only reason to go against the current desires held by the model would be a selfish/dangerous desire. There's no way that we could know this for sure, but we could be fairly certain that the only desires for changing the model's action are immoral ones. There are always grey cases, but if the grey cases are minimal, unlikely and present a much lower risk, then we're better focusing on the clearly presented risk.
Dynamic vs static alignment in this sense is entirely situational.
Knowing what you want
Though it appears generally preferable that a model is dynamic in its approach to aligning with human values, the manner in which it does so raises questions. It is beyond the scope of this paper to look particularly deeply into each aspect of this issue, as each would require a complex investigation within itself, but I do outline concerns for this approach to alignment. These are issues largely only faced by dynamic models, as static models do not necessarily have to understand the desires those around them are portraying.
For a model to recognise the desires held by an individual in a given moment, as a dynamic model must, you must either tell it what your goals are or it should determine them through other means. For example, a model may be trained to read behavioural signals, speech patterns, neural activity, etc. The former opens the model up to being deceived, it can be lied to without recognising it, and thus may not fulfil its role as well as it ought. A model that can be deceived is also far more open to use by malicious actors. The latter raises questions surrounding privacy, as if a model can infer our desires it likely will have the ability to use this data to further infer other non-cognitive states and beliefs. We would need to implement rules around what information a model has access to, how it may use this information, and limit its ability to share this information with others. It may also be that this is not entirely practically possible.
A dynamic model also faces the issue of determining which of your desires to follow. Human desires are not one-track things, we do not always want what is best for us, and we can want different things in the moment to what we desire overall. We may, for example, hold the desire to complete some form of exercise for at least 30 minutes every day. The model recognises this desire when it is given, likely on New Year's Day, and also recognises that on January 12th at 11:30 pm, you hold the desire to go to bed. You are adamant that what you want most in the world is a good night’s sleep and in order to get this you must go to sleep immediately. The model is faced with two contradicting desires, one that appears more rational and long-term, the other that appears to be more strongly felt in the present moment. The model needs to determine which of your desires should be fulfilled. The model may also choose instead to optimise not for any one of your many contradicting current desires but to optimise for the desires it expects you to hold in the future if you spend a long time thinking about exactly what it is you want. It may choose the desire it would expect you to follow yourself, allowing the present desire to override the general desire in cases where it is particularly strong. It may instead choose the desire that would be more rational, that would be better for you in the long term, and convince you to exercise for the remaining half an hour in the day. It could be down to the individual actor which of these approaches it would prefer a model to take, and the difference could be entirely down to human preference. In the former, I would like there to be a tradeoff between what I currently want and what I want overall based on strength, but in the latter, I would prefer that no tradeoff is available. However, the model may choose to disregard my preference if it believes a different approach would more readily fulfil what it views as my desires.
Beyond just the ethical considerations of each of these choices, there is a vast complexity in producing a model capable of any of these actions. These are not issues we tend to have to face with a static alignment model. We may prefer to have a simpler model that is more easily made moral and aligned in a way that respects autonomy, privacy etc. These issues seem more relevant to our indecision regarding the specification of the model than issues inherent to the model itself. I also now go on to highlight that the issues presented are also issues for static models, for other reasons, and thus we should not weigh too heavily on the possibility that a model might impact privacy if designed with a specific desire approach.
The Value Change Problem
It seems plausible that AI models have incentives to change our preferences to make them easier to align. This may not necessarily be the case, models may want to fulfil our desires with the express purpose of doing so for the sake of them being our desires, in the same way that you may respond to a friend offering to let you choose where you take them out for dinner with “I want to go wherever you want to go”. Instinctively, it seems easy to assume that dynamically aligned systems hold these incentives statically aligned systems would prefer that our values do not change and thus would not attempt to influence them. I would suggest that a preference for our desires not to change is just as strong a reason for the system to influence our desires as wanting them to change. Static models want what we want now, they don't want our desires to change as, if we share desires, we are more likely to assist the system in achieving its goal. If a model is designed to be helpful, it may change our desires to make it easier to be helpful or it may influence us to retain the desire to be helped by this model. It still has incentives to influence our desires, just not the same incentives as would be expected from a dynamic model. Dynamic models want to meet our desires and are more able to do this if our desires are more easily met. Humans are very impressionable and AI is very good at influencing people, so this is something we should be worried about in both cases. There may be cases where we would prefer that AI does impact our desires; if we take a broad definition of desire influence, we could suggest that an AI persuading us to eat a healthier breakfast is a case where we would like an AI to influence our desires. We may, therefore, produce a set of cases where we would allow AI to impact our desires. I would argue that the major issue within AI influence on desires is the loss of autonomy, so this set of allowed cases should be broad, excluding cases wherein the agent loses the ability to intervene. An AI system should be allowed to try and convince me to have fruit with my toast rather than chocolate spread, but it should not be able to convince me so well that I am incapable of any other choice. This poses an issue for static alignment, as we may change the criterion for being in this set, we may determine a new definition for “autonomy” or “free will”, and static models will not update to accommodate this.
If it turns out that we can't stop models from influencing our desires in ways we do not want, it is worth asking whether it is better for AIs to change our desires to be more easily met or for them to prevent our desires from changing at all. Influencing our desires to be more easily met at least allows for some control over which desires we have. The set of desires allowed by a dynamic model is smaller than the set of all possible desires but it is larger than the set including only the singular desire originally held, the desire a static model would influence you to retain. You have more agency and control when being influenced by a dynamic model than a static model. If our desires are changing in some way we're still developing, but static desires would limit the progress of humanity. At the moment of having a desire, we generally would prefer that our desires stay the same, but overall people recognise their own ability to be incorrect and, in that case, would be willing to alter their view. There is a distinction between believing you are correct and wanting to maintain the same desire regardless of newly presented information. Static AIs may limit our access to new information that would alter our desires (in the opinion of the holder, for the better) whereas progress is more feasible with a dynamic model. Regardless of how correct we believe our own desires to be, it appears obviously bad that a model may limit our access to information that may change them.
It is also worth questioning whether it is a bad thing that AI changes our desires at all. This is discussed very well in 'The Value Change Problem' by Nora Ammann. One may argue that if our desires are changed, they are still being fulfilled and we are still getting what we want. However, this seems like a strange kind of getting what we want as these desires may not be what we should or would normally want given the situation. Desiring something should come from rational, unbiased thought. If we're made to desire something at the will of another agent, it doesn't seem to be a normal desire. This desire does not appear to have been produced through the usual mental processes. You could note that you're made to want things all the time, you want some things because you were told they were cool as a teenager by your friends. You didn't think it initially, and wouldn't have come up with it on your own, but you do now think it, and this isn't a bad thing. However, the motives here are different - your friend is expressing an opinion that you come to agree with. The AI system is trying to persuade you, to intentionally alter your dispositions, entirely for its own benefit. This is a different kind of desire than liking bedazzled jeans as a teenager; opinions on jeans don't make up quite so much of your personality as wanting the world to be a particular way. Even if we alter the analogy to a point where the AI and your friend are trying to convince you of the same thing, an AI system has a greater ability to persuade and thus a stronger coercive ability. In the friend case, you have the ability to disagree and thus retain more autonomy. Regardless of how persuasive your friend may be, they are not as capable of persuasion as an incredibly intelligent AI system. Ammann talks about this using the term “illegitimate value change”, specifically defining it as harm to a person’s ability to interject in their own value-change process. This is specifically a bad thing not only for the harm to agency but also as it can lead to less richness and subtlety of value.
Value Drift
Statically aligned models hold a hodge-podge of different goals based on the human goals they are aligned to as human goals are not singular or particularly clear-cut. The models will care to some extent about promoting human welfare, to some extent about following the instructions of the owner etc. These values may change over time, however, causing problems for alignment. This can occur either through the relative weight the agent places on each goal shifting over time or through a sort of evolutionary selection.
I would argue that the former case, in terms of the relative weight the agent places on each goal, is more of an issue for statically aligned models as they hold a particular goal for a longer time period, as a robust static model would maintain the same goals throughout its life and thus have a larger time period in which to drift from the original intention. The instrumental goals of a dynamic agent change regularly to match the desires of the individual at present, meaning they are updated and corrected from any value drift that may have occurred, but the dynamic model may still experience value drift from the terminal goal of “what humans want”. The instrumental goals, the sense of “what humans want” at this present moment, may, however, form the aforementioned hodge podge of (sub)goals that may experience relative weight shift. This may make it difficult to spot value shift in dynamic models as we’re not sure whether these values are changing because they’re supposed to or if it’s value drift. Given they don’t get much time, it is still likely that they’re not experiencing value shift. In this sense (weight shifting), value drift is more of a problem for statically aligned models.
In the latter case, value drift occurs through selection and competition. A set of static models may plausibly all have different combinations of goals, allowing for some to possibly outcompete others by amassing more resources, copying themselves more, etc. This feels less likely for dynamic models as they’re all trying to match current values, likely the same or at least similar, so are more likely to work together. Environmental pressures and cultural evolution are what dynamic values are trying to move with. It's like dynamic values want to flow with the river and static values are eroded by it. Dynamic models will change the instrumental goals and maintain the desire to do what humans want whereas static models are more likely to be eroded to the extent of deciding to follow different rules. If two models’ desires directly oppose one another, e.g. “Make X the richest person in the world” vs “Make Y the richest person in the world” then their competition will be less likely to lead to value drift, as the directness means there is little way for incremental changes to be rewarded enough to lead to shift.
Corrigibility
It could be argued that only dynamically aligned agents would be corrigible by default, making them safer in terms of the shutdown problem. Corrigibility is important in terms of the shutdown problem as an incentive for self-preservation is a likely instrumental goal of most artificial agents, as identified in Omohundro (2008). If an aligned agent is dynamic it has to be corrigible, or else it cannot be fully dynamically aligned - it has to be able to adapt to human feedback. Static agents, on the other hand, are not necessarily corrigible. They naturally start off as corrigible in order to be aligned, but the reasoning behind corrigibility may mean that when human values change they no longer allow for human input to their goals and the actions they produce. For example, a statically aligned model may be shutdownable in training as it reasons around wanting the same things as humans so it determines that their desire for the model to shutdown must therefore be the appropriate action to further its own desires. If the model later finds that humans hold desires different to its own it will no longer make this assumption and choose of its own accord not to shut down. Our asking the model to shutdown is no longer good evidence to the favour of shutting down.
Static agents still can be corrigible, depending on what they’re trained to want, as it may be a human desire that the system is corrigible. However, this raises the question of exactly what it is they are adapting, as the terminal goal of the system must remain the de re sense of ‘what humans want’ at the time of training. It could be the case that they adapt the instrumental goals of the system. The model may be trying to follow the terminal goal of "increase human welfare" and the instrumental goal that may originally have been "make money" can become "make money in ethical ways" because it learns that making money in unethical ways does not increase aggregate human welfare. We may suggest that this should have been learnt during initial training, but this assumes that we have predicted everything necessary. A more advanced model should be able to improve as it goes. Improvement does not mean the model strays from its terminal goal. This may make it harder for the model to avoid value drift, but you could either work something in to prevent value drift or determine a tradeoff between the two. Corrigibility is therefore incredibly useful for a static model to have in order to further its own terminal desire.
It is worth asking whether static agents must be corrigible. I suggest not, a static agent that sticks to its terminal goals should be well aligned. Corrigibility may raise issues regarding our ability to shut down a statically aligned model, but if it is a desire of ours that we may shut it down, this must be implemented during training. If it's been trained well enough to be completely aligned, it should also not necessarily need to change the instrumental goals. If the referent of "what humans want" changes, a static model’s terminal goal should not change. You could argue that it should not be strongly corrigible in order to be aligned, strongly referring to there being an ability to completely change its instrumental goals. To train corrigibility into the model whilst also maintaining its status as a statically aligned model, you need to look at the strength of corrigibility. If we're training a model to be static then we don't want it to be corrigible in the same way we would want a dynamic model to be corrigible. If we want it to be static and corrigible, we would need to make it weakly corrigible in that it may change how it presents or slightly alter the instrumental values in order to adapt to human feedback, but it shouldn't have the ability to change its own terminal values or make a particularly drastic change to instrumental goals. A dynamic model would be strongly corrigible, changing its instrumental values entirely to match what humans want at the present moment. There does seem to be an innate difference in that the dynamic model is capable of changing a much more higher-order desire that it holds than the static model is allowed to do.
Deceptive Models
It may be worth asking whether one form of alignment would be more likely to produce deceptive models. When testing for alignment, a model may recognise that we are testing for a certain kind of alignment and pose as this when it is actually aligned in a different sense. It is much easier to deceive somebody who doesn’t identify which type of alignment they’re looking for. A dynamic model will also appear exactly as a static model would if it were tested immediately after being deployed, thus they would both appear as aligned static models based on behavioural testing. It would only be after a length of time wherein the dynamic model has adapted to new desires that it becomes clear it is not in fact statically aligned. It may be the case that as it appears harder to train a dynamic model it is therefore more likely to become misaligned.
I do not plan to go particularly far into this topic as it feels somewhat unnecessary when the majority of this piece of writing has been spent talking about aligned models. It is worth mentioning as we may not be able to tell whether a model is static or dynamic and therefore we may think it's aligned (because we're looking for the other sense) when it isn’t.
Conclusions
Overall, it appears far more likely that a model will become statically aligned than dynamically aligned. The training process is simpler as dynamic models require abilities not necessary (though possibly useful) in static models, including introspective and self-altering abilities. Static models therefore have a speed and simplicity advantage over dynamic models. A static model also does not require the same depth of situational awareness as in a dynamic model. Decreasing situational awareness can decrease the risk of the model becoming deceptive, so static models may be more widely produced.
Though static models are more likely to be produced, whether this be due to ease or decreased risk, in many cases dynamic models appear to be preferable. This is situational; static models appear preferable in cases where changing the goal of the agent is more likely than not to be for immoral or unsafe reasons. However, dynamic models are preferred in most other cases. Dynamic models are more able to advance with humanity, this being particularly useful as humanity appears to be becoming increasingly moral. It is also the case that a model that does not change with humanity may limit its progress, thus impeding on the moral autonomy of future generations. It could be argued that dynamic models face issues around determining which human desires to fulfil that are not faced by static models. These issues are not particularly quantifiable as the major issue faced is our own uncertainty around which approach would be best considered as “fulfilling human desires” and thus cannot form a major aspect of a preferential argument. Both models interact with the value change problem, harming agency and value diversity by interfering with human values and attempting to prevent our ability to interject within our own value change process. Static models may interfere in an attempt to prevent our values from changing, thus limiting human progress. Dynamic models instead attempt to make our desires easier to fulfil. Naturally, we would prefer that models did not attempt to alter our desires, but in the case where prevention is not possible, the latter seems preferable. We still maintain some control in this case as the set of “allowed” values is larger. I also considered other issues to a lesser extent including the effect of value drift and corrigibility on our preferences, the effects of both appearing to be in favour of dynamic alignment as neither appeared to pose a particularly strong problem for these models. Deception is more likely and harder to identify in dynamic models, so I would recommend further research into the impact of alignment type on the likelihood of deception, but I do think this is beyond the scope of this paper.
There being a dissociation between which type of alignment is more likely and which would be preferable clearly presents an issue. It is not obvious how we could ensure that a model would be dynamically aligned or, even, how to make this particularly likely. Specific training processes and an appropriate level of situational awareness would make it more likely that a model would be dynamic, but it seems unlikely that we may be able to make this a significant probability. I would recommend that somebody more technically skilled than myself look into ways to ensure a model is dynamically aligned as I note that I am not the most likely person to identify a solution here. The object of this paper is to present an identified issue, that we are not likely to produce the kind of alignment we desire without a conscious effort in its favour, and now I propose further work is required to identify a solution.
0 comments
Comments sorted by top scores.