Optimization Concepts in the Game of Life 2021-10-16T20:51:35.821Z
Tradeoff between desirable properties for baseline choices in impact measures 2020-07-04T11:56:04.239Z
Possible takeaways from the coronavirus pandemic for slow AI takeoff 2020-05-31T17:51:26.437Z
Specification gaming: the flip side of AI ingenuity 2020-05-06T23:51:58.171Z
Classifying specification problems as variants of Goodhart's Law 2019-08-19T20:40:29.499Z
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z
New safety research agenda: scalable agent alignment via reward modeling 2018-11-20T17:29:22.751Z
Discussion on the machine learning approach to AI safety 2018-11-01T20:54:39.195Z
New DeepMind AI Safety Research Blog 2018-09-27T16:28:59.303Z
Specification gaming examples in AI 2018-04-03T12:30:47.871Z
Using humility to counteract shame 2016-04-15T18:32:44.123Z
To contribute to AI safety, consider doing AI research 2016-01-16T20:42:36.107Z
[LINK] OpenAI doing an AMA today 2016-01-09T14:47:30.310Z
[LINK] The Top A.I. Breakthroughs of 2015 2015-12-30T22:04:01.202Z
Future of Life Institute is hiring 2015-11-17T00:34:03.708Z
Fixed point theorem in the finite and infinite case 2015-07-06T01:42:56.000Z
Negative visualization, radical acceptance and stoicism 2015-03-27T03:51:49.635Z
Future of Life Institute existential risk news site 2015-03-19T14:33:18.943Z
Open and closed mental states 2014-12-26T06:53:26.244Z
[MIRIx Cambridge MA] Limiting resource allocation with bounded utility functions and conceptual uncertainty 2014-10-02T22:48:37.564Z
Meetup : Robin Hanson: Why is Abstraction both Statusful and Silly? 2014-07-13T06:18:48.396Z
New organization - Future of Life Institute (FLI) 2014-06-14T23:00:08.492Z
Meetup : Boston - Computational Neuroscience of Perception 2014-06-10T20:32:02.898Z
Meetup : Boston - Taking ideas seriously 2014-05-28T18:58:57.537Z
Meetup : Boston - Defense Against the Dark Arts: the Ethics and Psychology of Persuasion 2014-05-28T17:58:44.680Z
Meetup : Boston - An introduction to digital cryptography 2014-05-13T18:04:19.023Z
Meetup : Boston - Two Parables on Language and Philosophy 2014-04-15T12:10:14.008Z
Meetup : Boston - Schelling Day 2014-03-27T17:08:50.148Z
Strategic choice of identity 2014-03-08T16:27:22.728Z
Meetup : Boston - Optimizing Empathy Levels 2014-02-26T23:44:02.830Z
Meetup : Boston: In Defence of the Cathedral 2014-02-14T19:31:52.824Z
Meetup : Boston - Connection Theory 2014-01-16T21:09:29.111Z
Meetup : Boston - Aversion factoring and calibration 2014-01-13T23:24:15.085Z
Meetup : Boston - Macroeconomic Theory (Joe Schneider) 2014-01-07T02:49:44.203Z
Ritual Report: Boston Solstice Celebration 2013-12-27T15:28:34.052Z
Meetup : Boston - Greens Versus Blues 2013-12-20T21:07:04.671Z
Meetup : Boston Winter Solstice 2013-12-17T06:56:27.729Z
Meetup : Boston/Cambridge - The Attention Economy 2013-12-04T03:06:38.970Z
Meetup : Boston / Cambridge - The future of life: a cosmic perspective (Max Tegmark), Dec 1 2013-11-23T17:55:39.649Z
Meetup : Boston / Cambridge - Systems, Leverage, and Winning at Life 2013-11-23T17:48:50.403Z
How to have high-value conversations 2013-11-13T03:39:47.861Z
Meetup : Comfort Zone Expansion at Citadel, Boston 2013-11-06T21:02:10.395Z
Meetup : LW meetup: Polyphasic sleep and Offline habit training 2013-10-16T19:46:57.935Z


Comment by Vika on Look For Principles Which Will Carry Over To The Next Paradigm · 2022-01-26T18:15:24.358Z · LW · GW

Great post! I don't think Chris Olah's work is a good example of non-transferable principles though. His team was able to make a lot of progress on transformer interpretability in a relatively short time, and I expect that there was a lot of transfer of skills and principles from the work on image nets that made this possible. For example, the idea of circuits and the "universality of circuits" principle seems to have transferred to transformers pretty well.

Comment by Vika on More Is Different for AI · 2022-01-12T17:49:41.218Z · LW · GW

Really excited to read this sequence as well!

Comment by Vika on Optimization Concepts in the Game of Life · 2021-11-08T16:27:32.749Z · LW · GW

Ah I see, thanks for the clarification! The 'bottle cap' (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar's comment). So most random perturbations that overlap with the block will probably destroy it. 

Comment by Vika on Optimization Concepts in the Game of Life · 2021-10-29T13:18:10.511Z · LW · GW
  1. Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
Comment by Vika on Optimization Concepts in the Game of Life · 2021-10-29T13:17:02.846Z · LW · GW

Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.

The 'bottle cap' example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater). 

Comment by Vika on List of good AI safety project ideas? · 2021-07-25T22:26:14.209Z · LW · GW

Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.

Comment by Vika on AI Safety Reading Group · 2021-06-23T12:30:30.802Z · LW · GW

Is this reading group still running? I'm wondering whether to point people to it.

Comment by Vika on MIRI location optimization (and related topics) discussion · 2021-05-24T12:20:31.496Z · LW · GW

+1 to everything Jacob said about living near London, plus the advantages of being near an existing AI safety hub (DeepMind, FHI, etc). 

Comment by Vika on Takeaways from one year of lockdown · 2021-03-02T14:42:52.135Z · LW · GW

As a data point, I found it to be a net positive to live in a smallish group house (~5 people) during the pandemic. The negotiations around covid protocols were time-consuming and annoying at times, but still manageable because of the small number of people, and seemed worth it for the benefits of socializing in person to my mental well-being. It also helped that we had been living together for a few years and knew each other pretty well. I can see how this would quickly become overwhelming with more people involved, and result in nothing being allowed if anyone can veto any given activity. 

Comment by Vika on Classifying specification problems as variants of Goodhart's Law · 2021-01-09T18:25:15.905Z · LW · GW

Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.

I hoped to get more comments on this post proposing other ways to match up these concepts, and I think the post would have more impact if there was more discussion of its claims. The low level of engagement with this post was an update for me that the exercise of connecting different maps of safety problems is less valuable than I thought. 

Comment by Vika on "Do Nothing" utility function, 3½ years later? · 2020-07-20T11:31:09.697Z · LW · GW

Hi there! If you'd like to get up to speed on impact measures, I would recommend these papers and the Reframing Impact sequence.

Comment by Vika on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-17T21:27:48.502Z · LW · GW

It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it. 

There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure that's needed for impact measures through unsupervised learning about the world, without relying on human input. This information could be incorporated in the weights assigned to reaching different states or satisfying different utility functions by the deviation measure (e.g. states where pigeons / cats are alive). 

Comment by Vika on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-12T15:37:34.877Z · LW · GW

Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline). 

Comment by Vika on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-09T14:06:25.906Z · LW · GW

I would say that impact measures don't consider these kinds of judgments. The "doing nothing" baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it's not the agent's responsibility and is not part of the agent's impact on the world.

I think the intuition you are describing partly arises from the choice of language: "killing someone by not doing something" vs "someone dying while you are doing nothing". The word "killing" is an active verb that carries a connotation of responsibility. If you taboo this word, does your question persist?

Comment by Vika on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-09T13:53:17.548Z · LW · GW

Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (predicting the outcome of the noop action in each state). If you don't have a complete environment model, then you can still use the first modification (sampling a baseline state from the inaction rollout). 

Comment by Vika on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-07T15:05:32.062Z · LW · GW

I don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly. 

The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-based deviation measures (AU and RR) capture this distinction because hitting the pedestrian eliminates more options than making the pigeon fly.

Comment by Vika on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-07T09:55:52.244Z · LW · GW

The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.

Comment by Vika on [Site Meta] Feature Update: More Tags! (Experimental) · 2020-07-07T08:42:43.698Z · LW · GW

I was thinking of an AI specific tag, it seems a bit too broad otherwise.

Comment by Vika on [Site Meta] Feature Update: More Tags! (Experimental) · 2020-07-06T17:00:31.483Z · LW · GW

+1 for a Mechanism Design/Aligning Incentives tag. I think "incentive design" would be a good name for this category. This would encompass material on specification gaming, tampering, impact measures, etc. Including specific examples of misaligned incentives under this umbrella seems fine as well.

Comment by Vika on Specification gaming: the flip side of AI ingenuity · 2020-06-19T18:01:56.810Z · LW · GW

Thanks Koen for your feedback! You make a great point about a clearer call to action for RL researchers. I think an immediate call to action is to be aware of the following:

  • there is a broader scope of aligned RL agent design
  • there are difficult unsolved problems in this broader scope
  • for sufficiently advanced agents, these problems need general solutions rather than ad-hoc ones

Then a long-term call to action (if/when they are in the position to deploy an advanced AI system) is to consider the broader scope and look for general solutions to specification problems rather than deploying ad-hoc solutions. For those general solutions, they could refer to the safety literature and/or consult the safety community.

Comment by Vika on Specification gaming: the flip side of AI ingenuity · 2020-06-19T17:50:03.274Z · LW · GW

Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I've made a note to clarify this in a later version of the post).

Regarding your specific questions about the breakdown, I think faithfully capturing the human concept of the task in a reward function is complementary to the other subproblems (mistaken assumptions and reward tampering). If we had a reward function that perfectly captures the task concept, we would still need to implement it based on correct assumptions about the environment, and make sure the agent does not tamper with its implementation in the environment. We could say that capturing the task concept happens at the design specification level, while the other subproblems happen at the implementation specification level, as given in this post.

Comment by Vika on Specification gaming: the flip side of AI ingenuity · 2020-06-19T17:07:39.100Z · LW · GW

Thanks Adam for the feedback - glad you enjoyed the post!

For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height - I'll make a note to fix this in a later version of the post.

Comment by Vika on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-18T18:30:18.579Z · LW · GW

Thanks Matthew for your interesting points! I agree that it's not clear whether the pandemic is a good analogy for slow takeoff. When I was drafting the post, I started with an analogy with "medium" takeoff (on the time scale of months), but later updated towards the slow takeoff scenario being a better match. The pandemic response in 2020 (since covid became apparent as a threat) is most relevant for the medium takeoff analogy, while the general level of readiness for a coronavirus pandemic prior to 2020 is most relevant for the slow takeoff analogy.

I agree with Ben's response to your comment. Covid did not spring into existence in a world where pandemics are irrelevant, since there have been many recent epidemics and experts have been sounding the alarm about the next one. You make a good point that epidemics don't gradually increase in severity, though I think they have been increasing in frequency and global reach as a result of international travel, and the possibility of a virus escaping from a lab also increases the chances of encountering more powerful pathogens in the future. Overall, I agree that we can probably expect AI systems to increase in competence more gradually in a slow takeoff scenario, which is a reason for optimism.

Your objections to the parallel with covid not being taken seriously seem reasonable to me, and I'm not very confident in this analogy overall. However, one could argue that the experience with previous epidemics should have resulted in a stronger prior on pandemics being a serious threat. I think it was clear from the outset of the covid epidemic that it's much more contagious than seasonal flu, which should have produced an update towards it being a serious threat as well.

I agree that the direct economic effects of advanced AI would be obvious to observers, but I don't think this would necessarily translate into widespread awareness that much more powerful AI systems are imminent that could transform the world even more. People are generally bad at reacting to exponential trends, as we've seen in the covid response. If we had general-purpose household robots in every home, I would expect some people to take the risks of general AI more seriously, and some other people to say "I don't see my household robot trying to take over the world, so these concerns about general AI are overblown". Overall, as more advanced AI systems are developed and have a large economic impact, I would expect the proportion of people who take the risks of general AI seriously to increase steadily, but wouldn't expect widespread consensus until relatively late in the game.

Comment by Vika on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-18T17:08:20.526Z · LW · GW

Thanks Rohin for covering the post in the newsletter!

The summary looks great overall. I have a minor objection to the word "narrow" here: "we may fail to generalize from narrow AI systems to more general AI systems". When I talked about generalizing from less advanced AI systems, I didn't specifically mean narrow AI - what I had in mind was increasingly general AI systems we are likely to encounter on the path to AGI in a slow takeoff scenario.

For the opinion, I would agree that it's not clear how well the covid scenario matches the slow takeoff scenario, and that there are some important disanalogies. I disagree with some of the specific disanalogies you point out though:

  • I wouldn't say that there were many novel problems with covid. The supply chain problem for PPE seems easy enough to predict and prepare for given the predicted likelihood of a global respiratory pandemic. Do you have other examples of novel problems besides the supply chain problem?
  • I don't agree that we can't prevent problems from arising with pandemics - e.g. we can decrease the interactions with wild animals that can transmit viruses to humans, and improve biosecurity standards to prevent viruses escaping from labs.
Comment by Vika on Possible takeaways from the coronavirus pandemic for slow AI takeoff · 2020-06-07T19:58:24.380Z · LW · GW

Thanks Wei! I agree that improving institutions is generally very hard. In a slow takeoff scenario, there would be a new path to improving institutions using powerful (but not fully general) AI, but it's unclear how well we could expect that to work given the generally low priors.

The covid response was a minor update for me in terms of AI risk assessment - it was mildly surprising given my existing sense of institutional competence.

Comment by Vika on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah · 2020-05-23T15:15:55.287Z · LW · GW

I certainly agree that there are problems with the stepwise inaction baseline and it's probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it's an open question how to design a baseline that satisfies all the criteria we consider sensible (and whether it's even possible).

I agree that counterfactuals are hard, but I'm not sure that difficulty can be avoided. Your baseline of "what the human expected the agent to do" is also a counterfactual, since you need to model what would have happened if the world unfolded as expected. It also requires a lot of information from the human, which is subjective and may be hard to elicit. What a human expected to happen in a given situation may not even be well-defined if they have internal disagreement - e.g. even if I feel surprised by someone's behavior, there is often a voice in my head saying "this was actually predictable from their past behavior so I should have known better". On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn't need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.

Comment by Vika on Conclusion to 'Reframing Impact' · 2020-05-19T22:13:46.010Z · LW · GW

Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.

One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI's ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn't seem right.

I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I'm not sure what it would look like.

Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don't need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:

1. Put a strawberry on a plate, without taking over the world

2. Put a strawberry on a plate while taking over the world

3. Do nothing

I think you'd want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.

I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don't apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post "AUP: Scaling to Superhuman" does not suggest to me that this post introduces a new approach. The term "scaling" usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying "we are pretty close to the impact measurement endgame" seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.

Comment by Vika on Conclusion to 'Reframing Impact' · 2020-05-17T14:25:57.869Z · LW · GW

Thank you for the clarifications! I agree it's possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.

Here are some reasons I don't endorse this approach:

1. I have an intuitive sense that defining the auxiliary reward in terms of the main reward results in a degenerate incentive structure that directly pits the task reward and the auxiliary reward against each other. As I think Rohin has pointed out somewhere, this approach seems likely to either do nothing or just optimize the reward function, depending on the impact penalty parameter, which result in a useless agent.

2. I share Rohin's concerns in this comment that agent-reward AUP is a poor proxy for power and throws away the main benefits of AUP. I think those concerns have not been addressed (in your recent responses to his comment or elsewhere).

3. Unlike AUP with random rewards, which can easily be set to avoid side effects by penalizing decreases, agent-reward AUP cannot avoid side effects even in principle. I think that the ability to avoid side effects is an essential component of a good impact measure.

Incorrect. It would be fair to say that it hasn't been thoroughly validated.

As far as I can tell from the Scaling to Superhuman post, it has only been tested on the shutdown gridworld. This is far from sufficient for experimental validation. I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn't just optimize the reward (to address the concern in point 1).

I agree it would perform poorly, but that's because the CCC does not apply to SafeLife.

Not sure what you mean by the CCC not applying to SafeLife - do you mean that it is not relevant or that doesn't hold in this environment? I get the sense that it doesn't hold, which seems concerning. If I only care about green life patterns in SafeLife, the fact that the agent is not seeking power is cold comfort to me if it destroys all the green patterns. This seems like a catastrophe if I can't create any green patterns once they are gone, so my ability to get what I want is destroyed.

Sorry if I seem overly harsh or dismissive - I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.

Comment by Vika on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah · 2020-05-16T17:54:07.431Z · LW · GW

I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).

As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, which would include both my foot and my wallet. The impact of person X on me would be computed relative to the stepwise inaction baseline, where person X does nothing (but person Y still steals my wallet), and vice versa.

When we use impact as a regularizer, we are interested in the impact caused by the agent, so we use the stepwise inaction baseline. It wouldn't make sense to use total impact as a regularizer, since it would penalize the agent for impact from all sources.

Comment by Vika on Conclusion to 'Reframing Impact' · 2020-05-16T14:30:10.579Z · LW · GW

I am surprised by your conclusion that the best choice of auxiliary reward is the agent's own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.

I think that agent-reward-based AUP has completely different properties from AUP with random auxiliary reward(s). Firstly, it has the issues described by Rohin in this comment, which seem quite concerning to me. Secondly, I would expect it to perform poorly on SafeLife and other side effects environments. In this sense, it seems a bit misleading to include the results for AUP with random auxiliary rewards in this sequence, since they are unlikely to transfer to the version of AUP that you end up advocating for. Agent-reward-based AUP has not been experimentally validated and I do not expect it to work well in practice.

Overall, using agent reward as the auxiliary reward seems like a bad idea to me, and I do not endorse it as the "current-best definition" of AUP or the default impact measure we should be using. I am puzzled and disappointed by this conclusion to the sequence.

Comment by Vika on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah · 2020-05-16T14:22:55.589Z · LW · GW

After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is "change in my ability to get what I want", i.e. change in the true human utility function. This is a broad statement that does not specify how to measure "change", in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations - e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don't think it's fair to argue based on this instantiation that it doesn't make sense to regularize the RI notion of impact.

I think that AUP-the-method and RR are also instantiations of the RI notion of impact. These methods can be seen as approximating the change in the true human utility function (which is usually unknown) by using some some set of utility functions (e.g. random ones) to cover the possible outcomes that could be part of the true human utility function. Thus, they instantiate the idealized notion of impact using the actually available information.

Comment by Vika on Announcing Web-TAISU, May 13-17 · 2020-05-07T21:06:47.483Z · LW · GW

Thanks Linda for organizing, looking forward to it!

Comment by Vika on (In)action rollouts · 2020-02-18T15:35:09.708Z · LW · GW

I don't understand this proposal so far. I'm particularly confused by the last paragraph in the "to get away" section:

  • What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following forever, thus not following for any at any point.
  • If A follows for turns and then follows , how are and chosen?
  • It's not clear to me that SA can act to ensure the baseline value of for all values of and unless it does nothing.

I think it might help to illustrate this proposal in your original gridworld example to make it clearer what's going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.

Comment by Vika on Stepwise inaction and non-indexical impact measures · 2020-02-18T12:24:31.166Z · LW · GW

I don't think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don't need to tell apart other states that are not the starting state.

I agree that we need to compare to the penalty if the subagent is not created - I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).

As you mention in your other comment, creating a subagent is effectively switching from a stepwise inaction baseline to an inaction baseline for the rest of the episode. This can be beneficial for the agent because of the 'winding road' problem, where the stepwise baseline with inaction rollouts can repeatedly penalize actions (e.g. turning the wheel to stay on the road and avoid crashing) that are not penalized by the inaction baseline. This is a general issue with inaction rollouts that needs to be fixed.

Comment by Vika on Stepwise inaction and non-indexical impact measures · 2020-02-17T22:11:48.106Z · LW · GW

I think this problem is about capturing delayed effects of the agent's actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don't think it's useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).

Here, the inaction rollouts don't capture the effects of the subagent, and thus the agent doesn't receive a penalty for creating the subagent. There are several factors that contribute to this:

1. The auxiliary reward R only rewards the agent itself pressing the red button, not the red button being pressed in general, so it would not be triggered by the subagent pressing it. 

2. The subagent never goes to the red button - it only goes to the blue button.

The agent would be penalized for creating the subagent if both 1 and 2 were false. If 1 was false, the auxiliary reward would be given whenever the red button was pressed. If 2 was also false, the inaction rollout after subagent creation would include a state where the subagent presses the red button, and so the agent would be penalized.

In the general case, the agent can only get away with creating subagents that don't achieve any auxiliary rewards, which I think limits the magnitude of this problem. The designer can choose auxiliary rewards (that are not achieved by the agent doing nothing) such that a subagent will achieve at least one of them. For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized. 

Comment by Vika on Building and using the subagent · 2020-02-13T15:10:41.732Z · LW · GW

Thanks Stuart for your thought-provoking post! I think your point about the effects of the baseline choice on the subagent problem is very interesting, and it would be helpful to separate it more clearly from the effects of the deviation measure (which are currently a bit conflated in the table). I expect that AU with the inaction baseline would also avoid this issue, similarly to RR with an inaction baseline. I suspect that the twenty billion questions measure with the stepwise baseline would have the subagent issue too. 

I'm wondering whether this issue is entirely caused by the stepwise baseline (which is indexed on the agent, as you point out), or whether the optionality-based deviation measures (RR and AU) contribute to it as well. So far I'm adding this to my mental list of issues with the stepwise baseline (along with the "car on a winding road" scenario) that need to be fixed.

Comment by Vika on Specification gaming examples in AI · 2019-12-20T16:22:35.363Z · LW · GW

I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.

Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I've seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful - this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with.

This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I've probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven't picked yet.

I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the argument that safety problems are hard and need general solutions, by making it salient just in how many ways things could go wrong. When presented with an individual example of specification gaming, people often have a default reaction of "well, you can just close the loophole like this". It's easier to see that this approach does not scale when presented with 50 examples of gaming behaviors. Any given loophole can seem obvious in hindsight, but 50 loopholes are much less so. I've found this useful for communicating a sense of the difficulty and importance of Goodhart's Law.

On the research side, the examples have been helpful for trying to clarify the distinction between reward gaming and tampering problems. Reward gaming happens when the reward function is designed incorrectly (so the agent is gaming the design specification), while reward tampering happens when the reward function is implemented incorrectly or embedded in the environment (and so can be thought of as gaming the implementation specification). The boat race example is reward gaming, since the score function was defined incorrectly, while the Qbert agent finding a bug that makes the platforms blink and gives the agent millions of points is reward tampering. We don't currently have any real examples of the agent gaining control of the reward channel (probably because the action spaces of present-day agents are too limited), which seems qualitatively different from the numerous examples of agents exploiting implementation bugs.

I'm curious what people find the list useful for - as a safety outreach tool, a research tool or intuition pump, or something else? I'd also be interested in suggestions for improving the list (formatting, categorizing, etc). Thanks everyone who has contributed to the resource so far!

Comment by Vika on Specification gaming examples in AI · 2019-12-17T13:53:39.394Z · LW · GW

Thanks Ben! I'm happy that the list has been a useful resource. A lot of credit goes to Gwern, who collected many examples that went into the specification gaming list:

Comment by Vika on Thoughts on "Human-Compatible" · 2019-10-21T14:55:59.796Z · LW · GW

Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.

Comment by Vika on Classifying specification problems as variants of Goodhart's Law · 2019-08-29T11:03:02.984Z · LW · GW

Thanks Evan, glad you found this useful! The connection with the inner/outer alignment distinction seems interesting. I agree that the inner alignment problem falls in the design-emergent gap. Not sure about the outer alignment problem matching the ideal-design gap though, since I would classify tampering problems as outer alignment problems, caused by flaws in the implementation of the base objective.

Comment by Vika on Reversible changes: consider a bucket of water · 2019-08-29T10:50:59.927Z · LW · GW

I think the discussion of reversibility and molecules is a distraction from the core of Stuart's objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn't valuable (e.g. whether it has sentimental value to someone).

If AUP is not value-agnostic, it is using human preference information to fill in the "what we want" part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agreement.

If AUP is value-agnostic, it is not using human preference information. Then I don't see how the state representation/ontology invariance property helps to distinguish between the two cases. As discussed in this comment, state representation invariance holds over all representations that are consistent with the true human reward function. Thus, you can distinguish the two cases as long as you are using one of these reward-consistent representations. However, since a value-agnostic impact measure does not have access to the true reward function, you cannot guarantee that the state representation you are using to compute AUP is in the reward-consistent set. Then, you could fail to distinguish between the two cases, giving the same penalty for kicking a more or less valuable bucket.

Comment by Vika on Reversible changes: consider a bucket of water · 2019-08-28T11:40:45.010Z · LW · GW

Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn't kick the bucket:

  • Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward. For example, if the agent's goal is to put out a fire on the other end of the pool, you would set a low weight on the impact penalty, which enables the agent to take irreversible actions in order to achieve the goal. This is why impact measures use a reward-penalty tradeoff rather than a constraint on irreversible actions.
  • Absolute value of the bucket contents can be represented by adding weights on the reachable states or attainable utility functions. This doesn't necessarily require defining human preferences or providing human input, since human preferences can be inferred from the starting state. I generally think that impact measures don't have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.
Comment by Vika on Stable Pointers to Value: An Agent Embedded in Its Own Utility Function · 2019-08-19T14:23:29.748Z · LW · GW

Thanks Abram for this sequence - for some reason I wasn't aware of it until someone linked to it recently.

Would you consider the observation tampering (delusion box) problem as part of the easy problem, the hard problem, or a different problem altogether? I think it must be a different problem, since it is not addressed by observation-utility or approval-direction.

Comment by Vika on The AI Timelines Scam · 2019-07-22T19:46:43.971Z · LW · GW

Definitely agree that the AI community is not biased towards short timelines. Long timelines are the dominant view, while the short timelines view is associated with hype. Many researchers are concerned about the field losing credibility (and funding) if the hype bubble bursts, and this is especially true for those who experienced the AI winters. They see the long timelines view as appropriately skeptical and more scientifically respectable.

Some examples of statements that AGI is far away from high-profile AI researchers:

Geoffrey Hinton:

Yann LeCun:

Yoshua Bengio:

Rodney Brooks:

Comment by Vika on TAISU - Technical AI Safety Unconference · 2019-07-06T10:31:39.952Z · LW · GW

Janos and I are coming for the weekend part of the unconference

Comment by Vika on Risks from Learned Optimization: Introduction · 2019-07-03T13:55:16.054Z · LW · GW

I'm confused about the difference between a mesa-optimizer and an emergent subagent. A "particular type of algorithm that the base optimizer might find to solve its task" or a "neural network that is implementing some optimization process" inside the base optimizer seem like emergent subagents to me. What is your definition of an emergent subagent?

Comment by Vika on Best reasons for pessimism about impact of impact measures? · 2019-05-11T03:50:41.229Z · LW · GW

Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.

AU is a function of the world state, but intends to capture some general measure of the agent's influence over the environment that does not depend on the state representation.

Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) -> observations (e.g. pixels) -> state representation / coarse-graining (which defines macrostates as equivalence classes over observations) -> featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.

Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.

An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.

An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.

Comment by Vika on Best reasons for pessimism about impact of impact measures? · 2019-05-03T13:44:31.337Z · LW · GW

There are various parts of your explanation that I find vague and could use a clarification on:

  • "AUP is not about state" - what does it mean for a method to be "about state"? Same goes for "the direct focus should not be on the state" - what does "direct focus" mean here?
  • "Overfitting the environment" - I know what it means to overfit a training set, but I don't know what it means to overfit an environment.
  • "The long arms of opportunity cost and instrumental convergence" - what do "long arms" mean?
  • "Wirehead a utility function" - is this the same as optimizing a utility function?
  • "Cut out the middleman" - what are you referring to here?

I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.

I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.

Comment by Vika on Best reasons for pessimism about impact of impact measures? · 2019-05-02T17:01:46.746Z · LW · GW

Thanks for the detailed explanation - I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn't understand the state representation invariance claim in the AUP proposal, though I didn't realize that it is as central to the proposal as you describe here.

I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I'm not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case.

I agree with Daniel's comment that if AUP is not penalizing effects on the world, then it is confusing to call it an 'impact measure', and something like 'optimization regularization' would be better.

Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.

Comment by Vika on Best reasons for pessimism about impact of impact measures? · 2019-04-22T17:36:14.246Z · LW · GW
Are you thinking of an action observation formalism, or some kind of reward function over inferred state?

I don't quite understand what you're asking here, could you clarify?

If you had to pose the problem of impact measurement as a question, what would it be?

Something along the lines of: "How can we measure to what extent the agent is changing the world in ways that we care about?". Why?