When discussing AI risks, talk about capabilities, not intelligence 2023-08-11T13:38:48.844Z
[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy 2023-03-07T11:55:01.131Z
Power-seeking can be probable and predictive for trained agents 2023-02-28T21:10:25.900Z
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques 2022-11-25T14:36:08.948Z
Threat Model Literature Review 2022-11-01T11:03:22.610Z
Clarifying AI X-risk 2022-11-01T11:03:01.144Z
DeepMind alignment team opinions on AGI ruin arguments 2022-08-12T21:06:40.582Z
Refining the Sharp Left Turn threat model, part 1: claims and mechanisms 2022-08-12T15:17:38.304Z
Paradigms of AI alignment: components and enablers 2022-06-02T06:19:59.378Z
ELK contest submission: route understanding through the human ontology 2022-03-14T21:42:26.952Z
Optimization Concepts in the Game of Life 2021-10-16T20:51:35.821Z
Tradeoff between desirable properties for baseline choices in impact measures 2020-07-04T11:56:04.239Z
Possible takeaways from the coronavirus pandemic for slow AI takeoff 2020-05-31T17:51:26.437Z
Specification gaming: the flip side of AI ingenuity 2020-05-06T23:51:58.171Z
Classifying specification problems as variants of Goodhart's Law 2019-08-19T20:40:29.499Z
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z
New safety research agenda: scalable agent alignment via reward modeling 2018-11-20T17:29:22.751Z
Discussion on the machine learning approach to AI safety 2018-11-01T20:54:39.195Z
New DeepMind AI Safety Research Blog 2018-09-27T16:28:59.303Z
Specification gaming examples in AI 2018-04-03T12:30:47.871Z
Using humility to counteract shame 2016-04-15T18:32:44.123Z
To contribute to AI safety, consider doing AI research 2016-01-16T20:42:36.107Z
[LINK] OpenAI doing an AMA today 2016-01-09T14:47:30.310Z
[LINK] The Top A.I. Breakthroughs of 2015 2015-12-30T22:04:01.202Z
Future of Life Institute is hiring 2015-11-17T00:34:03.708Z
Fixed point theorem in the finite and infinite case 2015-07-06T01:42:56.000Z
Negative visualization, radical acceptance and stoicism 2015-03-27T03:51:49.635Z
Future of Life Institute existential risk news site 2015-03-19T14:33:18.943Z
Open and closed mental states 2014-12-26T06:53:26.244Z
[MIRIx Cambridge MA] Limiting resource allocation with bounded utility functions and conceptual uncertainty 2014-10-02T22:48:37.564Z
Meetup : Robin Hanson: Why is Abstraction both Statusful and Silly? 2014-07-13T06:18:48.396Z
New organization - Future of Life Institute (FLI) 2014-06-14T23:00:08.492Z
Meetup : Boston - Computational Neuroscience of Perception 2014-06-10T20:32:02.898Z
Meetup : Boston - Taking ideas seriously 2014-05-28T18:58:57.537Z
Meetup : Boston - Defense Against the Dark Arts: the Ethics and Psychology of Persuasion 2014-05-28T17:58:44.680Z
Meetup : Boston - An introduction to digital cryptography 2014-05-13T18:04:19.023Z
Meetup : Boston - Two Parables on Language and Philosophy 2014-04-15T12:10:14.008Z
Meetup : Boston - Schelling Day 2014-03-27T17:08:50.148Z
Strategic choice of identity 2014-03-08T16:27:22.728Z
Meetup : Boston - Optimizing Empathy Levels 2014-02-26T23:44:02.830Z
Meetup : Boston: In Defence of the Cathedral 2014-02-14T19:31:52.824Z
Meetup : Boston - Connection Theory 2014-01-16T21:09:29.111Z
Meetup : Boston - Aversion factoring and calibration 2014-01-13T23:24:15.085Z
Meetup : Boston - Macroeconomic Theory (Joe Schneider) 2014-01-07T02:49:44.203Z
Ritual Report: Boston Solstice Celebration 2013-12-27T15:28:34.052Z
Meetup : Boston - Greens Versus Blues 2013-12-20T21:07:04.671Z
Meetup : Boston Winter Solstice 2013-12-17T06:56:27.729Z
Meetup : Boston/Cambridge - The Attention Economy 2013-12-04T03:06:38.970Z
Meetup : Boston / Cambridge - The future of life: a cosmic perspective (Max Tegmark), Dec 1 2013-11-23T17:55:39.649Z
Meetup : Boston / Cambridge - Systems, Leverage, and Winning at Life 2013-11-23T17:48:50.403Z


Comment by Vika on More Is Different for AI · 2024-01-15T10:30:18.827Z · LW · GW

I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience. 

Comment by Vika on Counterarguments to the basic AI x-risk case · 2024-01-12T11:19:13.228Z · LW · GW

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

Comment by Vika on Refining the Sharp Left Turn threat model, part 1: claims and mechanisms · 2023-12-20T19:52:17.210Z · LW · GW

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim 2 (alignment techniques stop working). 
  • It probably relies on some weaker version of Claim 2 (alignment techniques failing to apply to more powerful systems in some way). This seems necessary for deceptive alignment to arise, e.g. if our interpretability techniques fail to detect deceptive reasoning. However, I expect that most ways this could happen would not be due to the alignment techniques being fundamentally inadequate for the capability transition to more powerful systems (the strong version of Claim 2 used in SLT).
Comment by Vika on Clarifying AI X-risk · 2023-12-20T18:17:42.482Z · LW · GW

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), which I would summarize as "non-deceptiveness is anti-natural / hard to disentangle from general capabilities". I would probably put this under "SG  MAPS", assuming an irreducible kind of specification gaming where it's very difficult (or impossible) to distinguish deceptiveness from non-deceptiveness (including through feedback on the model's reasoning process). Though it could also be GMG, where the "non-deceptiveness" concept is incoherent and thus very difficult to generalize well. 

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2023-12-20T15:27:42.609Z · LW · GW

I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things. 

Thoughts on some of the cruxes in the post based on last year's developments:

  • Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it work? 
    • There has been a lot of progress on AGI governance and broad endorsement of the risks this year, so I feel somewhat more optimistic about global cooperation than a year ago.
  • Will we know how capable our models are? 
    • The field has made some progress on designing concrete capability evaluations - how well they measure the properties we are interested in remains to be seen.
  • Will systems acquire the capability to be useful for alignment / cooperation before or after the capability to perform advanced deception? 
    • At least so far, deception and manipulation capabilities seem to be lagging a bit behind usefulness for alignment (e.g. model-written evals / critiques, weak-to-strong generalization), but this could change in the future. 
  • Is consequentialism a powerful attractor? How hard will it be to avoid arbitrarily consequentialist systems?
    • Current SOTA LLMs seem surprisingly non-consequentialist for their level of capability. I still expect LLMs to be one of the safest paths to AGI in terms of avoiding arbitrarily consequentialist systems. 

I hoped to see other groups do the survey as well - looks like this didn't happen, though a few people asked me to share the template at the time. It would be particularly interesting if someone ran a version of the survey with separate ratings for "agreement with the statement" and "agreement with the implications for risk". 

Comment by Vika on When discussing AI risks, talk about capabilities, not intelligence · 2023-08-11T15:34:48.289Z · LW · GW

I agree that a possible downside of talking about capabilities is that people might assume they are uncorrelated and we can choose not to create them. It does seem relatively easy to argue that deception capabilities arise as a side effect of building language models that are useful to humans and good at modeling the world, as we are already seeing with examples of deception / manipulation by Bing etc. 

I think the people who think we can avoid building systems that are good at deception often don't buy the idea of instrumental convergence either (e.g. Yann LeCun), so I'm not sure that arguing for correlated capabilities in terms of intelligence would have an advantage. 

Comment by Vika on Steering GPT-2-XL by adding an activation vector · 2023-07-25T15:08:34.724Z · LW · GW

Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-06-06T19:47:04.210Z · LW · GW

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn't seems to rule out that shards can be modeled as contextually activated subagents with utility functions.) 

An upside of formalism is that you can tell when it's wrong, and thus it can help make our thinking more precise even if it makes assumptions that may not apply. I think defining your terms and making your arguments more formal should be a high priority. I'm not advocating spending hundreds of hours proving theorems, but moving in the direction of formalizing definitions and claims would be quite valuable. 

It seems like a bad sign that the most clear and precise summary of shard theory claims was written by someone outside your team. I highly agree with this takeaway from that post: "Making a formalism for shard theory (even one that’s relatively toy) would probably help substantially with both communicating key ideas and also making research progress." This work has a lot of research debt, and paying it off would really help clarify the disagreements around these topics. 

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-06-06T19:27:30.559Z · LW · GW

Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like "We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts". 

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-06-06T15:56:15.097Z · LW · GW

Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.

I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the "strong distributional shift" assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can't generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive". 

The hypothesis you mentioned seems compatible with the assumptions of this post. When you say "the policy develops motivations related to obvious correlates of its historical reinforcement signals", these "motivations" seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals. 

I suspect a lot of the disagreement here comes from different interpretations of the "internal representations of goals" assumption, I will try to rephrase that part better. 

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-06-06T14:45:14.791Z · LW · GW

The internal representations assumption was meant to be pretty broad, I didn't mean that the network is explicitly representing a scalar reward function over observations or anything like that - e.g. these can be implicit representations of state features I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits. 

Comment by Vika on TurnTrout's shortform feed · 2023-06-05T19:49:17.651Z · LW · GW

Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post.

Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

(Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)

Comment by Vika on TurnTrout's shortform feed · 2023-06-02T11:03:27.994Z · LW · GW

Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk.

Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence. (There's a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.)

The subtleties that you discuss are important in general, but don't seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-05-31T15:16:48.299Z · LW · GW

Here is my guess on how shard theory would affect the argument in this post:

  1. In my understanding, shard theory would predict that the model learns multiple goals from the training-compatible (TC) set (e.g. including both the coin goal and the go-right goal in CoinRun), and may pursue different learned goals in different new situations. The simplifying assumption that the model pursues a randomly chosen goal from the TC set also covers this case, so this doesn't affect the argument. 
  2. Shard theory might also imply that the training-compatible set should be larger, e.g. including goals for which the agent's behavior is not optimal. I don't think this affects the argument, since we just need the TC set to satisfy the condition that permuting reward values in  will produce a reward vector that is still in the TC set. 

So think that assuming shard theory in this post would lead to the same conclusions - would be curious if you disagree. 

Comment by Vika on When is Goodhart catastrophic? · 2023-05-25T14:13:11.254Z · LW · GW

Great post! I especially enjoyed the intuitive visualizations for how the heavy-tailed distributions affect the degree of overoptimization of X. 

As a possibly interesting connection, your set of criteria for an alignment plan can also be thought of as criteria for selecting a model specification that approximates the ideal specification well, especially trying to ensure that the approximation error is light-tailed. 

Comment by Vika on Yoshua Bengio: How Rogue AIs may Arise · 2023-05-24T09:48:34.945Z · LW · GW

David had many conversations with Bengio about alignment during his PhD, and gets a lot of credit for Bengio taking AI risk seriously

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-05-20T09:10:09.337Z · LW · GW

Thanks Alex for the detailed feedback! I agree that learning a goal from the training-compatible set is a strong assumption that might not hold. 

This post assumes a standard RL setup and is not intended to apply to LLMs (it's possible some version of this result may hold for fine-tuned LLMs, but that's outside the scope of this post). I can update the post to explicitly clarify this, though I was not expecting anyone to assume that this work applies to LLMs given that the post explicitly assumes standard RL and does not mention LLMs at all. 

I agree that reward functions are not the best way to refer to possible goals. This post builds on the formalism in the power-seeking paper which is based on reward functions, so it was easiest to stick with this terminology. I can talk about utility functions instead (which would be equivalent to value functions in this case) but this would complicate exposition. I think it is pretty clear in the post that I'm not talking about reinforcement functions and the training reward is not the optimization target, but I could clarify this further if needed.

I find the idea of a training-compatible goal set useful for thinking about the possible utilities that are consistent with feedback received during training. I think utility functions are still the best formalism we have to represent goals, and I don't have a clear sense of the alternative you are proposing. I understand what kind of object a utility function is, and I don't understand what kind of object a value shard is. What is the type signature of a shard - is it a policy, a utility function restricted to a particular context, or something else? When you are talking about a "partial encoding of a goal in the network", what exactly do you mean by a goal? 

I would be curious what predictions shard theory makes about the central claim of this post. I have a vague intuition that power-seeking would be useful for most contextual goals that the system might have, so it would still be predictive to some degree, but I don't currently see a way to make that more precise. 

I've read a few posts on shard theory, and it seems very promising and interesting, but I don't really understand what its claims and predictions are. I expect I will not have a good understanding or be able to apply the insights until there is a paper that makes the definitions and claims of this theory precise and specific. (Similarly, I did not understand your power-seeking theory work until you wrote a paper about it.) If you're looking to clarify the discourse around RL processes, I believe that writing a definitive reference on shard theory would be the most effective way to do so. I hope you take the time to write one and I really look forward to reading it. 

Comment by Vika on Power-seeking can be probable and predictive for trained agents · 2023-04-20T12:57:13.065Z · LW · GW

Which definition / result are you referring to?

Comment by Vika on [Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy · 2023-03-10T11:33:55.840Z · LW · GW

We expect that an aligned (blue-cloud) model would have an incentive to preserve its goals, though it would need some help from us to generalize them correctly to avoid becoming a misaligned (red-cloud) model. We talk about this in more detail in Refining the Sharp Left Turn (part 2)

Comment by Vika on [Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy · 2023-03-07T12:07:15.567Z · LW · GW

Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

Comment by Vika on Optimization Concepts in the Game of Life · 2023-01-16T15:54:33.901Z · LW · GW

Thanks Alex for the detailed feedback! I have updated the post to fix these errors. 

Curious if you have high-level thoughts about the post and whether these definitions have been useful in your work. 

Comment by Vika on Imitative Generalisation (AKA 'Learning the Prior') · 2023-01-16T15:27:39.162Z · LW · GW

This post provides a maximally clear and simple explanation of a complex alignment scheme. I read the original "learning the prior" post a few times but found it hard to follow. I only understood how the imitative generalization scheme works after reading this post (the examples and diagrams and clear structure helped a lot). 

Comment by Vika on Saving Time · 2023-01-16T14:47:53.175Z · LW · GW

This post helped me understand the motivation for the Finite Factored Sets work, which I was confused about for a while. The framing of agency as time travel is a great intuition pump. 

Comment by Vika on Selection Theorems: A Program For Understanding Agents · 2023-01-16T12:22:11.432Z · LW · GW

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems. 

Comment by Vika on Finding gliders in the game of life · 2022-12-08T15:29:00.794Z · LW · GW

+1. This section follows naturally from the rest of the article, and I don't see why it's labeled as an appendix -  this seems like it would unnecessarily discourage people from reading it. 

Comment by Vika on The Plan - 2022 Update · 2022-12-02T15:08:45.618Z · LW · GW

It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 years. Current interpretability work is focused on low-level abstractions (e.g. identifying how a model represents basic facts about the world) and extending the current approaches to higher-level abstractions seems hard. Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods? 

Comment by Vika on Refining the Sharp Left Turn threat model, part 2: applying alignment techniques · 2022-11-25T17:40:58.016Z · LW · GW

I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 

Comment by Vika on Takeaways from a survey on AI alignment resources · 2022-11-10T16:50:31.332Z · LW · GW

Too bad that my list of AI safety resources didn't make it into the survey - would be good to know to what extent it would be useful to keep maintaining it. Will you be running future iterations of this survey? 

Comment by Vika on Simulators · 2022-10-25T21:19:11.923Z · LW · GW

I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tuned with RL, which creates agentic incentives on the simulator level as well. 

You make a good point about the difficulty of identifying dangerous models if the danger is triggered by very specific prompts. I think this may go both ways though, by making it difficult for a simulated agent to execute a chain of dangerous behaviors, which could be interrupted by certain inputs from the user. 

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-10-25T21:07:36.101Z · LW · GW

I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic. 

Comment by Vika on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-10-09T16:00:33.166Z · LW · GW

No worries! Thanks a lot for updating the post

Comment by Vika on The alignment problem from a deep learning perspective · 2022-09-16T19:46:43.759Z · LW · GW

Thanks Richard for this post, it was very helpful to read! Some quick comments:

  • I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
  • The architectural assumptions (e.g. the prediction & action heads) don't seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
  • Phase 1 and 2 seem to map to outer and inner alignment respectively. 
  • Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned. 
  • I'm confused why mechanistic interpretability is listed under phase 3 in the research directions - surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques. 
Comment by Vika on Simulators · 2022-09-13T17:10:20.579Z · LW · GW

Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically targeted at simulated agents. For example, a simulated agent might need some level of persistence within the simulation to execute these behaviors, and we may be able to influence the simulator to generate less persistent agents. 


Comment by Vika on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-09-13T16:38:58.864Z · LW · GW

I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck - across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then the overall probability of Ought influencing the org that builds AGI is 10%. Your estimate of 1% seems too low to me unless you are a lot more pessimistic about alignment researchers influencing their organization from the inside. 

Comment by Vika on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-09-13T16:21:21.262Z · LW · GW

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more". 

Some corrections for your overall description of the DM alignment team:

  • I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
  • I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"
Comment by Vika on Toni Kurz and the Insanity of Climbing Mountains · 2022-08-25T18:46:37.778Z · LW · GW

This post resonates with me on a personal level, since my mother was really into mountain climbing in her younger years. She quit after seeing a friend die in front of her (another young woman who broke her neck against an opposing rock face in an unlucky fall). It seems likely I wouldn't be here otherwise. Happy to report that she is still enjoying safer mountain activities 50 years later. 

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-08-17T17:56:46.934Z · LW · GW

Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-08-17T16:59:16.224Z · LW · GW

Hmm, thanks... Can you elaborate what "this" is? 

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-08-17T16:57:12.095Z · LW · GW

We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-08-17T16:20:16.775Z · LW · GW

Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments). 

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-08-17T15:46:52.532Z · LW · GW

Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy. 

Comment by Vika on Paradigms of AI alignment: components and enablers · 2022-08-13T11:59:55.461Z · LW · GW

Thanks, glad you found the post useful!

Maintaining uncertainty over the goal allows the system to model the set of goals that are consistent with the training data, notice when they disagree with each other out of distribution, and resolve that disagreement in some way (e.g. by deferring to a human). 

Comment by Vika on DeepMind alignment team opinions on AGI ruin arguments · 2022-08-12T21:15:59.077Z · LW · GW


Comment by Vika on Gradations of Agency · 2022-07-20T14:18:38.455Z · LW · GW

Ah, I think you intended level 6 as an OR of learning from imitation / imagined experience, while I interpreted it as an AND. I agree that humans learn from imitation on a regular basis (e.g. at school). In my version of the hierarchy, learning from imitation and imagined experience would be different levels (e.g. level 6 and 7) because the latter seems a lot harder. In your decision theory example, I think a lot more people would be able to do the imitation part than the imagined experience part. 

Comment by Vika on Gradations of Agency · 2022-07-20T09:41:26.142Z · LW · GW

I think some humans are at level 6 some of the time (see Humans Who Are Not Concentrating Are Not General Intelligences). I would expect that learning cognitive algorithms from imagined experience is pretty hard for many humans (e.g. examples in the Astral Codex post about conditional hypotheticals). But maybe I have a different interpretation of Level 6 than what you had in mind?

Comment by Vika on Gradations of Agency · 2022-07-19T14:59:20.105Z · LW · GW

This is an interesting hierarchy! I'm wondering how to classify humans and various current ML systems along this spectrum. My quick take is that most humans are at Levels 4-5, AlphaZero is at level 5, and GPT-3 is at level 4 with the right prompting. Curious if you have specific ML examples in mind for these levels. 

Comment by Vika on Examples of AI Increasing AI Progress · 2022-07-19T09:53:40.680Z · LW · GW

Makes sense, thanks. I think the current version of the list is not a significant infohazard since the examples are well-known, but I agree it's good to be cautious. (I tweeted about it to try to get more examples, but it didn't get much uptake, happy to delete the tweet if you prefer.) Focusing on outreach to people who care about AI risk seems like a good idea, maybe it could be useful to nudge researchers who don't work on AI safety because of long timelines to start working on it. 

Comment by Vika on Examples of AI Increasing AI Progress · 2022-07-18T18:20:14.551Z · LW · GW

Really excited to see this list, thanks for putting it together! I shared it with the DM safety community and tweeted about it here, so hopefully some more examples will come in. (Would be handy to have a short URL for sharing the spreadsheet btw.)

I can see several ways this list can be useful: 

  • as an outreach tool (e.g. to convince skeptics that recursive self-improvement is real) 
  • for forecasting AI progress
  • for coming up with specific strategies for slowing down AI progress

Curious whether you primarily intend this to be an outreach tool or a resource for AI forecasting / governance. 

Comment by Vika on Looking back on my alignment PhD · 2022-07-04T18:37:20.665Z · LW · GW

I think it's plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me. 

In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I've often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives). 

Comment by Vika on Looking back on my alignment PhD · 2022-07-01T16:46:53.335Z · LW · GW

Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I've often found myself held back by these. 

I agree that impact measures are not super useful for alignment (apart from deconfusion) and I've also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I'm curious why you wish you had stopped working on it sooner.