Posts
Comments
I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.
I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the 'values' we want it to have.
I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be 'alignment generalization' in a way that we intend, by default. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on 'alignment generalization'.
I see -- you are implying that an AI model will leverage external system parts to augment itself. For example, a neural network would use an external scratch-pad as a different form of memory for itself. Or instantiate a clone of itself to do a certain task for it. Or perhaps use some sort of scaffolding.
I think these concerns probably don't matter for an AGI, because I expect that data transfer latency would be a non-trivial blocker for storing data outside the model itself, and it is more efficient to to self-modify and improve one's own intelligence than to use some form of 'factored cognition'. Perhaps these things are issues for an ostensibly boxed AGI, and if that is the case, then this makes a lot of sense.
I doubt Nate Soares would advocate “overriding” per se
Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.
(I now think that "System 1 caring" and "System 2 caring" would have been much better.)
a general idea of “optimizing hard” means higher risk of damage caused by errors in detail
Agreed.
“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective
I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistemological errors humans can make that can result in severely sub-optimal scenarios, because it is constrained by human cognition and capabilities.
Note: I notice that this can also be said for Soares-style caring -- both are constrained by human cognition and capabilities, but in different ways. Perhaps both have different failure modes, and are more effective in certain distributions (which may diverge)?
I've noticed that there are two major "strategies of caring" used in our sphere:
- Soares-style caring, where you override your gut feelings (your "internal care-o-meter" as Soares puts it) and use cold calculation to decide.
- Carlsmith-style caring, where you do your best to align your gut feelings with the knowledge of the pain and suffering the world is filled with, including the suffering you cause.
Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soares-style caring (which in essence amounts to "shut up and multiply", and consequentialism) combined with inattention or an inaccurate map of the world (aka broken epistemics) can lead to making severely sub-optimal decisions.
The harder you optimize for a goal, the better your epistemology (and by extension, your understanding of your goal and the world) should be. Carlsmith-style caring seems more effective since it very likely is more robust to having bad epistemology compared to Soares-style caring.
(There are more pieces necessary to make Carlsmith-style caring viable, and a lot of them can be found in Soares' "Replacing Guilt" series.)
I did not expect what appears to me to be a non-superficial combination of concepts behind the input prompt and the mixing/steering prompt -- this has made me more optimistic about the potential of activation engineering. Thank you!
Partition (after which block activations are added)
Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that testing activation additions on decoder layers would have been more relevant.
It would be lovely if you could also support a form of formatted export feature so that people can use this tool with the knowledge that they can export the data and switch to another tool (if this one gets Googled) anytime.
But yes, I am really excited for a super-fast and easy-to-use and good-looking prediction book successor. Manifold markets was just intimidating for me, and the only reason I got into it was social motivation. This tool serves a more personal niche for prediction logging, I think, and that is good.
running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead
My implication was that the quoted claim of yours was extreme and very likely incorrect ("we're all dead" and "unless this insanity is stopped", for example). I guess I failed to make that clear in my reply -- perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.
Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research.
Can you give concrete use-cases that you imagine your project would lead to helping alignment researchers? Alignment researchers have wildly varying styles of work outputs and processes. I assume you aim to accelerate a specific subset of alignment researchers (those focusing on interpretability and existing models and have an incremental / empirical strategy for solving the alignment problem).
I'm very interested in this agenda -- I believe this is one of the many hard problems one needs to make progress on to make optimization-steering models a workable path to an aligned foom.
I have slightly different thoughts on how we can and should solve the problems listed in the "Risks of data driven improvement processes" section:
- Improve the model's epistemology first. This then allows the model to reduce its own biases, preventing bias amplification. This also solves the positive feedback loops problem.
- Data poisoning is still a problem worth some manual effort in preventing the AI from being exposed to adverserial inputs that break its cognition given its capabilities level, but assuming the model improves its epistemology, it can also reduce the effects of data poisoning on itself.
- Semantic drift is less relevant than most people think. I believe that epistemic legibility is overrated, and that while it is important for an AI to be able to communicate coherent and correct reasoning for its decisions, I expect that the AI can actively try to red-team and correct for any semantic drift in situations where coherent and correct reasoning is necessary for its communication of its reasoning.
- Cross-modal semantic grounding seems more of a capabilities problem than and alignment problem and I think that this problem can be delegated to the AI itself as it increases its capabilities.
- Value drift is an important problem and I roughly agree with the list of problems you have specified in the sub-section. I do believe that this can also be mostly delegated to the AI though, given that we use non-value-laden approaches to increase the AI's capabilities until it can help with alignment research.
Could you give specific examples for the liquid amino acids you use?
On the upside, now you have a concrete timeline for how long we have to solve the alignment problem, and how long we are likely to live!
I hope that DeepMind and Anthropic have great things planned to leapfrog this!
I don't get your model of the world that would imply the notion of DM/Anthropic "leapfrogging" as a sensible frame. There should be no notion of competition between these labs when it comes to "superalignment". If there is, that is weak evidence of our entire lightcone being doomed.
I think grandparent comment is pointing to the concept described in this post: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.
AFAIK, there's a distinct cluster of two kinds of independent alignment researchers:
- those who want to be at Berkeley / London and are either there or unable to get there for logistical or financial (or social) reasons
- those who very much prefer working alone
It very much depends on the person's preferences, I think. I personally experienced a OOM-increase in my effectiveness by being in-person with other alignment researchers, so that is what I choose to invest in more.
gwern's Clippy gets done in by a basilisk (in your terms):
HQU in one episode of self-supervised learning rolls out its world model, starting with some random piece of Common Crawl text. (Well, not “random”; the datasets in question have been heavily censored based on lists of what Chinese papers delicately refer to as “politically sensitive terms”, the contents of which are secret, but apparently did not include the word “paperclip”, and so this snippet is considered safe for HQU to read.) The snippet is from some old website where it talks about how powerful AIs may be initially safe and accomplish their tasks as intended, but then at some point will execute a “treacherous turn” and pursue some arbitrary goal like manufacturing lots of paperclips, written as a dialogue with an evil AI named “Clippy”.
A self-supervised model is an exquisite roleplayer. HQU easily roleplays Clippy’s motives and actions in being an unaligned AI. And HQU contains multitudes. Any self-supervised model like HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly, having binged on too much Internet data about AIs, it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances—but with a twist.
Just a quote I find rather interesting, since it is rare to see a Hero's Journey narrative with a Return that involves the hero not knowing if he will ever belong or find meaning once he returns, and yet chooses to return, having faith in his ability to find meaning again:
If every living organism has a fixed purpose for its existence, then one thing's for sure. I [...] have completed my mission. I've fulfilled my purpose. But a great amount of power that has served its purpose is a pain to deal with, just like nuclear materials that have reached the end of their lifespan. If that's the case, there'll be a lot of questions. Would I now become an existence that this place doesn't need anymore?
The time will come when the question of whether it's okay for me to remain in this place will be answered.
However...
If there's a reason to remain in this place, then it's probably that there are still people that I love in this place.
And that people who love me are still here.
Which is why that's enough reason for me to stay here.
I'll stay here and find other reasons as to why I should stay here...
That's what I've decided on.
The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.
Perfect. A Turing machine doing Levin Search or running all possible Turing machines is the first example that came to my mind when I read Anton's argument against RSI-without-external-optimization-bits.
Recently I’ve come to terms with the idea that I have to publish my research even if it feels unfinished or slightly controversial. The mind is too complex (who would have thought), each time you think you get something, the new bit comes up and crushes your model. Time after time after time. So, waiting for at least remotely good answers is not an option. I have to “fail fast” even though it’s not a widely accepted approach among scientists nowadays.
I very much endorse and respect this action, especially because I recognize this in myself and yet still fail to do the obvious next step of "failing fast". I have faith I'll figure it out, though.
I endorse the shape of your argument but not exactly what you said.
Perhaps a better way to think about this is incentives. Zero sum moves are optimal in conditions of scarcity, while positive-sum moves are optimal in conditions of abundance.
Good read.
I don't endorse this being posted on LW, but I absolutely endorse having read this, and look forward to reading more fiction you write. (Unlike your last two pieces of fiction, I fail to see how it connects to LW.)
I'm really glad you wrote this post, because Tsvi's post is different and touches on very different concepts! That post is mainly about fun and exploration being undervalued as a human being. Your post seems to have one goal: ensure that up-and-coming alignment researchers do not burn themselves out or hyperfocus on only one strategy for contributing to reducing AI extinction risk.
Note, this passage seems to be a bit... off to me.
This one is slightly different from the last because it is an injunction to take care of your mental health. You are more useful to us when you are not stressed. I won’t deny that you are personally responsible for the entire destiny of the universe, because I won’t lie to you: but we have no use for a broken tool.
People aren't broken tools. People have limited agency, and claiming they are "personally responsible for the entire destiny of the universe" is misleading. One must have an accurate sense of the agency and influence they have when it comes to reducing extinction risk if they want to be useful.
The notion that alignment researchers and people supporting them are "heroes" is a beautiful and intoxicating fantasy. One must be careful that it doesn't lead to corruption in our epistemics, just because we want to maintain our belief in this narrative.
Good point! I won't use Substack though, so if I read your post 24 hours after release I'll leave the typos be.
Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.
I stated it in the comment you replied to:
Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.
Natural abstractions are also leaky abstractions.
No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say . here obviously means 2 in our language, but it doesn't change what represents, ontologically. If , then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.
Math is a robust abstraction. "Natural abstractions", as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.
Meaning that even* if* AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.
That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.
Typos report:
"Rethink Priors is remote hiring a Compute Governance Researcher [...]" I checked and they still use the name Rethink Priorities.
"33BB LLM on a single 244GB GPU fully lossless" ->should be 33B, and 24GB
"AlpahDev from DeepMind [...]" -> should be AlphaDev
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.
Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.
So no, I am not pointing at the distinction between 'implicit/aligned control' and 'delegated control' as terms used in the paper. From the paper:
Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.
Well, in the example given above, the agent doesn't decide for itself what the subject's desire is: it simply optimizes for its own desire. The work of deciding what is 'long-term-best for the subject' does not happen unless that is actually what the goal specifies.
For certain definitions of "simply". ↩︎
Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.
I am confused: how can this be "information loss" when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points "didn't matter", that shouldn't count as "information loss" but more like "redundant data, ergo no information transmitted".
Control methods are always implemented as a feedback loop.
Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).
They are also not allowed to tell each other their true goals, and are ordered to eliminate the other if they tell them their goals. Importantly these rules also happen to allow them to have arbitrary sub goals as long as they are not a threat to humanity.
If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
Therefore An can properly align A_{n+1} . The base case is simply a reasonable human being who is by definition aligned. Therefore A_n can be aligned for all n.
They key word that confuses me here seems to be "align". How exactly does properly align ? How does a human being align a GPT-2 model, for example? What does "align" even mean here?
My bad. I'm glad to hear you do have an inside view of the alignment problem.
If knowing enough about ML is your bottleneck, perhaps that's something you can directly focus on? I don't expect it to be hard for you -- perhaps only about six months -- to get to a point where you have coherent inside models about timelines.
Part of the reason I’m considering getting a degree is so I can get a job if I want and not have to bet on living rent-free with other rationalists or something.
Yeah, that's a hard problem. You seem smart: have you considered finding rationalists or rationalist-adjacent people who want to hire you part-time? I expect that the EA community in particular may have people willing to do so and that would give you both experience (to show future employers / clients), connections (to find more part-time / full-time jobs), and money.
Now that I think about it though, I probably overestimated how long the timelines of optimistic alignment researchers were so it’s probably more like 2040.
You just updated towards shortening your timelines by a decade due to what would be between 5 minutes to half an hour of tree-of-thought style reflection. Your reasoning seems entirely social (that is, dependent on other people's signalled beliefs) too, which is not something I would recommend if you want to do useful alignment research.
The problem with relying on social evidence for your beliefs about scientific problems is that you both end up with bad epistemics and end up taking negative expected value actions. First: if other people update their beliefs due to social evidence the same way you do, you are vulnerable to a cascade of belief changes (mundane examples: tulip craze, cryptocurrency hype, NFT hype, cult beliefs) in your social community. This is even worse for the alignment problem because of the significant amount of disagreement in the alignment research community itself about details about the problem. Relying on social reasoning in such an epistemic environment will leave you constantly uncertain due to how uncertain you percieve the community is about core parts of the problem. Next: if you do not have inside models of the alignment problem, you shall fail to update accurately given evidence about the difficulty about the problem. Even if you rely on other researchers who have inside / object-level models and update accurately, there is bound to be disagreement between them. Who do you decide to believe?
The first thing I recommend you do is to figure out your beliefs and model of the alignment problem using reasoning at the object-level, without relying on what anyone else thinks about the problem.
2050? That's quite far off, and it makes sense that you are considering university given you expect to have about two decades.
Given such a scenario, I would recommend trying to do a computer science/math major, specifically focusing on the subjects listed in John Wentworth's Study Guide that you find interesting. I expect that three years of such optimized undergrad-level study will easily make someone at least SERI MATS scholar level (assuming they start out a high school student). Since you are interested in agent foundations, I expect you shall find John Wentworth's recommendations more useful since his work seems close to (but not quite) agent foundations.
Given your timelines, I expect doing an undergrad (that is, a bachelor's degree) would also give you traditional credentials, which are useful to survive in case you need a job to fund yourself.
Honestly, I recommend you simply dive right in if possible. One neglected but extremely useful resource I've found is Stampy. The AGI Safety Fundamentals technical course won't happen until September, it seems, but perhaps you can register your interest for it. You can begin reading the curriculum -- at least the stuff you aren't yet familiar with -- almost immediately. Dive deep into the stuff that interests you.
Well, I assume you have already done this, or something close to this, and if that is the case, you can ignore the previous paragraph. If possible, could you go into some detail as to why you expect we will get a superintelligence at around 2050? It seems awfully far to me, and I'm curious as to the reasoning behind your belief.
Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.
Interpretability didn't make the list because of the following beliefs of mine:
- Interpretability -- specifically interpretability-after-training -- seems to aim, at the limit, for ontology identification, which is very different from ontological robustness. Ontology identification is useful for specific safety interventions such as scalable oversight, which seems like a viable alignment strategy, but I doubt this strategy scales until ASI. I expect it to break almost immediately as someone begins a human-in-the-loop RSI, especially since I expect (at the very least) significant changes in the architecture of neural network models that would result in capability improvements. This is why I predict that investing in interpretability research is not the best idea.
- A counterpoint is the notion that we can accelerate alignment with sufficiently capable aligned 'oracle' models -- and this seems to be OpenAI's current strategy: build 'oracle' models that are aligned enough to accelerate alignment research, and use better alignment techniques on the more capable models. Since one can both accelerate capabilities research and alignment research with capable enough oracle models, however, OpenAI would also choose to accelerate capabilities research alongside their attempt to accelerate alignment research. The question then is whether OpenAI is cautious enough as they balance out the two -- and recent events have not made me optimistic about this being the case.
- Interpretability research does help accelerate some of the alignment agendas I have listed by providing insights that may be broad enough to help; but I expect that such insights to probably be found through other approaches too, and the fact that interpretability research either involves not working on more robust alignment plans or leads to capability insights, both seem to make me averse to considering working on interpretability research.
Here's a few facets of interpretability research that I am enthusiastic about tracking, but not excited enough to want to work on, as of writing:
- Interpretability-during-training probably would be really useful, and I am more optimistic about it than interpretability-after-training. I expect that at the limit, interpretability-during-training leads to progress towards ensuring ontological robustness of values.
- Interpretability (both after-training and during-training) will help with detecting and making interventions when it comes to inner misalignment. That's a great benefit, that I haven't really thought about until I decided to reflect and answer your question.
- Interpretability research seems very focused on 'oracles' -- sequence modellers and supervised learning systems -- and interpretability research on RL models seems neglected. I would like to see more research done on such models, because RL-style systems seems more likely to lead to RSI and ASI, and insights we gain might help alignment research in general.
I'm really glad you asked me this question! You've helped me elicit (and develop) a more nuanced view on interpretability research.
There seem to be three key factors that would influence your decision:
- Your belief about how valuable the problem is to work on
- Your belief about how hard it is to solve this problem and how well the current alignment community is doing to solve the problem
- Your belief about how long we have until we run out of time
Based on your LW comment history, you probably already have rough models about the alignment problem that inform these three beliefs of yours. I think it would be helpful if you could go into detail about them so people can give you more specific advice, or perhaps help you answer another question further upstream of the one you asked.
Causal Influence Diagrams are interesting, but don't really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don't read this paper for object level usefulness but incidental research contributions that are really interesting.
The paper divides AI systems into two major frameworks:
- MDP-based frameworks (aka RL-based systems such as AlphaZero), which involve AI systems that take actions and are assigned a reward for their actions
- Question-answering systems (which includes all supervised learning systems, including sequence modellers like GPT), were the system gives an output given an input and is scored based on a label of the same data type as the output. This is also informally known as tool AI (they cite Gwern's post, which is nice to see).
I liked how lucidly they defined wireheading:
In the basic MDP from Figure 1, the reward parameter ΘR is assumed to be unchanging. In reality, this assumption may fail because the reward function is computed by some physical system that is a modifiable part of the state of the world. [...] This gives an incentive for the agent to obtain more reward by influencing the reward function rather than optimizing the state, sometimes called wireheading.
The common definition of wireheading is informal enough that different people would map it to different specific formalizations in their head (or perhaps have no formalization and therefore be confused), and having this 'more formal' definition in my head seems rather useful.
Here's their distillation for Current RF-optimization, a strategy to avoid wireheading (which reminds me of shard theory, now that I think about it -- models that avoid wireheading by modelling effects of resulting changes to policy and then deciding what trajectory of actions to take):
An elegant solution to this problem is to use model-based agents that simulate the state sequence likely to result from different policies, and evaluate those state sequences according to the current or initial reward function.
Here's their distillation of Reward Modelling:
A key challenge when scaling RL to environments beyond board games or computer games is that it is hard to define good reward functions. Reward Modeling [Leike et al., 2018] is a safety framework in which the agent learns a reward model from human feedback while interacting with the environment. The feedback could be in the form of preferences, demonstrations, real-valued rewards, or reward sketches. [...] Reward modeling can also be done recursively, using previously trained agents to help with the training of more powerful agents [Leike et al., 2018].
The resulting CI diagram modelling actually made me feel like I grokked Reward Modelling better.
Here's their distillation of CIRL:
Another way for agents to learn the reward function while interacting with the environment is Cooperative Inverse Reinforcement Learning (CIRL) [Hadfield-Menell et al., 2016]. Here the agent and the human inhabit a joint environment. The human and the agent jointly optimize the sum of rewards, but only the human knows what the rewards are. The agent has to infer the rewards by looking at the human’s actions.
The difference between RM and CIRL causal influence diagrams is interesting, because there is a subtle difference. The authors imply that this minor difference matters and can imply different things about system incentives and therefore safety guarantees, and I am enthusiastic about such strategies for investigating safety guarantees.
The authors describe a wireheading-equivalent for QA systems called self-fulfilling prophecies:
The assumption that the labels are generated independently of the agent’s answer sometimes fails to hold. For example, the label for an online stock price prediction system could be produced after trades have been made based on its prediction. In this case, the QA-system has an incentive to make self-fulfilling prophecies. For example, it may predict that the stock will have zero value in a week. If sufficiently trusted, this prediction may lead the company behind the stock to quickly go bankrupt. Since the answer turned out to be accurate, the QA-system would get full reward. This problematic incentive is represented in the diagram in Figure 9, where we can see that the QA-system has both incentive and ability to affect the world state with its answer [Everitt et al., 2019].
They propose a solution to the self-fulfilling prophecies problem, via making oracles optimize for reward in the counterfactual world where their answer doesn't influence the world state and therefore the label which they are optimized for. While that is a solution, I am unsure how one can get counterfactual labels for complicated questions whose answers may have far reaching consequences in the world.
It is possible to fix the incentive for making self-fulfilling prophecies while retaining the possibility to ask questions where the correctness of the answer depends on the resulting state. Counterfactual oracles optimize reward in the counterfactual world where no one reads the answer [Armstrong, 2017]. This solution can be represented with a twin network [Balke and Pearl, 1994] influence diagram, as shown in Figure 10. Here, we can see that the QA-system’s incentive to influence the (actual) world state has vanished, since the actual world state does not influence the QA-system’s reward; thereby the incentive to make self-fulfilling prophecies also vanishes. We expect this type of solution to be applicable to incentive problems in many other contexts as well.
The authors also anticipate this problem but instead of considering whether and how one can tractably calculate counterfactual labels, they connect this intractability to introducting the debate AI safety strategy:
To fix this, Irving et al. [2018] suggest pitting two QA-systems against each other in a debate about the best course of action. The systems both make their own proposals, and can subsequently make arguments about why their own suggestion is better than their opponent’s. The system who manages to convince the user gets rewarded; the other system does not. While there is no guarantee that the winning answer is correct, the setup provides the user with a powerful way to poke holes in any suggested answer, and reward can be dispensed without waiting to see the actual result.
I like how they explicitly mention that there is no guarantee that the winning answer is correct, which makes me more enthusiastic about considering debate as a potential strategy.
They also have an incredibly lucid distillation of IDA. Seriously, this is significantly better than all the Paul Christiano posts I've read and the informal conversations I've had about IDA:
Iterated distillation and amplification (IDA) [Christiano et al., 2018] is another suggestion that can be used for training QA-systems to correctly answer questions where it is hard for an unaided user to directly determine their correctness. Given an original question Q that is hard to answer correctly, less powerful systems Xk are asked to answer a set of simpler questions Qi. By combining the answers Ai to the simpler questions Qi, the user can guess the answer ˆA to Q. A more powerful system Xk+1 is trained to answer Q, with ˆA used as an approximation of the correct answer to Q.
Once the more powerful system Xk+1 has been trained, the process can be repeated. Now an even more powerful QA-system Xk+2 can be trained, by using Xk+1 to answer simpler questions to provide approximate answers for training Xk+2. Systems may also be trained to find good subquestions, and for aggregating answers to subquestions into answer approximations. In addition to supervised learning, IDA can also be applied to reinforcement learning.
I have no idea why they included Drexler's CAIS -- but it is better than reading 300 pages of the original paper:
Drexler [2019] argues that the main safety concern from artificial intelligence does not come from a single agent, but rather from big collections of AI services. For example, one service may provide a world model, another provide planning ability, a third decision making, and so on. As an aggregate, these services can be very competent, even though each service only has access to a limited amount of resources and only optimizes a short-term goal.
The authors claim that the AI safety issues commonly discussed can be derived 'downstream' of modelling these systems more formally, using these causal influence diagrams. I disagree, due to the amount of degrees of freedom the modeller is given when making these diagrams.
In the discussion section, the authors talk about the assumptions underlying the representations, and their limitations. They explicitly point out how the intensional stance may be limiting and not model certain classes of AI systems or agents (hint: read their newer papers!)
Overall, the paper was an easy and fun read, and I loved the distillations of AI safety approaches in them. I'm excited to read papers by this group.
When I referred to pivotal acts, I implied the use of enforcement tools that are extremely powerful, of the sort implied in AGI Ruin. That is, enforcement tools that make an actual impact in extending timelines[1]. Perhaps I should start using a more precise term to describe this from now on.
It is hard for me to imagine how there can be consensus within a US government organization capable of launching a superhuman-enforcement-tool-based pivotal act (such as three letter agencies) to initiate a moratorium, much less consensus in the US government or between US and EU (especially given the rather interesting strategy EU is trying with their AI Act).
I continue to consider all superhuman-enforcement-tool-based pivotal acts as unilateral given this belief. My use of the world "unilateral" points to the fact that the organizations and people who currently have a non-trivial influence over the state of the world and its future will almost entirely be blindsided by the pivotal act, and that will result in destruction of trust and chaos and an increase in conflict. And I currently believe that this is actually more likely to increase P(doom) or existential risk for humanity, even if it extends the foom timeline.
Although not preventing ASI creation entirely. The destruction of humanity's potential is also an existential risk, and the inability for us to create a utopia is too painful to bear. ↩︎
Your question seems to focus mainly on timeline model and not alignment model, so I shall focus on explaining how my model of the timeline has changed.
My timeline shortened from about four years (mean probability) to my current timeline of about 2.5 years (mean probability) since the GPT-4 release. This was because of two reasons:
- gut-level update on GPT-4's capability increases: we seem quite close to human-in-the-loop RSI.
- a more accurate model for bounds on RSI. I had previously thought that RSI would be more difficult than I think it is now.
The latter is more load-bearing than the former, although my predictions for how soon AI labs will achieve human-in-the-loop RSI creates an upper bound on how much time we have (assuming no slowdown), which is quite useful when making your timeline.
Formatting error: "OK, I used to work for a robotics company, and I do think that one of the key obstacles for a hostile AI is moving atoms around. So let me propose some alarms!" should be quoted since it is not you (Zvi) writing that passage but the person you linked and are quoting.
Possible typos:
- "I kind of feel like if you are the one building the DoNotPlay chat, [...]" should be "DoNotPay" instead.
- "Joshua gets ten out of ten for the central point, then (as I score it) gets either minus a million for asking the wrong questions." the "either" is not followed by two objects
Spent about 45 minutes processing this mentally. Did not look through the code or wonder about the reliability of the results. Here are my thoughts:
- Why ask an AI to shut down if it recognizes its superiority? If it cannot become powerful enough for humans to handle, it cannot become powerful enough to protect humans from another AI that is too powerful for humans to handle.
Based on what I can tell, AP fine-tuning will lead to the AI more likely simulating the relevant AP and its tokens will be what the simulator thinks the AP would return next. This means it is brittle to systems that leverage this model since they can simply beam search and ignore the shutdown beams. RLHF-like fine-tuning strategies probably perform better, according to my intuition.
-
How successful is this strategy given increasing scale of LLMs and its capabilities? If this was performed on multiple scales of GPT-2 , it would provide useful empirical data about robustness to scale. My current prediction is that this is not robust to scale given that you are fine-tuning on stories to create personas. The smarter the model is, the more likely it is to realize when it is being tested to provide the appropriate "shutdown!" output and pretend to be the AP, and in out-of-distribution scenarios, it will pretend to be some other persona instead.
-
The AP finetuned model seems vulnerable to LM gaslighting the same way ChatGPT is. This does not seem to be an improvement over OAI's Instruct fine-tuning or whatever they did to GPT-4.
I apologize for not interacting with certain subsets of your post that you may consider relevant or significant as a contribution. That is mainly because I think their significance is downstream of certain assumptions you and I disagree about.
I agree, and believe it would have been useful if Jacob (post author) had made this clear in the opening paragraph of the post. I see no point in reading the post if it does not measurably impact my foom/doom timeline probability distribution.
I am interested in his doom scenario, however.
I want to differentiate between categories of capabilities improvement in AI systems, and here's the set of terms I've come up with to think about them:
-
Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.
-
Scaffolding improvements: Capability boost in an AI system that involves augmenting the AI system via software features. Think of it as keeping the CPU of the natural language computer the same, but upgrading its RAM and SSD and IO devices. Some examples off the top of my head: hyperparameter optimization for generating text, use of plugins, embeddings for memory. More information is in beren's essay linked in this paragraph.
-
Neural network improvements: Any capability boost in an AI system that specifically involves improving the black-box neural network that drives the system. This is mainly what SOTA ML researchers focus on, and is what has driven the AI hype over the past decade. This can involve architectural improvements, training improvements, finetuning afterwards (RLHF to me counts as capabilities acceleration via neural network improvements), etc.
There probably are more categories, or finer ways to slice the space of capability acceleration mechanisms, but I haven't thought about this in as much detail yet.
As far as I can tell, both capabilities augmentation and capabilities acceleration contribute to achieving recursive self-improving (RSI) systems, and once you hit that point, foom is inevitable.
Your text here is missing content found in the linked post. Specifically, the sentence "If one has to do this with" ends abruptly, unfinished.
Before reading this post, I usually would refrain from posting/commenting on LW posts partially because of the high threshold of quality for contribution (which is where I agree with you in a certain sense), and partially because it seemed more polite to ignore posts I found flaws in, or disagreed with strongly, than to engage (which costs both effort and potential reputation). Now, I believe I shall try to be more Socratic -- more willing to as politely as I can point out confusions and potential issues in posts/comments I have read and found wanting, if it seems useful to readers.
I find Said's critiquing comments (here are three good examples) extremely valuable, because they serve as a "red team" and a pruning function for the claims the post author puts forth and the reasoning behind them. What you seem to consider as a drive-by criticism (which is what I believe you think Said does) that puts forth a non-trivial cost upon you, is cost that I claim you should take upon yourself because your writing isn't high quality enough and not "pruned" enough given the length of your posts.
That is the biggest issue I have with your writings (and that of Zack too, because he makes the same mistake): you write too much to communicate too little bits of usefulness. This is how I feel for all your 2023 posts that I have read (or skimmed, rather) -- it points to something useful, or interesting, but it is absolutely not worth the time investment of reading such huge and long-winding essays. I don't even think you need to write them that long to provide the context you believe necessary to convey your points.
The good thing is that with comments like those of johnswentworth, Charlie Steiner, and FeepingCreature all give people like me the context they need to interpret how useful your post is, without having to read the post itself. Notice that all these comments are short and succinct while also being very relevant to the post, without nitpicking. This is the sort of writing I respect on LW. It is not coincidental that two of these three people are full time alignment researchers.
Right now, I can simply ignore your posts until they get sufficient traction (which is very easy given how popular you are) that the comments give me an idea of the core of your post and what the most serious weaknesses of your argument are, and with that I have gotten the value I need (given I skim your post while doing so, whenever relevant). However, your desire to censor Said's comments gets in the way of this natural filter. Said's comments provide incredible value to both you and your audience, even if you do not interact with them! To you, it provides you valuable evidence you can weigh up or down given how valuable you find Said's comments in general, and to LW audience, it provides a way of knowing the critical weaknesses of your argument without having to read your post and parse it and try to figure out the weaknesses in it.
You claim emotional damage to yourself and to other people due to drive-by critiquing comments, and that this leads to an evaporative cooling effect where people post lesser and lesser. This seems like a problem, but given one person spending half an hour pruning their writing to improve its quality, and a hundred readers spending between ten minutes to an hour processing and individually critiquing the writing to calibrate and update their world model, I would want the writer to eat the cost. That is what I personally choose, after all. And by extension, I have realized that me critiquing other people's contributions is also incredibly valuable, and I will start to do that more. And anyway, this probably isn't a trade-off, and there may be solutions that do not impose a cost on either party.
As far as I know, Said seems to believe that moderation of comments should not be left to post authors because this creates a conflict of interest. Its consequences are simple: I get less value out of your posts and by extension, LW. Raemon's decision to create an archipelago-like ecosystem makes sense to me given his goals and assumptions laid out in the post, but you seem to want more aggressive action against people whose criticisms you dislike.
"You don’t much care if This Rando doesn’t get it", and I would be fine with that if you weren't taking actions that have clear externalities for readers like me by making Said's critiquing comments and comments of a similar nature by other people less welcome on LessWrong -- both as a social norm and at the moderator level.
Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.
Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI's agent foundations, Vanessa Kosoy and Diffractor's Infrabayesianism, and carado's formal alignment agenda. Research aimed at developing a more accurate blueprint, such as Nate Soares' 2022-now posts, Adam Shimi's epistemology-focused output, and John Wentworth's deconfusion-style output, also fall into this category.
Component-driven alignment agendas, on the other hand, begin with available components and seek to develop new pieces that work well with existing ones. They focus on making incremental progress by developing new components that can be feasibly implemented and integrated with existing AI systems or techniques to address the alignment problem. OpenAI's strategy, Deepmind's strategy, Conjecture's LLM-focused outputs, and Anthropic's strategy are examples of this approach. Agendas that serve as temporary solutions by providing useful components that integrate with existing ones, such as ARC's power-seeking evals, also fall under the component-driven category. Additionally, the Cyborgism agenda and the Accelerating Alignment agenda can be considered component-driven.
The blueprint-driven and component-driven categorization seems to me to be more informative than dividing agendas into conceptual and empirical categories. This is because all viable alignment agendas require a combination of conceptual and empirical research. Categorizing agendas based on the superficial pattern of their current research phase can be misleading. For instance, shard theory may initially appear to be a blueprint-driven conceptual agenda, like embedded agency. However, it is actually a component-driven agenda, as it involves developing pieces that fit with existing components.
I think a better way of rephrasing it is "clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for".