Posts
Comments
I think you’re probably right. But even this will make it harder to establish an agency where the bureaucrats/technocrats have a lot of autonomy, and it seems there’s at least a small chance of an extreme ruling which could make it extremely difficult.
Yeah, I think they will probably do better and more regulations than if politicians were more directly involved, but I’m not super sanguine about bureaucrats in absolute terms.
This was addressed in the post: "To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of "agentic") as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don't think these discussions have been very helpful; I think agency is probably a "you know it when you see it" kind of phenomenon. Additionally, even if we do need a more formal operationalization of agency for this proposal to work, I suspect that we will only be able to develop such an operationalization via more empirical research. The main particular thing I mean to actively exclude by stipulating that the system must be agentic is an LLM or similar system arguing for itself to be improved in response to a prompt. "
Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious human actors where security problems could pop up before the AI becomes misaligned, but this failure mode seems less likely and less x-risk inducing than misalignment, and involves a pretty different set of measures (info sharing policies, encrypting the weights as opposed to training techniques or evals).
I think concrete ideas like this that take inspiration from past regulatory successes are quite good, esp. now that policymakers are discussing the issue.
I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models - Leopold thinks that governments and the public will "wake up" sufficiently quickly to catastrophic risk from AI such that there will be regulatory and PR forces which effectively prevent the release of misaligned models, including evals/ways of detecting misalignment that are more robust than those that might be used by ordinary consumers (which could as you point out likely be fooled by surface-level alignment due to RLHF).
How do any capabilities or motivations arise without being explicitly coded into the algorithm?
I don't think it is correct to conceptualize MLE as a "goal" that may or may not be "myopic." LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don't intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra's idea of "HFDT," for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
- How does it give the AI a myopic goal? It seems like it's basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren't myopic agents, they aren't agents at all. As such I'm not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model's outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
- I agree scaling well with data is quite good. But see (1)
- How?
- I was never that concerned about this, but I agree that it does seem good to offload more training to pretraining as opposed to finetuning for this and other reasons
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
Yeah I agree with both your object level claim (ie I lean towards the “alignment is easy” camp) and to a certain extent your psychological assessment, but this is a bad argument. Optimism bias is also well documented in many cases, so to establish that alignment is hard people are overly pessimistic, you need to argue more on the object level against the claim or provide highly compelling evidence that such people are systematically irrationally pessimistic on most topics.
Strong upvote. A corollary here is that a really important part of being a “good person” is being good at being able to tell when you’re rationalizing your behavior/otherwise deceiving yourself into thinking you’re doing good. The default is that people are quite bad at this but as you said don’t have explicitly bad intentions, which leads to a lot of people who are at some level morally decent acting in very morally bad ways.
In the intervening period, I've updated towards your position, though I still think it is risky to build systems with capabilities that open ended which are that close to agents in design space
I agree with something like this, though I think you're too optimistic w/r/t deceptive alignment being highly unlikely if the model understands the base objective before getting a goal. If the model is sufficiently good at deception, there will be few to no differential adversarial examples. Thus, while gradient descent might have a slight preference for pointing the model at the base objective over a misaligned objective induced by the small number of adversarial examples, the vastly larger number of misaligned goals suggests to me that it is at least plausible that the model learns to pursue a misaligned objective even if it gains an understanding of the base objective before it becomes goal directed because the inductive biases of SGD favor models that are more numerous in parameter space.
My understanding of Shard Theory is that what you said is true, except sometimes the shards "directly" make bids for outputs (particularly when they are more "reflexive," e. g. the "lick lollipop" shard is activated when you see a lollipop), but sometimes make bids for control of a local optimization module which then implements the output which scores best according to the various competing shards. You could also imagine shards which do a combination of both behaviors. TurnTrout can correct me if I'm wrong.
My takeoff speeds are on the somewhat faster end, probably ~a year or two from “we basically don’t have crazy systems” to “AI (or whoever controls AI) controls the world”
EDIT: After further reflection, I no longer endorse this. I would now put 90% CI from 6 months to 15 years with median around 3.5 years. I still think fast takeoff is plausible but now think pretty slow is also plausible and overall more likely.
I think I'm like >95th percentile on verbal-ness of thoughts. I feel like almost all of my thoughts that aren't about extremely concrete things in front of me or certain abstract systems that are best thought of visually are verbal, and even in those cases I sometimes think verbally. Almost all of the time, at least some words are going through my head, even if it's just random noise or song lyrics or something like that. I struggle to imagine what it would be like to not think this way, as if I feel like many propositions can't be easily represented in an image. For example, if I think of an image of a dog in my home, this could correspond to the proposition "there is a dog in my home" or "I wish there were a dog in my home" or "I wish there weren't a dog in my home" or "This is the kind of dog that I would have if I had a dog."
Hmm... I guess I'm skeptical that we can train very specialized "planning" systems? Making superhuman plans of the sort that could counter those of an agentic superintelligence seems like it requires both a very accurate and domain-general model of the world as well as a search algorithm to figure out which plans actually accomplish a given goal given your model of the world. This seems extremely close in design space to a more general agent. While I think we could have narrow systems which outperform the misaligned superintelligence in other domains such as coding or social manipulation, general long-term planning seems likely to me to be the most important skill involved in taking over the world or countering an attempt to do so.
This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can't get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can't get feedback from another intelligent grader/assessor reasoning in a ~a priori way.
I think this is a very good critique of OpenAI's plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently "generally intelligent" that they won't need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system's capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model. The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their "beliefs", even if they are generally intelligent enough to in some way "know" the correct answer. However, addressing this issue involves solving a different and I think probably easier problem (ELK/creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.
I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.
I think human history provides mixed evidence for the ability of such agents to take over the world. While almost every human in history has failed to accumulate massive amounts of power, relatively few have tried. Moreover, when people have succeeded at quickly accumulating lots of power/taking over societies, they often did so with surprisingly small strategic advantages. See e. g. this post; I think that an AI that was both +12 SD at planning/general intelligence and social manipulation could, like the conquistadors, achieve a decisive strategic advantage without having to have some kind of crazy OP military technology/direct force advantage. Consider also Hitler's rise to power and the French Revolution as cases where one actor/a small group of actors was able to surprisingly rapidly take over a country.
While these examples provide some evidence in favor of it being easier than expected to take over the world, overall, I would not be too scared of a +12 SD human taking over the world. However, I think that the AI would have some major advantages over an equivalently capable human. Most importantly, the AI could download itself onto other computers. This seems like a massive advantage, allowing the AI to do basically everything much faster and more effectively. While individually extremely capable humans would probably greatly struggle to achieve a decisive strategic advantage, large groups of extremely intelligent, motivated, and competent humans seem obviously much scarier. Moreover, as compared to an equivalently sized group of equivalently capable humans, a group of AIs sharing their source code would be able to coordinate among themselves far better, making them even more capable than the humans.
Finally, it is much easier for AIs to self modify/self improve than it is for humans to do so. While I am skeptical of foom for the same reasons you are, I suspect that over a period of years, a group of AIs could accumulate enough financial and other resources that they could translate these resources into significant cognitive improvements, if only by acquiring more compute.
While the AI has the disadvantage relative to an equivalently capable human of not immediately having access to a direct way to affect the "external" world, I think this is much less important than the AIs advantages in self replication, coordination, an self improvement.
You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the explicit search case, it seems like many other plausible models for how RL agents might mechanistically function imply agent-ish behavior, even if the model is not primarily using explicit search. However, this is because, due to the fact that the agent must accomplish the training objective, the space of possible behaviors is heavily constrained. In questions where the prediction space is less constrained to begin with (e. g. questions about how the far future will go), different “mechanistic” explanations (for example, thinking that the far future will be controlled by a human superintelligence vs an alien superintelligence vs evolutionary dynamics) imply significantly different predictions.
I think the NAH does a lot of work for interpretability of an AI's beliefs about things that aren't values, but I'm pretty skeptical about the "human values" natural abstraction. I think the points made in this post are good, and relatedly, I don't want the AI to be aligned to "human values"; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA
Good point, post updated accordingly.
I think making progress on ML is pretty hard. In order for a single AI to self improve quickly enough that it changed timelines, it would have to improve close to as fast as the speed at which all of the humans working on it could improve it. I don't know why you would expect to see such superhuman coding/science capabilities without other kinds of superintelligence.
I think the world modelling improvements from modern science and IQ raising social advances can be analytically separated from changes in our approach to welfare. As for non consensual wireheading, I am uncertain as to the moral status of this, so it seems like partially we just disagree about values. I am also uncertain as to the attitude of Stone Age people towards this - while your argument seems plausible, the fact that early philosophers like the Ancient Greeks were not pure hedonists in the wireheading sense but valued flourishing seems like evidence against this, suggesting that favoring non consensual wireheading is downstream of modern developments in utilitarianism.
The claim about Stone Age people seems probably false to me - I think if Stone Age people could understand what they were actually doing (not at the level of psychology or morality, but at the purely "physical" level), they would probably do lots of very nice things for their friends and family, in particular give them a lot of resources. However, even if it is true, I don't think the reason we have gotten better is because of philosophy - I think it's because we're smarter in a more general way. Stone Age people were uneducated and had less good nutrition than us; they were literally just stupid.
Having had a similar experience, I strongly endorse this advice. Actually optimizing for high quality relationships in modern society looks way different than following the social strategies that didn't get you killed in the EEA.
I think this is probably true; I would assign something like a 20% chance of some kind of government action in response to AI aimed at reducing x-risk, and maybe a 5-10% chance that it is effective enough to meaningfully reduce risk. That being said, 5-10% is a lot, particularly if you are extremely doomy. As such, I think it is still a major part of the strategic landspace even if it is unlikely.
Why should we expect that as the AI gradually automates us away, it replace us with better versions of ourselves rather than non-sentient, or minimally non-aligned, robots who just do its bidding?
I don’t think we have time before AGI comes to deeply change global culture.
This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.
Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.
This seems unlikely to be the case to me. However, even if this is the case and so the AI doesn't need to deceive us, isn't disempowering humans via force still necessary? Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it. Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against human attacks. For example, if we think about the ants analogy, ants are incapable of harming us not just because they are stupid, but because they are also extremely physically weak. If human are faced with physically powerful animals, even if we can subdue them easily, we still have to think about them to do it.
Check out CLR's research: https://longtermrisk.org/research-agenda. They are focused on answering questions like these because they believe that competition between AI's is a big source of s-risk
It seems to me that it is quite possible that language models develop into really good world modelers before they become consequentialist agents or contain consequentialist subagents. While I would be very concerned with using an agentic AI to control another agentic AI for the reasons you listed and so am pessimistic about eg debate, AI still seems like it could be very useful for solving alignment.
This seems pretty plausible to me, but I suspect that the first AGIs will exhibit a different distribution of skills across cognitive domains than humans and may also be much less agentic. Humans evolved in environments where the ability to form and execute long term plans to accumulate power and achieve dominance over other humans was highly selected for. The environments in which the first AGIs are trained may not have this property. That doesn’t mean they won’t develop it, but they may very well not until they are more strongly and generally superintelligent,
They rely on secrecy to gain relative advantages, but absolutely speaking, openness increases research speed; it increases the amount of technical information available to every actor.
With regard to God specifically, belief in God is somewhat unique because God is supposed to make certain things good in virtue of his existence; the value of the things religious people value is predicated on the existence of God. In contrast, the value of cake to the kid is not predicated on the actual existence of the cake.
I think this is a good point and one reason to favor more CEV style solutions to alignment, if they are possible, rather than solutions which align make the values of the AI relatively "closer" to our original values.
Or, the other way around, perhaps "values" are defined by being robust to ontology shifts.
This seems wrong to me. I don't think that reductive physicalism is true (i. e. the hard problem really is hard), but if I did, I would probably change my values significantly. Similarly for religious values; religious people seem to think that God has a unique metaphysical status such that his will determines what is right and wrong, and if no being with such a metaphysical status existed, their values would have to change.
How do you know what that is? You don't have the ability to stand outside the mind-world relationship and perceive it, any more than anything else. You have beliefs about the mind-world relationship, but they are all generated by inference in your mind. If there were some hard core of non-inferential knowledge about he ontological nature of reality, you might be able to lever it to gain more knowledge, but there isn't because because the same objections apply
I'm not making any claims about knowing what it is. The OP's argument is that our normal deterministic model is self refuting because it undermines our ability to have knowledge, so the truth of the model can be assumed in the first place.
The point is about correspondence. Neither correlations nor predictive accuracy amount to correspondence to a definite ontology.
Yes, a large range of worlds with different ontologies imply the same observations. The further question of assigning probabilities to those different worlds comes down to how to assign initial priors, which is a serious epistemological problem. However, this seems unrelated to the point made in the OP, which is that determinism is self-undermining.
More broadly, I am confused as to what claim you think that I am making which you disagree with.
By "process," I don't mean internal process of thought involving an inference from perceptions to beliefs about the world, I mean the actual perceptual and cognitive algorithm as a physical structure in the world. Because of the way the brain actually works in a deterministic universe, it ends up correlated with the external world. Perhaps this is unknowable to us "from the inside," but the OP's argument is not about external world skepticism given direct access only to what we perceive, but rather that given normal hypotheses about how the brain works, we should not trust the beliefs it generates. I am simply pointing out that this is false, because these normal hypotheses imply the kind of correlation that we want.
We get correspondence to reality through predictive accuracy; we can predict experience well using science because scientific theories are roughly isomorphic to the structures in reality that they are trying to describe.
Yeah this is exactly right imo. Thinking about good epistemics as about believing what is "justified" or what you have "reasons to believe" is unimportant/useless insofar as it departs from "generated by a process makes that the ensuing map correlate with the territory." In the world where we don't have free will, but our beliefs are produced deterministically by our observations and our internal architecture in a way such that they are correlated with the world, we have all the knowledge that we need.
While this doesn't answer the question exactly, I think important parts of the answer include the fact that AGI could upload itself to other computers, as well as acquire resources (minimally money) completely through using the internet (e. g. through investing in stocks via the internet). A superintelligent system with access to trillions of dollars and with huge numbers of copies of itself on computers throughout the world more obviously has a lot of potentially very destructive actions available to it than one stuck on one computer with no resources.