Where I agree and disagree with Eliezer

paulfchristiano

Where I agree and disagree with Eliezer

post by paulfchristiano · 2022-06-19T19:15:55.698Z · LW · GW · 223 comments

  Agreements
  Disagreements
  My take on Eliezer's takes
None
232 comments

(Partially in response to AGI Ruin: A list of Lethalities [LW · GW]. Written in the same rambling style. Not exhaustive.)

Agreements

Powerful AI systems have a good chance of deliberately and irreversibly disempowering humanity. This is a much more likely failure mode than humanity killing ourselves with destructive physical technologies.
Catastrophically risky AI systems could plausibly exist soon, and there likely won’t be a strong consensus about this fact until such systems pose a meaningful existential risk per year. There is not necessarily any “fire alarm.”
Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.
I think that many of the projects intended to help with AI alignment don't make progress on key difficulties and won’t significantly reduce the risk of catastrophic outcomes. This is related to people gravitating to whatever research is most tractable and not being too picky about what problems it helps with, and related to a low level of concern with the long-term future in particular. Overall, there are relatively few researchers who are effectively focused on the technical problems most relevant to existential risk from alignment failures.
There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better.
Even when thinking about accident risk, people’s minds seem to go to what they think of as “more realistic and less sci fi” risks that are much less likely to be existential (and sometimes I think less plausible). It’s very possible this dynamic won’t change until after actually existing AI systems pose an existential risk.
There is a good chance that an AI catastrophe looks like an abrupt “coup” where AI systems permanently disempower humans with little opportunity for resistance. People seem to consistently round this risk down to more boring stories that fit better with their narratives about the world. It is quite possible that an AI coup will be sped up by humans letting AI systems control killer robots, but the difference in timeline between "killer robots everywhere, AI controls everything" and "AI only involved in R&D" seems like it's less than a year.
The broader intellectual world seems to wildly overestimate how long it will take AI systems to go from “large impact on the world” to “unrecognizably transformed world.” This is more likely to be years than decades, and there’s a real chance that it’s months. This makes alignment harder and doesn’t seem like something we are collectively prepared for.
Humanity usually solves technical problems by iterating and fixing failures; we often resolve tough methodological disagreements very slowly by seeing what actually works and having our failures thrown in our face. But it will probably be possible to build valuable AI products without solving alignment, and so reality won’t “force us” to solve alignment until it’s too late. This seems like a case where we will have to be unusually reliant on careful reasoning rather than empirical feedback loops for some of the highest-level questions.
AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.
If you had incredibly powerful unaligned AI systems running on a server farm somewhere, there is very little chance that humanity would maintain meaningful control over its future.
“Don’t build powerful AI systems” appears to be a difficult policy problem, requiring geopolitical coordination of a kind that has often failed even when the stakes are unambiguous and the pressures to defect are much smaller.
I would not expect humanity to necessarily “rise to the challenge” when the stakes of a novel problem are very large. I was 50-50 about this in 2019, but our experience with COVID has further lowered my confidence.
There is probably no physically-implemented reward function, of the kind that could be optimized with SGD, that we’d be happy for an arbitrarily smart AI to optimize as hard as possible. (I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.)
Training an AI to maximize a given reward function does not generically produce an AI which is internally “motivated” to maximize reward. Moreover, at some level of capability, a very wide range of motivations for an AI would lead to loss-minimizing behavior on the training distribution because minimizing loss is an important strategy for an AI to preserve its influence over the world.
It is more robust for an AI system to learn a good model for the environment, and what the consequences of its actions will be, than to learn a behavior like “generally being nice” or “trying to help humans.” Even if an AI was imitating data consisting of “what I would do if I were trying to be nice,” it would still be more likely to eventually learn to imitate the actual physical process producing that data rather than absorbing some general habit of niceness. And in practice the data we produce will not be perfect, and so “predict the physical process generating your losses” is going to be positively selected for by SGD.
You shouldn’t say something like “well I might as well assume there’s a hope” and thereby live in a specific unlikely world where alignment is unrealistically easy in one way or another. Even if alignment ends up easy, you would be likely to end up predicting the wrong way for it to be easy. If things look doomed to you, in practice it’s better to try to maximize log odds of success as a more general and robust strategy for taking advantage of lucky breaks in a messy and hard-to-predict world.
No current plans for aligning AI have a particularly high probability of working without a lot of iteration and modification. The current state of affairs is roughly “if alignment turns out to be a real problem, we’ll learn a lot about it and iteratively improve our approach.” If the problem is severe and emerges quickly, it would be better if we had a clearer plan further in advance—we’d still have to adapt and learn, but starting with something that looks like it could work on paper would put us in a much better situation.
Many research problems in other areas are chosen for tractability or being just barely out of reach. We pick benchmarks we can make progress on, or work on theoretical problems that seem well-posed and approachable using existing techniques. Alignment isn’t like that; it was chosen to be an important problem, and there is no one ensuring that the game is “fair” and that the problem is soluble or tractable.

Disagreements

(Mostly stated without argument.)

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.
Eliezer often talks about AI systems that are able to easily build nanotech and overpower humans decisively, and describes a vision of a rapidly unfolding doom from a single failure. This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it, but I think it’s very unlikely what will happen in the real world. By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D. More generally, the cinematic universe of Eliezer’s stories of doom doesn’t seem to me like it holds together, and I can’t tell if there is a more realistic picture of AI development under the surface.
One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.
AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.
The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy [LW · GW]” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.
Many of the “pivotal acts” that Eliezer discusses involve an AI lab achieving a “decisive strategic advantage” (i.e. overwhelming hard power) that they use to implement a relatively limited policy, e.g. restricting the availability of powerful computers. But the same hard power would also let them arbitrarily dictate a new world order, and would be correctly perceived as an existential threat to existing states. Eliezer’s view appears to be that a decisive strategic advantage is the most realistic way to achieve these policy goals, despite the fact that building powerful enough AI systems runs an overwhelming risk of destroying the world via misalignment. I think that preferring this route to more traditional policy influence requires extreme confidence about details of the policy situation; that confidence might be justified by someone who knew a lot more about the details of government than I do, but Eliezer does not seem to. While I agree that this kind of policy change would be an unusual success in historical terms, the probability still seems much higher than Eliezer’s overall probabilities of survival. Conversely, I think Eliezer greatly underestimates how difficult it would be for an AI developer to covertly take over the world, how strongly and effectively governments would respond to that possibility, and how toxic this kind of plan is.
I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research. Eliezer is right that this doesn’t make the problem go away (if humans don’t solve alignment, then why think AIs will solve it?) but I think it does mean that arguments about how recursive self-improvement quickly kicks you into a lethal regime are wrong (since AI is accelerating the timetable for both alignment and capabilities).
When talking about generalization outside of the training distribution, I think Eliezer is generally pretty sloppy. I think many of the points are roughly right, but that it’s way too sloppy to reach reasonable conclusions after several steps of inference. I would love to see real discussion of these arguments, and in some sense it seems like Eliezer is a good person to push that discussion forward. Right now I think that relevant questions about ML generalization are in fact pretty subtle; we can learn a lot about them in advance but right now just mostly don’t know. Similarly, I think Eliezer’s reasoning about convergent incentives and the deep nature of consequentialism is too sloppy to get to correct conclusions and the resulting assertions are wildly overconfident.
In particular, existing AI training strategies don’t need to handle a “drastic” distribution shift from low levels of intelligence to high levels of intelligence. There’s nothing in the foreseeable ways of building AI that would call for a big transfer like this, rather than continuously training as intelligence gradually increases. Eliezer seems to partly be making a relatively confident claim that the nature of AI is going to change a lot, which I think is probably wrong and is clearly overconfident. If he had been actually making concrete predictions over the last 10 years I think he would be losing a lot of them to people more like me.
Eliezer strongly expects sharp capability gains, based on a combination of arguments that I think don’t make sense and an analogy with primate evolution which I think is being applied poorly. We’ve talked about this before, and I still think Eliezer’s position is probably wrong and clearly overconfident. I find Eliezer’s more detailed claims, e.g. about hard thresholds, to be much more implausible than his (already probably quantitatively wrong) claims about takeoff speeds.
Eliezer seems confident about the difficulty of alignment based largely on his own experiences working on the problem. But in fact society has spent very little total effort working on the problem, and MIRI itself would probably be unable to solve or even make significant progress on the large majority of problems that existing research fields routinely solve. So I think right now we mostly don’t know how hard the problem is (but it may well be very hard, and even if it’s easy we may well fail to solve it). For example, the fact that MIRI tried and failed to find a “coherent formula for corrigibility” is not much evidence that corrigibility is “unworkable.”
Eliezer says a lot of concrete things about how research works and about what kind of expectation of progress is unrealistic (e.g. talking about bright-eyed optimism in list of lethalities [LW · GW]). But I don’t think this is grounded in an understanding of the history of science, familiarity with the dynamics of modern functional academic disciplines, or research experience. The Eliezer predictions most relevant to “how do scientific disciplines work” that I’m most aware of are incorrectly predicting that physicists would be wrong about the existence of the Higgs boson (LW bet registry [? · GW]) and expressing the view that real AI would likely emerge from a small group rather than a large industry (pg 436 but expressed many places).
I think Eliezer generalizes a lot from pessimism about solving problems easily to pessimism about solving problems at all; or from the fact that a particular technique doesn’t immediately solve a problem to pessimism about the helpfulness of research on that technique. I disagree with Eliezer about how research progress is made, and don’t think he has any special expertise on this topic. Eliezer often makes objections to particular implementations of projects (like using interpretability tools for training). But in order to actually talk about whether a research project is likely to succeed, you really really need to engage with the existential quantifier where future researchers get to choose implementation details to make it work. At a minimum that requires engaging with the strongest existing versions of these proposals, and if you haven’t done that (as Eliezer hasn’t) then you need to take a different kind of approach. But even if you engage with the best existing concrete proposals, you still need to think carefully about whether your objections are the kind of thing that will be hard to overcome as people learn more details in the future. One way of looking at this is that Eliezer is appropriately open-minded about existential quantifiers applied to future AI systems thinking about how to cause trouble, but seems to treat existential quantifiers applied to future humans in a qualitatively rather than quantitatively different way (and as described throughout this list I think he overestimates the quantitative difference).
As an example, I think Eliezer is unreasonably pessimistic about interpretability while being mostly ignorant about the current state of the field. This is true both for the level of understanding potentially achievable by interpretability, and the possible applications of such understanding. I agree with Eliezer that this seems like a hard problem and many people seem unreasonably optimistic, so I might be sympathetic if Eliezer was making claims with moderate confidence rather than high confidence. As far as I can tell most of Eliezer’s position here comes from general intuitions rather than arguments, and I think those are much less persuasive when you don’t have much familiarity with the domain.
Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated). When Eliezer dismisses the possibility of AI systems performing safer tasks millions of times in training and then safely transferring to “build nanotechnology” (point 11 of list of lethalities [LW · GW]) he is not engaging with the kind of system that is likely to be built or the kind of hope people have in mind.
List of lethalities [LW · GW] #13 makes a particular argument that we won’t see many AI problems in advance; I feel like I see this kind of thinking from Eliezer a lot but it seems misleading or wrong. In particular, it seems possible to study the problem that AIs may “change [their] outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over [them]” in advance. And while it’s true that if you fail to solve that problem then you won’t notice other problems, this doesn’t really affect the probability of solving alignment overall: if you don’t solve that problem then you die, and if you do solve that problem then then you can study the other problems.
I don’t think list of lethalities [LW · GW] is engaging meaningfully with the most serious hopes about how to solve the alignment problem. I don’t think that’s necessarily the purpose of the list, but it’s quite important if you want to assess the probability of doom or contribute meaningfully to solving the problem (or to complain about other people producing similar lists).
I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.
Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. In addition to disliking his concept of pivotal acts, I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.
Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface. I think this is fairly plausible but it’s not at all obvious, and moreover there are plenty of techniques intermediate between “copy individual reasoning steps” and “optimize end-to-end on outcomes.” I think that the last 5 years of progress in language modeling have provided significant evidence that training AI to imitate human thought may be economically competitive at the time of transformative AI, potentially bringing us to something more like a 50-50 chance. I can’t tell if Eliezer should have lost Bayes points here, but I suspect he would have and if he wants us to evaluate his actual predictions I wish he would say something about his future predictions.
These last two points (and most others from this list) aren’t aren’t actually part of my central alignment hopes or plans. Alignment hopes, like alignment concerns, can be disjunctive. In some sense they are even more disjunctive, since the existence of humans who are trying to solve alignment is considerably more robust than the existence of AI systems who are trying to cause trouble (such AIs only exist if humans have already failed at significant parts of alignment). Although my research is focused on cases where almost every factor works out against us, I think that you can get a lot of survival probability from easier worlds.
Eliezer seems to be relatively confident that AI systems will be very alien and will understand many things about the world that humans don’t, rather than understanding a similar profile of things (but slightly better), or having weaker understanding but enjoying other advantages like much higher serial speed. I think this is very unclear and Eliezer is wildly overconfident. It seems plausible that AI systems will learn much of how to think by predicting humans even if human language is a uselessly shallow shadow of human thought, because of the extremely short feedback loops. It also seems quite possible that most of their knowledge about science will be built by an explicit process of scientific reasoning and inquiry that will proceed in a recognizable way to human science even if their minds are quite different. Most importantly, it seems like AI systems have huge structural advantages (like their high speed and low cost) that suggest they will have a transformative impact on the world (~~and obsolete human contributions to alignment~~ retracted [LW(p) · GW(p)]) well before they need to develop superhuman understanding of much of the world or tricks about how to think, and so even if they have a very different profile of abilities to humans they may still be subhuman in many important ways.
AI systems reasoning about the code of other AI systems is not likely to be an important dynamic for early cooperation between AIs. Those AI systems look very likely to be messy, such that the only way AI systems will reason about their own or others’ code is by looking at behavior and using the same kinds of tools and reasoning strategies as humans. Eliezer has a consistent pattern of identifying important long-run considerations, and then flatly asserting that they are relevant in the short term without evidence or argument. I think Eliezer thinks this pattern of predictions isn’t yet conflicting with the evidence because these predictions only kick in at some later point (but still early enough to be relevant), but this is part of what makes his prediction track record impossible to assess and why I think he is greatly overestimating it in hindsight.
Eliezer’s model of AI systems cooperating with each other to undermine “checks and balances” seems wrong to me, because it focuses on cooperation and the incentives of AI systems. Realistic proposals mostly don’t need to rely on the incentives of AI systems, they can instead rely on gradient descent selecting for systems that play games competitively, e.g. by searching until we find an AI which raises compelling objections to other AI systems’ proposals. (Note: I may be misunderstanding what he’s saying in places like list of lethalities [LW · GW] 35; those may be a response to random things people say on the internet rather than engaging with alignment strategies that are being explored in practice. But even then I think he is responding to relatively weak versions of these arguments.) Eliezer equivocates between a line like “AI systems will cooperate” and “The verifiable activities you could use gradient descent to select for won’t function appropriately as checks and balances.” But Eliezer’s position is a conjunction that fails if either step fails, and jumping back and forth between them appears to totally obscure the actual structure of the argument.
Eliezer seems to expect that by the time AI systems can achieve pivotal acts, they can also manipulate humans very well—such that for example it’s pointless to try to have debates between them or to try to play an adversarial game against them. But if we imagine minds with a human profile of abilities, it seems like they would be superhuman at R&D around the same time they were superhuman at persuasion, and could easily go either way. Moreover, there are a lot of reasons the AI seems much more likely to be superhuman at R&D if that’s what we want: it’s trained primarily to work on R&D, it’s using tools and structure designed to facilitate R&D, there is a large collaboration amongst AIs to advance R&D whereas manipulation is often being performed by individual AI systems trying to win a debate. And on top of all that, if anything I very weakly expect AI systems’ comparative advantage to be in R&D rather than human manipulation (since R&D is wildly out of distribution for humans).
I don’t think surviving worlds have a plan in the sense Eliezer is looking for. Based on what Eliezer says I don’t feel like he has a clear or accurate picture of what successful “plans” look like in the real world. I don’t see any particular reason to defer to Eliezer at all on this point.
Eliezer says that his list of lethalities [LW · GW] is the kind of document that other people couldn’t write and therefore shows they are unlikely to contribute (point 41). I think that’s wrong. I think Eliezer’s document is mostly aimed at rhetoric or pedagogy rather than being a particularly helpful contribution to the field that others should be expected to have prioritized; I think that which ideas are “important” is mostly a consequence of Eliezer’s idiosyncratic intellectual focus rather than an objective fact about what is important; the main contributions are collecting up points that have been made in the past and ranting about them and so they mostly reflect on Eliezer-as-writer; and perhaps most importantly, I think more careful arguments on more important difficulties are in fact being made in other places. For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list. About half of them are overlaps, and I think the other half are if anything more important since they are more relevant to core problems with realistic alignment strategies.^[1]

My take on Eliezer's takes

Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
Eliezer’s post (and most of his writing) isn’t bringing much new evidence to the table; it mostly either reasons a priori or draws controversial conclusions from uncontroversial evidence. I think that calls for a different approach than Eliezer has taken historically (if the goal was to productively resolve these disagreements).
- I think that these arguments mostly haven’t been written down publicly so that they can be examined carefully or subject to criticism. It’s not clear whether Eliezer has the energy to do that, but I think that people who think that Eliezer’s position is important should try to understand the arguments well enough to do that.
- I think that people with Eliezer’s views haven’t engaged very much productively with people who disagree (and have often made such engagement hard). I think that if you really dive into any of these key points you will quickly reach details where Eliezer cannot easily defend his view to a smart disinterested audience. And I don’t think that Eliezer could pass an ideological Turing test for people who disagree.
- I think those are valuable steps to take if you have a contrarian take of great importance, which remains controversial even within your weird corner of the world, and whose support comes almost entirely from reasoning and argument.
A lot of the post seems to rest on intuitions and ways of thinking that Eliezer feels are empirically supported (rather than on arguments that can be explicitly stated). But I don’t feel like I actually have much evidence about that, so I think it really does just come down to the arguments.
- I think Eliezer would like to say that the last 20 years give a lot of evidence for his object-level intuitions and general way of thinking about the world. If that’s the case, I think we should very strongly expect that he can state predictions about the future that will systematically be better than those of people who don’t share his intuitions or reasoning strategies. I remain happy to make predictions about any questions he thinks would provide this kind of evidence, or to state a bunch of random questions where I’m happy to predict (where I think he will probably slightly underperform me). If there aren’t any predictions about the future where these intuitions and methodologies overperform, I think you should be very skeptical that they got a lot of evidence over the last 20 years (and that’s at least something that requires explanation).
- I think Eliezer could develop good intuition about these topics that is “backed up” by predicting the results of more complicated arguments using more broadly-accepted reasoning principles. Similarly, a mathematician might have great intuitions about the truth of a theorem, and those intuitions could come entirely from feedback loops involving formal proofs rather than empirical data. But if two mathematicians had differing intuitions about a theorem, and their intuitions both came from formally proving a bunch of similar theorems, then I think the way to settle the disagreement is by using the normal rules of logic governing proofs. So this brings us back to the previous bullet point, and I think Eliezer should be more interested in actually making arguments and engaging with legitimate objections.
- I don’t think Eliezer has any kind of track record of exhibiting understanding in other ways (e.g. by accomplishing technological goals or other projects that require engaging with details of the world or making good day-to-day predictions). I think that’s OK, but it means that I more strongly expect any empirically-backed intuitions to be cashed out as either predictions from afar or more careful arguments.

^{^}
Ten examples off the top of my head, that I think are about half overlapping and where I think the discussions in the ELK doc are if anything more thorough than the discussions in the list of lethalities:
1. Goals defined in terms of sense data are manipulable by an AI who can compromise sensors, and this is a serious obstruction to using ML to optimize what we actually care about.
2. An AI may manipulate sensors by exploiting facts about the world or modes of heuristic reasoning that humans are totally unfamiliar with, such that humans couldn’t recognize such tampering even if they spent a very long time examining proposed actions.
3. The human process of scientific understanding, even if automated, may end up being significantly less efficient than the use of gradient descent to find opaque models of the world. In this case, it may be inevitable that AI systems understand things about the world we don’t even if they try to help us do science.
4. If an AI is trained to predict human judgments or optimize scores as assessed by humans, then humans are likely to make errors. An AI system will eventually learn these errors rather than learning the intended behavior. Even if these errors aren’t themselves important, it will then predictably copy human errors out of distribution leading to catastrophic outcomes.
5. Even if humans make no errors in the training set, an AI which understands the world already has a model of a human which can be quickly repurposed to make good predictions about human judgments, and so it will tend to do this and therefore copy human errors off distribution.
6. Even if the AI has no model of a human, in the limit where the AI’s model is very complex and alien it is still faster and simpler for the AI to learn a model of “what a human would say” from scratch then to learn the intended ontology identification. So we can't count on SGD.
7. There are many training strategies that can train an AI to answer questions even in cases where humans could not answer correctly. However most of the approaches we know now, including those being explored in practice, seem to consistently top out at “questions that humans could answer if they have a lot more compute" which does not always seem good enough.
8. We could imagine more elaborate games where the easiest strategy for the AI is honesty, and then to regularize on computation time in order to learn an honest policy, but those require us to be careful about the construction of the training data in order ensure that the task is sufficiently hard, and there are no existing proposals that have that property. It's very hard to even set up games for which no strategy can outperform honesty.
9. Even if you were optimizing based on reliable observations of the real world, there are many bad actions that have no human-legible consequences for many years. At the point when legible consequences materialize it may be in a world that is too complex for existing humans to evaluate whether they are good or bad. If we don’t build an AI that understands our preferences about this kind of subtle bad behavior, then a competitive world will push us into a bad outcome.
10. If the simplest policy to succeed at our task is a learned optimizer, and we try to regularize our AI to e.g. answer questions quickly, then its best strategy may be to internally searching for a policy which answers questions slowly (because it’s quicker to find such a policy, and the time taken by the search is larger than the time taken by the mesapolicy). This makes it difficult to lean on regularization strategies to incentivize honesty.

223 comments

Comments sorted by top scores.

comment by Richard_Ngo (ricraz) · 2022-06-19T21:04:17.426Z · LW(p) · GW(p)

Strong +1s to many of the points here. Some things I'd highlight:

Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he'd have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he's found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn't have found that as difficult. (I'm sympathetic about Eliezer having in the past engaged with many interlocutors who were genuinely very bad at understanding his arguments. However, it does seem like the lack of detail in those arguments is now a bigger bottleneck.)
I think that the intuitions driving Eliezer's disagreements with many other alignment researchers are interesting and valuable, and would love to have better-fleshed-out explanations of them publicly available. Eliezer would probably have an easier time focusing on developing his own ideas if other people in the alignment community who were pessimistic about various research directions, and understood the broad shape of his intuitions, were more open and direct about that pessimism. This is something I've partly done in this post [AF · GW]; and I'm glad that Paul's partly done it here.
I like the analogy of a mathematician having intuitions about the truth of a theorem. I currently think of Eliezer as someone who has excellent intuitions about the broad direction of progress at a very high level of abstraction - but where the very fact that these intuitions are so abstract rules out the types of path-dependencies that I expect solutions to alignment will actually rely on. At this point, people who find Eliezer's intuitions compelling should probably focus on fleshing them out in detail - e.g. using toy models, or trying to decompose the concept of consequentialism - rather than defending them at a high level.

Replies from: Zack_M_Davis

↑ comment by Zack_M_Davis · 2022-06-20T04:46:01.236Z · LW(p) · GW(p)

I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn't have found that as difficult.

But what makes you so confident that it's not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?

Of course, it makes sense for other people who don't trust the (purported) expert to require an explanation, and not just take the (purported) expert's word for it. (So, I agree that fleshing out detailed examples is important for advancing our collective state of knowledge.) But the (purported) expert's own confidence should track correctness, not how easy it is to convince people using words.

Replies from: ricraz, paulfchristiano, PhilGoetz

↑ comment by Richard_Ngo (ricraz) · 2022-06-20T05:27:03.809Z · LW(p) · GW(p)

But what makes you so confident that it's not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?

Yepp, this is a judgement call. I don't have any hard and fast rules for how much you should expect experts' intuitions to plausibly outpace their ability to explain things. A few things which inform my opinion here:

Explaining things to other experts should be much easier than explaining them to the public.
Explaining things to other experts should be much easier than actually persuading those experts.
It's much more likely that someone has correct intuitions if they have a clear sense of what evidence would make their intuitions stronger.

I don't think Eliezer is doing particularly well on any of these criteria. In particular, the last one was why I pressed Eliezer to make predictions rather than postdictions in my debate with him. The extent to which Eliezer seemed confused that I cared about this was a noticeable update for me in the direction of believing that Eliezer's intuitions are less solid than he thinks.

It may be the case that Eliezer has strong object-level intuitions about the details of how intelligence works which he's not willing to share publicly, but which significantly increase his confidence in his public claims. If so, I think the onus is on him to highlight that so people can make a meta-level update on it.

↑ comment by paulfchristiano · 2022-06-20T05:03:42.284Z · LW(p) · GW(p)

I agree that intuitions might get you to high confidence without the ability to explain ideas legibly.

That said, I think expert intuitions still need to usually (always?) be grounded out in predictions about something (potentially including the many implicit predictions that are often required to do stuff). It seems to me like Eliezer is probably relying on a combination of:

Predicting stuff from afar. I think that can usually be made legible with a few years' lead time. I'm sympathetic to the difficulty of doing this (despite my frequent snarky tone), though without doing it I think Eliezer himself should have more doubts about the possibility of hindsight bias if this is really his main source of evidence. In theory this could also be retrodictions about history which would make things more complicated in some ways but faster in others.
Testing intuitions against other already-trusted forms of reasoning, and particularly concrete arguments. In this regime, I don't think it's necessarily the case that Eliezer ought to be able to easily write down a convincing version of the arguments, but I do think we should expect him to systematically be right more often when we dig into arguments evaluated using those same already-trusted forms of reasoning. And so on this perspective it's also quite useful to get into the argument.

I'm also sympathetic to Richard's point as a more contingent claim about the nature of expert intuition on technical topics---there just aren't many domains where you can get this much confidence out of intuitions you can't explain more crisply than this. Though I wouldn't be super confident about that, and (like you) I would mostly lean on the claim about collective epistemology if trying to push Eliezer to do something differently.

↑ comment by PhilGoetz · 2022-11-08T00:21:58.201Z · LW(p) · GW(p)

But what makes you so confident that it's not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?

That's irrelevant, because what Richard wrote was a truism. An Eliezer who understands his own confidence in his ideas will "always" be better at inspiring confidence in those ideas in others. Richard's statement leads to a conclusion of import (Eliezer should develop arguments to defend his intuitions) precisely because it's correct whether Eliezer's intuitions are correct or incorrect.

comment by alyssavance · 2022-06-19T20:37:42.171Z · LW(p) · GW(p)

Fantastic post! I agree with most of it, but I notice that Eliezer's post has a strong tone of "this is really actually important, the modal scenario is that we literally all die, people aren't taking this seriously and I need more help". More measured or academic writing, even when it agrees in principle, doesn't have the same tone or feeling of urgency. This has good effects (shaking people awake) and bad effects (panic/despair), but it's a critical difference and my guess is the effects are net positive right now.

Replies from: paulfchristiano, conor-sullivan

↑ comment by paulfchristiano · 2022-06-19T21:11:09.788Z · LW(p) · GW(p)

I definitely agree that Eliezer's list of lethalities hits many rhetorical and pedagogical beats that other people are not hitting and I'm definitely not hitting. I also agree that it's worth having a sense of urgency given that there's a good chance of all of us dying (though quantitatively my risk of losing control of the universe though this channel is more like 20% than 99.99%, and I think extinction is a bit less less likely still).

I'm not totally sure about the net effects of the more extreme tone, I empathize with both the case in favor and the case against. Here I'm mostly just trying to contribute to the project of "get to the bottom of what's likely to happen and what should be done."

I did start the post with a list of 19 agreements with Eliezer, including many of the claims that are most relevant to the urgency, in part so that I wouldn't be misconstrued as arguing that everything is fine.

Replies from: RAB, ThirdSequence

↑ comment by RAB · 2022-06-20T19:26:44.512Z · LW(p) · GW(p)

I really appreciate your including a number here, that's useful info. Would love to see more from everyone in the future - I know it takes more time/energy and operationalizations are hard, but I'd vastly prefer to see the easier versions over no versions or norms in favor of only writing up airtight probabilities.

(I also feel much better on an emotional level hearing 20% from you, I would've guessed anywhere between 30 and 90%. Others in the community may be similar: I've talked to multiple people who were pretty down after reading Eliezer's last few posts.)

↑ comment by ThirdSequence · 2023-11-30T13:51:00.603Z · LW(p) · GW(p)

This recent tweet claims that your current p(doom) is 50%.

In another post, [LW · GW]you mentioned:

"[...] I give different numbers on different days. Sometimes that’s because I’ve considered new evidence, but normally it’s just because these numbers are just an imprecise quantification of my belief that changes from day to day."

If the tweet is credible, I am curious if this difference in p(doom) is due to the day-to-day fluctuations of your belief, or have you considered new evidence and your initial belief that p(doom) < 20% is outdated?

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2023-11-30T17:26:35.278Z · LW(p) · GW(p)

I clarified my views here [LW · GW] because people kept misunderstanding or misquoting them.

The grandparent describes my probability that humans irreversibly lose control of AI systems, which I'm still guessing at 10-20%. I should probably think harder about this at some point and revise it, I have no idea which direction it will move.

I think the tweet you linked is referring to the probability for "humanity irreversibly messes up our future within 10 years of building human-level AI." (It's presented as "probability of AI killing everyone" which is not really right.)

I generally don't know what people mean when they say p(doom). I think they probably imagine that the vast majority of existential risk from AI comes from loss of control, and that catastrophic loss of control necessarily leads to extinction, both of which seem hard to defend.

↑ comment by Lone Pine (conor-sullivan) · 2022-06-23T15:47:46.809Z · LW(p) · GW(p)

The problem with Eliezer's recent posts (IMO) is not in how pessimistic they are, but in how they are actively insulting to the reader. EY might not realize that his writing is insulting, but in that case he should have an editor who just elides those insulting points. (And also s/Eliezer/I/g please.)

Replies from: wnoise

↑ comment by wnoise · 2023-11-01T02:41:59.475Z · LW(p) · GW(p)

My sense is not that Eliezer is insulting, so much as mainstream (and neurotypical) discourse treats far too much as insulting. Might be a distinction without a difference in practice, but "insulting" is relative to viewpoint and context, not an objective matter.

comment by Ben Pace (Benito) · 2022-06-19T20:43:56.390Z · LW(p) · GW(p)

Solid contribution, thank you.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-20T04:13:38.007Z · LW(p) · GW(p)

Agreed explicitly for the record.

comment by Sam Marks (samuel-marks) · 2022-06-20T01:41:52.570Z · LW(p) · GW(p)

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here.

I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.

Points on both lists:

Eliezer's "first critical try" framing downplays the importance of trial-and-error with non-critical tries.
It's not clear that a "pivotal act" by an aligned AI is the only way to prevent unaligned AI systems from being created.
Eliezer badly equivocates between "alignment is hard"/"approach X to alignment doesn't obviously solve it" and "alignment is impossible to solve within our time limit"/"approach X to alignment is doomed."
Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.
Eliezer's arguments for fast takeoffs aren't precise enough to warrant his confidence.
Eliezer's reasoning on generalization across distributional shift seems sloppy. Paul doesn't dig into this much, but I would add that there are approaches [LW · GW] to reasoning about the inductive biases of ML systems which, together with Anthropic-esque empirical work on how things scale with capabilities, could give us some measure of confidence that a promising-looking alignment scheme will generalize.
Based on recent work, ML systems might be much more interpretable than Eliezer seems to think.
Eliezer doesn't seriously engage with any of the most promising approaches to alignment (and by his own admission, probably could not pass an ITT for them).
Debate-esque strategies for checking outputs of powerful AI systems aren't obviously doomed by Eliezer's concerns about coordination.
Eliezer's argument that it's impossible to train a powerful agent by imitating human thought seem bad.
Regarding the question "Why did no one but Eliezer write a List of Lethalities?" a pretty plausible answer is "because List of Lethalities was not an especially helpful document and other researchers didn't think writing it was a priority."

I won't try to list all of the things that Paul mentioned which weren't on my list, but some of the most useful (for me) were:

Eliezer's doomy stories often feature a superintelligent AI system which is vastly smarter than it needs to be in order to kill us, which is a bit unrealistic since these stories ought to be about the first AI which is powerful enough to attempt permanently disempowering us. To patch this story, you need to either imagine a less powerful system turning dangerous or humans having already made aligned systems up to the level of capabilities of the new dangerous system, both of which feel less scary than the classic "atomized by nanobots" stories.
AI systems will be disincentivized from hiding their capabilities, since we'll be trying to produce AI systems with powerful capabilities.
Approaches to alignment are disjunctive, so pessimistic cases need to seriously engage with an existential quantifier over the research humans (perhaps assisted by whatever AI research assistants we can safely produce) can perform in the coming ~decades.
Since R&D is out-of-distributions for humans, we might expect to have a comparative advantage in dealing with deception from AI systems.

Finally, a few points which were on my list and not Paul's, and which I feel like writing out:

"Consequentialist which plans explicitly using exotic decision theory" is not a likely shape for the first superintelligent AI systems to take, but many of Eliezer's doomy arguments seem to assume AI systems of that form. Now, it's true that the AI systems we build might figure out that agents of that form are especially powerful and invest time into trying to build them. But that's a problem we can hopefully leave to our aligned superintelligent research assistants; building such aligned research assistants seems much less doomed.
(This is a disagreement with both Paul and Eliezer.) Contra the view that capabilities will necessarily improve a lot before alignment failures start being a problem, it seems plausible to me that many commercial applications for AI might rely on solving alignment problems. You can't deploy your smart power grid if it keeps doing unsafe things.
Eliezer's view that you're not likely to make progress on alignment unless you figured out it was a problem by yourself seems insane to me. I can't think of any other research field like this ("you can't be expected to make progress on mitigating climate change unless you independently discovered that climate change would be a problem"), and I'm not sure where Eliezer's opinion that alignment is an exception is coming from.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-20T04:09:49.503Z · LW(p) · GW(p)

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer

Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.

Replies from: concernedcitizen64, samuel-marks, TurnTrout, owngrove, quintin-pope, david-johnston, sharmake-farah

↑ comment by concernedcitizen64 · 2022-06-20T04:43:29.953Z · LW(p) · GW(p)

It seems to me like you have a blind spot [LW(p) · GW(p)] regarding how your position as a community leader functions. If you, very well respected high status rationalist, write a long, angry post [LW · GW] dedicated to showing everyone else that they can't do original work and that their earnest attempts at solving the problem are, at best, ineffective & distracting and you're tired of having to personally go critique all of their action plans... They stop proposing action plans. They don't want to dilute the field with their "noise", and they don't want you and others to think they're stupid for not understanding why their actions are ineffective or not serious attempts in the first place. I don't care what you think you're saying - the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they're not you and can't think correctly about these sorts of issues.

[Redacted rant/vent for being mean-spirited and unhelpful]

Replies from: TurnTrout, Eliezer_Yudkowsky, Emrik North, Evan R. Murphy

↑ comment by TurnTrout · 2022-06-20T16:45:42.279Z · LW(p) · GW(p)

I don't care what you think you're saying - the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they're not you and can't think correctly about these sorts of issues.

I think this is, unfortunately, true. One reason people might feel this way is because they view LessWrong posts through a social lens. Eliezer posts about how doomed alignment is and how stupid everyone else's solution attempts are, that feels bad, you feel sheepish about disagreeing, etc.

But despite understandably having this reaction to the social dynamics, the important part of the situation is not the social dynamics. It is about finding technical solutions to prevent utter ruination. When I notice the status-calculators in my brain starting to crunch and chew on Eliezer's posts, I tell them to be quiet, that's not important, who cares whether he thinks I'm a fool. I enter a frame in which Eliezer is a generator of claims and statements, and often those claims and statements are interesting and even true, so I do pay attention to that generator's outputs, but it's still up to me to evaluate those claims and statements, to think for myself.

If Eliezer says everyone's ideas are awful, that's another claim to be evaluated. If Eliezer says we are doomed, that's another claim to be evaluated. The point is not to argue Eliezer into agreement, or to earn his respect. The point is to win in reality, and I'm not going to do that by constantly worrying about whether I should shut up.

If I'm wrong on an object-level point, I'm wrong, and I'll change my mind, and then keep working. The rest is distraction.

Replies from: Zvi, Dirichlet-to-Neumann

↑ comment by Zvi · 2022-06-20T18:24:30.499Z · LW(p) · GW(p)

Sounds like same way we had a dumb questions post we need somewhere explicitly for posting dumb potential solutions that will totally never work, or something, maybe?

Replies from: alenglander, TurnTrout, eugene-d, adrian-arellano-davin

↑ comment by Aryeh Englander (alenglander) · 2022-06-23T16:18:39.190Z · LW(p) · GW(p)

I have now posted a "Half-baked AI safety ideas thread" (LW version [LW · GW], EA Forum version [EA · GW]) - let me know if that's more or less what you had in mind.

↑ comment by TurnTrout · 2022-06-22T23:35:47.606Z · LW(p) · GW(p)

I think it's unwise to internally label good-faith thinking as "dumb." If I did that, I feel that I would not be taking my own reasoning seriously. If I say a quick take, or an uninformed take, I can flag it as such. But "dumb potential solutions that will totally never work"? Not to my taste.

That said, if a person is only comfortable posting under the "dumb thoughts incoming" disclaimer—then perhaps that's the right move for them.

Replies from: Rana Dexsin

↑ comment by Rana Dexsin · 2022-06-24T17:31:14.491Z · LW(p) · GW(p)

The point of that label is that for someone who already has the status-sense of “my ideas are probably dumb”, any intake point that doesn't explicitly say “yeah, dumb stuff accepted here” will act as an emotional barrier. If you think what you're carrying is trash, you'll only throw it in the bin and not show it to anyone. If someone puts a brightly-colored bin right in front of you instead with “All Ideas Recycling! Two Cents Per Idea”, maybe you'll toss it in there instead.

In the more general population, I believe the underlying sense to be a very common phenomenon, and easily triggered. Unless there is some other social context propping up a sense of equality, people will regularly feel dumb around you because you used a single long-and-classy-sounding word they didn't know, or other similar grades of experience. Then they will stop telling you things. Including important things! If someone else who's aligned can very overtly look less intimidating to step up and catch them, especially if they're also volunteering some of the filtering effort that might otherwise make a broad net difficult to handle, that's a huge win, especially because when people stop telling you things they often also stop listening and stop giving you the feedback you need to preserve alliances, much less try to convince them of anything “for real” rather than them walking away and feeling a sense of relief and throwing everything you said in the “that's not for people like me” zone and never thinking about it again.

Notice what Aryeh Englander emphasized near the beginning of each of these secondary posts: “I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them” [LW · GW], “I don't want to spam the group with half-thought-through posts, but I also want to post these ideas” [LW · GW]. Beyond their truth value, these act as status-hedges (or anti-hedges, if you want to think of it in the sense of a hedge maze). They connect the idea of “I am feeling the same intimidation as you; I feel as dumb as you feel right now” with “I am acting like it's okay to be open about this and giving you implicit permission to do the same [LW · GW]”, thus helping puncture the bubble. (There is potentially some discussion to be had around the Sequences link I just edited in and what that implies for what can be expected socially, but I don't want to dig too far unless people are interested and will only say that I don't think relying on people putting that principle into practice most of the time is realistic in this context.)

↑ comment by Eugene D (eugene-d) · 2022-06-22T23:38:49.159Z · LW(p) · GW(p)

I for one really appreciate the 'dumb-question' area :)

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-22T11:26:19.837Z · LW(p) · GW(p)

Oh yes please. Maybe some tag that could be added to the comment. Maybe a comment in a different color.

↑ comment by Dirichlet-to-Neumann · 2022-06-23T14:12:29.696Z · LW(p) · GW(p)

Saying that people should not care about social dynamics and only about object level arguments is a failure at world modelling. People do care about social dynamics, if you want to win, you need to take that into account. If you think that people should act differently, well, you are right, but the people who counts are the real one, not those who live in your head.

Incentives matters. In today's lesswrong, the threshold of quality for having your ideas heard (rather than everybody ganging up on you to explain how wrong you are) is much higher for people who disagree with Eliezer than for people who agree with him. Unsurprisingly, that means that people filter what they say at a higher rate if they disagree with Eliezer (or any other famous user honestly - including you.).

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-25T23:45:05.105Z · LW(p) · GW(p)

I wondered whether people would take away the message that "The social dynamics aren't important." I should have edited to clarify, so thanks for bringing this up.

Here was my intended message: The social dynamics are important, and it's important to not let yourself be bullied around, and it's important to make spaces where people aren't pressured into conformity. But I find it productive to approach this situation with a mindset of "OK, whatever, this Eliezer guy made these claims, who cares what he thinks of me, are his claims actually correct?" This tactic doesn't solve the social dynamics issues on LessWrong. This tactic just helps me think for myself.

So, to be clear, I agree that incentives matter, I agree that incentives are, in one way or another, bad around disagreeing with Eliezer (and, to lesser extents, with other prominent users). I infer that these bad incentives spring both from Eliezer's condescension and rudeness, and also a range of other failures.

For example, if many people aren't just doing their best to explain why they best-guess-of-the-facts agree with Eliezer—if those people are "ganging up" and rederiving the bottom line of "Eliezer has to be right"—then those people are failing at rationality,

or any other famous user honestly - including you.

For the record, I welcome any thoughtful commenter to disagree with me, for whatever small amount that reduces the anti-disagreement social pressure. I don't negatively judge people who make good-faith efforts to disagree with me, even if I think their points are totally mistaken.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-20T06:03:57.643Z · LW(p) · GW(p)

Seems to be sort of an inconsistent mental state to be thinking like that and writing up a bullet-point list of disagreements with me, and somebody not publishing the latter is, I'm worried, anticipating social pushback that isn't just from me.

Replies from: swarriner, John_Maxwell_IV, pktechgirl

↑ comment by swarriner · 2022-06-20T16:56:01.731Z · LW(p) · GW(p)

somebody not publishing the latter is, I'm worried, anticipating social pushback that isn't just from me.

Respectfully, no shit Sherlock, that's what happens when a community leader establishes a norm of condescending to inquirers.

I feel much the same way as Citizen in that I want to understand the state of alignment and participate in conversations as a layperson. I too, have spent time pondering your model of reality to the detriment of my mental health. I will never post these questions and criticisms to LW because even if you yourself don't show up to hit me with the classic:

Answer by Eliezer Yudkowsky [LW · GW]Apr 10, 2022 [LW(p) · GW(p)] 38
As a minor token of how much you're missing:

then someone else will, having learned from your example. The site culture has become noticeably more hostile in my opinion ever since Death with Dignity, and I lay that at least in part at your feet.

Replies from: TurnTrout, concernedcitizen64, concernedcitizen64

↑ comment by TurnTrout · 2022-06-20T16:59:37.670Z · LW(p) · GW(p)

Yup, I've been disappointed with how unkindly Eliezer treats people sometimes. Bad example to set.

EDIT: Although I note your comment's first sentence is also hostile, which I think is also bad.

↑ comment by concernedcitizen64 · 2022-06-20T19:59:54.707Z · LW(p) · GW(p)

Let me make it clear that I'm not against venting, being angry, even saying to some people "dude, we're going to die", all that. Eliezer has put his whole life into this field and I don't think it's fair to say he shouldn't be angry from time to time. It's also not a good idea to pretend things are better than they actually are, and that includes regulating your emotional state to the point that you can't accurately convey things. But if the linchpin of LessWrong says that the field is being drowned by idiots pushing low-quality ideas (in so many words), then we shouldn't be surprised when even people who might have something to contribute decide to withhold those contributions, because they don't know whether or not they're the people doing the thing he's explicitly critiquing.

↑ comment by concernedcitizen64 · 2022-06-20T23:21:34.328Z · LW(p) · GW(p)

You (and probably I) are doing the same thing that you're criticizing Eliezer for. You're right, but don't do that. Be the change you wish to see in the world.

Replies from: iceman, Dirichlet-to-Neumann, swarriner

↑ comment by iceman · 2022-06-21T20:55:45.539Z · LW(p) · GW(p)

That sort of thinking is why we're where we are right now.

Be the change you wish to see in the world.

I have no idea how that cashes out game theoretically. There is a difference between moving from the mutual cooperation square to one of the exploitation squares, and moving from an exploitation square to mutual defection. The first defection is worse because it breaks the equilibrium, while the defection in response is a defensive play.

swarriner's post, including the tone, is True and Necessary.

↑ comment by Dirichlet-to-Neumann · 2022-06-23T14:15:07.774Z · LW(p) · GW(p)

High prestige users being condescending to low prestige users does not promote the same social norms as low prestige users being impertinent to high prestige users.

↑ comment by swarriner · 2022-06-20T23:54:33.059Z · LW(p) · GW(p)

While that's an admirable position to take and I'll try to take it in hand, I do feel EY's stature in the community puts us in differing positions of responsibility concerning tone-setting.

↑ comment by John_Maxwell (John_Maxwell_IV) · 2022-06-20T11:25:09.111Z · LW(p) · GW(p)

Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.

Replies from: conor-sullivan

↑ comment by Lone Pine (conor-sullivan) · 2022-06-23T16:18:01.205Z · LW(p) · GW(p)

Can you give us 3-5 bullet points of summary?

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2022-06-23T23:47:33.192Z · LW(p) · GW(p)

Power makes you dumb, stay humble.
Tell everyone in the organization that safety is their responsibility, everyone's views are important.
Try to be accessible and not intimidating, admit that you make mistakes.
Schedule regular chats with underlings so they don't have to take initiative to flag potential problems. (If you think such chats aren't a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned to spot safety problems. Among other things, it seems workers are sometimes more willing to share concerns frankly with an outsider than they are with their boss.)
Accept that not all of the critical feedback you get will be good quality.

The book disrecommends anonymous surveys on the grounds that they communicate the subtext that sharing your views openly is unsafe. I think anonymous surveys might be a good idea in the EA community though -- retaliation against critics seems fairly common here (i.e. the culture of fear [EA · GW] didn't come about by chance). Anyone who's been around here long enough will have figured out that sharing your views openly isn't safe. (See also the "People are pretty justified in their fears of critiquing EA leadership/community norms" bullet point here [EA · GW], and the last paragraph in this comment [EA(p) · GW(p)].)

Replies from: conor-sullivan

↑ comment by Lone Pine (conor-sullivan) · 2022-06-24T09:15:08.873Z · LW(p) · GW(p)

Sure is lovely how the rationalist community is living up to its rationality norms.

↑ comment by Elizabeth (pktechgirl) · 2022-06-27T04:35:54.025Z · LW(p) · GW(p)

I think it is very true that the pushback is not just from you, and that nothing you could do would drive it to zero, but also that different actions from you would lead to a lot less fear of bad reactions from both you and others.

↑ comment by Emrik (Emrik North) · 2022-06-23T19:29:03.710Z · LW(p) · GW(p)

To be honest, the fact that Eliezer is being his blunt unfiltered self is why I'd like to go to him first if he offered to evaluate my impact plan re AI. Because he's so obviously not optimising for professionalism, impressiveness, status, etc. he's deconfounding his signal and I'm much better able to evaluate what he's optimising for.^[1] Hence why I'm much more confident that he's actually just optimising for roughly the thing I'm also optimising for. I don't trust anyone who isn't optimising purely to be able to look at my plan and think "oh ok, despite being a nobody this guy has some good ideas" if that were true.

And then there's the Graham's Design Paradox thing. I think I'm unusually good at optimising purely, and I don't think people who aren't around my level or above would be able to recognise that. Obviously, he's not the only one, but I've read his output the most, so I'm more confident that he's at least one of them.

^{^}
Yes, perhaps a consequentialist would be instrumentally motivated to try to optimise more for these things, but the fact that Eliezer doesn't do that (as much) just makes it easier to understand and evaluate him.

↑ comment by Evan R. Murphy · 2022-06-21T00:15:42.511Z · LW(p) · GW(p)

They don't want to dilute the field with their "noise"

I think it would be great regarding posts and comments about AI on LessWrong if we could establish a more tolerant atmosphere and bias toward posting/commenting without fear of producing "noise". The AI Alignment Forum exists to be the discussion platform that's filtered to only high-quality posts and comments. So it seems suboptimal and not taking advantage of the dual-forum system for people to be self-censoring to a large degree on the more permissive forum (i.e. LessWrong).

(This is not at all to dismiss your concerns and say "you should feel more comfortable speaking freely on LessWrong". Just stating a general direction I'd like to see the community and conversation norms move in.)

↑ comment by Sam Marks (samuel-marks) · 2022-06-20T09:17:41.056Z · LW(p) · GW(p)

Why privately?!

(Treating this as non-rhetorical, and making an effort here to say my true reasons rather than reasons which I endorse or which make me look good...)

In order of importance, starting from the most important:

It would take a lot of effort to turn the list of disagreements I wrote for myself into a proper post, and I decided the effort wasn't worth it. I'm impressed how quickly Paul wrote this response, and it wouldn't surprise me if there are some people reading this who are now wondering if they should still post their rebuttals they've been drafting for the last week.
As someone without name recognition, I have a general fear -- not unfounded, I think -- of posting my opinions on alignment publicly, lest they be treated as the ramblings of a self-impressed newcomer with a shallow understanding of the field.^[1] Some important context is that I'm a math grad student in the process of transitioning into a career in alignment, so I'm especially sensitive right now about safeguarding my reputation.
I expected (rightly) that someone more established than me would end up posting a rebuttal better than mine.
General anxiety around posting my thoughts (what if my ideas are dumb? what if no one takes them seriously? etc)
My inside view was that List of Lethalities was somewhere between unhelpful and anti-helpful, and I was pretty mad about it.^[2] I worried that if I tried to draft a reply, it would come across as angrier than I reflectively endorse. (And this would have also carried reputational costs.)

And finally, one reason which wasn't really a big deal, maybe like 1% of my hesitance, but which I'll include just because I think it makes a funny story:

6. This coming spring I'll be teaching a Harvard math dept course on MIRI-style decision theory^[3]. I had in mind that I might ask you (Eliezer) if you wanted to give a guest lecture. But I figured you probably wouldn't be interested in doing so if you knew me as "the unpleasant-seeming guy who wrote an angry list of all the reasons List of Lethalities was dumb," so.

^{^}
Some miscellaneous related thoughts:
- LessWrong does have a fair number of posts these days which I'd categorize as "ramblings by someone with a shallow understanding of alignment," so I don't begrudge anyone for starting out with a prior that mine is one such.
- Highly public discussions like the one launched by List of Lethalities seem more likely to attract such posts, relative to narrower discussions on more niche topics. This makes me especially reticent to publicly opine on discussions like this one.
^{^}
On the morning after List of Lethalities was published, a friend casually asked how I was doing. I replied, "I wish I had a mood ring with a color for 'mad at Eliezer Yudkowsky' because then you wouldn't have to ask me how I'm doing."
^{^}
Given the context, I should clarify that my inside-view doesn't actually expect MIRI-style decision theory to be useful towards alignment; my motivation for teaching a course on the topic is just that it seems fun and was easy to plan.

Replies from: None

↑ comment by [deleted] · 2022-06-20T17:11:54.503Z · LW(p) · GW(p)

↑ comment by TurnTrout · 2022-06-20T16:35:21.467Z · LW(p) · GW(p)

OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment.

Here's one stab^[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties^[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their "love" value to configurations of atoms? If it's really hard to get intelligences to care about reality, how does the genome do it millions of times each day?

Taking an item from your lethalities post:

19... More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment [LW(p) · GW(p)].) I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.

I haven't made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn't yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don't know how to credibly communicate that the theory is at the level of actually really important to evaluate & critique ASAP, because time is slipping away. But I'll keep trying anyways.

^{^}
I'm attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader's patience for improved future communication attempts.
^{^}
Like

1. "Human beings tend to bind their terminal values to their model of reality", or

2. "Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don't stop caring about their family because they can model the world in terms of complex amplitudes."

Replies from: ramana-kumar

↑ comment by Ramana Kumar (ramana-kumar) · 2022-06-21T09:51:25.978Z · LW(p) · GW(p)

Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.

But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).

The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:

for AGIs there's a principal (humans) that we want to align the AGI to
for humans there is no principal - our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn't solved.

Replies from: TurnTrout, conor-sullivan

↑ comment by TurnTrout · 2022-06-22T00:23:50.054Z · LW(p) · GW(p)

I addressed this distinction previously [LW(p) · GW(p)], in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.

The important disanalogy

I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.

for humans there is no principal - our values can be whatever

Huh? I think I misunderstand you. I perceive you as saying: "There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values."

If so, I strongly disagree. Like, in the world where that is true, wouldn't parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not "whatever", human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.

Or if you take evolution as the principal, the alignment problem wasn't solved.

The linked theory makes it obvious why evolution couldn't have possibly solved the human alignment problem. To quote:

Since human values are generally defined over the learned human WM, evolution could not create homo inclusive-genetic-fitness-maximus.
If values form because reward sends reinforcement flowing back through a person's cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-reduction process and thus are easy to form values around.
But if there aren't local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can't instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
On the other hand, under this theory, human values (by their nature) usually involve concepts which are easy to form shards of value around... Shard theory provides a story for why we might succeed at shard-alignment, even though evolution failed.

(Edited to expand my thoughts)

Replies from: ramana-kumar

↑ comment by Ramana Kumar (ramana-kumar) · 2022-06-23T09:58:26.368Z · LW(p) · GW(p)

I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":

cares about something
cares about something external (not shallow function of local sensory data)
cares about something specific and external

(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.

I think I need to distinguish two versions of feat 3:

there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
there is a principal who gets to choose what the specific targets of caring are (and they succeed)

Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-26T00:36:10.144Z · LW(p) · GW(p)

Hm. I feel confused about the importance of 3b as opposed to 3a. Here's my first guess: Because we need to target the AI's motivation in particular ways in order to align it with particular desired goals, it's important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards "dog" or "rock" or "cheese wheels" or "cooperating with humans."

Is this close?

Replies from: ramana-kumar

↑ comment by Ramana Kumar (ramana-kumar) · 2022-06-27T09:41:27.985Z · LW(p) · GW(p)

Yes that sounds right to me.

↑ comment by Lone Pine (conor-sullivan) · 2022-06-23T16:23:53.367Z · LW(p) · GW(p)

Or if you take evolution as the principal, the alignment problem wasn't solved.

In what sense? Because modern humans use birth control? Then what do you make of the fact that most people seem to care about whether biological humans exist a billion years hence?

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-26T00:54:24.061Z · LW(p) · GW(p)

People definitely do not terminally care about inclusive genetic fitness in its pure abstract form, there is not something inside of them which pushes for plans which increase inclusive genetic fitness. Evolution failed at alignment, strictly speaking.

I think it's more complicated to answer "did evolution kinda succeed, despite failing at direct alignment?", and I don't have time to say more at the moment, so I'll stop there.

Replies from: ricardo-meneghin-filho, conor-sullivan

↑ comment by Ricardo Meneghin (ricardo-meneghin-filho) · 2022-06-26T12:49:19.296Z · LW(p) · GW(p)

I think the focus on "inclusive genetic fitness" as evolution's "goal" is weird. I'm not even sure it makes sense to talk about evolution's "goals", but if you want to call it an optimization process, the choice of "inclusive genetic fitness" as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for "things that expand", in the entire universe, and humans definitely seem partially aligned with that - the ways in which they aren't seem non-competitive with this goal.

↑ comment by Lone Pine (conor-sullivan) · 2022-06-26T02:15:21.639Z · LW(p) · GW(p)

I don't know, if I was a supervillian I'd certainly have a huge number of kids and also modify my and my children's bodies to be more "inclusively genetically fit" in any way my scientist-lackeys could manage. Parents also regularly put huge amounts of effort into their children's fitness, although we might quibble about whether in our culture they strike the right balance of economic, physical, social, emotional etc fitness.

↑ comment by owngrove · 2022-06-20T07:38:21.606Z · LW(p) · GW(p)

One reason you might do something like "writing up a list but not publishing it" is if you perceive yourself to be in a mostly-learning mode rather than a mostly-contributing one. You don't want to dilute the discussion with your thoughts that don't have a particularly good chance of adding anything, and you don't want to be written off as someone not worth listening to in a sticky way, but you want to write something down develop your understanding / check against future developments / record anything that might turn out to have value later after all once you understand better.

Of course, this isn't necessarily an optimal or good strategy, and people might still do it when it isn't - I've written down plenty of thoughts on alignment over the years, I think many of the actual-causal-reasons I'm a chronic lurker are pretty dumb and non-agentic - but I think people do reason like this, explicitly or implicitly.

There's a connection here to concernedcitizen64's point about your role as a community leader, inasmuch as your claims about the quality of the field can significantly influence people's probabilities that their ideas are useful / that they should be in a contributing mode, but IMO it's more generally about people's confidence in their contributions.

Overall I'd personally guess "all the usual reasons people don't publish their thoughts" over "fear of the reception of disagreement with high-status people" as the bigger factor here; I think the culture of LW is pretty good at conveying that high-quality criticism is appreciated.

Replies from: samuel-marks

↑ comment by Sam Marks (samuel-marks) · 2022-06-20T08:19:38.847Z · LW(p) · GW(p)

(I mostly endorse this explanation, but am also writing a reply with some more details.)

↑ comment by Quintin Pope (quintin-pope) · 2022-06-21T14:30:30.690Z · LW(p) · GW(p)

I read the "List of Lethalities", think I understood it pretty well, and I disagree with it in multiple places. I haven't written those disagreements up like Paul did because I don't expect that doing so would be particularly useful. I'll try to explain why:

The core of my disagreement is that I think you are using a deeply mistaken framing of agency / values and how they arise in learning processes. I think I've found a more accurate framing, from which I've drawn conclusions very different to those expressed in your list, such as:

Human values are not as fragile as they introspectively appear. The felt sense of value fragility is, in large part, due to a type mismatch between the cognitive processes which form, implement, and store our values on the one hand and the cognitive processes by which we introspect on our current values on the other.
The processes by which we humans form/reflect on/generalize our values are not particularly weird among the space of processes able to form/reflect on/generalize values. Evolution pretty much grabbed the most accessible such process and minimally modified it in ways that are mostly irrelevant to alignment. E.g., I think we're more inclined to generalize our values in ways that conform to the current social consensus, as compared to an "idealized" value forming/reflecting/generalizing process.
- Relatedly, I think that "values meta-preferences" have a simple and fairly convergent core of how to do correct values reflection/generalization, in much the same way that "scientific discovery" has a simple, convergent core of how to do correct inference (i.e., Bayesianism^[1]).
It's possible for human and AI value systems to partially overlap to a non-trivial degree that's robust to arbitrary capabilities gain on the part of the AI, such that a partially misaligned AI might still preserve humanity in a non-terrible state, depending on the exact degree and type of the misalignment.

The issue is that this list of disagreements relies on a framing which I've yet to write up properly. If you want to know whether or how much to update on my list, or how to go about disagreeing with the specifics of my beliefs, you'll need to know the frame I'm using. Given inferential distance, properly introducing / explaining new frames is very difficult. Anyone interested can look at my current early draft for introducing the frame (though please take care not to let the current bad explanation inoculate you against a good idea).

So, my current plan is to continue working on posts that target deeper disagreements, even though there are many specific areas where I think the "List of Lethalities" is wrong.

^{^}
Well, the correct answer here is probably actually infra-Bayesianism, or possibly something even weirder. The point is, it's information-theoretic-simple and convergently useful for powerful optimizing systems.

↑ comment by David Johnston (david-johnston) · 2022-06-21T13:28:29.327Z · LW(p) · GW(p)

I've written a few half-baked alignment takes for Less Wrong, and they seem to have mostly been ignored. I've since decided to either bake things fully, look for another venue, or not bother, and I'm honestly not particularly enthused about the fully bake option. I don't know if anything similar has had any impact on Sam's thinking.

↑ comment by Noosphere89 (sharmake-farah) · 2022-06-21T14:56:14.868Z · LW(p) · GW(p)

My own biggest disagreement with you is the idea that morality and values are objective. While I'm a moral realist, I'm of the weakest kind of realist and view morals and values as inherently subjective. In other words there's no fact of the matter here, and post-modernism is actually useful here (I'm a strong critic of post-modernism, but it's basically correct vis-a-vis morality and values.)

Replies from: AgentME

↑ comment by AgentME · 2022-06-23T21:02:56.093Z · LW(p) · GW(p)

I think you misunderstand EY if you think he believes that morality and values are objective. If they were, then alignment would be easy because as long as the AI was smart enough, it could be depended on to figure out the "correct" morality and values. The common values that humanity shares are probably in part arbitrary evolutionary accidents. The goal is to create AI with values that allow humanity to live by its values, instead of creating an AI with non-overlapping values caused by its own design accidents. (EY's article Sorting pebbles into correct heaps [LW · GW] implies some of these ideas.)

comment by Rohin Shah (rohinmshah) · 2022-06-20T02:16:53.647Z · LW(p) · GW(p)

I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I'd have actively agreed with the claims.

Things that don't meet that bar:

General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don't necessarily have the same views, primarily because I've engaged much less with Eliezer and so don't have confident Eliezer-models. (They all seem plausible to me, except where I've specifically noted disagreements below.)

Agreement 14: Not sure exactly what this is saying. If it's "the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value" I agree.

Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of "just being nice", and even with the details I expect I'd feel pretty uncertain which was more likely.

Agreement 17: It seems totally fine to focus your attention on a specific subset of "easy-alignment" worlds and ensuring that those worlds survive, which could be described as "assuming there's a hope". That being said, there's something in this vicinity I agree with: in trying to solve alignment, people sometimes make totally implausible assumptions about the world; this is a worse strategy for reducing x-risk than working on the worlds you actually expect and giving them another ingredient that, in combination with a "positive model violation", could save those worlds.

Disagreement 10: I don't have a confident take on the primate analogy; I haven't spent enough time looking into it for that.

Disagreement 15: I read Eliezer as saying something different in point 11 of the list of lethalities than Paul attributes to him here; something more like "if you trained on weak tasks either (1) your AI system will be too weak to build nanotech or (2) it learned the general core of intelligence and will kill you once you get it to try building nanotech". I'm not confident in my reading though.

Disagreement 18: I find myself pretty uncertain about what to expect in the "breed corrigible humans" thought experiment.

Disagreement 22: I was mostly in agreement with this, but "obsoleting human contributions to alignment" is a pretty high bar if you take it literally, and I don't feel confident that happens before superintelligent understanding of the world (though it does seem plausible).

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-21T03:39:28.915Z · LW(p) · GW(p)

On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-06-21T19:59:28.280Z · LW(p) · GW(p)

Great, I agree with all of that.

comment by Jan_Kulveit · 2022-06-20T00:05:03.554Z · LW(p) · GW(p)

Broadly agree with this in most points of disagreement with Eliezer, and also agree with many points of agreement.

Few points where I sort of disagree with both, although this is sometimes unclear

1.

Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.

I literally agree with this, but at the same time, in contrast to Eliezer's original point, I also think there is a decent chance the world would respond in a somewhat productive way, and this is a mayor point of leverage.

For people who doubt this, I'd point to variance in initial governmental-level response to COVID19, which ranged from "highly incompetent" (eg. early US) to "quite competent" (eg Taiwan). (I also have some intuitions around this based on non-trivial amounts of first-hand experience with how governments actually internally worked and made decisions - which you certainly don't need to trust, but if you are highly confident in inability of governments to act, or do reasonable things, you should at least be less confident.)

2.

AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.

While I do agree there likely won't be strong technological hurdles, I think "right around the human level" is the point where it seems most likely some regulatory hurdles can be erected, or the human coordination landscape can change, or resources spent on alignment research could grow extremely fast, or, generally, weird things can happen. While I generally agree weird bad things can happen, I also do think weird good things can happen, and this also likely seems a potential period of increased leverage.

There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better. I think Eliezer’s term “the last derail” is hyperbolic but on point.

I do agree that the pressures do exist, and would be bad if it caused many people working on the pessimistic-assumptions-side to switch to work on e.g. corporate governance; on the other hand, I don't agree it's just a distraction. Given previous two points, I think the overall state of power / coordination / conflict can have significant trajectory-shaping influence.

Also, this dynamic will likely bring many more people to work on alignment-adjacent topics, and I think there is some chance to steer part of this attention to productive work on important problems; I think this is more likely if at least some alignment researchers bother to engage with this influx of attention (as opposed to ignoring it as random distraction).

This response / increases in attention in some sense seem like the normal way how humanity solves problems, and it may be easier to steer it, rather than e.g. try to find&convince random people to care about technical alignment problems.

Replies from: paulfchristiano, Wei_Dai

↑ comment by paulfchristiano · 2022-06-20T00:52:19.528Z · LW(p) · GW(p)

It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).

Here are some extremely rambling thoughts on point 3.

I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to mostly point to differences in "which problems related to AI are we trying to solve?" We could think about technical or institutional or economic approaches/aspects of any problem.

With respect to "which problem are we trying to solve?": I also think potential undesirable effects of AI on the balance of power are real and important, both because it affects our long term future and because it will affect humanity's ability to cope with problems during the transition to AI. I think that problem is at least somewhat less important than alignment, but will probably get much more attention by default. I think this is especially true from a technical perspective, because technical work plays a totally central work for alignment, and a much more unpredictable and incidental role for affecting the balance of power.

I'm not sure how alignment researchers should engage with this kind of alignment-adjacent topic. My naive guess would be that I (and probably other alignment researchers) should:

Try to have reasonable takes on other problems (and be appropriately respectful/deferential when we don't know what we're talking about).
Feel comfortable "staying in my lane" even though it does inevitably lead to lots of people being unhappy with us.
Be relatively clear about my beliefs and prioritization with EA-types who are considering where to work, even though that will potentially lead to some conflict with people who have different priorities. (Similarly, I think people who work on different approaches to alignment should probably be clear about their positions and disagree openly, even though it will lead to some conflict.)
Generally be respectful, acknowledge legitimate differences in what people care about, acknowledge differing empirical views without being overconfident and condescending about it, and behave like a reasonable person (I find Eliezer is often counterproductive on this front, though I have to admit that he does a better job of clearly expressing his concerns and complaints than I do).

I am somewhat concerned that general blurring of the lines between alignment and other concerns will tend to favor topics with more natural social gravity. That's not enough to make me think it's clearly net negative to engage, but is at least enough to make me feel ambivalent. I think it's very plausible that semi-approvingly citing Eliezer's term "the last derail" was unwise, but I don't know. In my defense, the difficulty of talking about alignment per se, and the amount of social pressure to instead switch to talking about something else, is a pretty central fact about my experience of working on alignment, and leaves me protective of spaces and norms that let people just focus on alignment.

(On the other hand: (i) I would not be surprised if people on the other side of the fence feel the same way, (ii) there are clearly spaces---like LW---where the dynamic is reversed, though they have their own problems, (iii) the situation is much better than a few years ago and I'm optimistic that will continue getting better for a variety of reasons, not least that the technical problems in AI alignment become increasingly well-defined and conversations about those topics will naturally become more focused.)

I'm not convinced that the dynamic "we care a lot about who ends up with power, and more important topics are more relevant to the distribution of power" is a major part of how humanity solves hard human vs nature problems. I do agree that it's an important fact about humans to take into account when trying to solve any problem though.

Replies from: Jan_Kulveit

↑ comment by Jan_Kulveit · 2022-06-21T09:30:13.715Z · LW(p) · GW(p)

Not very coherent response to #3. Roughly

Caring about visible power is a very human motivation, and I'd expect will draw many people to care about "who are the AI principals", "what are the AIs actually doing", and few other topics, which have significant technical components
Somewhat wild datapoints in this space: nuclear weapons, space race. in each case, salient motivations such as "war" led some of the best technical people to work on hard technical problems. in my view, the problems the technical people ended up working on were often "vs. nature" and distant from the original social motivations
Another take on this is, some people want to technically interesting and import problems, but some of them want to work on "legibly important" or "legibly high-status" problems
I do believe there are some opportunities in steering some fraction of this attention toward some of the core technical problems (not toward all of them, at this moment).
This can often depend on framing; while my guess is e.g. you shouldn't probably work on this, my guess is some people who understand alignment technical problems should
This can also depend on social dynamics; your "naive guess" seem a good starting point
Also: it seems there are many low-hanging fruits in low-difficulty problems which someone should work on - eg at this moment, many humans should be spending a lot of time trying to get empirical understanding of what types of generalization are LLMs capable of.

With prioritization, I think it would be good if someone made some sort of a curated list "who is working on which problems, and why" - my concern with part of the "EAs figuring out what to do" is many people are doing some sort of expert-aggregation on the wrong level. (Like, if someone basically averages your and Eliezer Yudkowsky's conclusions giving 50% weight each, I don't think it is useful and coherent model)

↑ comment by Wei Dai (Wei_Dai) · 2022-06-20T04:04:50.115Z · LW(p) · GW(p)

For people who doubt this, I’d point to variance in initial governmental-level response to COVID19, which ranged from “highly incompetent” (eg. early US) to “quite competent” (eg Taiwan).

Seems worth noting that Taiwan is an outlier in terms of average IQ of its population. Given this, I find it pretty unlikely that typical governmental response to AI would be more akin to Taiwan than the US.

Replies from: Jan_Kulveit, adrian-arellano-davin

↑ comment by Jan_Kulveit · 2022-06-20T14:21:17.480Z · LW(p) · GW(p)

I doubt that's the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.
My main claim isn't about what a median response would be, but something like "difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn't over-determined, and sensibe responses are within human reach". Even if Taiwan was an outlier, it's not like it's inhabited by aliens or run by a friendly superintelligence.
Empirically, median governmental response to a novel crisis is copycat policymaking from some other governments

Replies from: Wei_Dai, habryka4

↑ comment by Wei Dai (Wei_Dai) · 2022-06-20T17:19:37.976Z · LW(p) · GW(p)

I doubt that’s the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.

What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.

My main claim isn’t about what a median response would be, but something like “difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn’t over-determined, and sensibe responses are within human reach”.

This seems to depend on response to AI risk being of similar difficulty as response to COVID. I think people who updated towards "bad response to AI risk is overdetermined" did so partly on the basis that the former is much harder. (In other words, if the median government has done this badly against COVID, what chance does it have against something much harder?) I wrote down a list of things that make COVID an easier challenge, which I now realize may be a bit of a tangent if that's not the main thing you want to argue about, but I'll put down here anyway so as to not waste it.

it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures
previous human experiences with pandemics, including very similar ones like SARS
there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders
COVID isn't agenty and can't fight back intelligently
potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)

Replies from: PeterMcCluskey, Jan_Kulveit

↑ comment by PeterMcCluskey · 2022-06-22T16:41:43.771Z · LW(p) · GW(p)

I asked myself for an example of a country whose initial pandemic response was unusually poor, settled on Brazil, and found that Brazil's IQ was lower than I expected at 87. So that's one data point that supports your hypothesis.

I suspect that cultural homogeneity is at least as important.

↑ comment by Jan_Kulveit · 2022-06-21T08:27:35.835Z · LW(p) · GW(p)

What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.

What seemed to make a difference

someone with a good models what to do getting to advisory position when the politicians freak out
previous experience with SARS
ratio of "trust in institutions" vs. "trust in your neighbors wisdom"
raw technological capacity
ability of the government to govern (ie execute many things at short time)

In my view, 1. and 4. could go better than in covid, 2. is irrelevant, 3. and 5. seem broad parameters which can develop in different directions. Image you somehow become the main advisor to US president when the situation becomes really weird, and she follows your advice closely - my rough impression is in most situations you would be able to move the response to be moderately sane.

it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures

Empirically, this often wasn't true. Humans had mildly confused ideas about the micro-level, but often highly confused ideas about the exponential macro-dynamics. (We created a whole educational game on that, and have some feedback that for some policymakers it was the thing that helped them understand... after a year in the pandemic)

previous human experiences with pandemics, including very similar ones like SARS
there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders
COVID isn't agenty and can't fight back intelligently
potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)

One factor which may make governments more responsive to AI risk is covid wasn't exactly threatening to states. Covid was pretty bad for individual people, and some businesses, but in some cases, the relative power of states even grew during covid. In contrast, in some scenarios it may be clear that AI is existential risk for states as well.

↑ comment by habryka (habryka4) · 2022-06-20T17:43:35.608Z · LW(p) · GW(p)

Australia seems to have suffered a lot more from the pandemic than the U.S., paying much more in the cost of lockdown than even a relatively conservative worst-case estimate would have been for the costs of an uncontrolled COVID pandemic. I don't know about the others, but given that you put Australia on this list, I don't currently trust the others to have acted sensibly.

Replies from: Jan_Kulveit, adrian-arellano-davin

↑ comment by Jan_Kulveit · 2022-06-21T08:50:39.785Z · LW(p) · GW(p)

I'm not sure if you actually read carefully what you are commenting on. I emphasized early response, or initial governmental-level response in both comments in this thread.

Sure, multiple countries on the list made mistakes later, some countries sort of become insane, and so on. Later, almost everyone made mistakes with vaccines, rapid tests, investments in contact tracing, etc.

Arguing that the early lockdown was more costly than "an uncontrolled pandemic" would be pretty insane position (cf GDP costs, Italy had the closest thing to an uncontrolled pandemic). (Btw the whole notion of "an uncontrolled pandemic" is deeply confused - unless you are a totalitarian dictatorship, you cannot just order people "live as normally" during a pandemic when enough other people are dying; you get spontaneous "anarchic lockdowns" anyway, just later and in a more costly way)

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-06-22T02:05:13.825Z · LW(p) · GW(p)

If Australia was pursuing a strategy of "lock down irrespective of cost", then I don't think it makes sense to describe the initial response as competent. It just happened to be right in this case, but in order for the overall response to helpful, it has to be adaptive to the actual costs. I agree that the early response on its own would have indicated a potentially competent decision-making algorithm, but the later followup showed that the algorithm seems to have mostly been correct on accident, and not on-purpose.

I do appreciate the link to the GDP cost article. I would have to look into the methodology more to comment on that, but it certainly seems like an interest analysis and suggestive result.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-21T01:53:45.339Z · LW(p) · GW(p)

I don't think this is true at all. See: https://www.lesswrong.com/posts/r9gfbq26qvrjjA7JA/thank-you-queensland [LW · GW]

Replies from: JBlack, habryka4

↑ comment by JBlack · 2022-06-21T03:17:52.083Z · LW(p) · GW(p)

I absolutely agree. Australia has done substantially better than most other nations regarding COVID from all of economic, health, and lifestyle points of view. The two largest cities did somewhat worse in lifestyle for some periods, but most other places had far fewer and less onerous restrictions than most other countries for nearly 2 years. I personally was very happy to have lived with essentially zero risk of COVID and essentially zero restrictions both personal or economic for more than a year and a half.

A conservative worst-case estimate for costs of an uncontrolled COVID outbreak in Australia was on the order of 300,000 deaths and about $600 billion direct economic loss over 2 years, along with even larger economic impacts from higher-order effects.

We did very much better than that, especially in health outcomes. We had 2,000 deaths up until giving up on elimination in December last year, which was about 0.08 deaths per thousand. Even after giving up on local elimination, we still only have 0.37 per thousand compared with United States at 3.0 per thousand.

Economic losses are also substantially less than US in terms of comparison with the pre-pandemic economy, but the attribution of causes there is much more contentious as with everything to do with economics.

↑ comment by habryka (habryka4) · 2022-06-21T02:42:53.896Z · LW(p) · GW(p)

I know of a good number of friends who were unable to continue their jobs that requires substantial in-person abroad coordination since Australia prevented nationals from leaving their own country. I also talked to 2-3 Australians who thought that Australia had messed up pretty badly here.

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-21T02:53:56.194Z · LW(p) · GW(p)

Sure. I also talked to tens of Australians who thought that they did a great job. In Spain, the country where I am from, I know personally many people who were also unable to continue their jobs, and not because the country forbade their nationals to leave. There is going to be a lot of variance in the individual opinions. The amount of dead people is on the other hand a more objective measure on how successful were countries at dealing with the pandemic

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-21T21:36:04.904Z · LW(p) · GW(p)

Taking the # of dead people as an objective is biasing the question.

Fundamentally, there is a question of whether the benefits of lockdowns were worth the costs. Measuring that only by # of dead people is ignoring the fundamental problems with the lockdowns.

Let me explicate.

I think I am in the minority position on this board (and Habryka might be too) in that I feel it is obvious that the relatively small number of elderly people saved counterfactually by lockdowns is not comparable to the enormous mental, economic loss, the dangerous precedent for civil liberties set by lockdowns etc. It is clear to me that a "correct" utilitarian calculation will conclude that the QALYs lost by dead elderly people in the first world is absolutely swamped by the QALYs lost by mental health of young people and the millions of global poor thrown back into poverty. (Moreover, this ignores the personal liberty aspect that people are free to make their own safety/lifestyle tradeoffs and it should require a superabundance of QALYs saved to impinge on this freedom)

Bolstered by the apparent succes of Taiwan I supported a short lockdown followed by track & trace - but mid summer 2020 it was clear that this was never going to work. Actually, Taiwan had to revert to lockdowns later during the pandemic anyway. It was clear to me that further lockdowns were no longer worth it.

Even if you think the lockdowns were justified, one should note that Australia has gone much farther; it has continued severe COVID restrictions even after vaccination & absence of a long-term plan. It has made it almost completely impossible to go in or out of the country (even if one is an Australian citizen willing to undergo extensive testing) . In my humble opinion this is completely crazy territory.

Speaking about completely crazy territory... If you measure a country's COVID response by # of deaths by COVID then the "bestest most responsible government" would be the Chinese government. I hope you will agree with me that this would be a mistake.

Replies from: habryka4, adrian-arellano-davin

↑ comment by habryka (habryka4) · 2022-06-22T02:09:28.951Z · LW(p) · GW(p)

My assessment is also that the health costs of the pandemic were small in comparison to the secondary effects of lockdown (which were mostly negative). Any analysis that primarily measures deaths seems to me to ignore the vast majority of the impact (which is primarily economic and social).

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-23T01:09:17.530Z · LW(p) · GW(p)

I know this is a sensitive topic and I probably won't change your mind but hear me out for a second. Re. China, I do agree with you that the response of the CCP (now) is not really a model of what an exemplar government should do. I also agree that up to a certain point you shouldn't measure exclusively the number of dead people to judge how well a country fared. But it certainly is an important variable that we shouldn't discount either. The number of dead people is closely correlated to other important factors such as the number of people suffering long covid or even the human suffering in general. I do agree with you that lockdowns in many places have caused potentially more harm than they should. The problem is that not all lockdowns are the same, and people keep them treating as equivalent. Another problem is that I see that many people are rationalizing that things couldn't have been different, which is super convenient especially for those in power.

So let me talk a bit about Australia (I was living there during the whole pandemic period).

USA sits right now at 3015 dead people per 1M. Australia's casualties are 364.

I can guarantee you, that to everyone I spoke with who was living at the time in other places (I have many friends in different European countries, Spain, Italy, France, England, etc) would have swtiched places with me without thinking about it for a second.

I follow very closely the news in the USA and I know how extremely biased the coverage was (including some famous podcasters, I am looking at you, Joe Rogan). They focused a lot on the Australian border restrictions / lockdown in Melbourne and very little on the fact that for almost two years, most Australians enjoyed a mostly normal life when people abroad were facing repeatedly absurd government interventions/restrictions. It is not totally true that the borders were completly close either: I have a friend who was allowed to leave the country to visit her dying father in Italy. She came back to Australia and she had to do a quarantine, true, but she was allowed to be back.

The lockdowns in Australia (at least in Queensland where I lived) served a purpose: buy time for the contact tracers so that COVID cases can really be taken down to zero. In Queensland we have a long one at the beginning (2 months maybe?) but then we have a few more (don't rememeber how many, maybe 3?) that lasted only a few days. They understood very well that dealing with COVID should be a binary thing: Either you have no cases, or you are facing repeated waves of covid. This must continue until everyone has an opportunity to have two shots of the vaccine. Once that everyone had a chance, the borders were opened again and most restrictions were lifted. So in this regard, I do think that the harsh Chinese government measures AT THE BEGINNING (i.e. closing the national borders, PCRs, selective lockdowns, contact tracing, etc), made much more sense that everything that was happening in most of the Western world. Talking to a few Chinese friends, they considered utterly outrageous the fact that we were justifying the death of people saying that they were old anyway or that we shouldn't stop the economy.

I still remember that at the very beginning of the pandemic, the POTUS was given a press conference and he showed a hesitancy rarely seen on him: he swallowed and took a few seconds to say, stuttering a little bit, that he hadn't taken measures, there could be a hundred thousand American dying. Today the tally sits at more than 1M. Things could have been different.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-23T17:59:39.200Z · LW(p) · GW(p)

Fair enough. Thank you for explaining where you are coming from. I do agree that if an island is able to close the borders and thereby avoid severe domestic lockdowns this can be justified.

(364 Vs 3015 is two orders of magnitude?)

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-23T21:39:50.687Z · LW(p) · GW(p)

Oooops! Corrected, thanks

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-20T09:53:58.235Z · LW(p) · GW(p)

Well, Australia did orders of magnitude better than USA and in IQ they seem to be pretty close. I'm not sure that IQ is the right variable to look at

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-19T20:47:54.120Z · LW(p) · GW(p)

This is a thread for anyone who wants to give a high-level take or reaction that isn't contributing much to the discussion (and thus isn't worth a top-level comment).

Replies from: neel-nanda-1, Gedusa, daniel-kokotajlo, mtrazzi, Flaglandbase, Jotto999, None

↑ comment by Neel Nanda (neel-nanda-1) · 2022-06-19T22:59:30.400Z · LW(p) · GW(p)

I broadly agree with this much more than Eliezer's and think this did a good job of articulating a bunch of my fuzzy "this seems off". Most notably, Eliezer underrating the Importance and tractability of interpretability, and overrating the discontinuity of AI progress

↑ comment by Gedusa · 2022-06-20T13:19:22.486Z · LW(p) · GW(p)

I found it really helpful to have a list of places where Eliezer and Paul agree. It's interesting to see that there is a lot of similarity on big picture stuff like AI being extremely dangerous.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-19T20:48:23.191Z · LW(p) · GW(p)

I think my take is roughly "What Paul would think if he had significantly shorter timelines."

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-19T20:53:41.367Z · LW(p) · GW(p)

Do you think that some of my disagreements should change if I had shorter timelines?

(As mentioned last time we talked, but readers might not have seen: I'm guessing ~15% on singularity by 2030 and ~40% on singularity by 2040.)

Replies from: daniel-kokotajlo, Evan R. Murphy

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-19T21:56:49.411Z · LW(p) · GW(p)

I think most of your disagreements on this list would not change.
However, I think if you conditioned on 50% chance of singularity by 2030 instead of 15%, you'd update towards faster takeoff, less government/societal competence (and thus things more likely to fail at an earlier, less dignified point), more unipolar/local takeoff, lower effectiveness of coordination/policy/politics-style strategies, less interpretability and other useful alignment progress, less chance of really useful warning shots... and of course, significantly higher p(doom).

To put it another way, when I imagine what (I think) your median future looks like, it's got humans still in control in 2035, sitting on top of giant bureaucracies of really cheap, really smart proto-AGIs that fortunately aren't good enough at certain key skills (like learning-to-learn, or concept formation, or long-horizon goal-directedness) to be an existential threat yet, but are definitely really impressive in a bunch of ways and are reshaping the world economy and political landscape and causing various minor disasters here and there that serve as warning shots. So the whole human world is super interested in AI stuff and policymakers are all caught up on the arguments for AI risk and generally risks are taken seriously instead of dismissed as sci-fi and there are probably international treaties and stuff and also meanwhile the field of technical alignment has had 13 more years to blossom and probably lots of progress has been made on interpretability and ELK and whatnot and there are 10x more genius researchers in the field with 5+ years of experience already... and even in this world, singularity is still 5+ years away, and probably there are lots of expert forecasters looking at awesome datasets of trends on well-designed benchmarks predicting with some confidence when it will happen and what it'll look like.

This world seems pretty good to me, it's one where there is definitely still lots of danger but I feel like >50% chance things will be OK. Alas it's not the world I expect, because I think probably things will happen sooner and go more quickly than that, with less time for the world to adapt and prepare.

Replies from: elityre, sharmake-farah

↑ comment by Eli Tyre (elityre) · 2022-06-20T23:17:01.390Z · LW(p) · GW(p)

I personally found this to be a very helpful comment for visualizing how things could go.

↑ comment by Noosphere89 (sharmake-farah) · 2024-10-03T17:08:22.635Z · LW(p) · GW(p)

Re my own updates, I'd say that my own probability of 50-55% chance of singularity by 2030, using the knowledge about AI, alignment and governance we have now:

Faster takeoff is correct, but nowhere near as fast as Eliezer's usual stories.

Somewhat less competence, but only somewhat, because of the MNM effect and the ridiculously strong control system that was essentially a collective intelligence that operated for at least several months, and more generally I believe that governments will respond harder as the problem gets more severe.

IMO, we are probably going to get fairly concentrated takeoffs, but not totally unipolar takeoffs.

Politics and coordination will be reasonably effective by default, because I expect the government and the public to wake up hard once AIs start automating a lot of stuff.

IMO, most of the value of alignment and interpretability research will be gotten very near into the singularity, or even right on the event horizon of the transition from human to AI, for almost the same reasons why a whole lot of the percentage of capability research will be gotten, but also it's quite surprising how much we got the low-hanging fruit of alignment such that we could well dream for bigger targets.

Useful warning shots will definitely be less, but I also expect governments to wake up a lot more than they have right now once they realize that AI is automating everything.

↑ comment by Evan R. Murphy · 2022-06-21T00:58:23.005Z · LW(p) · GW(p)

I'm guessing ~15% on singularity by 2030 and ~40% on singularity by 2040

These figures surprise me, I thought that you believed in shorter timelines because from Agreements #8 in your post where you said "[Transformative AI] is more likely to be years than decades, and there’s a real chance that it’s months", .

~40% by 2040 sounds like an expectation of transformative AI probably taking decades. (Unless I'm drawing a false equivalence between transformative AI and what you mean by "singularity".)

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-21T03:04:40.998Z · LW(p) · GW(p)

In agreement #8 I'm talking about the time from "large impact on the world" (say increasing GDP by 10%, automating a significant fraction of knowledge work, "feeling like TAI is near," something like that) to "transformative impact on the world" (say singularity, or 1-2 year doubling times, something like that). I think right now the impact of AI on the world is very small compared to this standard.

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2022-06-21T06:08:11.101Z · LW(p) · GW(p)

Thanks, that makes it more clear to me the two different periods of time you're talking about.

↑ comment by Michaël Trazzi (mtrazzi) · 2022-06-22T01:26:15.600Z · LW(p) · GW(p)

Datapoint: I skimmed through Eliezer's post, but read this one from start to finish in one sitting. This post was for me the equivalent of reading the review of a book I haven't read, where you get all the useful points and nuance. I can't stress enough how useful that was for me. Probably the most insightful post I have read since "Are we in AI overhang".

↑ comment by Flaglandbase · 2022-06-20T06:17:53.980Z · LW(p) · GW(p)

I never even thought about super-AI dangers before coming to this site, only sub-AI dangers. However IF these claims are true, then there should be delays imposed on AI research. There would be no alternative.

It should be done in a way that would not slow down the type of progress we really want: inventing a way to defeat the problem of death using technology. The money that would be invested in inventing super powerful hyper-computer minds should instead be invested in inventing a single standard design of powerful "brain chip". Each brain chip would contain all the information extracted from a single human brain, and could replace that brain's existence in a durable VR environment.

It goes without saying this alternative research program would be much, much slower and more expensive than just inventing a single superhuman hyper-AI. It might take a century to invent a way to extract and back up the contents of a single human brain. And that is just too long. In fact it's intolerable because everyone alive today would still have to die, and be lost forever.

So it would still be necessary to invent a single, narrowly focused hyper-AI, that would have only ONE task. It would be to invent a way to "transfer" human minds from perishable brains into a more durable medium. After completing that task, the hyper-AI would be shut down.

↑ comment by Jotto999 · 2022-06-20T23:49:29.958Z · LW(p) · GW(p)

Very broadly,

in 2030 it will still be fairly weird and undersubstantiated, to say that a dev's project might accidentally turn everyone's atoms into ML hardware, or might accidentally cause a Dyson sphere to be build.

Replies from: Verden

↑ comment by Verden · 2022-06-21T14:56:31.590Z · LW(p) · GW(p)

I'm not totally sure what you're referring to, but if you're talking about Paul's guess of "~15% on singularity by 2030 and ~40% on singularity by 2040", then I want to point out that looking at these two questions, his prediction seems in line with the Metaculus community prediction

Replies from: paulfchristiano, Jotto999

↑ comment by paulfchristiano · 2022-06-21T15:03:58.126Z · LW(p) · GW(p)

I don't think it will ever seem plausible for an accident to turn everyone's atoms into ML hardware though, because we will probably remain closer to an equilibrium with no free energy for powerful AI to harvest.

↑ comment by Jotto999 · 2022-06-25T18:24:22.001Z · LW(p) · GW(p)

I disagree with the community on that. Knocking out silver turing, Montezuma (in the way described), 90% equivalent on Winogrande, and 75th percentile on maths SAT will either take longer to be actually demonstrated in a unified ML system, OR it will happen way sooner than 39 months before "an AI which can perform any task humans can perform in 2021, as well or superior to the best humans in their domain.", which is incredibly broad. If the questions mean what they are written to mean, as I read them, it's a hell of a lot more than 39 months (median community estimate).
The thing I said is about some important scenarios described by people giving significant probability to a hostile hard takeoff scenario. I included the comment here in this subthread because I don't think it contributed much to the discussion.

↑ comment by [deleted] · 2022-06-21T12:32:46.669Z · LW(p) · GW(p)

comment by Steven Byrnes (steve2152) · 2022-06-21T04:46:48.429Z · LW(p) · GW(p)

RE discussion of gradual-ness, continuity, early practice, etc.:

FWIW, here’s how I currently envision AGI developing, which seems to be in a similar ballpark as Eliezer’s picture, or at least closer than most people I think? (Mostly presented without argument.)

There’s a possible R&D path that leads to a model-based RL AGI. It would very agent-y, and have some resemblance to human brain algorithms (I claim [? · GW]), and be able to “figure things out” and “mull things over” and have ideas and execute on them, and understand the world and itself, etc., akin to how humans do all those things.

Large language models (LLMs) trained mainly by self-supervised learning (SSL), as built today, are not that path (although they might include some ingredients which would overlap with that path). In my view, those SSL systems are almost definitely safer, and almost definitely much less capable, than the agent-y model-based RL path. For example, I don’t think that the current SSL-LLM path is pointing towards “The last invention that man need ever make”. I won’t defend that claim here.

But meanwhile, like it or not, lots of other people are as we speak racing down the road towards the more brain-like, more agent-y, model-based RL AGI. We should presume that they’ll eventually succeed. We could try to stop them, but doing so seems impossible right now. In the future, the SSL-LLM path will be producing more impressive AI models than today, but I don’t expect that fact to change the world very much in relevant ways, such that we’ll still be in roughly the same situation of not having any way (AFAICT) to stop researchers from inventing agent-y model-based RL AGI. So our only choice is to figure out how to navigate the world in which people eventually build agent-y model-based RL AGI.

(Note—Since “SSL-LLM safety” doesn’t coincide with “agent-y model-based RL safety”, a natural consequence is Eliezer [uncharitably] describing some people’s work as ‘not engaging with the core difficulties’ or whatever.)

Anyway, model-based RL algorithms can already do neat things like play computer games, but they can’t yet conquer humanity, and I think part of that is related to model-based RL algorithmic innovations that have yet to happen. So during some period of time, those future algorithmic innovations will happen, and meanwhile people will be scaling up and optimizing and hardware-accelerating the algorithms and architectures.

(Based on human brain compute requirements, I think training from scratch to at least human-level intelligence will probably eventually be possible with relatively modest amounts of chips and money, see here [LW · GW].)

A key question is: how long is this period between “This specific model-based RL technological path is ~~producing the AIs that everyone is using and everyone is talking about~~ receiving a very large share of overall attention and investment by the ML research community” [reworded for clarification, see later comment [LW(p) · GW(p)]] and “This specific model-based RL technological path can produce an out-of-control AGI that could destroy the world”?

Hard to say, but “a couple years” seems entirely plausible to me, and even “zero years (because, until the leading team worked out the kinks, their results weren’t great compared to other very different approaches, and few people were paying attention)” seems plausible. Whereas even “10 years” seems implausibly high to me, I think.

So I find all the arguments in this post related to slow-takeoff, gradual-ness, continuity, etc. to be not so reassuring.

Replies from: quintin-pope, Evan R. Murphy

↑ comment by Quintin Pope (quintin-pope) · 2022-06-21T07:04:10.744Z · LW(p) · GW(p)

My expectation is that people will turn SSL models into agentic reasoners. I think this will happen through refinements to “chain of thought”-style reasoning approaches. See here. Such approaches absolutely do let LLMs “mull things over” to a limited degree, even with current very crude methods to do chain of thought with current LLMs. I also think future RL advancements will be more easily used to get better chain of thought reasoners, rather than accelerating a new approach to the SOTA.

↑ comment by Evan R. Murphy · 2022-06-22T02:34:16.990Z · LW(p) · GW(p)

A key question is: how long is this period between “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about” and “This specific model-based RL technological path can produce an out-of-control AGI that could destroy the world”?
Hard to say, but “a couple years” seems entirely plausible to me, and even “zero years (because, until the leading team worked out the kinks, their results weren’t great compared to other very different approaches, and few people were paying attention)” seems plausible. Whereas even “10 years” seems implausibly high to me, I think.

I don't think Paul would disagree with you about "a couple years" being plausible, based on Agreements #8 from his post (bold mine):

8. The broader intellectual world seems to wildly overestimate how long it will take AI systems to go from “large impact on the world” to “unrecognizably transformed world.” This is more likely to be years than decades, and there’s a real chance that it’s months. This makes alignment harder and doesn’t seem like something we are collectively prepared for.

At first I read Paul's post as having very gradualist assumptions all around. But he clarified to me in this comment [LW(p) · GW(p)] and the back-and-forth we had in replies that he's a bit long on the initial time before AI has large impact on the world (similar to your “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about”), which he pegs at ~40% by 2040. After that point, he predicts a pretty speedy progression to "unrecognizably transformed world", which I think includes the possibility of catastrophe.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-22T14:30:52.408Z · LW(p) · GW(p)

I don’t think Paul is saying the same thing as me. My wording was bad, sorry.

When I said “the AIs that everyone is using and everyone is talking about”, I should have said “the AIs that are receiving a very large share of overall attention and investment by the ML research community”. (I just went back and edited the original.)

As of today (2022), large language models are “the AIs that are receiving a very large share of overall attention and investment by the ML research community”. But they are not having a “large impact on the world” by Paul’s definition. For example, the current contribution of large language models to global GDP is ≈0%.

The question of whether an AI approach is “receiving a very large share of overall attention and investment by the ML research community” is very important because:

if yes, we expect low-hanging fruit to be rapidly picked, after which we expect incremental smaller advances perpetually, and we expect state-of-the-art models to be using roughly the maximum amount of compute that is at all possible to use.
if no (i.e. if an AI approach is comparatively a bit of a backwater, like say model-based RL or probabilistic programming today), we should be less surprised by (for example) a flurry of very impactful advances within a short period of time, while most people aren’t paying attention, and then bam, we have a recipe for a superhuman AGI that can be trained on a university GPU cluster.

Replies from: Evan R. Murphy, bill-benzon

↑ comment by Evan R. Murphy · 2022-06-22T19:49:46.407Z · LW(p) · GW(p)

Ok I see what you mean, thanks for clarifying.

↑ comment by Bill Benzon (bill-benzon) · 2022-06-22T15:53:22.057Z · LW(p) · GW(p)

I suspect that LLMs are going to be put to more and more practical use in the near future. I just did a search on "AI and legal briefs" and came up with ads and articles about "prediction based" systems to help lawyers prepare legal briefs. I assume "prediction based" means LLM.

comment by Ruby · 2022-06-21T18:21:16.929Z · LW(p) · GW(p)

Curated. Eliezer's List of Lethalities [LW · GW] post has received an immense amount of attention, rightly so given the content, and I am extremely glad to see this response go live since Eliezer's views do not reflect a consensus, and it would be sad to have only one set of views be getting all the attention when I do think many of the questions are non-obvious.

I am very pleased to see public back-and-forth on questions of not just "how and whether we are doomed", but the specific gears behind them (where things will work vs cannot work). These questions bear on the enormous resources poured into AI safety work right now. Ensuring those resources get allocated in a way that actually the improve odds of our success is key.

I hope that others continue to share and debate their models of the world, Alignment, strategy, etc. in a way that is both on record and easily findable by others. Hopefully, we can look back in 10, 20, 50, etc years and reflect on how well we reasoned in these cloudy times.

comment by leogao · 2022-06-20T03:29:04.477Z · LW(p) · GW(p)

Liked this post a lot. In particular I think I strongly agree with "Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument" as the general vibe of how I feel about Eliezer's arguments.

A few comments on the disagreements:

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.”

An in-between position would be to argue that even if we're maximally competent at the institutional problem, and can extract all the information we possibly can through experimentation before the first critical try, that just prevents the really embarrassing failures. Irrecoverable failures could still pop up every once in a while after entering the critical regime that we just could not have been prepared for, unless we have a full True Name of alignment. I think the crux here depends on your view on the Murphy-constant of the world (i.e how likely we are to get unknown unknown failures), and how long you think we need to spend in the critical regime before our automated alignment research assistants solve alignment.

By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D.

For what it's worth I think the level of tech needed to overpower humans in more boring ways is a substantial part of my "doom cinematic universe" (and I usually assume nanobots is meant metaphorically). In particular, I think it's plausible the "slightly-less-impressive-looking" systems that come before the first x-risk AI are not going to be obviously step-before-x-risk any more so than current scary capabilities advances, because of uncertainty over its exact angle (related to Murphy crux above) + discontinuous jumps in specific capabilities as we see currently in ML.

if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff.

SGD is definitely far from perfect optimization, and it seems plausible that if concealment against SGD is a thing at all, then it would be due to some kind of instrumental thing that a very large fraction of powerful AI systems converge on.

Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research

I think there's a lot of different cruxes hiding inside the question of how AI acceleration of alignment research interacts with P(doom), including how hard alignment is, and whether AGI labs will pivot to focus on alignment (some earlier thoughts here [LW(p) · GW(p)]), even assuming we can align the weak systems used for this. Overall I feel very uncertain about this.

Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects

Explicitly registering agreement with this prediction.

Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface.

Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment [LW(p) · GW(p)]). Also seems related to his arguments about ML systems having squiggles.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-20T04:14:59.939Z · LW(p) · GW(p)

Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment [LW(p) · GW(p)]). Also seems related to his arguments about ML systems having squiggles.

Yup. You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.

Replies from: paulfchristiano, Wei_Dai

↑ comment by paulfchristiano · 2022-06-20T04:23:17.072Z · LW(p) · GW(p)

The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.

I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.

I don't think you've made much argument about when the transition occurs. Existing language models strongly appear to be "imitation upstream of optimization." For example, it is much easier to get optimization out of them by having them imitate human optimization, than by setting up a situation where solving a hard problem is necessary to predict human behavior.

I don't know when you expect this situation to change; if you want to make some predictions then you could use empirical data to help support your view. By default I would interpret each stronger system with "imitation upstream of optimization" to be weak evidence that the transition will be later than you would have thought. I'm not treating those as failed predictions by you or anything, but it's the kind of evidence that adjusts my view on this question.

(I also think the chimp->human distinction being so closely coupled to language is further weak evidence for this view. But honestly the bigger thing I'm saying here is that 50% seems like a more reasonable place to be a priori, so I feel like the ball's in your court to give an argument. I know you hate that move, sorry.)

Replies from: vanessa-kosoy, EliasSchmied

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-20T13:47:42.267Z · LW(p) · GW(p)

Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs currently in use are not big enough to contain a brain, but we're gradually getting closer there^[1].

More precisely, as ANNs grow in size we're approaching a regime I dubbed "pseudorealizability": on the one hand, the brain is in the prior^[2], one the other hand, its description complexity is pretty high and therefore its prior probability is pretty low. Moreover, a more sophisticated agent (e.g. infra-Bayesian RL / Turing RL / infra-Bayesian physicalist) would be able to use the rest of world as useful evidence to predict some features of the human brain (i.e. even though human brains are complex, they are not random, there are reasons they came to be the way they are if you understand the broader context e.g. evolutionary history). But, the latter kind of inference does not take the form of having a (non-mesa-optimizing) complete cartesian parsimonious model of the world in which brains are a particular piece, because (i) such a model would be too computationally expensive (non-realizability) and (ii) bridge rules add a lot of complexity.

Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI. Why agents rather than some kind of non-agentic epistemic engines? Because, IB and IBP suggest that, this level of epistemic sophistication requires some entanglement between epistemic rationality and instrumental rationality: in these frameworks, it is not possible to decouple the two entirely.

From the perspective of the outer loop, we can describe the situation as: "I woke up, expecting to see a world that is (i) simple and (ii) computationally cheap. At first glance, the world seemed like, not that. But, everything became clear when I realized that the world is generated by a relatively-simple-and-cheap 'deity' who made the world like this on purpose because it's beneficial for it from its own strange epistemic vantage point."

Coming back to the question of, when to expect the transition from imitation-upstread-of-optimization to imitation-downstream-of-optimization. By the above line of argument, we should expect this transition to happen before the AI succeeds at any task which requires reasoning at least as sophisticated as the kind of reasoning that allows inferring properties of human brains from understanding the broader context of the world. Unfortunately, I cannot atm cache this out into a concrete milestone, but (i) it seems very believable that current language models are not there and (ii) maybe if we think about it more, we can come up with such a milestone.

Cotra's report is a relevant point of reference, even though "having as many parameters as the brain according to some way to count brain-parameters" is ofc not the same as "capable of representing something which approximates the brain up to an error term that behaves like random noise". ↩︎
Assuming the training protocol is sufficiently good at decoupling the brain from the surrounding (more complex) world and pointing the AI at only trying to imitate the brain. ↩︎

Replies from: ricraz, jacob-pfau

↑ comment by Richard_Ngo (ricraz) · 2022-06-20T22:44:25.973Z · LW(p) · GW(p)

Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI.

In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.

Replies from: vanessa-kosoy, Charlie Steiner

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-21T16:52:48.285Z · LW(p) · GW(p)

Not quite sure what you're saying here. Is the claim that speed penalties would help shift the balance against mesa-optimizers? This kind of solutions are worth looking into, but I'm not too optimistic about them atm. First, the mesa-optimizer probably won't add a lot of overhead compared to the considerable complexity of emulating a brain. In particular, it need not work by anything like our own ML algorithms. So, if it's possible to rule out mesa-optimizers like this, it would require a rather extreme penalty. Second, there are limits on how much you can shape the prior while still having feasible learning. And I suspect that such an extreme speed penalty would not cut it. Third, depending on the setup, an extreme speed penalty might harm generalization^[1]. But we definitely need to understand it more rigorously.

The most appealing version is Christiano's "minimal circuits", but that only works for inputs of fixed size. It's not so clear what's the variable-input-size ("transformer") version of that. ↩︎

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-06-21T18:20:19.689Z · LW(p) · GW(p)

No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming:

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers)

But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-22T05:58:17.522Z · LW(p) · GW(p)

The word "bounded" in "bounded simplicity prior" referred to bounded computational resources. A "bounded simplicity prior" is a prior which involves either a "hard" (i.e. some hypotheses are excluded) or a "soft" (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here [LW(p) · GW(p)] (w/o any claim to originality).

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-06-22T07:33:08.336Z · LW(p) · GW(p)

Ah, I see. That makes sense now!

↑ comment by Charlie Steiner · 2022-06-21T08:54:29.150Z · LW(p) · GW(p)

This seems like a good thing to keep in mind, but also sounds too pessimistic about the ability of gradient descent to find inference algorithms that update more efficiently than gradient descent.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-06-21T18:28:36.812Z · LW(p) · GW(p)

I do expect this to happen. The question is merely: what's the best predictor of how hard it is to find inference algorithms more ~~efficient~~ effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2022-06-21T19:40:48.995Z · LW(p) · GW(p)

Faster than gradient descent is not a selective pressure, at least if we're considering typical ML training procedures. What is a selective pressure is regularization, which functions much more like a complexity prior than a speed prior.

So (again sticking to modern day ML as an example, if you have something else in mind that would be interesting) of course there will be a cutoff in terms of speed, excluding all algorithms that don't fit into the neural net. But among algorithms that fit into the NN, the penalty on their speed will be entirely explainable as a consequence of regularization that e.g. favors circuits that depend on fewer parameters, and would therefore be faster after some optimization steps.

If examples of successful parameters were sparse and tended to just barely fit into the NN, then this speed cutoff will be very important. But in the present day we see that good parameters tend to be pretty thick on the ground, and you can fairly smoothly move around in parameter space to make different tradeoffs.

↑ comment by Jacob Pfau (jacob-pfau) · 2022-06-21T22:19:31.227Z · LW(p) · GW(p)

Here's my stab at rephrasing this argument without reference to IB. Would appreciate corrections, and any pointers on where you think the IB formalism adds to the pre-theoretic intuitions:

At some point imitation will progress to the point where models use information about the world to infer properties of the thing they're trying to imitate (humans) -- e.g. human brains were selected under some energy efficiency pressure, and so have certain properties. The relationship between "things humans are observed to say/respond to" to "how the world works" is extremely complex. Imitation-downstream-of-optimization is simpler. What's more, imitation-downstream-of-optimization can be used to model (some of) the same things the brain-in-world strategy can. A speculative example: a model learns that humans use a bunch of different reasoning strategies (deductive reasoning, visual-memory search, analogizing...) and does a search over these strategies to see which one best fits the current context. This optimization-to-find-imitation is simpler than learning the evolutionary/cultural/educational world model which explains why the human uses one strategy over another in a given context.

↑ comment by Elias Schmied (EliasSchmied) · 2022-06-29T18:52:09.120Z · LW(p) · GW(p)

I must be missing something here. Isn't optimizing necessary for superhuman behavior? So isn't "superhuman behavior" a strictly stronger requirement than "being a mesaoptimizer"? So isn't it clear which one happens first?

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-29T22:45:55.445Z · LW(p) · GW(p)

Fast imitations of subhuman behavior or imitations of augmented of humans are also superhuman. As is planning against a human-level imitation. And so on.

It's unclear if systems trained in that way will be imitating a process that optimizes, or will be optimizing in order to imitate. (Presumably they are doing both to varying degrees.) I don't think this can be settled a priori.

Replies from: EliasSchmied

↑ comment by Elias Schmied (EliasSchmied) · 2022-06-30T04:41:30.684Z · LW(p) · GW(p)

This "imitating an optimizer" / "optimizing to imitate" dichotomy seems unnecessarily confusing to me. Isn't it just inner alignment / inner misalignment (with the human behavior you're being trained on)? If you're imitating an optimizer, you're still an optimizer.

Replies from: david-johnston, EliasSchmied

↑ comment by David Johnston (david-johnston) · 2022-07-01T00:00:50.672Z · LW(p) · GW(p)

I agree with this. If the key idea is, for example, optimising imitators generalise better than imitations of optimisers, or for a second example that they pursue simpler goals, it seems to me that it'd be better just to draw distinctions based on generalisation or goal simplicity and not on optimising imitators/imitations of optimisers.

↑ comment by Elias Schmied (EliasSchmied) · 2022-06-30T05:02:00.527Z · LW(p) · GW(p)

Sorry, I should be more specific. We are talking about AGI Safety, it seems unlikely that running narrow AI faster gets you AGI. I'm not sure if you disagree with that. I don't understand what you mean by "imitations of augmented of humans" and "planning against a human-level imitation".

↑ comment by Wei Dai (Wei_Dai) · 2022-06-20T19:00:19.945Z · LW(p) · GW(p)

comment by Laura B (Lara_Foster) · 2022-06-22T17:07:40.691Z · LW(p) · GW(p)

I notice that as someone without domain specific knowledge of this area, that Paul's article seems to fill my model of a reality-shaped hole better than Eliezer's. This may just be an artifact of the specific use of language and detail that Paul provides which Eliezer does not, and Eliezer may have specific things he could say about all of these things and is not choosing to do so. Paul's response at least makes it clear to me that people, like me, without domain specific knowledge are prone to being pulled psychologically by use of language in various directions and should be very careful about making important life decisions based on concerns of AI safety without first educating themselves much further on the topic, especially since giving attention and funding to the issue at least has the capacity to cause harm.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-20T04:07:29.606Z · LW(p) · GW(p)

For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list.

I skimmed through the report and didn't find anything that looked like a centralized bullet point list of difficulties. I think it's valuable in general if people say what the problems are that they're trying to solve, and then collect them into a place so people can look them over simultaneously. I realize I haven't done enough of this myself, but if you've already written up the component pieces, that can make it easier to collect the bullet list.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-20T04:30:57.028Z · LW(p) · GW(p)

I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."

If you are currently looking for the list of difficulties: see the long footnote [LW(p) · GW(p)].

If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state at the very top. Then there are a series of sections organized around possible solutions and the problems with those solutions, which highlight many of the general difficulties. I don't intuitively feel like a bulleted list of difficulties would have been a better way to describe the difficulties.

comment by TekhneMakre · 2022-06-20T01:42:09.932Z · LW(p) · GW(p)

> The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones,
> does not have such a large effect on the scientific problem.

Another major difference is that we're forced to solve the problem using only analogies (and reasoning), as opposed to also getting to study the actual objects in question. And, there's a big boundary between AIs that would lose vs. win a fight with humanity, which causes big disanalogies between AIs, and how alignment strategies apply to AIs, before and after that boundary. (Presumably there's major disagreement about how important these disanalogies are / how difficult they are to circumvent with other analogies.)

> AI is accelerating the timetable for both alignment and capabilities

AI accelerates the timetable for things we know how to point AI at (which shades into convergently instrumental things that we point at just by training an AI to do anything). We know how to point AI at things that can be triangulated with clear metrics, like "how well does the sub-AI you programmed perform at such and such tasks". We much less know how to point AI at alignment, or at more general things like "do good human-legible science / philosophy" that potentially tackle the core hard parts of alignment. So I don't buy that these rates are very meaningfully coupled. Clearly there's some coupling, like you could increase the ability of people working on alignment to find interesting papers 10x more easily, but that seems fairly modest and bounded a help; it's not going to make alignment research 100x faster / better, because it's still routing the important stuff through human researchers. How do you view AI as accelerating the timetable for alignment?

Replies from: bogdanb

↑ comment by bogdanb · 2022-07-02T20:24:06.935Z · LW(p) · GW(p)

AI accelerates the timetable for things we know how to point AI at

It also accelerates the timetable for random things that we don’t expect and don’t even try to point the AI at but that just happen to be easier for incrementally-better AI to do.

Since the space of stuff that helps alignment seems much smaller than the space of dangerous things, you’d expect most things the AI randomly accelerates without us pointing it at will be dangerous.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-07-03T02:07:59.782Z · LW(p) · GW(p)

Not exactly, because it's not exactly "random things", it's heavily weighted on convergently instrumental things. If you could repurpose ~all the convergently instrumental stuff that randomly targeted AI can do towards AI alignment, like I think Christiano is trying to do, then you'd have a pretty strong coupling. Whether you can do that though is an open question, whether that would be sufficient is an open question.

Replies from: bogdanb

↑ comment by bogdanb · 2023-07-21T07:02:40.900Z · LW(p) · GW(p)

I’m not sure I understand your weighting argument. Some capabilities are “convergently instrumental” because they are useful for achieving a lot of purposes. I agree that AIs construction techniques will target obtaining such capabilities, precisely because they are useful.

But if you gain a certain convergently instrumental capability, it then automatically allows you to do a lot of random stuff. That’s what the words mean. And most of that random stuff will not be safe.

I don’t get what the difference is between “the AI will get convergently instrumental capabilities, and we’ll point those at AI alignment” and “the AI will get very powerful and we’ll just ask it to be aligned”, other than a bit of technical jargon.

As soon as the AI it gets sufficiently powerful [convergently instrumental capabilities], it is already dangerous. You need to point it precisely at a safe target in outcomes-space or you’re in trouble. Just vaguely pointing it “towards AI alignment” is almost certainly not enough; specifying that outcome safely is the problem we started with.

(And you still have the problem that while it’s working on that someone else can point it at something much worse.)

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2023-07-21T14:40:27.379Z · LW(p) · GW(p)

I don't see much of a disagreement here? I'm just saying that the way in which random things are accelerated is largely via convergent stuff; and therefore there's maybe some way that one can "repurpose" all that convergent stuff towards some aligned goal. I agree that this idea is dubious / doesn't obviously work. As a contrast, one could imagine instead a world in which new capabilities are sort of very idiosyncratic to the particular goal they serve, and when you get an agent with some goals, all its cognitive machinery is idiosyncratic and hard to parse out, and it would be totally infeasible to extract the useful cognitive machinery and repurpose it.

comment by Steven Byrnes (steve2152) · 2022-06-21T04:39:32.666Z · LW(p) · GW(p)

RE Disagreement 5: Some examples where the aligned AIs will not consume the “free energy” of an out-of-control unaligned AI are:

1. Exploiting the free energy requires humans trusting the AIs more than they actually do. For example, humans with a (supposedly) aligned AGI may not trust the AGI to secure their own nuclear weapons systems, or to hack into its enemies’ nuclear weapons systems, or to do recursive self-improvement, or to launch von Neumann probes that can never be called back. But an out-of-control AGI would presumably be willing to do all those things.

2. Exploiting the free energy requires violating human laws, norms, Overton Windows, etc., or getting implausibly large numbers of human actors to agree with each other, or suffering large immediate costs for uncertain benefits, etc., such that humans don’t actually let their aligned AGIs do that. For example, maybe the only viable gray goo defense system consists of defensive nanobots that go proliferate in the biosphere, harming wildlife and violating national boundaries. Would people + aligned AGIs actually go and deploy that system? I’m skeptical. Likewise, if there’s a neat trick to melt all the non-whitelisted GPUs on the planet, I find it hard to imagine that people + aligned AGIs would actually do anything with that knowledge, or even that they would go looking for that knowledge in the first place. But an out-of-control unaligned AGI wouldn’t hesitate.

3. Exploiting the free energy accomplishes a goal that no human would want to accomplish, e.g. the removal of all oxygen from the atmosphere. Here, the attacking and defending AIs are trying to do two different things. Destroying a power grid may be much easier or much harder than preventing a power grid from being destroyed; a gray goo defense system may be much easier or much harder to create than gray goo, etc. I don’t have great knowledge about attack-defense balance in any of these domains, but I’m concerned by the disjunctive nature of the problem—an out-of-control AGI would presumably attack in whatever way had the worst attack-defense imbalance.

(Above is somewhat redundant with Paul’s strategy-stealing post [LW · GW]; like Zvi [LW(p) · GW(p)] I thought it was a nice post but I drew the opposite conclusion.)

Replies from: owngrove

↑ comment by owngrove · 2022-06-23T23:24:06.284Z · LW(p) · GW(p)

Seconding all of this.

Another way to state your second point - the only way to exploit that free energy may be through something that looks a lot like a 'pivotal act'. And in your third point, there may be no acceptable way to exploit that free energy, in which case the only option is to prevent any equally-capable unaligned AI from existing - not necessarily through a pivotal act, but Eliezer argues that's the only practical way to do so.

I think the existence/accessibility of these kinds of free energy (offense-favored domains whose exploitation is outside of the Overton window or catastrophic) this is a key crux for 'pivotal act' vs. gradual risk reduction strategies, plausibly the main one.

In the terms of Paul's point #2 - this could still be irrelevant because earlier AI systems will have killed us in more boring ways, but the 'radically advancing the state of human R&D' branch may not meaningfully change our vulnerability. I think this motivates the 'sudden doom' story even if you predict a smooth increase in capabilities.

comment by Jacob Pfau (jacob-pfau) · 2022-06-19T22:04:28.338Z · LW(p) · GW(p)

I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, __etc.__ and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research

This seems like a crux for the Paul-Eliezer disagreement which can explain many of the other disagreements (it's certainly my crux). In particular, conditional on taking Eliezer's side on this point, a number of Eliezer's other points all seem much more plausible e.g. nanotech, advanced deception/treacherous turns, and pessimism regarding the pace of alignment research.

There's been a lot of debate [? · GW] on this point, and some of it was distilled by Rohin [LW · GW]. Seems to me that the most productive way to move forward on this disagreement would be to distill the rest of the relevant MIRI conversations, and solicit arguments on the relevant cruxes.

Replies from: tor-okland-barstad

↑ comment by Tor Økland Barstad (tor-okland-barstad) · 2022-06-24T14:01:56.501Z · LW(p) · GW(p)

How useful AI-systems can be at this sort of thing after becoming catastrophically dangerous is also worth discussing more than is done at present. At least I think so. Between Eliezer and me I think maybe that's the biggest crux (my intuitions about FOOM are Eliezer-like I think, although AFAIK I'm more unsure/agnostic regarding that than he is).

Obviously a more favorable situation if AGI-system is aligned before it could destroy the world. But even if we think we succeeded with alignment prior to superintelligence (and possible FOOM), we should look for ways it can help with alignment afterwards, so as to provide additional security/alignment-assurance.

As Paul points out, verification will often be a lot easier than generation, and I think techniques that leverage this (also with superintelligent systems that may not be aligned) is underdiscussed. And how easy/hard if would be for an AGI-system to trick us (into thinking it's being helpful when it really wasn't) would depend a lot on how we went about things.

Various potential ways of getting help for alignment while keeping "channels of causality" quite limited and verifying the work/output of the AI-system in powerful ways.

I've started on a series about this: https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/getting-from-unaligned-to-aligned-agi-assisted-alignment

comment by Eli Tyre (elityre) · 2022-06-20T23:25:21.294Z · LW(p) · GW(p)

Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. In addition to disliking his concept of pivotal acts, I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.

I, personally, would like 5 or 10 examples, from disparate fields, of verification being easier than generation.

And also counterexamples, if anyone has any.

Replies from: paulfchristiano, johnswentworth, johnlawrenceaspden, sharmake-farah, elityre, johnlawrenceaspden, Kenny, alexander-gietelink-oldenziel

↑ comment by paulfchristiano · 2022-06-21T03:25:54.722Z · LW(p) · GW(p)

I'm just going to name random examples of fields, I think it's true essentially all the time but I only have personal experience in a small number of domains where I've actually worked:

It's easier to recognize a good paper in computer science or ML than to write one. I'm most familiar with theoretical computer science, where this is equally true in domains that are not yet formalized, e.g. a mediocre person in the field is still able to recognize important new conceptual ideas without being able to generate them. In ML it requires more data than is typically present in a paper (but e.g. can be obtained by independent replications or by being able to inspect code).
Verifying that someone has done a good job writing software is easier than writing it yourself, if you are able to e.g. interact with the software, get clear explanations of what they did and why, and have them also write good tests.
Verifying a theory in physics is easier than generating it. Both in the sense that it's much easier to verify that QM or the standard model or general relativity is a good explanation of existing phenomena than it is to come up with those models from scratch, and in the sense that e.g. verifying claims about how the LHC supports a given claim is easier than designing and building the LHC.
Verifying that someone has built a good GPU or a quantum computer is much easier than building one. This is completely clear if you are able to perform experiments on the computer. I also think it's almost certainly true if you are trying to evaluate a design and manufacturing process though I have less firsthand experience
There are a ton of fuzzy domains where we have less objective evidence but the claim seems obviously true to me. Evaluating papers in philosophy, useful exercises in futurism, alignment ideas, etc. all seem meaningfully easier than generating them (particularly if we require them to come with convincing justification). I think other people have different intuitions here but I'm not sure how to engage and if there are disagreements about more established fields that's obviously nicer to use as an example.

Replies from: David Lorell

↑ comment by David Lorell · 2022-10-20T17:30:13.387Z · LW(p) · GW(p)

This feels like stepping on a rubber duck while tip-toeing around sleeping giants but:

Don't these analogies break if/when the complexity of the thing to generate/verify gets high enough? That is, unless you think the difficulty of verification of arbitrarily complex plans/ideas is asymptotic to some human-or-lower level of verification capability (which I doubt you do) then at some point humans can't even verify the complex plan.

So, the deeper question just seems to be takeoff speeds again: If takeoff is too fast, we don't have enough time to use "weak" AGI to help produce actually verifiable plans which solve alignment. If takeoff is slow enough, we might. (And if takeoff is too fast, we might not notice that we've passed the point of human verifiability until it's too late.)

(I am consciously not bringing up ideas about HCH / other oversight-amplification ideas because I'm new to the scene and don't feel familiar enough with them.)

↑ comment by johnswentworth · 2022-06-21T01:00:34.552Z · LW(p) · GW(p)

I expect there will probably be a whole debate on this at some point, but as counterexamples I would give basically all the examples in When Money is Abundant, Knowledge is the Real Wealth [? · GW] and What Money Cannot Buy [? · GW]. The basic idea in both of these is that expertise, in most fields, is not easier to verify than to generate, because most of the difficulty is in figuring out what questions to ask and what to pay attention to, which itself require expertise.

More generally, I expect that verification is not much easier than generation in any domain where figuring out what questions to ask and what to pay attention to is itself the bulk of the problem. Unfortunately, this is very highly correlated with illegibility, so legible examples are rare.

Replies from: Jack Ryan

↑ comment by Jack R (Jack Ryan) · 2022-07-09T23:07:19.377Z · LW(p) · GW(p)

It's not obvious to me that the class of counter-examples "expertise, in most fields, is not easier to verify than to generate" are actually counter-examples. For example for "if you're not a hacker, you can't tell who the good hackers are," it still seems like it would be easier to verify whether a particular hack will work than to come up with it yourself, starting off without any hacking expertise.

Replies from: johnswentworth

↑ comment by johnswentworth · 2022-07-10T18:53:29.137Z · LW(p) · GW(p)

First, "does the hack work?" is not the only relevant question. A good hacker knows that other things also matter - e.g. how easy the code is for another person to understand, or how easy it is to modify later on. This principle generalizes: part of why expertise is hard-to-recognize is because non-experts won't realize which questions to ask.

Second, checking whether a program does what we intend in general (i.e. making sure it has no bugs) is not consistently easier than writing a correct program oneself, especially if the program we're trying to check is written by a not-very-good programmer. This is the fundamental reason why nobody uses formal verification methods: writing the specification for what-we-want-the-code-to-do is usually about as difficult, in practice, as writing the code to do it. (This is actually a separate argument/line-of-evidence that verification is not, in practice and in general, easier than generation.)

↑ comment by johnlawrenceaspden · 2022-06-21T17:21:42.775Z · LW(p) · GW(p)

One particularly difficult case is when the thing you're trying to verify has a subtle flaw.

Consider Kempe's proof of the four colour theorem, which was generally accepted for eleven years before being refuted. (It is in fact a proof of the five-colour theorem)

And of course, subtle flaws are much more likely in things that someone has designed to deceive you.

Against an intelligent adversary, verification might be much harder than generation. I'd cite Marx and Freud as world-sweeping obviously correct theories that eventually turned out to be completely worthless. I can remember a time when both were taken very seriously in academic circles.

Replies from: bogdanb

↑ comment by bogdanb · 2022-07-02T20:56:45.349Z · LW(p) · GW(p)

Exactly. You can’t generalize from “natural” examples to adversarial examples. If someone is trying hard to lie to you about something, verifying what they say can very well be harder than finding the truth would have been absent their input, particularly when you don’t know if and what they want to lie about.

I’m not an expert in any of these and I’d welcome correction, but I’d expect verification to be at least as hard as “doing the thing yourself” in cases like espionage, hacking, fraud and corruption.

↑ comment by Noosphere89 (sharmake-farah) · 2022-06-20T23:41:14.992Z · LW(p) · GW(p)

The entire P vs NP problem basically boils down to " is it easier to verify the correct answer than generate it?" And while it's still unproven, in our universe the answer seems to be yes. So conditioning on P not equaling NP, it's much easier to verify that it's correct than to generate a proof or hypothesis.

Replies from: Kenny

↑ comment by Kenny · 2022-06-22T05:48:49.828Z · LW(p) · GW(p)

But specific P problems can still be 'too hard' to solve practically.

↑ comment by Eli Tyre (elityre) · 2022-06-26T20:05:04.399Z · LW(p) · GW(p)

Actually, my more specific question is "is verification still easier than generation, if the generation is adversarial?" That seems like a much more specific problem space than just "generation and verification in general."

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-26T20:42:41.343Z · LW(p) · GW(p)

What kind of example are you looking for / what does your question mean?

I think if someone just tries their hardest to make "something that people will think is useful ML hardware" they will typically end up making useful ML hardware. I think this is most obvious for humans and human firms, but also very probably true for alien intelligences with quite different ability profiles.

I'm not sure if that's what you mean by "adversarial" (it seems like it's usually the relevant question), and if so I'm not sure how/whether it differs from the examples I gave.

I think if someone tries their hardest to make "something that people will think is useful ML hardware but isn't," I'm sure that's also possible (though apparently much harder than just making useful ML hardware). Though on the flip side if someone then said "Recognize an argument that this hardware isn't actually useful" I think that's also much easier than generating the deceptive hardware itself.

(That discussion seems the same for my other 4 examples. If someone tries their hardest to produce "something that looks like a really great scientific theory" or "something that looks like a ground-breaking paper in TCS after careful evaluation" or whatever, you will get something that has a good probability of being a great scientific theory or a ground-breaking paper.)

↑ comment by johnlawrenceaspden · 2022-06-21T17:08:45.146Z · LW(p) · GW(p)

It's vastly easier to understand a maths proof (almost any maths proof) than it is to invent one.

It's a lot easier to verify a solution to a problem in NP than it is to generate one (by definition!, but a lot of problems turn out to be NP-complete)

It's a lot easier to check that someone caught a cricket ball than it is to catch one.

It's a lot easier to check that someone can drive than to teach them.

It's a lot easier to tell whether a program can tell the difference between cats and dogs than to write a program that can.

Counterexamples:

It can be a easier to write a correct computer program than to verify it, and easier to fix the bugs than to find them.

It can be easier to find an algorithm than to prove that it works.

↑ comment by Kenny · 2022-06-22T05:47:29.020Z · LW(p) · GW(p)

I agree that "verification is much, much easier than generation".

But I don't agree that verification is generally 'easy enough'.

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-21T22:11:34.339Z · LW(p) · GW(p)

I am surprised noone has mentioned P Vs NP and its myriad incarnations yet.

comment by jessicata (jessica.liu.taylor) · 2022-06-19T23:34:46.381Z · LW(p) · GW(p)

AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.

One different way I've been thinking about this issue recently is that humans have fundamental cognitive limits e.g. brain size that AGI wouldn't have. There are possible biotech interventions to fix these but the easiest ones (e.g. just increase skull size) still require decades to start up. AI, meanwhile, could be improved (by humans and AIs) on much faster timescales. (How important something like brain size is depends on how much intellectual progress is explained by max intelligence than total intelligence; a naive reading of intellectual history would say max intelligence is important given that a high percentage of relevant human knowledge follows from <100 important thinkers.)

This doesn't lead me to assign high probability to "takeoff in 1 month", my expectation is still that AI improving AI will be an extension of humans improving AI (and then centaurs improving AI), but the iteration cycle time could be a lot faster due to AIs not having fundamental human cognitive limits.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-20T00:22:14.754Z · LW(p) · GW(p)

My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.

comment by Jan_Kulveit · 2024-01-09T01:44:37.777Z · LW(p) · GW(p)

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'.

Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.

comment by Vanessa Kosoy (vanessa-kosoy) · 2024-01-11T07:07:47.910Z · LW(p) · GW(p)

I wrote a review here [LW · GW]. There, I identify the main generators of Christiano's disagreement with Yudkowsky^[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.

^{^}
I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-19T20:37:16.492Z · LW(p) · GW(p)

I don’t think surviving worlds have a plan in the sense Eliezer is looking for.

This seems wrong to me, could you elaborate? Prompt: Presumably you think we do have a plan, it just doesn't meet Eliezer's standards. What is that plan?

Eliezer said:

Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan.
... Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else's job to prove those solutions wrong. That world started trying to solve their important lethal problems earlier than this. Half the people going into string theory shifted into AI alignment instead and made real progress there. When people suggest a planetarily-lethal problem that might materialize later - there's a lot of people suggesting those, in the worlds destined to live, and they don't have a special status in the field, it's just what normal geniuses there do - they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug and 'How can you be sure that will happen' / 'There's no way you could be sure of that now, we'll have to wait on experimental evidence.'

I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them and at any rate the plan is to work on fixing those holes and to not deploy powerful AGI until those holes are fixed. I'm guessing you agree that it's bad to meet suggestions for lethal problems with '"how can you be sure / we'll have to wait / shrug" and that instead it's good for people to start thinking about those problems and designing solutions now.

I guess there's also the "It is a written plan. It is not secret" part. I for one would feel noticeably better if we had a written, non-secret plan.

Replies from: paulfchristiano, ricraz, paulfchristiano

↑ comment by paulfchristiano · 2022-06-19T20:46:14.339Z · LW(p) · GW(p)

I think most worlds, surviving or not, don't have a plan in the sense that Eliezer is asking about.

I do agree that in the best worlds, there are quite a lot of very good plans and extensive analysis of how they would play out (even if it's not the biggest input into decision-making). Indeed, I think there are a lot of things that the best possible world would be doing that we aren't, and I'd give that world a very low probability of doom even if alignment was literally impossible-in-principle.

ETA: this is closely related to Richard's point in the sibling [LW(p) · GW(p)].

↑ comment by Richard_Ngo (ricraz) · 2022-06-19T21:14:36.861Z · LW(p) · GW(p)

I think it's less about how many holes there are in a given plan, and more like "how much detail does it need before it counts as a plan?" If someone says that their plan is "Keep doing alignment research until the problem is solved", then whether or not there's a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.

Analogy for why I don't buy this: I don't think that the Wright brothers' plan to solve the flying problem would count as a "plan" by Eliezer's standards. But it did work.

Replies from: cubefox

↑ comment by cubefox · 2022-06-23T22:44:53.724Z · LW(p) · GW(p)

As far as I understand, Eliezer doesn't claim that plans are generally very useful for making progress in solving problems. Trial and error usually works very well. But he also says that trial and error will not work for the alignment problem; we have to get it right the first time, therefore detailed plans are our only hope. This isn't a overconfidence in plans, it is just a high confidence that the usual trial and error approach can't be used this time.

↑ comment by paulfchristiano · 2022-06-19T21:01:12.055Z · LW(p) · GW(p)

I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them

I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.

I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do think it's reasonably likely to agree at a very high level. I think that's also true even in the much better worlds that do have tons of plans.

at any rate the plan is to work on fixing those holes and to not deploy powerful AGI until those holes are fixed

I wouldn't say there is "a plan" to do that.

Many people have that hope, and have thought some about how we might establish sufficient consensus about risk to delay AGI deployment for 0.5-2 years if things look risky, and how to overcome various difficulties with implementing that kind of delay, or what kind of more difficult moves might be able to delay significantly longer than that.

comment by Thomas Larsen (thomas-larsen) · 2022-08-28T22:16:55.992Z · LW(p) · GW(p)

Overall, there are relatively few researchers who are effectively focused on the technical problems most relevant to existential risk from alignment failures.

What do you think these technical problems are?

comment by Kaj_Sotala · 2022-06-19T23:46:31.397Z · LW(p) · GW(p)

Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior.

I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such as by everyone wanting to donate to sperm/egg banks".

But evolution does not seem to maximize fitness in that sense [LW(p) · GW(p)], where the fitness of a species would be a distinct thing-in-the-world that could be directly observed and optimized for. Something like "docileness" or "size", as used in animal breeding, would be a much better analogy, since those things are something that you can directly observe and optimize for - and human breeders do.

And... if humans had been explicitly bred for friendliness and corrigibility for a while, it seems to me that they likely would want to do the analogous thing of maximizing-their-donations-to-sperm/egg-banks. After all, we can already see that people who are high on either end of some personality trait such as altruism/selfishness, dominance/submission, openness/conservatism, etc., are likely to view that trait as a virtue (as long as nothing in the environment too overwhelmingly disproves this) and seek to become even more like that.

Altruistic people often want to become even more altruistic, selfish people eliminate their altruistic "weaknesses", dominant people to become more dominant, submissive people to make it easier for themselves to submit (this has some strong counterforces in our culture where submissiveness is generally considered undesirable, but you can still see it valued in e.g. workplace cultures where workers resent reforms that would give them more autonomy, preferring bosses to "just tell them what to do"), open people to become more open to experience, and so on.

Probably if people high on such traits were offered chances to self-modify to become even moreso - which seems analogous to the sperm/egg bank thing, since it's the cognitive optimization form of the instinctive thing - quite a few of them would.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-21T22:16:15.125Z · LW(p) · GW(p)

What about selecting for "moderation in all things"? Is that not virtue?

Aristotle invented quantification you heard here first

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-19T20:16:27.482Z · LW(p) · GW(p)

Thanks for writing this!

Typo: "I see this kind of thinking from Eliezer a lot but it seems misleading or long" should be "...or wrong"

comment by Eli Tyre (elityre) · 2023-03-25T17:27:06.238Z · LW(p) · GW(p)

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.

We could maybe make the world safer a little at a time, but we do have to get to a equilibrium in which the world is protected from explosive growth when some system (including a ecosystem of multiple AIs), starts pulling away from the growth-rate of the rest of the world, and gains decisive power.

My model here is something like "even small differences in the rate at which systems are compounding power and/or intelligence lead to gigantic differences in absolute power and/or intelligence, given that the world is moving so fast."

Or maybe another way to say it: the speed at which a given system can compound it's abilities is very fast, relative to the rate at which innovations diffuse through the economy, for other groups and other AIs to take advantage of.

It seems like all of the proposals that seem like they meet this desiderata (making the world safe from that kind of explosion in power of one system over all the others), look pretty pivotal-act like, rather than a series of marginal improvements.

Replies from: ESRogs, paulfchristiano

↑ comment by ESRogs · 2023-04-04T23:59:58.098Z · LW(p) · GW(p)

My model here is something like "even small differences in the rate at which systems are compounding power and/or intelligence lead to gigantic differences in absolute power and/or intelligence, given that the world is moving so fast."
Or maybe another way to say it: the speed at which a given system can compound it's abilities is very fast, relative to the rate at which innovations diffuse through the economy, for other groups and other AIs to take advantage of.

I'm a bit skeptical of this. While I agree that small differences in growth rates can be very meaningful, I think it's quite difficult to maintain a growth rate faster than the rest of the world for an extended period of time.

Growth and Trade

The reason is that: growth is way easier if you engage in trade. And assuming that gains from trade are shared evenly, the rest of the world profits just as much (in absolute terms) as you do from any trade. So you can only grow significantly faster than the rest of the world while you're small relative to the size of the whole world.

To give a couple of illustrative examples:

The "Asian Tigers" saw their economies grow faster than GWP during the second half of the 20th century because they were engaged in "catch-up" growth. Once their GDP per capita got into the same ballpark as other developed countries, they slowed down to a similar growth rate to those countries.
Tesla has grown revenue at an average of 50% per year for 10 years. That's been possible because they started out as a super small fraction of all car sales, and there were many orders of magnitude of growth available. I expect them to continue growing at something close to that rate for another 5-10 years, but then they'll slow down because the global car market is only so big.

Growth without Trade

Now imagine that you're a developing nation, or a nascent car company, and you want to try to grow your economy, or the number of cars you make, but you're not allowed to trade with anyone else.

For a nation it sounds possible, but you're playing on super hard mode. For a car company it sounds impossible.

Hypotheses

This suggests to me the following hypotheses:

Any entity that tries to grow without engaging in trade is going to be outcompeted by those that do trade, but
Entities that grow via trade will have their absolute growth capped at the size of the absolute growth of the rest of the world, and thus their growth rate will max out at the same rate as the rest of the world, once they're an appreciable fraction of the global economy.

I don't think these hypotheses are necessarily true in every case, but it seems like they would tend to be true. So to me that makes a scenario where explosive growth enables an entity to pull away from the rest of the world seem a bit less likely.

↑ comment by paulfchristiano · 2023-03-25T19:40:05.270Z · LW(p) · GW(p)

I agree that small differences in growth rates between firms or countries, compounded over many doublings of total output, will lead to large differences in final output. But I think there are quite a lot of other moving steps in this story before you get to the need for a pivotal act. It seems like you aren't pointing to the concentration of power per se (if so, I think your remedies would look like normal boring stuff like corporate governance!), I think you are making way more opinionated claims about the risk posed by misalignment.

Most proximately, I don't think that "modestly reduce the cost of alignment" or "modestly slow the development or deployment of unaligned AI" need to look like pivotal acts. It seems like humans can do those things a bit, and plausibly with no AI assistance can do them at >1 year per year of delay. AI assistance could help humans do those things better, improving our chances of getting over 1 year per year of delay. Modest governance changes could reduce the risk each year of catastrophe. You don't necessarily have to delay that long in calendar time in order to get alignment solutions. etc.

comment by Eli Tyre (elityre) · 2022-08-06T00:20:12.317Z · LW(p) · GW(p)

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by
unaligned AI, and consuming the “[free energy](https://www.lesswrong.com/posts/yPLr2tnXbiFXkMWvk/an-equilibrium-of-no-free-energy)”that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.

On the face of it, this seems true, and it seems like a pretty big clarification to my thinking. You can buy more time or more safety, at little bit at a time, instead of all at once, in sort of the way that you want to achieve life extension escape velocity.

But it seems like this largely depends on whether you expect takeoff to be hard or soft. If AI takeoff is hard, you need pretty severe interventions, because they either need to prevent the deployment of AGI or be sufficient to counter the actions of a superintelligece. Generally, it seems like the sharper takeoff is, the more good outcomes flow through pivotal acts, and the smoother takeoff is the more we should expect good outcomes to flow through incremental improvements.

Are there any incremental actions that add up to a "pivotal shift" in a hard takeoff world?

comment by Nicholas / Heather Kross (NicholasKross) · 2022-06-20T18:51:24.023Z · LW(p) · GW(p)

Regarding disagreement (2), I think many of Yudkowsky's "doom stories" are more intuition pumps / minimum bounds for demonstrating properties of superintelligence.

E.g. nanotech isn't there because he necessarily thinks it's what an unaligned AGI would do. Instead, it's to demonstrate how high the relative tech capabilities are of the AGI.

His point (which he stresses in different ways), is "don't look at the surface details of the story, look instead at the implied capabilities of the system".

Similar with "imagine it self-improving in minutes". It may or may not happen that way specifically, but the point is "computers work on such short timescales, and recursion rates can compound quickly enough, that we should expect some parts of the process to be faster than expected, including maybe the FOOM".

It's not supposed to be a self-contained cinematic universe, it's supposed to be "we have little/no reason to expect it to not be at least this weird", according to his background assumptions (which he almost always goes into more detail on anyway).

Replies from: Kenny

↑ comment by Kenny · 2022-06-22T05:55:57.178Z · LW(p) · GW(p)

I'm not sure what you mean by "how high the relative tech capabilities are of the AGI".

I think the general capability of the AGI itself, not "tech" capabilities specifically, are plenty dangerous themselves.

The general danger seems more like 'a really powerful but unaligned optimizer' that's 'let loose'.

I'm not sure that 'agent-ness' is necessary for catastrophe; just 'strong enough optimization' and a lack of our own capability in predicting the consequences of running the AGI.

I do agree with this:

It's not supposed to be a self-contained cinematic universe, it's supposed to be "we have little/no reason to expect it to not be at least this weird", according to his background assumptions (which he almost always goes into more detail on anyway).

comment by Nicholas / Heather Kross (NicholasKross) · 2022-06-20T19:06:20.946Z · LW(p) · GW(p)

Regarding disagreement (7): I'd like to see more people using AI to try and make useful contributions to alignment.

More broadly, I think the space of alignment working methods, literally the techniques researchers would use day-to-day, has been under-explored.

If the fate of the world is at stake, shouldn't someone at least try hokey idea-generation techniques lifted from corporations? Idea-combinations generators? Wacky proof-helper softwares? Weird physical-office setups like that 10-chambered linear room thing I saw somewhere but can't find now? I don't expect these to help a ton, and I expect high degrees of failure, but I also expect surviving worlds to have tried them already and maybe written up (short, low-effort, low-conscientiousness, low-mental-cycles) notes on how well they worked.

comment by Eli Tyre (elityre) · 2023-03-25T17:37:19.030Z · LW(p) · GW(p)

In particular, existing AI training strategies don’t need to handle a “drastic” distribution shift from low levels of intelligence to high levels of intelligence. There’s nothing in the foreseeable ways of building AI that would call for a big transfer like this, rather than continuously training as intelligence gradually increases.

An obvious possible regime change is the shift to training (some) agents that do lifetime learning rather than only incorporating capability from SGD.

That's one thing simple thing that seems likely to generate a sharp left turn [LW · GW].

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2023-03-25T19:34:47.319Z · LW(p) · GW(p)

I wouldn't call that a simple thing---"lifetime learning" is a description of a goal, not an algorithm. How are these agents trained? It's hard produce sophisticated long-horizon learning mechanisms by gradient descent using existing techniques (because you don't many lifetimes over which to adjust such mechanisms by gradient descent). So most of the time the mechanism is something built by humans or that transfers from short tasks, and then we need to talk details.

Perhaps more importantly, why does lifetime learning go quickly from "doesn't help meaningfully" to "and now the agents are radically superhuman"? I think at a basic level I don't understand the mechanism of the potential sharp left turn (though I may also not fully understand what that term refers to).

comment by Tapatakt · 2022-08-02T12:55:22.064Z · LW(p) · GW(p)

Russian translation by me

comment by Signer · 2022-06-19T22:01:28.225Z · LW(p) · GW(p)

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.

Wouldn't demonstrating the risk increase motivation for capability gains for everyone else?

comment by Michael Thiessen (michael-thiessen) · 2022-07-14T04:41:34.953Z · LW(p) · GW(p)

"I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."

This seems like a completely absurd claim to me, unless by verification you mean some much weaker claim like that you can show something sometimes works.

Coming from the world of software, generating solutions that seem to work is almost always far easier than any sort of formal verification that they work. I think this will be doubly true in any sort of adversarial situation where any flaws will be actively sought out and exploited. Outside of math domains I find it difficult to come up with examples of where verification is easier than generation, and easy to come up with the opposite.

comment by Kenny · 2022-06-22T05:41:32.472Z · LW(p) · GW(p)

I've long interpreted Eliezer, in terms of your disagreements [2-6], as offering deliberately exaggerated examples.

I do think you might be right about this [from disagreement 2]:

By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D.

I do like your points overall for disagreements [1] and [2].

I feel like there's still something being 'lost in translation'. When I think the of the Eliezer-AGI and why it's an existential risk, I think that it would be able to exploit a bunch of 'profitable capability boosting opportunities'. I agree that a roughly minimal 'AGI' probably/might very well not be able to do that. Some possible AGI can tho. But you're also right that there are other risk, possibily also existential, that we should expect to face before Eliezer's specific 'movie plots' would be possible.

But then I also think the specific 'movie plots' are besides the point.

If you're right that some other AI system mega+-kills humans – that then is the "nanotech" to fear. If it's a foregone conclusion, it's not that much better in the cases where it takes the AI, e.g. several years to kill us all, versus 'instantaneously'.

I also have a feeling that:

Some (minimal) 'AGI' is very possible, e.g. in the next five (5) years.
The gap between 'very disruptive' and 'game over' might be very small.

I guess I disagree with your disagreement [7]. I think partly because:

AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc.

might be AI systems that are "catastrophically dangerous" because of the above.

I think maybe one disagreement I have with both you and Eliezer is that I don't think an AI system needs to be 'adversarial' to be catastrophically dangerous. A sufficiently critical feature missing from the training data might be sufficient to generate, e.g. an idea, that can be apparently reasonably verified as aligned and yet lead to catastrophe.

I am very happy that you're asking for more details about a lot of Eliezer's intuitions. That seems likely to be helpful even if they're wrong.

I'm skeptical of your disagreement [19]. Is it in fact the case that we currently have good enough abilities at verifying, e.g. ideas, problems, and proposals? I don't feel like that's the case; definitely not obviously so.

I think I've updated towards your disagreements [18] and [22], especially because we're probably selecting for understandable AIs to at least some degree. It seems like people are already explicitly developing AI systems to generate 'super human' human-like behavior. Some AI systems probably are, and will continue to be, arbitrarily 'alien' tho.

For your disagreement [23], I'd really like to read some specific details about how that could work, AIs reasoning about each other's code.

Overall, I think you made some good/great points and I've updated towards 'hope' a little. My 'take on your take (on Eliezer's takes)' is that I don't know what to think really, but I'm glad that you're both writing these posts!

comment by rafaelCosman · 2022-06-22T01:00:10.141Z · LW(p) · GW(p)

Excellent post, thank you Paul. This is an important message that the community needs to hear right now.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-06-22T01:22:08.920Z · LW(p) · GW(p)

What is the message you're referring to? (The above post has 46 numbered items plus further bulleted lists.)

Replies from: Dirichlet-to-Neumann

↑ comment by Dirichlet-to-Neumann · 2022-06-23T08:59:54.534Z · LW(p) · GW(p)

That Yudkowsky can be wrong sometimes and that the situation is not as hopeless as he thinks.

Replies from: rafaelCosman

↑ comment by rafaelCosman · 2022-06-26T22:53:08.294Z · LW(p) · GW(p)

Yes! And additionally, that even conditional on all of Yudkowsky's dire predictions being correct, that we can perform significantly better as a community (and civilization) by taking a more positive attitude (as exemplified by Paul's post).

People and entire societies have faced tough situations (including existential ones) many times and I think there are patterns to what sorts of attitudes maximize probability of success. Facing the brutal facts but remaining hopeful, passionate, and creative.

comment by DaemonicSigil · 2022-06-20T00:15:17.125Z · LW(p) · GW(p)

Posting this comment to start some discussion about generalization and instrumental convergence (disagreements #8 and #9).

So my general thoughts here are that ML generalization is almost certainly not good enough for alignment. (At least in the paradigm of deep learning.) I think it's true with high confidence that if we're trying to train a neural net to imitate some value function, and that function takes a high-dimensional input, then it will be possible to find lots of inputs that cause the network to produce a high value when the value function produces a low one, or vice-versa. In other words, adversarial inputs exist. This is true even when the function is simple enough that the network certainly has more than enough Bayesian evidence to pin down the function. As far as I know, we haven't yet discovered a way to really fix this problem, though there are certainly ways to make the adversarial inputs a little more rare/harder to find.

Paul also mentions that high intelligence isn't a separate regime that the AI needs to generalize to, but rather that the AI can be trained continuously as its intelligence increases. I agree with this, but I don't think it constitutes a valid objection, since the regimes that we actually want the AI to generalize between are cooperation and betrayal. Generally these would look pretty different, with betrayal plans involving the AI tiling space with adversarial examples, etc. And we'd generally expect a discontinuous switch to betrayal only when the AI is confident it can win, so there's not really an opportunity to train the AI on betrayal examples beforehand.

comment by Eli Tyre (elityre) · 2023-03-25T17:13:15.628Z · LW(p) · GW(p)

Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.

My understanding of Eliezer's view is that some domains are much harder to do aligned cognition in than others, and alignment is among the hardest.

(I'm not sure I clearly understand why. Maybe because it entails reasoning about humans and reasoning about building super powerful systems, so if your R&D SEAI is even a little bit unaligned, it will have ample leverage for seizing power?)

It's not so much that AIs will be able to do recursive self improvement before they're able to solve alignment. It's that making alignment progress is itself heavily alignment loaded, in a way that "recursively self improve" (without regard for alignment), isn't.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2023-03-25T19:42:01.386Z · LW(p) · GW(p)

I agree that Eliezer holds that view (and also disagree---I think this is the consensus view around LW but haven't seen anything I found persuasive as a defense). I don't think that's his whole view, since he frequently talks about AI doing explosive improvement before other big scientific changes, and generally seems to be living in a world where this is an obvious and unstated assumption behind many of the other things he says.

Replies from: Zack_M_Davis

↑ comment by Zack_M_Davis · 2023-03-26T19:08:37.625Z · LW(p) · GW(p)

I think this is the consensus view around LW [that AI can't help with alignment research] but haven't seen anything I found persuasive as a defense

I thought it was an argument from inaccessible information [LW · GW]: we know how to specify rewards for "Win a Go game", "Predict the next token", or "A human approved this output"; we don't know how to specify rewards for "Actually good alignment research".

I'm imagining that the counterargument might be that earlier weak alignment techniques (and the generation/verification gap) might be enough to bootstrap later, more automated alignment techniques?

Replies from: paulfchristiano, Ninety-Three

↑ comment by paulfchristiano · 2023-03-27T05:47:59.138Z · LW(p) · GW(p)

Yeah, I don't find "we can't verify good alignment research" nearly as persuasive as other people around here:

Verification does seem way easier, even for alignment research. This is probably the most interesting and perplexing disagreement.
Even if verification isn't easier than generation, you can still just do what a human would do faster. That seems like a big deal, and quite a lot of what early AI systems will be doing. Focusing only on generation vs verification seems like it's radically understating the case.
AI systems can also help with verification, e.g. noticing problems in possible ideas, generating experimental setups in which to evaluate ideas, and so forth. These tasks don't seem especially hard to verify either.
You could imagine training an ML system end-to-end on "make the next ML system smarter." But that's not how things appear to be going (and it's just really hard to do with gradient decent). Instead it looks likely that ML systems will probably be doing things more like "solve subtasks identified by other humans or AIs" (just like most humans who work on these things). In this regime, having a reward function for the end result isn't that important.
To the extent we can't recognize good alignment research when we see it, I think that also makes humans' alignment research less efficient, and so the comparative advantage question is less obvious than the absolute difficulty question.

I think that "we can't verify good alignment research" is probably a smaller consideration than "alignment is more labor intensive while capabilities research is more capital intensive." Neither is decisive, and I expect other factors will mostly dominate (like changes in allocation of labor).

This isn't to say I think it's easy to get AI systems to solve alignment for you, such that it doesn't matter if you work on it in advance. But I'm not yet persuaded at all by "AI systems will be crazy superhuman before they make big contributions in alignment," and don't think that the LW community should particularly expect other folks to be persuaded either.

↑ comment by Ninety-Three · 2023-03-26T21:24:50.893Z · LW(p) · GW(p)

we know how to specify rewards for... "A human approved this output"; we don't know how to specify rewards for "Actually good alignment research".

Can't these be the same thing? If we have humans who can identify actually good alignment research, we can sit them down in the RLHF booth and have the AI try to figure out how to make them happy.

Now obviously a sufficiently clever AI will infer the existence of the RLHF booth and start hacking the human in order to escape its box, which would be bad for alignment research. But it's looking increasingly plausible that e.g. GPT-6 will be smart enough to provide actually good mathematical research without being smart enough to take over the world (that doesn't happen until GPT-8). So why not alignment research?

To break the comparison I think you need to posit either that alignment research is way harder than math research (as Eli understands Eliezer does) such that anything smart enough to do it is also smart enough to hack a human, or I suppose it could be the case that we don't have humans who can identify actually good alignment research.

comment by Eli Tyre (elityre) · 2023-03-25T17:07:10.590Z · LW(p) · GW(p)

One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.

This objection only holds if you imagine that AI systems are acquiring knowledge about the world / capability only through gradient decent, as opposed to training an agent that learns, and becomes more capable thereby, over runtime.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2023-03-25T19:51:23.790Z · LW(p) · GW(p)

I would say: this objection holds for all AI designs that are being seriously considered to date. I agree it doesn't apply in full generality to "not-modern-ML." That said, you can use gradient descent to build agents that accumulate and manipulate knowledge (e.g. by reading and writing to databases, either in natural language or in opaque neuralese) and my argument applies just as well to those systems. I think you are imagining something more precise.

I do agree that once you say "the techniques will just be totally different from ML" then I get more into "all bets are off," and maybe then I could end up with a 50-50 chance of AI systems concealing their capabilities (though that still seems high). That said, I think you shouldn't be confident of "totally different from ML" at this point:

I think you have at least a reasonable probability on "modern ML leads to transformative effects and changes the game," especially if transformation happen soon. If this paradigm is supposed to top out, I would like more precision about where and why.
Our reasoning about alternatives to ML seems really weak and uninformative, and it's plausible that reasoning about ML is a better guide to what happens in the future even if there is a big transformation.
At this point it's been more than 10 years with essentially no changes to the basic paradigm that would be relevant to alignment. Surely that's enough to get to a reasonable probability of 10 more years?
This outcome already looked quite plausible to me in 2016, such that it was already worth focusing on, and it seems like evidence from the last 7 years makes it look much more likely.

comment by Eli Tyre (elityre) · 2022-08-10T19:12:40.638Z · LW(p) · GW(p)

Number 22:

Eliezer seems to be relatively confident that AI systems will be very alien and will understand many things about the world that humans don’t, rather than understanding a similar profile of things (but slightly better), or having weaker understanding but enjoying other advantages like much higher serial speed. I think this is very unclear and Eliezer is wildly overconfident. It seems plausible that AI systems will learn much of how to think by predicting humans even if human language is a uselessly shallow shadow of human thought, because of the extremely short feedback loops. It also seems quite possible that most of their knowledge about science will be built by an explicit process of scientific reasoning and inquiry that will proceed in a recognizable way to human science even if their minds are quite different. Most importantly, it seems like AI systems have huge structural advantages (like their high speed and low cost) that suggest they will have a transformative impact on the world (~~and obsolete human contributions to alignment~~ retracted [LW(p) · GW(p)]) well before they need to develop superhuman understanding of much of the world or tricks about how to think, and so even if they have a very different profile of abilities to humans they may still be subhuman in many important ways.

It seems to me that this claim is approximately equivalent to "takeoff will be soft, not hard". In hard takeoff world, it seems straightforward that AI systems will understand huge important part/dynamics of the world, in ways that humans don't, even a little?

comment by Eli Tyre (elityre) · 2022-08-10T19:03:00.577Z · LW(p) · GW(p)

Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated). When Eliezer dismisses the possibility of AI systems performing safer tasks millions of times in training and then safely transferring to “build nanotechnology” (point 11 of list of lethalities [LW · GW]) he is not engaging with the kind of system that is likely to be built or the kind of hope people have in mind.

It seems like Paul is imagining something CAIS-like, where you compose a bunch of AI abilities that are fairly robust in their behavior, and then conglomerate them into large projects that do big things, much like human organizations.

(Unless I'm misunderstanding, in which case the rest of this comment is obviated.)

It seems like this working depends on two factors:

First of all, it needs to be the case that conglomerations like this are competitive with giant models that are a single unified brain.
1. On first pass, this assumption seems pretty untrue? The communication bandwidth, and ability to operate as a unit, of people in an organization, is much much lower than that of the sub-modules of a person's brain.
Second it supposes that when you compose a bunch of AI systems, to do something big and novel like design APM systems, that each individual component will still be operating within it's training distribution, as opposed to this requiring some AIs in the engineering project being fed inputs that are really weird, and might produce unanticipated behavior.
1. This seems like a much weaker concern though. For one thing, it seems like you ought to be able to put checks on whether a given AI component is being fed out-of-distribution inputs, and raising a flag for oversight whenever that happens.

comment by Pattern · 2022-07-09T18:32:14.703Z · LW(p) · GW(p)

This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it,

Magically given a very powerful, unaligned, AI. (This 'the utility function is in code, in one place, and can be changed' assumption needs re-examination. Even if we assert it exists in there*, it might be hard to change in, say, a NN.)

* Maybe this is overgeneralizing from people, but what reason do we have to think an 'AI' will be really good at figuring out its utility function (so it can make changes without changing it, if it so desires). The postulate 'it will be able to improve itself, so eventually it'll be able to figure everything out (including how to do that)', seems to ignore things like 'improvements might make it more complex and harder to do that while improving.' Where and how do you distinguish between 'this is my utility function' and 'this is a bias I have'? (How have you improved this, and your introspecting abilities? How would a NN do either of those?)

One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.

Now there's an idea: due to competition, AIs do impressive things (which aren't necessarily safe). An AI creates the last advance that when implemented causes a FOOM + bad stuff.

Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research),

This doesn't necessarily require the above to be right or wrong - human level contributions (which aren't safe) could, worst case scenario...etc.

[6.] Many of the “pivotal acts”

(Added the 6 back in when it disappeared while copying and pasting it here.)

There's a joke about a philosopher king somewhere in there. (Ah, if only we had, an AI powerful enough to save us from AI, but still controlled by...)

I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous.

I think others (or maybe the OP [LW · GW] previously?) have pointed out that AI can affect the world in big ways way before 'taking it over'. Domain limited, or 'sub-/on par with/super-' 'human performance', doesn't necessarily matter which of those it is (though more power -> more effect is the expectation). Some domains are big.

comment by Nicholas / Heather Kross (NicholasKross) · 2022-06-20T18:45:40.428Z · LW(p) · GW(p)

Solving a scientific problem without being able to learn from experiments and failures is incredibly hard.

I wonder what, if any, scientific/theoretical problems have been solved right "on the first try" in human history. I know MIRI and others have done studies of history to find examples of e.g. technological discontinuities. Perhaps a study could be made of this?

An example [LW · GW] Yudkowsky brings up in the Sequences often, is Einstein's discovery of General Relativity. I think this is informative and helpful for alignment. Einstein did lots of thought experiments, and careful reasoning, to the point where his theory basically "came out" right, in time for experiments to prove it so.

More generally, I think Yudkowsky analogizes AI safety to physics, and it seems similar: combination of careful theory and expensive/dangerous experiments, the high intellectual barriers to entry, the need for hardcore conceptual and mathematical "engineering" to even think about the relevant things, the counterintuitive properties, etc.

TLDR write, self-critique, and formalize more thought experiments. This could help a lot with getting alignment theoretically right sooner (which helps regardless of how critical the "first" experimental attempt turns out to be).

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-10-03T17:49:09.715Z · LW(p) · GW(p)

My general answer is a lot in pure mathematics, but mathematics in full generality is incredibly hard, so we haven't even mapped the mathematical territory very well yet.

In scenarios where you actually have to interface with our specific physical/mathematical universe, I believe the answer is 0 times, or close to it, and Einstein cut the search space from thousands to millions of initially valid theories to 4 theories, but he didn't get to the correct theory 0-shot, and he did have to rely on experiment a little bit:

https://www.lesswrong.com/posts/GSBCw94DsxLgDat6r/interpreting-yudkowsky-on-deep-vs-shallow-knowledge#6HPjxMvTnP9JeibXZ [LW(p) · GW(p)]

comment by jfw01 · 2025-01-09T21:39:15.474Z · LW(p) · GW(p)

a particular technique doesn’t immediately solve a problem

I remember a story that got coverage on the state radio in New Zealand years ago. It said that multiple people have parts of the solution to some problem, and there is progress when there is an accident that introduces them to each other. There was a book about it, but I'm failing to find the details.

comment by jfw01 · 2025-01-09T21:15:48.548Z · LW(p) · GW(p)

implement a relatively limited policy

I read this as Libertarian; the hope that there could be a very stiff, strong government that was also small, and did only a subset of the things in the short-term interest of its supporters.

comment by jfw01 · 2025-01-09T20:59:55.543Z · LW(p) · GW(p)

Alignment isn’t like that; it was chosen to be an important problem

Like medicine.

This was specifically commented on in a book whose preface I read as a child. It was called something like "Medicine: from science to magic", and I have not found a clear link back to it.

comment by Marcelo Higasi (marcelo-higasi) · 2023-03-31T22:12:03.864Z · LW(p) · GW(p)

Good thoughtful essay. I think however that an intelligent enough AI would realise that it needs humans (at least in the medium term, decades?) To keep the infrastructure required to keep it running. At least until general purpose robots come in up. Robots that can take care of all the infrastructure tasks currently ran by humans. Robots that can self build and self replicate. This AI help accelerate the development of such robots but is not around the corner. Any thoughts on this?

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-03-31T22:19:14.136Z · LW(p) · GW(p)

General purpose robotics are basically ready to go as soon as an AI is ready to pilot them. They don't currently exist in great number, but it wouldn't take that many to assemble a factory made out of purpose-specific robots. The bigger problem is that an AI that thought it could do this would probably be mistaken at first attempt.

comment by Ashley Giles (ashley-giles-1) · 2023-03-31T11:05:35.835Z · LW(p) · GW(p)

I like this more than Eliezer's mainly because it's written in a way I can actually understand - even if it is a ramble.

comment by Multicore (KaynanK) · 2022-06-26T18:38:52.248Z · LW(p) · GW(p)

This has overtaken the post it's responding to as the top-karma post of all time.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-06-26T18:44:53.524Z · LW(p) · GW(p)

Yes, it's never an equilibrium state for Eliezer communicating key points about AI to be the highest karma post on LessWrong. There's too much free energy to be eaten by a thoughtful critique of his position. On LW 1.0 it was Holden's Thoughts on the Singularity Institute [LW · GW], and now on LW 2.0 it's Paul's list of agreements and disagreements with Eliezer.

Finally, nature is healing.

Replies from: Heighn

↑ comment by Heighn · 2022-06-26T18:59:13.181Z · LW(p) · GW(p)

What do you mean with "Finally, nature is healing"?

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-06-26T19:03:44.931Z · LW(p) · GW(p)

google images “nature is healing meme”

Replies from: Heighn

↑ comment by Heighn · 2022-06-26T19:27:45.596Z · LW(p) · GW(p)

Thanks! I should have been more clear; "Nature is healing" has some "EY was wrong in his post" energy I was wondering about.

comment by Prometheus · 2022-06-22T20:51:09.611Z · LW(p) · GW(p)

"By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research."

I think this assumption is unlikely. From what we know of human-lead research, accelerating AI capabilities is much easier than accelerating progress in alignment. I don't see why it would be different for an AI.

Replies from: eugene-d

↑ comment by Eugene D (eugene-d) · 2022-06-23T00:26:05.996Z · LW(p) · GW(p)

I wonder when Alignment and Capability will finally be considered synonymous, so that the efforts merge into one -- bc that's where any potential AI-safety lives, I would surmise.

comment by David Johnston (david-johnston) · 2022-06-22T02:43:37.850Z · LW(p) · GW(p)

10. AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.

I think there will be substantial technical hurdles along the lines of getting in-principle highly capable AI systems to reliably do what we want them to, these will probably be "commercially relevant" (i.e. not just a concern for X-risk researchers), and it's plausible (though far from certain) that hurdles of this type will slow the rate of AI progress.

One reason I think this might slow the rate of progress is that important parts of this can't be delegated to highly in-principle capable systems. One reason I think it might not slow progress much is that I don't currently see a good reason why most of this work couldn't be delegated to highly actually-capable systems.

comment by Nicholas / Heather Kross (NicholasKross) · 2022-06-20T20:31:26.311Z · LW(p) · GW(p)

(26) I think by "a plan", Yudkowsky partially means "a default paradigm and relevant concrete problems". There's no consensus on the first one, and Yudkowsky would disagree on the second one (since he thinks most current concrete problems are irrelevant to the core/eventual problem).

comment by Nicholas / Heather Kross (NicholasKross) · 2022-06-20T18:59:31.893Z · LW(p) · GW(p)

Disagreement (4): I think Yudkowsky maybe expects AGI to recursively self-improve on the way to becoming human-level.

comment by Logan Zoellner (logan-zoellner) · 2022-06-20T18:43:51.253Z · LW(p) · GW(p)

Mostly just here to say "I agree", especially regarding

Similarly, I think Eliezer’s reasoning about convergent incentives and the deep nature of consequentialism is too sloppy to get to correct conclusions and the resulting assertions are wildly overconfident.

and

I think that if you really dive into any of these key points you will quickly reach details where Eliezer cannot easily defend his view to a smart disinterested audience.

A lot of EY's points follow naturally if you think that the first AGI will be a recursively self improving maximally Bayesian reinforcement learner that fooms into existence as soon as someone invents the right metaheuristic. In this world we should be really worried about whether e.g. corrigibility is natural in some platonic sense or there is a small core to human alignment.

In Paul's world, AGI is the result of normal engineering, just at a scale 1000-1000x what OpenAI and DeepMind are doing now. In this world, it makes sense to talk about building large coalitions and really understanding what's going on in the guts of existing Deep Learning algorithms.

I think Paul's timelines (~15% on singularity by 2030 and ~40% on singularity by 2040) are a little conservative. Personally I estimate >50% by 2030, but Paul's story of how AGI gets built makes a lot more sense than EY's. And this goes a long way to explaining why I think the world is less doomed and we should focus less on a small team of people performing a Pivotal Act and more on than EY does.

Replies from: eugene-d

↑ comment by Eugene D (eugene-d) · 2022-06-23T14:28:24.659Z · LW(p) · GW(p)

I've heard a few times that AI experts both 1) admit we don't know much about what goes on inside, even as it stands today, and 2) we expect to extend more trust to the AI even as capabilities increase (most recently Ben Goertzel).

I'm curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben's analogy that 'I expect my dog to trust me, both bc I'm that much smarter, and I have a track-record of providing food/water for him' ?

thanks!

Eugene

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2022-06-23T18:26:44.149Z · LW(p) · GW(p)

I'm not personally on board with

2) we expect to extend more trust to the AI even as capabilities increase

The more capable an AI is, the more paranoid we should be about it. GPT-2 was bad enough you can basically give it to anyone who wanted it. GPT-3 isn't "dangerous" but you should at least be making sure it isn't being used for mass misinformation campaigns or something like that. Assuming GPT-4 is human-level, it should be boxed/airgapped and only used by professionals with a clear plan [LW · GW] to make sure it doesn't produce dangerous outputs. And if GPT-5 is super-intelligent (> all humans combined), even a text-terminal is probably too dangerous until we've solved the alignment problem. The only use cases where I would even consider using an unaligned GPT-5 is if you could produce a formal proof that its outputs were what you wanted.

I'm curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben's analogy that 'I expect my dog to trust me, both bc I'm that much smarter, and I have a track-record of providing food/water for him' ?

Don't agree with this at all. Explainability/alignment/trustworthiness are all pretty much orthogonal [LW · GW]to intelligence.

Replies from: eugene-d

↑ comment by Eugene D (eugene-d) · 2022-06-23T22:17:40.183Z · LW(p) · GW(p)

Thank you--btw before I try responding to other points, here's the Ben G vid to which I'm referring. Starting around 52m, for a few minutes, for that particular part anyway:

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2022-06-25T17:28:07.980Z · LW(p) · GW(p)

Listening to the context there, it sounds like what Ben is saying is once we've solved the alignment problem eventually we will trust the aligned AI to make decisions we don't understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn't done anything harmful so far it is trustworthy.

I also don't fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for "won't murder all humans" that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn't really work for fuzzy-defined words like "human" and "murder", not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I'm pessimistic about formal proof [LW · GW] as an alignment strategy generally.

In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.

Replies from: eugene-d

↑ comment by Eugene D (eugene-d) · 2022-06-29T12:23:58.129Z · LW(p) · GW(p)

OK thanks, I guess I missed him differentiating between 'solve alignment first, then trust', versus 'trusting first, given enough intelligence'. Although I think one issue w/having a proof is that we (or a million monkeys, to paraphrase him) still won't understand the decisions of the AGI...? ie we'll be asked to trust the prior proof instead of understanding the logic behind each future decision/step which the AGI takes? That also bothers me, because, what are the tokens which comprise a "step"? Does it stop 1,000 times to check with us that we're comfortable with, or understand, its next move?

However, since, it seems, we can't explain much of the decisions of our current ANI, how do we expect to understand future ones? He mentions that we may be able to, but only by becoming trans-human.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2022-06-29T16:29:54.714Z · LW(p) · GW(p)

Exactly what I'm thinking too.

comment by lc · 2022-06-19T20:17:12.621Z · LW(p) · GW(p)

I hope you're right.

comment by [deleted] · 2022-06-20T16:38:00.461Z · LW(p) · GW(p)

Replies from: paulfchristiano, None, None

↑ comment by paulfchristiano · 2022-06-20T16:58:12.971Z · LW(p) · GW(p)

I'd say so, though I may not engage a lot of the time.

↑ comment by [deleted] · 2022-06-20T16:55:59.783Z · LW(p) · GW(p)

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-20T17:15:24.483Z · LW(p) · GW(p)

Consuming free energy means things like: taking the jobs that unaligned AI systems could have done, making it really hard to hack into computers (either by improving defenses or, worst case, just having an ecosystem where any vulnerable machine is going to be compromised quickly by an aligned AI), improving the physical technology of militaries or law enforcement so that a misaligned AI does not have a significant advantage.

I also imagine AI systems doing things like helping negotiate and enforce agreements to reduce access to destructive technologies or manage the consequences (including in particular powerful AI systems themselves). And of course I imagine AI systems doing alignment research, generating new technological solutions, a clearer understanding of how to deploy AI systems, improving implementation quality at relevant labs, helping identify key risks and improve people's thinking about those risks, etc.

(I don't think that an AI developer is likely to be in a position to achieve a decisive strategic advantage, but I'd stand by this point regardless and think it's still reflecting an important disagreement about what the situation is likely to look like.)

Replies from: Vaniver, None

↑ comment by Vaniver · 2022-06-20T18:09:08.510Z · LW(p) · GW(p)

I'll note that I'm pretty enthusiastic about attempts to increase the security / sophistication of our civilization, for basically these reasons (the more efficient the stock market, the less money an unaligned AGI can make; the better computer security is, the less computers an unaligned AGI can steal, and so on). I'm nevertheless pretty worried about:

the 'intelligent adversary' part (where the chain's weakest link is the one that gets attacked, rather than a random link, meaning you need to do a ton of 'increasing sophistication' work for each unit of additional defense you get, given the number of attack surfaces)
the 'different payoff profile' part (great powers might be very interested in screwing with each other, and a world with great power spy conflict probably has much better security setups than one without, but none of them are interested in releasing a superplague that kills all humans, and so won't necessarily have better biodefense, i.e. AI may reveal lots of novel attack surfaces)
the 'fragile centralization / supply chain' part (a more sophisticated economy is probably less hardened against disruption than a less sophisticated economy, because the sophistication was in large part about how to get 'better returns in peacetime' than optimizing for survival / thriving broadly speaking / following traditions that had been optimized for that)

↑ comment by [deleted] · 2022-06-20T17:56:17.303Z · LW(p) · GW(p)

↑ comment by [deleted] · 2022-06-20T16:57:35.936Z · LW(p) · GW(p)

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2022-06-20T17:11:50.282Z · LW(p) · GW(p)

I think "just enough to impress the programmer" doesn't work---if you are doing gradient descent on impressiveness, then some other model will do even more and so be preferred.

In order for this to be robust, I think you need either gradient hacking to be underway, or to have a very strong sandbagging coalition such that SGD naturally can't find any direction to push towards less sandbagging. That feels really unlikely to me, at least much harder than anything Eliezer normally argues for about doom by default.

Replies from: None

↑ comment by [deleted] · 2022-06-20T18:08:28.968Z · LW(p) · GW(p)

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2022-06-21T01:13:08.840Z · LW(p) · GW(p)

Where can I read more about "sandbagging"?

I think "sandbagging" was just another term Paul was using for what you described as the AIs "underplaying their capabilities".

Replies from: None

↑ comment by [deleted] · 2022-06-21T06:01:41.860Z · LW(p) · GW(p)

Where I agree and disagree with Eliezer

Contents

Agreements

Disagreements

My take on Eliezer's takes

223 comments