Posts
Comments
What do you mean? I don't get what you are saying is convincing.
I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety? This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
I think this is the crux.
To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:
- As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
So for now, what I suspect is that o1's safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
- However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
I think the distinction is made to avoid confusing capability and alignment failures here.
I agree that it doesn't satisfy the user's request.
- More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
Yeah, this is the biggest issue of OpenAI for me, in that they aren't trying to steer too hard against deception.
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I'm not stating that it's hard to get the AI to value what we value, but it's not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Not always, but I'd say often.
I'd also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I'm not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.
In this case, it would mean the convergence to preserve your current values.
The answer to this is that we'd rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.
Oh, I was responding to something different, my apologies.
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
The evolution analogy of how evolution failed to birth intelligent minds that valued what evolution valued is an intuition pump that does get used in explaining outer/inner alignment failures, and is part of why in some corners there's a general backdrop of outer/ inner alignment being so hard.
It's also used in the sharp left turn, where the capabilities of an optimization process like humans outstripped their alignment to evolutionary objectives, and the worry is that an AI could do the same to us, and evolutionary analogies do get used here.
Both Eliezer Yudkowsky and Nate Soares use arguments that rely on evolution failing to get a selection target inside us, thus misaligning us to evolution:
This might actually be a useful idea, thanks for your idea.
I definitely agree that conditioning on AI catastrophe, I think the 4 step chaotic catastrophe is the most likely way an AI catastrophe leads to us being extinct or at least in a very bad position.
I admit the big difference is that I do think that 2 is probably incorrect, as we have some useful knowledge of how models form goals, and I expect this to continue.
That's actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can't have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn't itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that's no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn't fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that's when we can say the end goal has been achieved.
Some thoughts on this post:
- Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
I agree in not thinking hiding the CoT is good for alignment.
- They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
See this post for why even extreme evidence may not get them to undeploy:
At this point, the system becomes quite difficult to "deeply" correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.
While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:
https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/#tchmrbND2cNYui6aM
I definitely agree with this claim in general:
So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded.
(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don't think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor's domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don't need, and maybe don't want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it's been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that's just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
To answer this specific question
Do you mean we're waiting tiil 2026/27 for results of thee next scaleup? If this round (GPT5, Claude 4, Gemini 2.0) show diminishing returns, wouldn't we expect that the next will too?
Yes, assuming Claude 4/Gemini 2.0/GPT-5 don't release or are disappointing in 2025-2026, this is definitely evidence that things are slowing down.
It doesn't conclusively disprove it, but it does make progress shakier.
Agree with the rest of the comment.
I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking.
I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts.
I'd say the biggest area of how I'm skeptical so far is that I think there's a real difference between the useful idea that power is useful for the science loop and the idea that the AI will seize power by any means necessary to advance it's goals.
I think instrumental convergence will look more like local power-seeking that is more related to the task at hand, and not to serve some of it's other goals, primarily due to denser feedback constraining the solution space and instrumental convergence more than humans.
That said, this is a very good post, and I'm certainly happier that this much more rigorous post was written than a lot of other takes on scheming.
I'd argue one of the issues with a lot of early social media moderation policies was treating ironic beliefs that were usually banned as not ban-worthy, because as it turned out, ironic belief in some extremism turned out to either have been fake, or turned into the real versions over time.
To be clear, I don't yet believe that the rumors are true, or that if they are, that they matter.
We will have to wait until 2026-2027 to get real evidence on large training run progress.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd's comment below for a perspective:
I think this is in fact the crux, in that I don't think they can do this in the general case, no matter how much compute is used, and even in the more specific cases, I still expect it to be extremely hard verging on impossible to actually get the distribution, primarily because you get equal evidence for almost every value, for the same reasons as why getting more compute is an instrumental convergent goal, so you cannot infer the values of basically anyone solely on the fact that you live in a simulation.
In the general case, the distribution/probability isn't even well defined at all.
The boring answer to Solomonoff's malignness is that the simulation hypothesis is true, but we can infer nothing about our universe through it, since the simulation hypothesis predicts everything, and thus is too general a theory.
The affect is attenutated greatly provided that we assume the ability to arbitrarily copy the Solomonoff inductor/Halting Oracle, as then we can drive the complexity of picking out the universe arbitrarily close to picking out the specific user in the universe, and in the limit of infinite Solomonoff induction uses, are exactly equal:
IMO, the psychological unity of humankind thesis is a case of typical minding/overgeneralizing, combined with overestimating the role of genetics/algorithms and underestimating the role of data in what makes us human.
I basically agree with the game-theoretic perspective, combined with another perspective which suggests that as long as humans are relevant in the economy, you kind of have to help those humans if you want to profit, and merely an AI that automates a lot of work could disrupt it very heavily if a CEO could have perfectly loyal AI workers that never demanded anything in the broader economy.
The best argument here probably comes from Paul Christiano, but to summarize the argument, it's because even in a situation where we messed up pretty badly in aligning the AI, so long as the failure mode isn't deceptive alignment but instead misgeneralization of human preferences/non-deceptive alignment failures, it's pretty likely that there will be at least some human-regarding preferences, and that means the AI will do some acts of niceness if it is cheap to them, and preserving humans is very cheap for superintelligent AI.
More answers can be found here:
This is also my interpretation of the rumors, assuming they are true, which I don't put much probability on.
I'd say the main things that made my own p(Doom) went down this year are the following:
-
I've come to believe that data was both a major factor in capabilities and alignment, and I also believe that careful interventions on that data could be really helpful for alignment.
-
I've come to think that instrumental convergence is closer to a scalar quantity than a boolean, and while I don't think 0 instrumental convergence is incentivized for capabilities and domain reasons, I do think that restraining instrumental convergence/putting useful constraints on instrumental convergence like world models is helpful for capabilities to the extent that I think that power-seeking will likely be a lot more local than what humans do.
-
I've overall shifted towards a worldview where the common thought experiment of the second-species argument, where humans have killed over 90%+ of chimpanzees and gorillas due to them running away with intelligence and being misaligned neglects very crucial differences between the human and the AI case that makes my p(Doom) lower.
(Maybe another way to say it is I think the outcome of humans just completely running roughshod on every other species due to instrumental convergence is not the median outcome of AI development, but a deep outiler that is very uninformative to how AI outcomes will look like.)
- I've come to believe that human values, or at least the generator of values, are actually simpler than a lot of people think, and that a lot of the complexity that appears to be there is because we generally don't like admitting that very simple rules can generate very complex outcomes.
While I'm not a believer in the scaling has died meme yet, I'm glad you do have a plan for what happens if AI scaling does stop.
I'm less confident in this position since I put on a disagree emoji, but my reason is because it's much easier to control an AIs data sources for training than it is for humans, which means it's quite easy in theory (but might be difficult in practice, which worries me) to censor just enough data such that the model doesn't even think that it's likely in a simulation that doesn't add up to normality.
I agree chess is an extreme example, such that I think that more realistic versions would probably develop instrumental convergence at least in a local sense.
(We already have o1 at least capable of a little instrumental convergence.)
My main substantive claim is that constraining instrumental goals such that the AI doesn't try to take power via long-term methods is very useful for capabilities, and more generally instrumental convergence is an area where there is a positive manifold for both capabilities and alignment, where alignment methods increase capabilities and vice versa.
Maybe there's a case there, but I'd doubt it get past a jury, let alone result in any guilty verdicts.
Oh, now I understand.
And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.
I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.
Notably, no law I know of allows you to take legal action on a hunch that they might destroy the world based on your probability of them destroying the world being high without them doing any harmful actions (and no, building AI doesn't count here.)
To answer the question:
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won't work well enough to be relevant?
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Other AIs relying on much denser feedback will already rule the world before that happens.
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
Alright, I'll give 2 lessons that I do think generalize to superintelligence:
-
The data is a large factor in both it's capabilities and alignment, and alignment strategies should not ignore the data sources when trying to make predictions or trying to intervene on the AI for alignment purposes.
-
Instrumental convergence in a weak sense will likely exist, because having some ability to get more resources are useful for a lot of goals, but the extremely unconstrained versions of instrumental convergence often assumed where an AI will grab so much power as to effectively control humanity is unlikely to exist, given the constraints and feedback given to the AI.
For 1, the basic answer for why is because a lot of AI success in fields like Go and language modeling etc was jumpstarted by good data.
More importantly, I remember this post, and while I think it overstates things in stating that an LLM is just the dataset (it probably isn't now with o1), it does matter that LLMs are influenced by their data sources.
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
For 2, the basic reason for this is that the strongest capabilities we have seen that come out of RL either require immense amounts of data on pretty narrow tasks, or non-instrumental world models.
This is because constraints prevent you from having to deal with the problem where you produce completely useless RL artifacts, and evolution got around this constraint by accepting far longer timelines and far more computation in FLOPs than the world economy can tolerate.
Why do you think that LLMs will hit a wall in the future?
I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have (and note that I'm also using evidence from non-LLM sources like MCTS algorithm that was used for AlphaGo), and I also don't believe interpretability is why LLMs are mostly safe at all, but rather I think they're safe due to a combo of incapacity, not having extreme instrumental convergence, and the ability to steer them with data.\
Language is a simple example, but one that is generalizable pretty far.
It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime
Note that the primary points would apply to basically a whole lot of AI designs like MCTS for AlphaGo or a lot of other future architecture designs which don't imitate humans, barring ones which prevent you from steering it at all with data, or have very sparse feedback, which translates into weakly constraining instrumental convergence.
but we're moving away from the regime where such dense feedback is available, so I don't see what lessons transfer.
I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
Also, AIs will still have instrumental convergence, it's just that their goals will be more local and more focused around the training task, so unless the training task rewards global power-seeking significantly, you won't get it.
The one thing I'll say on the election is that a lot of people are using Kamala Harris's loss to put in their own reasons for why Kamala Harris lost that are essentially ideological propaganda.
Basically only the story that she was doomed from the start because of global backlash against incumbents for inflation matches the evidence best, and a lot of other theories are very much there for ideological purposes.
On the question of how much evidence the following scenarios are against the AI scaling thesis (which I roughly take to mean that more FLOPs and compute/data reliably makes AI better for economically important relevant jobs), I'd say that scenarios 4-6 falsify the hypothesis, while 3 is the strongest evidence against the hypothesis, followed by 2 and 1.
4 would make me more willing to buy algorithmic progress as important, 5 would make me more bearish on algorithmic progress, and 6 would make me have way longer timelines than I have now, unless governments fund a massive AI effort.
Start with an analogy to physics. There’s a Stephen Hawking quote I like:
> “Even if there is only one possible unified theory, it is just a set of rules and equations. What is it that breathes fire into the equations and makes a universe for them to describe? The usual approach of science of constructing a mathematical model cannot answer the questions of why there should be a universe for the model to describe. Why does the universe go to all the bother of existing?”
I could be wrong, but Hawking’s question seems to be pointing at a real mystery. But as Hawking says, there seems to be no possible observation or scientific experiment that would shed light on that mystery. Whatever the true laws of physics are in our universe, every possible experiment would just confirm, yup, those are the true laws of physics. It wouldn’t help us figure out what if anything “breathes fire” into those laws. What would progress on the “breathes fire” question even look like?? (See Tegmark’s Mathematical Universe book for the only serious attempt I know of, which I still find unsatisfying. He basically says that all possible laws of the universe have fire breathed into them. But even if that’s true, I still want to ask … why?)
By analogy, I’m tempted to say that an illusionist account can explain every possible experiment about consciousness, including our belief that consciousness exists at all, and all its properties, and all the philosophy books on it, and so on … but yet I’m tempted to still say that there’s some “breathes fire” / “why is there something rather than nothing” type question left unanswered by the illusionist account. This unanswered question should not be called “the hard problem”, but rather “the impossible problem”, in the sense that, just like Hawking’s question above, there seems to be no possible scientific measurement or introspective experiment and that could shed light on it—all possible such data, including the very fact that I’m writing this paragraph, are already screened off by the illusionist framework.
Well, hmm, maybe that’s stupid. I dunno.
My provisional answer is "An infinity of FLOPs/compute backs up the equations to make sure it works."
I think the crux of it is here:
I've always struggled to make sense of the idea of brain uploading because it seems to rely on some sort of dualism. As a materialist, it seems obvious to me that a brain is a brain, a program that replicates the brain's output is a program (and will perform its task more or less well but probably not perfectly), and the two are not the same.
I think that basically everything in the universe can be considered a program/computation, but I also think the notion of a program/computation is quite trivial.
More substantively, I think it might be possible to replicate at least some parts of the physical world with future computers that have what is called physical universality, where they can manipulate the physical world essentially arbitrarily.
So I don't view brains and computer programs as being of 2 different types, but rather as the same type as a program/computation.
See below for some intuition as to why.
http://www.amirrorclear.net/academic/ideas/simulation/index.html
On this:
When I said "problems we care about", I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I'm referring to.
I think the problem identified here is in large part a demand problem, in that lots of AI people only wanted AI capabilities, and didn't care for AI interpretability at all, so once the scaling happened, a lot of the focus went purely to AI scaling.
(Which is an interesting example of Goodhart's law in action, perhaps.)
See here:
https://www.lesswrong.com/posts/gXinMpNJcXXgSTEpn/ai-craftsmanship#Qm8Kg7PjZoPTyxrr6
IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.
E.g. project euler problems.
I definitely agree that there exist such problems where the scaling with population is pretty bad, but I'll give 2 responses here:
- The differences between a human level AI and an actual human are the ability to coordinate and share ontologies better between millions of instances, so the common problems that arise when trying to factorize out problems are greatly reduced.
- I think that while there are serial bottlenecks to lots of problem solving in the real world such that it prevents hyperfast outcomes, I don't think that serial bottlenecks are the dominating factor, because the stuff that is parallelizable like good execution is often far more valuable than the inherently serial computations like deep/original ideas.
Fair enough, I'll retract my comment.
I definitely agree that people are overupdating too much from this training run, and we will need to wait.
(I also made this mistake in overupdating.)
I just want to provide one important piece of information:
It turns out that Ilya Sutskever was misinterpreted as a claim about the model plateauing, but instead saying other directions work out better:
https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/?commentId=JFNZ5MGZnzKRtFFMu
Growing slower than logarithmic does not help. Only being bounded in t he limit give you, well, a bound in the limit.
Thanks for catching that error, I did not realize this.
I think I got it from here:
https://www.lesswrong.com/posts/EhHdZ5yBgEvLLx6Pw/chad-jones-paper-modeling-ai-and-x-risk-vs-growth
"Bounded utility solves none of the problems of unbounded utility." Thus the title of something I'm working on, on and off.
It's not ready yet. For a foretaste, some of the points it will make can be found in an earlier unpublished paper "Unbounded Utility and Axiomatic Foundations", section 3.
The reason that bounded utility does not help is that any problem that arises at infinity will already practically arise at a sufficiently large finite stage. Repeated plays of the finite games discussed in that paper will eventually give you a payoff that has a high probability of being close (in relative terms) to the expected value. But the time it takes for this to happen grows exponentially with the lengths of the individual games. You are unlikely to ever see your theoretically expected value, however long you play. The infinite game is non-ergodic; the game truncated to finitely many steps and finite payoffs is ergodic only on impractical timescales.
Infinitude in problems like these is better understood as an approximation to the finite, rather than the other way round. (There's a blog post by Terry Tao on this theme, but I've lost the reference to it.) The problems at infinity point to problems with the finite.
I definitely agree that the problems of infinite utilities are approximately preserved by the finitary version of the problem, and while there are situations where you can get niceness assuming utilities are bounded (conditional on giving players exponentially large lifespans), it's not the common or typical case.
Infinity makes things worse in that you no longer get any cases where nice properties like ergodicity or dominance are consistent with other properties, but yeah the finitary version is only a little better.
I admit, I think this is kind of a crux, but let me get down to this statement:
I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
One big difference between a human-level AI and a real human is coordination costs: Even without advanced decision theories like FDT/UDT/LDT, the ability to have millions of copies of an AI makes it possible for them to all have similar values, and divergences between them are more controllable in a virtual environment than a physical environment.
But my more substantive claim is that lots of how progress is made in the real world is because population growth allows for more complicated economies, more ability to specialize without losing essential skills, and just simply more data to deal with reality, and alignment, including strong alignment is not different here.
Indeed, I'd argue that a lot more alignment progress happened in the 2022-2024 period than the 2005-2015 period, and while I don't credit it all to population growth of alignment researchers, I do think a reasonably significant amount of the progress happened because we got more people into alignment.
Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.
See these quotes from Carl Shulman here for why:
Yeah. In science the association with things like scientific output, prizes, things like that, there's a strong correlation and it seems like an exponential effect. It's not a binary drop-off. There would be levels at which people cannot learn the relevant fields, they can't keep the skills in mind faster than they forget them. It's not a divide where there's Einstein and the group that is 10 times as populous as that just can't do it. Or the group that's 100 times as populous as that suddenly can't do it. The ability to do the things earlier with less evidence and such falls off at a faster rate in Mathematics and theoretical Physics and such than in most fields.
Yes, people would have discovered general relativity just from the overwhelming data and other people would have done it after Einstein.
The link for these quotes is here below:
https://www.lesswrong.com/posts/BdPjLDG3PBjZLd5QY/carl-shulman-on-dwarkesh-podcast-june-2023#Can_we_detect_deception_
Secondly, and more importantly, I question whether it is possible even in theory to produce infinite expected value. At some point you've created every possible flourishing mind in every conceivable permutation of eudaimonia, satisfaction, and bliss, and the added value of another instance of any of them is basically nil. In reality I would expect to reach a point where the universe is so damn good that there is literally nothing the Cosmic Flipper could offer me that would be worth risking it all.
This very much depends on the rate of growth.
For most human beings, this is probably right, because their values have a function that grows slower than logarithmic, which leads to bounds on the utility even assuming infinite consumption.
But it's definitely possible in theory to generate utility functions that have infinite expected utility from infinite consumption.
You are however pointing to something very real here, and that's the fact that utility theory loses a lot of it's niceness in the infinite realm, and while there might be something like a utility theory that can handle infinity, it will have to lose a lot of very nice properties that it had in the finite case.
See these 2 posts by Paul Christiano for why:
https://www.lesswrong.com/posts/hbmsW2k9DxED5Z4eJ/impossibility-results-for-unbounded-utilities
I mostly just agree with your comment here, and IMO the things I don't exactly get aren't worth disagreeing too much about, since it's more in the domain of a reframing than an actual disagreement.
My main predictions on how the AI debate will go over the next several years, assuming that AI progress continues:
-
There could well a large portion of the public freaked out, and my prediction is that it will range in the 10-50% of people who want to ban AI at any cost.
-
Polarization will happen along pro/anti-AI lines, and more importantly the bipartisan consensus on AI will likely collapse into polarized camps.
-
Republicans will shift into being AI accelerationists, while Democrats will shift more into the AI safety camp.
-
Maybe the AI backlash doesn't occur, or is far weaker than people think once prices collapse for some goods, and maybe the AI unemployment factor turns out to be tolerable for the public.
I don't give the 4th scenario a high chance, but it is worth keeping in mind.
(One of my takeaways in the 2024 election results around the world is that people are fine with lots of unemployment, but hate price increases, and this might apply to AGI too.)
The good news I'll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there's a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.
Specifically, the insights I'm talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.