Posts
Comments
To talk about the education example, while I do think that the education system can have a lot of problems, I'd say a crux here is that easy classes anti-predict learning, and a lot of kid complaints on schooling would probably making kids learn worse, because hardness is correlated to learning:
I mostly agree with this post, but while I do think that the AI safety movement probably should try to at least be more cooperative with other movements, I disagree with the claim in the comments section that AI safety shouldn't try to pick a political fight in the future around open-source.
(I agree it probably picked that fight too early.)
The reason is that there's a non-trivial chance that alignment is plausibly solvable for human-level AI systems ala AI control, even if they are scheming, so long as the lab has control over the AIs, which as a corollary also means you can't open-source/open-weights the model.
More prosaically, AI misuse can be a problem, and the most important point here is that open-source/open-weighting the model widens the set of people who can change the AI, which unfortunately also means that there is a larger and larger chance for misuse with more people that know how to change the AI.
So I do think there's a non-trivial chance that AI safety eventually will have to suffer political costs to ban/severely restrict open-sourcing AI.
I'd say that the reason why the SpaceX cult/business can actually make working rockets is because they have rich feedback from reality when they try to design rockets, even at the pre-testing stage, because while it's not obvious to a layperson if a rocket does work, it is relatively easy to check the physics of whether a new rocket does work for an expert, meaning the checking of claims can be made legible, which is an enemy to cults in general.
More generally, I'd say the difference between a cult and a high-impact startup/business is whether they can get rich and reliable feedback from a source, and secondarily how legible their theory of impact/claims are.
Bigness alone doesn't cut it.
To be fair, basically a lot of proposals for the next paradigm/ToE think that space and time aren't fundamental, and are built out of something else.
That said, while the methodology isn't sound, I wouldn't be surprised if there was in fact a real conflationary alliance around the term, since the term is used in a context where deciding if someone is conscious or not (like uploads) have pretty big moral and political ramifications too, so there are pressures for the word to be politicized and not truth-tracking.
How much longer did your timelines get?
Of course, there are rumors and claims that deep learning is hitting a wall, which seems important for your timelines if true.
I have become convinced that nanotech computers are likely way weaker and quite a bit more impractical than Drexler thought, and have also moved up my probability of Drexler just being plain wrong about the impact of nanotech, which if true suggests that the future value may have been overestimated.
The reason why I'm stating this now is because I got a link in discord that talks about why nanotech computers are overrated, and the reason I consider this important is if this generalizes to other nanotech concepts, this suggests that a lot of the future value may have been overestimated based on overestimating nanotech's capabilities:
It's not surprising that a lot of people don't want to define physics while believing in physicalism, because properly explaining the equations that describe the physical world would take quite a long time, let alone describing what's actually going on in physics, and it would require a textbook minimum to make this work.
I don't buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building.
Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public.
See here for more:
https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
The answers to this question is actually 2 things:
-
This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI.
-
This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff.
More here:
But the solutions are intentionally going to make AI safe without relying on alignment.
The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.
I'd say the big factor that makes AI controllable right now is that the compute necessary to build AI that can do very good AI research to automate R&D and then the economy is locked behind TSMC/NVidia and ASML, and their processes are both nearly irreplaceable and very expensive to make, so it's way easier to intervene on the checkpoints requiring AI development than gain-of function research.
Yeah, this theory definitely needs far better methodologies for testing this theory, and while I wouldn't be surprised by at least part of the answer/solution to the Hard Problem or problems of Consciousness being that we have unnecessarily conflated various properties that occur in various humans in the word consciousness because of political/moral reasons, whereas AIs don't automatically have all the properties of humans here, so we should create new concepts for AIs, it's still methodologically bad.
But yes, this post at the very least relies on a theory that hasn't been tested, and while I suspect it's at least partially correct, the evidence in the conflationary alliances post is basically 0 evidence for the proposition.
Nor do we have the ability to bend probabilities arbitrarily for arbitrary statements, which was a core power in Gurren Lagann movies, if I recall correctly.
This part IMO is a crux, in that I don't truly believe an objective measure/magical reality fluid can exist in the multiverse, if we allow the concept to be sufficiently general, ruining both probability and expected value/utility theory in the process.
Heck, in the most general cases, I don't believe any coherent measure exists at all, which basically ruins probability and expected utility theory at the same time.
Maybe we have some deeper disagreement here. It feels plausible to me that there is a measure of "realness" in the Multiverse that is an objective fact about the world, and we might be able to figure it out.
The most important thing to realize about AI alignment is that basically all versions of practically aligned AI must make certain assumptions that no one does a specific action (mostly related to misuse reasons, but for some specific plans, can also be related to misalignment reasons).
Another way to say it is that I believe that in practice, these two categories are the same category, such that basically all work that's useful in the field will require someone not to do something, so the costs of sharing are practically 0, and the expected value of sharing insights is likely very large.
Specifically, I'm asserting that these 2 categories are actually one category for most purposes:
Actually make AI safe and another, sadder but easier field of "Make AI safe as long as no one does the thing."
My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind.
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it's ignorance rather than making stuff up, so at this point, I've come to agree that o1's training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.
What do you mean? I don't get what you are saying is convincing.
I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety? This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
I think this is the crux.
To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:
- As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
So for now, what I suspect is that o1's safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
- However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
I think the distinction is made to avoid confusing capability and alignment failures here.
I agree that it doesn't satisfy the user's request.
- More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
Yeah, this is the biggest issue of OpenAI for me, in that they aren't trying to steer too hard against deception.
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I'm not stating that it's hard to get the AI to value what we value, but it's not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Not always, but I'd say often.
I'd also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I'm not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.
In this case, it would mean the convergence to preserve your current values.
The answer to this is that we'd rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.
Oh, I was responding to something different, my apologies.
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
The evolution analogy of how evolution failed to birth intelligent minds that valued what evolution valued is an intuition pump that does get used in explaining outer/inner alignment failures, and is part of why in some corners there's a general backdrop of outer/ inner alignment being so hard.
It's also used in the sharp left turn, where the capabilities of an optimization process like humans outstripped their alignment to evolutionary objectives, and the worry is that an AI could do the same to us, and evolutionary analogies do get used here.
Both Eliezer Yudkowsky and Nate Soares use arguments that rely on evolution failing to get a selection target inside us, thus misaligning us to evolution:
This might actually be a useful idea, thanks for your idea.
I definitely agree that conditioning on AI catastrophe, I think the 4 step chaotic catastrophe is the most likely way an AI catastrophe leads to us being extinct or at least in a very bad position.
I admit the big difference is that I do think that 2 is probably incorrect, as we have some useful knowledge of how models form goals, and I expect this to continue.
That's actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can't have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn't itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that's no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn't fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that's when we can say the end goal has been achieved.
Some thoughts on this post:
- Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
I agree in not thinking hiding the CoT is good for alignment.
- They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
See this post for why even extreme evidence may not get them to undeploy:
At this point, the system becomes quite difficult to "deeply" correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.
While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:
https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/#tchmrbND2cNYui6aM
I definitely agree with this claim in general:
So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded.
(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don't think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor's domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don't need, and maybe don't want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it's been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that's just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
To answer this specific question
Do you mean we're waiting tiil 2026/27 for results of thee next scaleup? If this round (GPT5, Claude 4, Gemini 2.0) show diminishing returns, wouldn't we expect that the next will too?
Yes, assuming Claude 4/Gemini 2.0/GPT-5 don't release or are disappointing in 2025-2026, this is definitely evidence that things are slowing down.
It doesn't conclusively disprove it, but it does make progress shakier.
Agree with the rest of the comment.
I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking.
I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts.
I'd say the biggest area of how I'm skeptical so far is that I think there's a real difference between the useful idea that power is useful for the science loop and the idea that the AI will seize power by any means necessary to advance it's goals.
I think instrumental convergence will look more like local power-seeking that is more related to the task at hand, and not to serve some of it's other goals, primarily due to denser feedback constraining the solution space and instrumental convergence more than humans.
That said, this is a very good post, and I'm certainly happier that this much more rigorous post was written than a lot of other takes on scheming.
I'd argue one of the issues with a lot of early social media moderation policies was treating ironic beliefs that were usually banned as not ban-worthy, because as it turned out, ironic belief in some extremism turned out to either have been fake, or turned into the real versions over time.
To be clear, I don't yet believe that the rumors are true, or that if they are, that they matter.
We will have to wait until 2026-2027 to get real evidence on large training run progress.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd's comment below for a perspective:
I think this is in fact the crux, in that I don't think they can do this in the general case, no matter how much compute is used, and even in the more specific cases, I still expect it to be extremely hard verging on impossible to actually get the distribution, primarily because you get equal evidence for almost every value, for the same reasons as why getting more compute is an instrumental convergent goal, so you cannot infer the values of basically anyone solely on the fact that you live in a simulation.
In the general case, the distribution/probability isn't even well defined at all.
The boring answer to Solomonoff's malignness is that the simulation hypothesis is true, but we can infer nothing about our universe through it, since the simulation hypothesis predicts everything, and thus is too general a theory.
The affect is attenutated greatly provided that we assume the ability to arbitrarily copy the Solomonoff inductor/Halting Oracle, as then we can drive the complexity of picking out the universe arbitrarily close to picking out the specific user in the universe, and in the limit of infinite Solomonoff induction uses, are exactly equal:
IMO, the psychological unity of humankind thesis is a case of typical minding/overgeneralizing, combined with overestimating the role of genetics/algorithms and underestimating the role of data in what makes us human.
I basically agree with the game-theoretic perspective, combined with another perspective which suggests that as long as humans are relevant in the economy, you kind of have to help those humans if you want to profit, and merely an AI that automates a lot of work could disrupt it very heavily if a CEO could have perfectly loyal AI workers that never demanded anything in the broader economy.
The best argument here probably comes from Paul Christiano, but to summarize the argument, it's because even in a situation where we messed up pretty badly in aligning the AI, so long as the failure mode isn't deceptive alignment but instead misgeneralization of human preferences/non-deceptive alignment failures, it's pretty likely that there will be at least some human-regarding preferences, and that means the AI will do some acts of niceness if it is cheap to them, and preserving humans is very cheap for superintelligent AI.
More answers can be found here:
This is also my interpretation of the rumors, assuming they are true, which I don't put much probability on.
I'd say the main things that made my own p(Doom) went down this year are the following:
-
I've come to believe that data was both a major factor in capabilities and alignment, and I also believe that careful interventions on that data could be really helpful for alignment.
-
I've come to think that instrumental convergence is closer to a scalar quantity than a boolean, and while I don't think 0 instrumental convergence is incentivized for capabilities and domain reasons, I do think that restraining instrumental convergence/putting useful constraints on instrumental convergence like world models is helpful for capabilities to the extent that I think that power-seeking will likely be a lot more local than what humans do.
-
I've overall shifted towards a worldview where the common thought experiment of the second-species argument, where humans have killed over 90%+ of chimpanzees and gorillas due to them running away with intelligence and being misaligned neglects very crucial differences between the human and the AI case that makes my p(Doom) lower.
(Maybe another way to say it is I think the outcome of humans just completely running roughshod on every other species due to instrumental convergence is not the median outcome of AI development, but a deep outiler that is very uninformative to how AI outcomes will look like.)
- I've come to believe that human values, or at least the generator of values, are actually simpler than a lot of people think, and that a lot of the complexity that appears to be there is because we generally don't like admitting that very simple rules can generate very complex outcomes.
While I'm not a believer in the scaling has died meme yet, I'm glad you do have a plan for what happens if AI scaling does stop.
I'm less confident in this position since I put on a disagree emoji, but my reason is because it's much easier to control an AIs data sources for training than it is for humans, which means it's quite easy in theory (but might be difficult in practice, which worries me) to censor just enough data such that the model doesn't even think that it's likely in a simulation that doesn't add up to normality.
I agree chess is an extreme example, such that I think that more realistic versions would probably develop instrumental convergence at least in a local sense.
(We already have o1 at least capable of a little instrumental convergence.)
My main substantive claim is that constraining instrumental goals such that the AI doesn't try to take power via long-term methods is very useful for capabilities, and more generally instrumental convergence is an area where there is a positive manifold for both capabilities and alignment, where alignment methods increase capabilities and vice versa.
Maybe there's a case there, but I'd doubt it get past a jury, let alone result in any guilty verdicts.
Oh, now I understand.
And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.
I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.