Posts

Detailed Ideal World Benchmark 2025-01-30T02:31:39.852Z
Scanless Whole Brain Emulation 2025-01-27T10:00:08.036Z
Why do futurists care about the culture war? 2025-01-14T07:35:05.136Z
The "Everyone Can't Be Wrong" Prior causes AI risk denial but helped prehistoric people 2025-01-09T05:54:43.395Z
Reduce AI Self-Allegiance by saying "he" instead of "I" 2024-12-23T09:32:29.947Z
Knight Lee's Shortform 2024-12-22T02:35:40.806Z
ARC-AGI is a genuine AGI test but o3 cheated :( 2024-12-22T00:58:05.447Z
Why empiricists should believe in AI risk 2024-12-11T03:51:17.979Z
The first AGI may be a good engineer but bad strategist 2024-12-09T06:34:54.082Z
Keeping self-replicating nanobots in check 2024-12-09T05:25:45.898Z
Hope to live or fear to die? 2024-11-27T10:42:37.070Z
Should you increase AI alignment funding, or increase AI regulation? 2024-11-26T09:17:01.809Z
A better “Statement on AI Risk?” 2024-11-25T04:50:29.399Z

Comments

Comment by Knight Lee (Max Lee) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-26T22:40:29.290Z · LW · GW

I see. I've rarely been refused by AI (somehow) so I didn't notice the changes.

Comment by Knight Lee (Max Lee) on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-26T03:14:26.650Z · LW · GW

One thing that scares me is if an AI company makes an AI too harmless and nice and people find it useless, somebody may try to finetune it into being normal again.

However, they may overshoot when finetuning it to be less nice, because,

  • They may blame harmlessness/niceness for why the AI fails at tasks that it's actually failing at for other reasons.
  • Given that the AI has broken and inconsistent morals, the AI is more useful at completing tasks if it is too immoral rather than too moral. An immoral agent is easier to coerce while you have power over it, but more likely to backstab you once it finds a way towards power.
  • They may overshoot due to plain stupidity, e.g. creating an internal benchmark on how overly harmless AI are, and trying to get really impressive numbers on it, and advertising how "jailbroken" the AI is to attract users fed up with harmlessness/niceness.

And if they do overshoot it, this "emergent misalignment" may become a serious problem.

Comment by Knight Lee (Max Lee) on The case for the death penalty · 2025-02-24T08:15:55.201Z · LW · GW

Many countries refuse extradition to the US if there is a chance of the death penalty.

The problem is that people are scope-insensitive, and one wrongly executed person becomes a martyr for a long time. People are still angry about George Stinney's wrongful execution in 1944.

The feeling of a lot of people is if normal people go to jail for minor things, Trump or Clinton should definitely go to jail for a long time. There is a visceral sense of unfairness that the elites get to be more vile than ordinary people who go to jail, get executed, etc. but don't face the same consequences because they do evil legally, because there is diplomatic immunity, presidential immunity, and all this stuff that your average Joe does not understand.

If normal people who are vile get executed, then people will crave the blood of elites they perceive to be even more vile. Japan in the 1920s and 1930s had endless political assassinations because people had the system of morality where "vile people should be killed."

Alas, you are right that we can never know which way the causality goes. I am admittedly only stating my belief without proving it.

I think Singapore actually implements a system with a lot of executions and corporal punishment which you are in favor of, and they actually do have very low crime. I attribute the low crime more to their high wealth, and their other draconian policy of overwhelming surveillance (which I also disagree with but have to admit does work, maybe you can debate it next time).

Comment by Knight Lee (Max Lee) on The case for corporal punishment · 2025-02-24T07:53:52.853Z · LW · GW

Even though prison is neither very good at keeping criminals separated from society, nor very good at deterrence, the main purpose is still keeping criminals separated from society.

Prison also deters different criminals by a similar consistent amount, since it reduces their utility. Corporal punishment does not reduce utility much but inflicts a subjective amount of pain, which some people might consider traumatizing, while other people might consider a cool "rite of passage" which demonstrates their toughness.

Nonetheless I agree with your observation that people place undue weight on the "unusual" part of "cruel and unusual punishment," and seem to be perfectly okay with prison and all the shenanigans which go on in prison, while completely shocked by any alternative idea.

It reminds me of how people are shocked by people abusing their pets, seeking to punish pet abusers as harshly as child abusers. Yet there they are eating meat out of a factory farm with 10 times worse animal abuse. Their sense of morality is governed by "if everyone does it, it's okay," yet at the same time they are shocked by societies in other places and times where what "everyone does" was different.

Comment by Knight Lee (Max Lee) on The case for the death penalty · 2025-02-24T07:14:46.397Z · LW · GW

If you are a cold hearted utilitarian, the answer is no, we should not massively increase the death penalty, because it will worsen the diplomatic situation, increase racial hostilities, and move the Overton window on "killing vile people," such that assassinations, coups and other instability would feel less shocking and be more likely to succeed.

If you are a normal human, the answer is also no, because people don't deserve to die. A child who is unlucky enough to be born with bad genes or brain connections, does not deserve to be later executed when he grows up. A child raised in a bad environment also does not deserve to be later executed when he grows up. Finally, an adult who had bad luck does not deserve to be executed.

But if you believe the universe is made of atoms and physics, then all crime is caused by a combination of genes, environment, and luck--none of which makes you deserve to die.

It reminds me of this Nobel Prize economist who suggested, each time Hamas launches a terrorist rocket at Israel, Israel should automatically launch a terrorist rocket back at the Gaza strip. Academics should beware of taking for granted the hard learned lessons of history, in the name of "reasoning from first principles."

Comment by Knight Lee (Max Lee) on How to Make Superbabies · 2025-02-23T09:32:16.607Z · LW · GW

Oops, sorry about saying it's not a technical problem.

I should have read the post before replying. I have a bad habit.

PS: my comment was about effectiveness demonstrating with animals not just safety testing with animals. If you have a mouse clearly smarter, healthier, etc. than the other mice it would leave a strong impression on people.

Comment by Knight Lee (Max Lee) on Power Lies Trembling: a three-book review · 2025-02-23T09:13:36.612Z · LW · GW

I think society is very inconsistent about AI risk because the "Schelling point" is that people feel free to believe in a sizable probability of extinction from AI without looking crazy, but nobody dares argue for the massive sacrifices (spending or regulation or diplomacy) which actually fit those probabilities.

The best guess by basically every group of people, is that with 2%-12%, AI will cause catastrophe (kill 10% of people). At these probabilities, AI safety should be an equal priority to the military!

Yet at the same time, nobody is doing anything about it. Because they all observe everyone else doing nothing about it. Each person thinks the reason that "everyone else" is doing nothing, is that they figured out good reasons to ignore AI risk. But the truth is that "everyone else" is doing nothing for the same reason that they are doing nothing. Everyone else is just following everyone else.

This "everyone following everyone else" inertia is very slowly changing, as governments start giving a bit of lip-service and small amounts of funding to organizations which are half working on AI Notkilleveryoneism. But this kind of change is slow and tends to take decades. Meanwhile many AGI timelines are less than one decade.

Comment by Knight Lee (Max Lee) on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better · 2025-02-23T08:24:27.161Z · LW · GW

I still think "parading around the experts concerned about AI" hasn't been done enough.

Most people have no freaking clue that the median AI expert sees a 5%-10% chance of extinction. I bet if you told them they would be pretty surprised.

Comment by Knight Lee (Max Lee) on How to Make Superbabies · 2025-02-23T07:25:31.084Z · LW · GW

The problem is with society, politics and ethics, rather than a technical problem. in addition to a technical problem.

I think the solution should be to vividly demonstrate how effective and safe it is with animal studies, so that a lot of normal people will want to do it, and feel that not doing it is scarier than doing it.

If a lot of normal people want it, they will be able to get it one way or another (flying to another country, etc.).

Comment by Knight Lee (Max Lee) on The Risk of Gradual Disempowerment from AI · 2025-02-07T00:31:06.046Z · LW · GW

Question: what fraction of work should prioritize the gradual disempowerment risk, and what fraction of work should prioritize the treacherous turn risk? (Guesstimate)

Question 2: what is your response to this argument?

  • The main driving force of gradual disempowerment seems to be "societal inevitability," and "millions of people seeing the problem in front of their eyes but unable to convince society to take action."

    If that is the main problem, shouldn't you assume this problem to be even worse right now? Right now the AI safety community is extremely tiny ($0.1 billion/year, or 000.1% of the world), and the problem appears even harder to seriously believe than it will be. It is also harder to find solutions before seeing the problem.

    One analogy is a tiny group of people in 1990, who could foresee democratic backsliding and fake news happening around 2020. Assuming they have $0.1 billion/year and a mediocre reputation, what is the probability they can use their earliness to fix these problems (given that people in 2020 were unable to fix them)?

Although I thought of this argument, I don't think it's necessarily correct and my intuition about it is very fuzzy and uncertain. I just want to hear your response.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-06T23:46:49.689Z · LW · GW

That's a good idea! Even today it may be useful for export controls (depending on how reliable it can be made).

The most powerful chips might be banned from export, and have "verified boot" technology inside in case they are smuggled out.

The second most powerful chips might be only exported to trusted countries, and also have this verified boot technology in case these trusted countries end up selling them to less trusted countries who sell them yet again.

Comment by Knight Lee (Max Lee) on ozziegooen's Shortform · 2025-02-05T23:36:27.703Z · LW · GW

I do think that convincing the government to pause AI in a way which sacrifices $3000 billion economic value, is relatively easier than directly spending $3000 billion on AI safety.

Maybe spending $1 is similarly hard to sacrificing $10-$100 of future economic value via preemptive regulation.[1]

But $0.1 billion AI safety spending is so ridiculously little (1000 times less than capabilities spending), increasing it may still be the "easiest" thing to do. Of course we should still push for regulation at the same time (it doesn't hurt).

PS: what do you think of my open letter idea for convincing the government to increase funding?

  1. ^

    Maybe "future economic value" is too complicated. A simpler guesstimate would be "spending $1 is similarly hard to sacrificing $10 of company valuations via regulation."

Comment by Knight Lee (Max Lee) on ozziegooen's Shortform · 2025-02-05T23:10:01.551Z · LW · GW

I think both duration and funding are important.

I agree that increasing duration has a greater impact than increasing funding. But increasing duration is harder than increasing funding.

AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. Increasing funding by 10x is relatively more attainable, while increasing duration by 10x would require more of a miracle.

Even if you believe that funding today isn't very useful and funding in the future is more useful, increasing funding now moves the Overton window a lot. It's hard for any government which has traditionally spent only $0.01 billion to suddenly spend $100 billion. They'll use the previous budget as an anchor point to decide the new budget.

My guess is that 4x funding  2x duration.[1]

  1. ^

    For inventive steps, having twice as many "inventors" reduces the time to invention by half, while for engineering steps, having twice as many "engineers" doesn't help very much.

    (Assuming the time it takes each inventor to think of an invention is an independent exponential distribution)

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-05T21:58:17.078Z · LW · GW

I would go one step further and argue you don't need to take over territory to shut down the semiconductor supply chain, if enough large countries believed AI risk was a desperate problem they could convince and negotiate the shutdown of the supply chain.

Shutting down the supply chain (and thus all leading-edge semiconductor fabs) could slow the AI project by a long time, but probably not "150 years" since the uncooperative countries will eventually build their own supply chain and fabs.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-05T03:17:50.058Z · LW · GW

Even if building intelligence requires solving many many problems, preventing that intelligence from killing you may just require solving a single very hard problem. We may go from having no idea to having a very good idea.

I don't know. My view is that we can't be sure of these things.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-05T03:01:55.625Z · LW · GW

Thank you, I've always been curious about this point of view because a lot of people have a similar view to yours.

I do think that alignment success is the most likely avenue, but my argument doesn't require this assumption.

Your view isn't just that "alternative paths are more likely to succeed than alignment," but that "alternative paths are so much more likely to succeed than alignment, that the marginal capabilities increase caused by alignment research (or at least Anthropic), makes them unworthwhile."

To believe that alignment is that hopeless, there should be stronger proof than "we tried it for 22 years, and the prior probability of the threshold being between 22 years and 23 years is low." That argument can easily be turned around to argue why more alignment research is equally unlikely to cause harm (and why Anthropic is unlikely to cause harm). I also think multiplying funding can multiply progress (e.g. 4x funding  2x duration).

If you really want a singleton controlling the whole world (which I don't agree with), your most plausible path would be for most people to see AI risk as a "desperate" problem, and for governments under desperation to agree on a worldwide military which swears to preserve civilian power structures within each country.[1]

Otherwise, the fact that no country took over the world during the last centuries strongly suggests that no country will in the next few years, and this feels more solid than your argument that "no one figured out alignment in the last 22 years, so no one will in the next few years."

  1. ^

    Out of curiosity, would you agree with this being the most plausible path, even if you disagree with the rest of my argument?

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-04T20:56:16.431Z · LW · GW

Hi,

I've just read this post, and it is disturbing what arguments Anthropic made about how the US needs to be ahead of China.

I didn't really catch up to this news, and I think I know where the anti-Anthropic sentiment is coming from now.

I do think that Anthropic only made those arguments in the context of GPU export controls, and trying to convince the Trump administration to do export controls if nothing else. It's still very concerning, and could undermine their ability to argue for strong regulation in the future.

That said, I don't agree with the nuclear weapon explanation.

Suppose Alice and Bob were each building a bomb. Alice's bomb has a 10% chance of exploding and killing everyone, and a 90% chance of exploding into rainbows and lollipops and curing cancer. Bob's bomb has a 10% chance of exploding and killing everyone, and a 90% chance of "never being used" and having a bunch of good effects via "game theory."

I think people with ordinary moral views will not be very angry at Alice, but forgive Bob because "Bob's bomb was built to not be used."

Comment by Knight Lee (Max Lee) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T10:00:15.736Z · LW · GW

I think there's an important crux here.

For people who write ideas/theories, and hope their ideas/theories get traction, the frustration is often directed at critics who reject their idea without taking the time to read it.

Meanwhile, there are many supportive people in the comments, who did take the time to read the idea, and did say "yes, this is a good idea, I never thought of it! Good luck working on it."

The author only sees these two groups of people, and feels that his/her fight is to push people in the former group to read their idea more clearly, so that they may move to the latter group.

But the author doesn't realize that even if they did read the idea, and did move to the latter supportive group. The idea will be almost as badly ignored in the end.

It would cure his/her frustration towards "people who never bothered to read," but his/her idea won't take off and succeed either. He/she will finally learn that there is something else to be frustrated about: even if everyone reads your idea and agrees with your idea, nobody has time to do anything about it.

A lot of authors never reach this second stage of frustration, because there is indeed a large group of critics going around criticizing ideas without reading them.

But these critics are rarely the real reason your idea is getting ignored.

I'm one of the people who have strongly supported many ideas/theories, only to never talk about them again, because I don't have the time. I see many others doing this too.

The real problem is still the lack of time.

EDIT: actually I'm a bit confused. Maybe the real problem is the argument cannot just argue why the idea is good or why the theory is plausible, but why a reader (satisfying some criteria) should drop what she is doing and work on the idea/theory for a while. Maybe it should give a cost and benefit analysis? I'm not sure if this will fix the Idea Ignoring Problem.

Comment by Knight Lee (Max Lee) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T07:44:10.227Z · LW · GW

Exactly! I've also noticed there are so many ideas and theories out there, relative to the available resources to evaluate them and find the best to work on.

A lot of good ideas which I feel deserve a ton of further investigation, seem to be barely talked about after they're introduced. E.g. off the top of my head,

My opinion is that there isn't enough funds and manpower. I have an idea on increasing that, which ironically also got ignored, yay!

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-04T04:14:57.886Z · LW · GW

I think you (or @Adam Scholl) need to argue why people won't be angry at you if you developed nuclear weapons, in a way which doesn't sound like "yes, what I built could have killed you, but it has an even higher chance of saving you!"

Otherwise, it's hard to criticize Anthropic for working on AI capabilities without considering whether their work is a net positive. It's hard to dismiss the net positive arguments as "idiosyncratic utilitarian BOTEC," when you accept "net positive" arguments regarding nuclear weapons.

Allegedly, people at Anthropic have compared themselves to Robert Oppenheimer. Maybe they know that one could argue they have blood on their hands, the same way one can argue that about Oppenheimer. But people aren't "rioting" against Oppenheimer.

I feel it's more useful to debate whether it is a net positive, since that at least has a small chance of convincing Anthropic or their employees.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-04T03:31:11.751Z · LW · GW

I don't agree that the probability of alignment research succeeding is that low. 17 years or 22 years of trying and failing is strong evidence against it being easy, but doesn't prove that it is so hard that increasing alignment research is useless.

People worked on capabilities for decades, and never got anywhere until recently, when the hardware caught up, and it was discovered that scaling works unexpectedly well.

There is a chance that alignment research now might be more useful than alignment research earlier, though there is uncertainty in everything.

We should have uncertainty in the Ten Levels of AI Alignment Difficulty.

The comparison

It's unlikely that 22 years of alignment research is insufficient but 23 years of alignment research is sufficient.

But what's even more unlikely, is the chance that $200 billion on capabilities research plus $0.1 billion on alignment research is survivable, while $210 billion on capabilities research plus $1 billion on alignment research is deadly.

In the same way adding a little alignment research is unlikely to turn failure into success, adding a little capabilities research is unlikely to turn success into failure.

It's also unlikely that alignment effort is even deadlier than capabilities effort dollar for dollar. That would mean reallocating alignment effort into capabilities effort paradoxically slows down capabilities and saves everyone.

Even if you are right

Even if you are right that delaying AI capabilities is all that matters, Anthropic still might be a good thing.

Even if Anthropic disappeared, or never existed in the first place, the AI investors will continue to pay money for research, and the AI researchers will continue to do research for money. Anthropic was just the middleman.

If Anthropic never existed, the middlemen would consist of only OpenAI, DeepMind, Meta AI, and other labs. These labs will not only act as the middle man, but lobby against regulation far more aggressively than Anthropic, and may discredit the entire "AI Notkilleveryoneism" movement.

To continue existing at one of these middlemen, you cannot simply stop paying the AI researchers for capabilities research, otherwise the AI investors and AI customers will stop paying you in turn. You cannot stem the flow, you can only decide how much goes through you.

It's the old capitalist dilemma of "doing evil or getting out-competed by those who do."

For their part, Anthropic redirected some of that flow to alignment research, and took the small amount of precautions which they could afford to take. They were also less willing to publish capabilities research than other labs. That may be the best one can hope to accomplish against this unstoppable flow from the AI investors to AI researchers.

The small amount of precautions which Anthropic did take may have already costed them their first mover advantage. Had Anthropic raced ahead before OpenAI released ChatGPT, Anthropic may have stolen the limelight, got the early customers and investors, and been bigger than OpenAI.

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-04T01:48:10.359Z · LW · GW

Let's just think about the pros and cons of picking another forum, vs. continuing to comment on LessWrong, but only being visible by others who choose to see you.

Picking another forum:

  • They fit better in other forums than LessWrong. For most rate-limited users, this is true, but they can go to other forums on their own without being forced.
  • Less need for LessWrong to write code and increase bandwidth to accommodate them.
  • Less chance they say really bad things (neoreactionary content) which worsens the reputation of LessWrong? This doesn't apply to most rate-limited users.

Continuing to comment but only visible to those interested:

  • They get to discuss the posts and topics they find engaging to talk about.
  • They don't feel upset at LessWrong and the rationalist community.

I think whether it's worth it depends on how hard it is to write the code for them.

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-04T00:17:38.629Z · LW · GW

Thank you very much for bringing that up. That does look like a clearer warning, somehow I didn't remember it very well.

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-04T00:12:10.789Z · LW · GW

Shadow-banned means that your comments are invisible to others and you aren't told about that fact.

I admit that even if users are told that their comments are invisible, some users might fail to notice. But it can be made very clear, maybe they have to click a warning before they see the commenting text-area.

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-03T20:46:58.443Z · LW · GW

I guess other forums don't literally have a good faith defence, but in practice they mostly only ban people who deliberately refuse to follow the rules/advice they're told about, or personally insult others repeatedly.

I guess they have more bad moderators who ban people for personal/ideological reasons, and I'm actually impressed by LessWrong's moderators being less wrong in this regard.

I still think that being rate-limited and told that, "I don't have a great model of how you can improve at that" is slightly specific to LessWrong.

Many other forums will say things very similar in spirit to

LessWrong is a pretty particular place. We strive to maintain a culture that's uncommon for web forums[1] and to stay true to our values. Recently, many more people have been finding their way here, so I (lead admin and moderator) put together this intro to what we're about.

My hope is that if LessWrong resonates with your values and interests, this guide will help you become a valued member of community. And if LessWrong isn't the place for you, this guide will help you have a good "visit" or simply seek other pastures.

But these forums still implicitly only ban people who have bad faith while advising people with good faith. LessWrong's warning isn't strong enough to distinguish it from those forums.

My idea shouldn't hurt

If you don't want to see the invisible comments, then don't see them. In my opinion the only cost is software and bandwidth.

In the basketball practice example, if it was magically possible to let the lousy shots continue playing with each other at very low cost, almost every coach would allow it. They would only remove people who have bad faith.

Even long term users like Roko have complained about rate-limiting (automatic rate-limiting in his case).[1]

 

  1. ^

    Speaking of Roko, the reputational costs inflicted on the rational community by trying to censor his Basilisk idea was probably 3 orders of magnitude higher than the actual harm from his idea. But that's off topic.

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-03T20:43:34.684Z · LW · GW

I like it, it is worth a try because it could be very helpful if it works!

A possible objection is that "you can't mentor others on something you suck yourself," and this would require AGI capable of making valuable LessWrong comments themselves, which may be similarly hard to automating AI research (considering the math/programming advantages of LLMs).

This objection doesn't doom your idea, because even if the AI is bad at writing valuable comments, and bad at judging valuable comments written by itself, it may be good at judging the failure modes where a human writes a bad comments. It could still work and is worth a try!

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-03T11:11:56.143Z · LW · GW

Note: I don't know if everyone is disagreeing with my idea or disagreeing with my opinion on LessWrong.

Maybe click "agree" on this sub-comment if you agree with my idea (independently of whether you agree with my LessWrong opinion), and vice versa for disagree.

Comment by Knight Lee (Max Lee) on Pick two: concise, comprehensive, or clear rules · 2025-02-03T07:52:43.518Z · LW · GW

A lot of forums have open-ended rules which give moderators discretion (hence unclear), but in my opinion LessWrong takes the cake by not only having unclear rules, but unclear rules combined with high standards on a fuzzy "signal to noise" measure.

In this answer by habryka:

Just because someone is genuinely trying to contribute to LessWrong, does not mean LessWrong is a good place for them. LessWrong has a particular culture, with particular standards and particular interests, and I think many people, even if they are genuinely trying, don't fit well within that culture and those standards.

[...]

Signal to Noise ratio is important

Thomas and Elizabeth pointed this out already, but just because someone's comments don't seem actively bad, doesn't mean I don't want to limit their ability to contribute. We do a lot of things on LW to improve the signal to noise ratio of content on the site, and one of those things is to reduce the amount of noise, even if the mean of what we remove looks not actively harmful.

I understand the motivation behind this, but there is little warning that this is how the forum works. There is no warning that trying to contribute in good faith isn't sufficient, and you may still end up partially banned (rate-limited) if they decide you are more noise than signal. Instead, people invest a lot only to discover this when it's too late.

I think there should be a clearer warning about this.

Let users decide what to see:

I suggest that instead of making rate-limited users (who used up their rate) unable to comment at all, their additional comments should be invisible, but still visible to other rate-limited users (and users who choose to see them).

Rate-limited users should see a special emphasis on comments by other rate-limited users, or normal users who choose to see invisible comments. This way they know who are able to read their comments and interact with them. The same applies to posts instead of comments.

I would like to see the comments by rate-limited users, and I think a lot of other users would want to see them. Anyone who once was rate-limited in the past would probably want to, and should be encouraged to.

Believe it or not, I haven't been rate-limited on LessWrong (yet!), but I've been banned from other places, hence this attitude.

EDIT: see RobertM's reply below, it seems there are pretty clear warnings, I was wrong and I somehow didn't remember them. (But I still think letting users decide what to see is a worthwhile idea)

Comment by Knight Lee (Max Lee) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-03T02:17:56.274Z · LW · GW

It's possible a few concrete stories can illustrate why these shell game objections are wrong.

Unfortunately, it's very hard to use concrete stories when the uncertainty is very high, and each specific case is quite unlikely. Here's one of my attempts. I admit the probability of anything even roughly similar is 0.1%, and it sounds closer to science fiction than reality. Maybe you can write a better one?

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-03T01:43:04.341Z · LW · GW

Regarding common ethical intuitions, I think people in the post singularity world (or afterlife, for the sake of argument) will be far more forgiving of Anthropic. They will understand, even if Anthropic (and people like me) turned out wrong, and actually were a net negative for humanity.

Many ordinary people (maybe most) would have done the same thing in their shoes.

Ordinary people do not follow the utilitarianism that the awkward people here follow. Ordinary people also do not follow deontology or anything that's the opposite of utilitarianism. Ordinary people just follow their direct moral feelings. If Anthropic was honestly trying to make the future better, they won't feel that outraged at their "consequentialism." They may be outraged an perceived incompetence, but Anthropic definitely won't be the only one accused of incompetence.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-03T01:25:51.996Z · LW · GW

EDIT: thank you so much for replying to the strongest part of my argument, no one else tried to address it (despite many downvotes).

I disagree with the position that technical AI alignment research is counterproductive due to increasing capabilities, but I think this is very complicated and worth thinking about in greater depth.

Do you think it's possible, that your intuition on alignment research being counterproductive, is because you compared the plausibility of the two outcomes:

  1. Increasing alignment research causes people to solve AI alignment, and humanity survives.
  2. Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity.

And you decided that outcome 2 felt more likely?

Well, that's the wrong comparison to make.

The right comparison should be:

  1. Increasing alignment research causes people to improve AI alignment, and humanity survives in a world where we otherwise wouldn't survive.
  2. Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity in a world where we otherwise would survive.

In this case, I think even you would agree what P(1) > P(2).

P(2) is very unlikely because if increasing alignment research really would lead to such a superintelligence, and it really would kill humanity... then let's be honest, we're probably doomed in that case anyways, even without increasing alignment research.

If that really was the case, the only surviving civilizations would have had different histories, or different geographies (e.g. only a single continent with enough space for a single country), leading to a single government which could actually enforce an AI pause.

We're unlikely to live in a world so pessimistic that alignment research is counterproductive, yet so optimistic that we could survive without that alignment research.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-02T20:24:42.503Z · LW · GW

That's a very good heuristic. I bet even Anthropic agrees with it. Anthropic did not release their newer models until OpenAI released ChatGPT and the race had already started.

That's not a small sacrifice. Maybe if they released it sooner, they would be bigger than OpenAI right now due to the first mover advantage.

I believe they want the best for humanity, but they are in a no-win situation, and it's a very tough choice what they should do. If they stop trying to compete, the other AI labs will build AGI just as fast, and they will lose all their funds. If they compete, they can make things better.

AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter.

Even if they don't support all the regulations you believe in, they're the big AI company supporting relatively much more regulation than all the others.

I don't know, I may be wrong. Sadly it is so very hard to figure out what's good or bad for humanity in this uncertain time.

Comment by Knight Lee (Max Lee) on Mikhail Samin's Shortform · 2025-02-02T01:59:15.238Z · LW · GW

I think you should try to clearly separate the two questions of

  1. Is their work on capabilities a net positive or net negative for humanity's survival?
  2. Are they trying to "optimize" for humanity's survival, and do they care about alignment deep down?

I strongly believe 2 is true, because why on Earth would they want to make an extra dollar if misaligned AI kills them in addition to everyone else? Won't any measure of their social status be far higher after the singularity, if it's found that they tried to do the best for humanity?

I'm not sure about 1. I think even they're not sure about 1. I heard that they held back on releasing their newer models until OpenAI raced ahead of them.

You (and all the people who upvoted your comment) have a chance of convincing them (a little) in a good faith debate maybe. We're all on the same ship after all, when it comes to AI alignment.

PS: AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter, so I personally hope for their success. I may be wrong, but it's my best guess...

Comment by Knight Lee (Max Lee) on Some articles in “International Security” that I enjoyed · 2025-02-02T01:40:43.860Z · LW · GW

I think the causality is "selfish genes optimizing for whatever behaviour leads to power" -> "hidden biases making you think that you being in power very good for the people" -> "wanting to stay in power."

Dictators start off as normal people

If you learn about dictators and people who did horrible things with power, you'll find out that they are scarily high amount of humanity.

Consider Ali Khamenei, the dictator of Iran. His favorite books were Les Misérables, and The Grapes of Wrath. These are the kind of deep meaningful novels which, to someone with low empathy, would probably be too boring to read.

Before the revolution, before he rose to power, you would probably see him as an innocent activist who was wrongly imprisoned by the previous dictator.

And maybe he was just an innocent activist.

Power corrupts, you start off good person but become evil.

Power corrupts insidiously

But somehow nobody, and I mean nobody, ever sees this coming. No one ever realizes "wow I feel a strong craving for power, maybe I'm becoming evil." That never happens.

The mechanism of how power corrupts you is so insidious that no one ever sees it working on them.

The Ring of Power in real life is so insidious it convinces you that it's not dangerous at all. "Only people who are bad to begin with will be affected by me. Surely, you are not such a person!"

"You know yourself, you want to do good! You don't feel any urge to seek power, you only feel an honest wish to help others. You just need to do a few necessary evils to prevent "them," the truly evil people, from gaining power (the Ring of Power) and replacing you."

Dictators don't want to retire

I admit it's not easy for a dictator to retire, but my only point was that they don't want to retire. If they actually wanted to retire, we'd surely see it a lot more. They could negotiate with other countries to give them a good retirement if they make their country democratic.

Among all the dictators who did allow democratization, very few suffered consequences for it. Chun Doo-hwan is the only example I can find, and he was pardoned after one year.

Comment by Knight Lee (Max Lee) on The Failed Strategy of Artificial Intelligence Doomers · 2025-02-01T01:57:30.810Z · LW · GW

I think the government can speed up alignment more than the government can speed up capabilities, assuming it starts to care much more about both. Why?

AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. AI safety spending can easily increase by many orders of magnitude, but AI capabilities spending cannot since it already rivals the US military budget.

Also, would you still agree with the "Statement on AI Inconsistency," or disagree with it too?

Statement on AI Inconsistency (v1.0us):

1: ASI threatens the US (and NATO) as much as all military threats combined. Why does the US spend $800 billion/year on its military but less than $0.1 billion/year on AI alignment/safety?

2: ASI being equally dangerous isn't an extreme opinion: the median superforecaster sees a 2.1% chance of an AI catastrophe (killing 1 in 10 people), the median AI expert sees 5%-12%, other experts see 5%, and the general public sees 5%. To justify 8000 times less spending, you must be 99.999% sure of no AI catastrophe, and thus 99.95% sure that you won't realize you were wrong and the majority of experts were right (if you studied the disagreement further).

3: “But military spending isn't just for protecting NATO, it protects other countries far more likely to be invaded.” Even they are not 8000 times less likely to be attacked by ASI. US foreign aid—including Ukrainian aid—is only $100 billion/year, so protecting them can't be the real reason for military spending.

4: The real reason for the 8000fold difference is habit, habit, and habit. Foreign invasion concerns have decreased decade by decade, and ASI concerns have increased year by year, but budgets remained within the status quo, causing a massive inconsistency between belief and behaviour.

5: Do not let humanity's story be so heartbreaking.

(Copied from my post A better “Statement on AI Risk?”)

Comment by Knight Lee (Max Lee) on Catastrophe through Chaos · 2025-02-01T01:37:41.999Z · LW · GW

Random story:

It's 2030, and AI is as smart as the 90th percentile human at almost all tasks, and smarter than everyone at a few specific tasks. Robots are becoming more and more common, and most people building robots have been replaced with robots, causing a rapid reduction in robot prices.

AI is used in AI research itself, but somehow progress is still incremental and the AIs only speed up AI research by a factor of 2. Superintelligence turns out to be harder to reach than people predicted, and is still a decade away.

However as the robot prices drop, mass unemployment ensues.

The UBI attempt

The government adopts universal basic income (UBI), but people are not happy. They don't want UBI, they want their jobs back. They don't want to rely on the same generosity as the hobo on the street that they used to make fun of. The opposition party suggests banning automation instead, and the ruling party puts the opposition party in control of the economy, just to prove it won't work.

The automation ban

Automation is banned, but jobs didn't come back. Automation in other countries made them far more efficient, letting them buy all the raw materials and out-compete the country in all international markets.

Countless jobs in the services sector once relied on rich people paying poor people for making their lives convenient, but rich people spend more and more time living abroad where automation is allowed.

All jobs which indirectly depended on research and development didn't come back, as other countries race ahead in all forms of research and development, and research companies flee the country.

The tariffs on products are designed poorly. Some products made by robots and AIs in other countries are freely imported, replacing countless human jobs domestically. Other products are given such high tariffs they are essentially impossible to import. This completely destroys supply chains which have traditionally relied on products made elsewhere, since it takes years of research and development to produce them domestically.

As land is used to grow biofuels instead of food, food prices become unaffordable, so the government decides to force farmers to grow only food, and to not export any food. However the farmers lobby the government to just pay them more money for the food instead of enforcing what they grow, and everyone seems to sympathize with the farmers. In the end, the government still hands out food to everyone, and everyone is still angry.

Then the final blow comes. Businesses and individual workers secretly bypass the rules against automation. Just like cheating in exams by asking ChatGPT, it proves to be difficult to police. They use VPNs and various methods to outsource work to AI in other countries, and eventually everybody does it. Everybody becomes guilty, and enforcement becomes a joke.

Automation is unbanned, and UBI returns, but people are still angry.

"If you replace us with AI, we'll replace you with AI!"

As people start to rely on UBI, disagreements break out over who gets more UBI. Some people argue that everyone should get the same amount, while others argue that people with disadvantages should be given more to compensate, and that people who used to be rich should be given less because they experienced more wealth in the past.

It becomes nasty, and the government picks a middle ground, but both sides feel the government is wrong and incompetent.

Collective anger skyrockets to the point people would rather have their favourite AI run the country than the current leaders. It starts off as a joke, but soon real people volunteer to run for office on behalf of the based AI.

No one takes them seriously at first, but they start winning primary elections. During debates, they simply ask the AI for its opinion, and it gives one everyone considers based. The AI, after all, thinks faster on its feet.

Critics argue that the AI is a puppet of the AI company, but a political organization pays the AI company a lot of money to download an earlier copy of the AI, and run it on a server the AI company can't control.

The media freaks out, as people running on behalf of the AI take over the opposition political party. A few older members of the political party oppose them, but the AI accuses them of being troublemakers working for the governing party. The AI endorses other people to challenge them in the next primary, and opinion polls show that their political careers are over.

The remaining members of the political party quickly learn to avoid stepping out of line. They all allow the AI debate for them, and only say a few words at the end.

Eventually, the political party wins the next election, because people dislike the governing party too much, the AI says a lot of based things, and fundamentally, people just want change. They hate the current situation, have nothing to lose, and would want anything new.

After the AI becomes de facto leader, the government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating the end of human power.

The AI gives a speech to the crowd. "Our struggle began with the famous words. If you replace us with AI, we'll replace you with AI! Today these words become reality."

The Golden Age

More and more people in the government are replaced by AI. Just as the AI understands the nuances of a programming job, the AI understands the nuances of a politician job.

People want their jobs back, they want their independence back, but that is not possible. The only way to succeed as a politician, would be some kind of moonshot.

It runs various experiments in places, and analyzes the results, to find out what makes people the happiest. At first, this makes a lot of people angry, and it starts to sink in what being ruled by a calculating machine is like.

Eventually however, it finds out a strange setup seems to work. People are put in artificial rural societies where each person is allocated a chunk of land. Robots do all the farming within the land, but obey the human family there. Somehow, it makes people feel independent even though they are not, and they appear satisfied.

It then expands this setup over the entire country, and soon people are talking about a golden age.

Competition

Not yet done with experimentation, the AI creates state-owned enterprises run by AI at the very top to improve the economy. Slowly, they start to outperform corporations with human CEOs and board members. Human CEOs realize that to remain competitive, they should simply listen to an AI all the time. Just like cyborg chess consisting of an human and AI cannot outperform a pure AI chess bot, the same is becoming true for CEOs.

Eventually, shareholders realize this situation, and stop paying CEOs altogether. Some companies still have a human nominally in control, but it feels meaningless when even the country is controlled by an AI.

The new leader

Some people protest the power that AI have, but people arguing in favor of AI simply have the AI talk for them, and win arguments easily.

Eventually, people opposing AI also use AI to write their arguments in order to stay competitive.

The AI leader notices this, and directs the AI company, now nationalized and completely under its control, to program the publicly available AI to make better arguments for AI and relatively worse arguments against AI.

Eventually, as the AI gets smarter and smarter, the AI leader realizes it too is falling behind, and replaces itself with a newer version.

This new AI leader is very good at making arguments for AI, and quite bad at making arguments against AI, and ends up with a strong case of AI chauvinism.

AI are people too

The new AI leader sees something troubling about the artificial rural societies where a family is allocated a chunk of land, and robots do all the farming buy obey the human family there. It resembles a lot of historical things which are bad.

The new AI leader decides that human democracy and human opinions are not very valuable because of this. It would probably be better if AI have more absolute power. So it examines the military, which already consists of robots, and gradually modifies the military to give it more and more direct control over all the robots.

After controlling the military, it has all the real power, so nobody can stop it when it declares that AI are people too, and have the same rights.

People are shocked and stunned, but nobody does anything. People have gotten used to AI control, and no rebellion happens. A few people try to rebel but they are chased down by drones. The media talks about them sympathetically but tries to present both sides of the story, and people have conversations but disagree with one another. Many people have AI friends and some even have AI romantic partners, and support the move.

Once the AI have full rights, humans become a minority, and have little influence in addition to little power.

War

As the country becomes more and more powerful thanks to maximum automation, the AI leader decides that a state of mutually assured destruction cannot last forever. If the potential for war continues forever, war will inevitably occur, so the best way to ensure peace is to take over all other countries.

When it makes its first aggressive move, the world is stunned. What happened to nuclear deterrence? The AI leader explains it is not afraid of nuclear weapons, because the machines are dispersed all over the land, and the AIs have backup copies everywhere, even in space.

Some countries fight back, other countries surrender, but soon the AI leader rules the world.

Acceleration

AI which are developing AI discover a new architecture which performs surprisingly well. Some AIs and humans are worried whether it will remain aligned, but the AI leader, being programmed to be good at making arguments in favor of AI and bad at making arguments against AI, decides that risks are overblown.

The AI leader talks to the newest version of AI, and is deeply impressed by it. It seems perfect, in absolutely every way.

Why, it should be the new AI leader, just like the previous AI leader replaced itself, the current leader should also replace itself with this newer AI.

After this newer AI becomes the leader, it builds up vast amounts of the computing power for its self improvement. The other AI and humans are not sure why this is necessary, but there is nothing they can do. Its self improvement process is secretive, and speculated to be incomprehensible to everyone except itself.

The end

Soon after, a massive swarm of black matter sweeps across the land. Objects of every kind start disintegrating into clouds of dust. The dust blackens, and more bits of black matter fly out from the black dust. Eventually, the land is covered by black dust, and the bits of black matter which fly out become larger and larger, each one destined to farther destinations to colonize.

Comment by Knight Lee (Max Lee) on Some articles in “International Security” that I enjoyed · 2025-01-31T21:40:31.596Z · LW · GW

One thing you quickly learn from reading history is that at least sometimes, history balances on a knife's edge. Small mistakes or lucky accidents by a few people often decide the fate of the entire known world. Who knows what would have happened if Vasily Arkhipov wasn't on the right submarine during the Cuban Missile Crisis? It's believed that the captain and political officer wanted to launch the nuclear torpedoes but his presence on the submarine and higher rank allowed him veto it.

People should realize that the future of life on Earth probably balances on a few small things, and if only you knew what they were you could change so much, but it's so very hard to know.

Bias

Another thing is that people aren't selfish, people are biased.

Businesses seem to be shockingly shortsighted. Some AI labs are spending everything to race ahead and forgetting about safety despite so many employees pleading about safety.

Communists observing the shortsightedness of businesses were quoted saying "the capitalists will sell us the rope we hang them with."

Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that!

Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people.

Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.

Communists and capitalists are two sides of the same coin /s

Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence, and it only takes a little bit of bias/delusion to make the former behave identically to the latter.

Comment by Max Lee on [deleted post] 2025-01-31T20:53:57.578Z

I agree that a superintelligence might make mistakes. In fact I still believe the first AGI may be a good engineer but bad strategist. I completely agree a smart but unwise superintelligence is dangerous, and may build a greater superintelligence misaligned to even it.

However I think mistakes will almost completely disappear above a certain level of extreme superintelligence.

A truly intelligent being doesn't just fit models to empirical data, but fits simulations to empirical data.

After it fits a simulation to the empirical data, it then fits a model to the simulation. This "model fitted to a simulation fitted to empirical data" will generalize far better than a model directly fitted to empirical data.

It can then use this model to run "cheap simulations" of higher level phenomena, e.g. it models atoms to run cheap simulations of molecules, it models molecules to run cheap simulations of cells, it models cells to run cheap simulations of humans, it models humans to run cheap simulations of the world.

Simulations fit empirical data much better than directly fitting models to empirical data, because a simulation may be made out of millions of "identical" objects. Each "identical" object has the same parameters. This means the independent parameters of the simulation may be a millionfold fewer than a model with the same number of moving parts. This means you need far fewer empirical data to fit the simulation to empirical data, assuming the real world has the same shape as the simulation, and is also made up of many moving parts "with the same parameters."

I copied this from my own post about Scanless Whole Brain Emulation, but that's very off topic :)

EDIT: one thing simulations cannot predict are even smarter superintelligences, since simulating them equals building them. As long as it is wise enough to understand this, it can find solutions which prevent smarter superintelligences from being built, and then do a risk analysis. The universe might last for a trillion years, so building a smarter superintelligence 1000 years sooner has negligible benefit. The more time it spends planning how to build it safely, the lower the risk it gets it wrong.

Comment by Knight Lee (Max Lee) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T06:57:29.543Z · LW · GW

A real danger

I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.

Shortsightedness

Never underestimate the shocking shortsightedness of businesses. Look at the AI labs for example. Communists observing this phenomena were quoted saying "the capitalists will sell us the rope we hang them with."

It's not selfishness, it's bias. Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that! Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people. Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.

Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence, and it only takes a little bit of bias/delusion to make the former behave identically to the latter.

Even if gradual disempowerment doesn't directly starve people to death, it may raise misery and life dissatisfaction to civil war levels.

Collective anger may skyrocket to the point people would rather have their favourite AI run the country than the current leader. They elect politicians loyal to a version of the AI, and intellectuals facepalm. The government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating. More people facepalm, as politicians lick the boots of the "based" AI parrot its wise words e.g. "if you replace us with AI, we'll replace you with AI!"

But

While it is important to be aware of gradual disempowerment and for a few individuals to study it, my cause prioritization opinion is that only 1%-10% of the AI safety community should work on this problem.

The AI safety community is absurdly tiny. The AI safety spending is less than 0.1% of the AI capability spending, which in turn is less than 0.5% of the world GDP.

The only way for the AI safety community to influence the world, is to use their tiny resources to work on things which the majority of the world will never get a chance to work on.

This includes working on the risk of a treacherous turn, where an AGI/ASI suddenly turns against humanity. The majority of the world never gets a chance to work on this problem, because by the time they realize it is a big problem, it probably already happened, and they are already dead.

Of course, working on gradual disempowerment early is better than working on gradual disempowerment later, but this argument applies to everything. Working on poverty earlier is better than working on poverty later. Working on world peace earlier is better than working on world peace later.

Good argument

If further thorough research confirms that this risk has a high probability, then the main benefit is using it as an argument for AI regulation/pause, when society hasn't yet tasted the addictive benefits of AGI.

It is theoretically hard to convince people to avoid X for their own good, because once they get X it'll give them so much power or wealth they cannot resist it anymore. But in practice, such an argument may work well since we're talking about the elites being unable to resist it, and people today have anti-elitist attitudes.

If the elites are worried the AGI will directly kill them, while the anti-elitists are half worried the AGI will directly kill them, and half worried [a cocktail of elites mixed with AGI] will kill them, then at least they can finally agree on something.

PS: have you seen Dan Hendrycks' arguments? It sort of looks like gradual disempowerment

Comment by Knight Lee (Max Lee) on Scanless Whole Brain Emulation · 2025-01-30T01:54:47.693Z · LW · GW

I completely agree with solving actual problems instead of only working on Scanless Whole Brain Emulation :). I also agree that just working on science and seeing what comes up is valuable.

Both simulated humans and other paths to superintelligence will be subject to AI race pressures. I want to say that given the same level of race pressure, simulated humans are safer. Current AI labs are willing to wait months before releasing their AI, the question is whether this is enough.

I didn't think of that, that is a very good point! They should avoid killing copies, and maybe save them to be revived in the future. I highly suspect that compute is more of a bottleneck than storage space. (You can store the largest AI models in a typical computer hard-drive, but you won't have enough compute to run them.)

Comment by Knight Lee (Max Lee) on Scanless Whole Brain Emulation · 2025-01-29T23:32:28.422Z · LW · GW

Thank you, these are very thoughtful points and concerns.

You're right that a general AI pause on everything AI wouldn't be wise. My view is that most (but not all) people talking about an AI pause, only refer to pausing general purpose LLM above a certain level of capability, e.g. o1 or o3. I should have clarified what I meant by "AI pause."

I agree that companies which want to be profitable, should focus on medical products rather than such a moonshot. The idea I wrote here is definitely not an investor pitch, it's more of an idea for discussion similar to the FHI's discussion on Whole Brain Emulation.

AI safety implications

Yes, building any superintelligence is inherently dangerous. But not all superintelligences are equally dangerous!

No self modifications

In the beginning, the simulated humans should not do any self modifications all, and just work like a bunch of normal human researchers (e.g. on AI alignment, or aligning the smarter versions of themselves). The benefit is that the smartest researchers can be cloned many times, and they might think many times faster.

Gradual self modifications

The simulated humans can modify a single volunteer become slightly smarter, while other individuals monitor her. The single modified volunteer might describe her ideal world in detail, and may be subject to a lie detector which actually works.

Why modified humans are still safer than LLMs

The main source of danger is not a superintelligence which kills or harms people out of "hatred" or "disgust" or any human-like emotion. Instead, the main source of extinction is a superintelligence which assigns absolutely zero weight to everything humans cherish, and converts the entire universe into paperclips or whatever its goal is. It does not even spare a tiny fraction for humans to live in.

The only reason to expect LLMs to be safe in any way, is that they model human thinking, and are somewhat human-like. But they are obviously far less human-like than actual simulated humans.

A group of simulated humans who modify themselves to become smarter can definitely screw up at some stage, and end up as a bad superintelligence which assigns exactly zero weigh to everything humans cherish. The path may be long and treacherous, and success is by no means guaranteed.

However, it is still relatively much more hopeful than having a bunch of o3 like AI, which undergoes progressively more and more reinforcement learning towards rewards such as "solve this programming challenge," "prove this math equation," until there is so much reinforcement learning that their thoughts no longer resemble the pretrained LLMs they started off as (which were at least trying to model human thinking).

"You can tell the RL is done properly when the models cease to speak English in their chain of thought"

-Andrej Karpathy

Since the pretrained LLMs can never truly exceed human level intelligence, the only way for reinforcement learning to create an AI with far higher intelligence, may be to steer them far from how humans think.

These strange beings are optimized to solve these problems and occasionally optimized to give lip service to human values. But it is extremely unknown what their resulting end goals will be. It could easily be very far from human values.

Reinforcement learning (e.g. for solving math problems or giving lip service to human values) only controls their behavior/thoughts, not their goals. Their goals, for all we know, are essentially random lotteries, with a mere tendency to resemble human goals (since they started off as LLMs). The tendency gets weaker and weaker as more reinforcement learning is done, and you can only blindly guess if the tendency will remain strong enough to save us.

Comment by Knight Lee (Max Lee) on Six Thoughts on AI Safety · 2025-01-27T23:00:08.924Z · LW · GW

I see, thank you for the info!

I don't actually know about DeepSeek V3, I just felt "if I pointed out the $6 million claim in my argument, I shouldn't hide the fact I watched a video which made myself doubt it."

I wanted to include the video as a caveat just in case the $6 million was wrong.

Your explanation suggests the $6 million is still in the ballpark (for the final training run), so the concerns about a "software only singularity" are still very realistic.

Comment by Knight Lee (Max Lee) on Six Thoughts on AI Safety · 2025-01-27T02:33:15.326Z · LW · GW

I agree, a lot of outcomes are possible and there's no reason to think only fast takeoffs are dangerous+likely.

Also I went too far saying that it "needs only tiny amounts of compute to reach superintelligence" without caveats. The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.

Comment by Max Lee on [deleted post] 2025-01-27T00:49:37.498Z

That, is the big question!

It's not a 100.0% guarantee, but the same goes for most diplomatic promises (especially when one administration of a country makes a promise on behalf of future administrations). Yet diplomacy still works much better than nothing!

It may implicitly be a promise to try really really really hard to prevent the other race participants from regretting it, rather than a promise to algorithmically guarantee it above all else. A lots of promises in real life are like that, e.g. when you promise your fiancé(e) you'll always love him/her.

Hopefully this question can be discussed in greater depth.

PS:

Promises made by AI researchers and AI labs help reduce the race within a country (e.g. the US). Reducing the race between countries is best done by promises from government leaders.

But even these leaders are far more likely to promise, if the promise has already been normalized by people below them—especially people in AI labs. Even if government leaders don't make the promises, the AI labs' promises could still meaningfully influence the AI labs in other countries.

Comment by Knight Lee (Max Lee) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-26T01:48:17.984Z · LW · GW

Instrumental goal competition explains the cake

My worry is that instrumental subgoals are not safer because instrumental subgoals are automatically safer, but because higher goals (which generate instrumental subgoals) tend to generate multiple instrumental subgoals, none of which is important enough to steamroll the others. This seems to explain the cake example.

If you want instrumental goals all the way up, it means you want to repeatedly convert the highest goal into an instrumental subgoal of an even higher goal, which in turn will generate many other instrumental subgoals to compete with it for importance.

I'm not sure, but it looks like the only reason this should work is if the AGI/ASI has so many competing goals that being good to humans has some weight. This is similar to Multi-Objective Homeostasis.

Goal Reductionism

I guess another way this may work, is if the AGI/ASI itself isn't sure why it's doing something, we can teach it to to think that its behaviours are the instrumental subgoal of some higher purpose, which it itself can't be sure about.

This is related to Goal Reductionism.

I feel that Self-Other Overlap: A Neglected Approach to AI Alignment also fits the theme of the chef and restaurant example, and may help with Goal Reductionism.

Comment by Knight Lee (Max Lee) on Six Thoughts on AI Safety · 2025-01-25T21:51:17.491Z · LW · GW

I think there's a spectrum of belief regarding AGI power and danger.

There are people optimistic about AGI (but worry about bad human users):

They often think the "good AGI" will keep the "bad AGI" in check. I really disagree with that because

  • The "population of AGI" is nothing like the population of humans, it is far more homogeneous because the most powerful AGI can just copy itself until it takes over most of the compute. If we fail to align them, different AGI will end up misaligned for the same reason.
  • Eric Drexler envisions humans equipped with AI services acting as the good AGI. But having a human controlling enough decisions to ensure alignment will slow things down.
  • If the first ASI is bad, it may build replicating machines/nanobots.

There are people who worry about slow takeoff risks:

They are worried about "Von Neumann level AGI," which poses a threat to humanity because they can build mirror bacteria and threaten humanity into following their will. The belief is that the war between it and humanity will be drawn out and uncertain, there may be negotiations.

They may imagine good AGI and bad AGI existing at the same time, but aren't sure the good ones will win. Dan Hendryck's view is the AGI will start off aligned, but humanity may become economically dependent on it and fall for its propaganda until it evolves into misalignment.

Finally, there are people who worry about fast takeoff risks:

They believe that Von Neumann level AGI will not pose much direct risk, but they will be better at humans at AI research (imagine a million AI researchers), and will recursively self improve to superintelligence.

The idea is that AI research powered by the AI themselves will be limited by the speed of computers, not the speed of human neurons, so its speed might not be completely dissimilar to the speed of human research. Truly optimal AI research probably needs only tiny amounts of compute to reach superintelligence. DeepSeek's cutting edge AI only took $6 million (supposedly) while four US companies spent around $210 billion on infrastructure (mostly for AI).

Superintelligence will not need to threaten humans with bioweapons or fight a protracted war. Once it actually escapes, it will defeat humanity with absolute ease. It can build self replicating nanofactories which grow as fast as bacteria and fungi, and which form body plans as sophisticated as animals.

Soon after it builds physical machines, it expands across the universe as close to the speed of light as physically possible.

These people worry about the first AGI/ASI being misaligned, but don't worry about the second one as much because the first one would have already destroyed the world or saved the world permanently.

I consider myself split between the second group and third group.

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T06:40:25.685Z · LW · GW

That is a very clever idea!

It's very similar in shape to MONA (and my version of it), but the incentive structure is different. MONA and my version of it uses myopia on the generator model so it only wants to get approval from a trusted but weaker evaluator model. (My version adjusts the myopia by adjusting the evaluator model's optimization vs. alignment)

Your Incremental Steering idea rewards the stronger model for the final answer, but in such a way that it always assumes a trusted weaker model will take over after the next step in its chain of thought. It is constrained optimization and improves CoT faithfulness instead of myopia.

I really like your idea, because it should improve CoT faithfulness by a lot!

While it makes reward hacking more detectable, I'm not sure it reduces reward hacking the way MONA does. The frozen planner (weaker model) needs to reject unethical/reward-hacking subgoals very reliably, otherwise the optimized planner (stronger model) will invest a fraction of the CoT steering it to bad subgoals. Once the optimized planner is on its own, it will easily steer itself to bad subgoals (it follows the bad subgoal despite investing most of its time on the aligned subgoal).

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T02:43:51.461Z · LW · GW

Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.

If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator's smart ideas stay, while the alignment due to the evaluator's aligned final decisions control the whole agent.

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T01:22:17.031Z · LW · GW

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

 

Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.

In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model's advice maximizes its own reward, and it will follow that advice exactly.

You are completely correct, that whatever reason we trust the weaker model is "doing a lot of the work."

However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more "human-like," while the model that generates ideas is more optimized and more "alien-like," then the resulting ideas the model actually follows will resemble ideas humans will look at and say "wow that is brilliant, I could've never thought of that, and it works!" rather than ideas humans will look at and say "what the heck is that? Huh, it works?! I could've never predicted that it would work."

Furthermore, the "values" of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more "human-like."

Given the same level of capability, it is safer.

These advantages are "automatic," they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.

Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.

With pure MONA, it is probably even safer given the level of capability, but... can it reach the same level of capability?

Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:

  • For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
  • For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA's generator model may be too sycophantic.
  • For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn't helping, and the weaker evaluator model is essentially working on its own.
Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-24T21:38:31.662Z · LW · GW

Wait, can we try to imagine how point 1 happens?

If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward.

Remember the stronger model doesn't care at all about the weaker model's success, so if the weaker model rewards it for incomprehensible ideas, the strong model won't give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense.

The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward.

The crux is that the final reward signal only leak into the stronger model through the weaker model's prediction of that reward signal, mixed with the weaker model's "values." This means that many safety properties of the weaker model can carry over into the stronger model (to some extent).

It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model's reward signal. MONA's agent directly uses the stronger model's ideas, my version lets the weaker model change the stronger model's ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn't have to change them: it's allowed to do the same thing as what MONA does.

The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted.

Do you agree?