Posts

Why do futurists care about the culture war? 2025-01-14T07:35:05.136Z
The "Everyone Can't Be Wrong" Prior causes AI risk denial but helped prehistoric people 2025-01-09T05:54:43.395Z
Reduce AI Self-Allegiance by saying "he" instead of "I" 2024-12-23T09:32:29.947Z
Knight Lee's Shortform 2024-12-22T02:35:40.806Z
ARC-AGI is a genuine AGI test but o3 cheated :( 2024-12-22T00:58:05.447Z
Why empiricists should believe in AI risk 2024-12-11T03:51:17.979Z
The first AGI may be a good engineer but bad strategist 2024-12-09T06:34:54.082Z
Keeping self-replicating nanobots in check 2024-12-09T05:25:45.898Z
Hope to live or fear to die? 2024-11-27T10:42:37.070Z
Should you increase AI alignment funding, or increase AI regulation? 2024-11-26T09:17:01.809Z
A better “Statement on AI Risk?” 2024-11-25T04:50:29.399Z

Comments

Comment by Knight Lee (Max Lee) on Six Thoughts on AI Safety · 2025-01-27T02:33:15.326Z · LW · GW

I agree, a lot of outcomes are possible and there's no reason to think only fast takeoffs are dangerous+likely.

Also I went too far saying that it "needs only tiny amounts of compute to reach superintelligence" without caveats. The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.

Comment by Max Lee on [deleted post] 2025-01-27T00:49:37.498Z

That, is the big question!

It's not a 100.0% guarantee, but the same goes for most diplomatic promises (especially when one administration of a country makes a promise on behalf of future administrations). Yet diplomacy still works much better than nothing!

It may implicitly be a promise to try really really really hard to prevent the other race participants from regretting it, rather than a promise to algorithmically guarantee it above all else. A lots of promises in real life are like that, e.g. when you promise your fiancé(e) you'll always love him/her.

Hopefully this question can be discussed in greater depth.

PS:

Promises made by AI researchers and AI labs help reduce the race within a country (e.g. the US). Reducing the race between countries is best done by promises from government leaders.

But even these leaders are far more likely to promise, if the promise has already been normalized by people below them—especially people in AI labs. Even if government leaders don't make the promises, the AI labs' promises could still meaningfully influence the AI labs in other countries.

Comment by Knight Lee (Max Lee) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-26T01:48:17.984Z · LW · GW

Instrumental goal competition explains the cake

My worry is that instrumental subgoals are not safer because instrumental subgoals are automatically safer, but because higher goals (which generate instrumental subgoals) tend to generate multiple instrumental subgoals, none of which is important enough to steamroll the others. This seems to explain the cake example.

If you want instrumental goals all the way up, it means you want to repeatedly convert the highest goal into an instrumental subgoal of an even higher goal, which in turn will generate many other instrumental subgoals to compete with it for importance.

I'm not sure, but it looks like the only reason this should work is if the AGI/ASI has so many competing goals that being good to humans has some weight. This is similar to Multi-Objective Homeostasis.

Goal Reductionism

I guess another way this may work, is if the AGI/ASI itself isn't sure why it's doing something, we can teach it to to think that its behaviours are the instrumental subgoal of some higher purpose, which it itself can't be sure about.

This is related to Goal Reductionism.

I feel that Self-Other Overlap: A Neglected Approach to AI Alignment also fits the theme of the chef and restaurant example, and may help with Goal Reductionism.

Comment by Knight Lee (Max Lee) on Six Thoughts on AI Safety · 2025-01-25T21:51:17.491Z · LW · GW

I think there's a spectrum of belief regarding AGI power and danger.

There are people optimistic about AGI (but worry about bad human users):

They often think the "good AGI" will keep the "bad AGI" in check. I really disagree with that because

  • The "population of AGI" is nothing like the population of humans, it is far more homogeneous because the most powerful AGI can just copy itself until it takes over most of the compute. If we fail to align them, different AGI will end up misaligned for the same reason.
  • Eric Drexler envisions humans equipped with AI services acting as the good AGI. But having a human controlling enough decisions to ensure alignment will slow things down.
  • If the first ASI is bad, it may build replicating machines/nanobots.

There are people who worry about slow takeoff risks:

They are worried about "Von Neumann level AGI," which poses a threat to humanity because they can build mirror bacteria and threaten humanity into following their will. The belief is that the war between it and humanity will be drawn out and uncertain, there may be negotiations.

They may imagine good AGI and bad AGI existing at the same time, but aren't sure the good ones will win. Dan Hendryck's view is the AGI will start off aligned, but humanity may become economically dependent on it and fall for its propaganda until it evolves into misalignment.

Finally, there are people who worry about fast takeoff risks:

They believe that Von Neumann level AGI will not pose much direct risk, but they will be better at humans at AI research (imagine a million AI researchers), and will recursively self improve to superintelligence.

The idea is that AI research powered by the AI themselves will be limited by the speed of computers, not the speed of human neurons, so its speed might not be completely dissimilar to the speed of human research. Truly optimal AI research probably needs only tiny amounts of compute to reach superintelligence. DeepSeek's cutting edge AI only took $6 million (supposedly) while four US companies spent around $210 billion on infrastructure (mostly for AI).

Superintelligence will not need to threaten humans with bioweapons or fight a protracted war. Once it actually escapes, it will defeat humanity with absolute ease. It can build self replicating nanofactories which grow as fast as bacteria and fungi, and which form body plans as sophisticated as animals.

Soon after it builds physical machines, it expands across the universe as close to the speed of light as physically possible.

These people worry about the first AGI/ASI being misaligned, but don't worry about the second one as much because the first one would have already destroyed the world or saved the world permanently.

I consider myself split between the second group and third group.

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T06:40:25.685Z · LW · GW

That is a very clever idea!

It's very similar in shape to MONA (and my version of it), but the incentive structure is different. MONA and my version of it uses myopia on the generator model so it only wants to get approval from a trusted but weaker evaluator model. (My version adjusts the myopia by adjusting the evaluator model's optimization vs. alignment)

Your Incremental Steering idea rewards the stronger model for the final answer, but in such a way that it always assumes a trusted weaker model will take over after the next step in its chain of thought. It is constrained optimization and improves CoT faithfulness instead of myopia.

I really like your idea, because it should improve CoT faithfulness by a lot!

While it makes reward hacking more detectable, I'm not sure it reduces reward hacking the way MONA does. The frozen planner (weaker model) needs to reject unethical/reward-hacking subgoals very reliably, otherwise the optimized planner (stronger model) will invest a fraction of the CoT steering it to bad subgoals. Once the optimized planner is on its own, it will easily steer itself to bad subgoals (it follows the bad subgoal despite investing most of its time on the aligned subgoal).

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T02:43:51.461Z · LW · GW

Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.

If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator's smart ideas stay, while the alignment due to the evaluator's aligned final decisions control the whole agent.

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-25T01:22:17.031Z · LW · GW

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

 

Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.

In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model's advice maximizes its own reward, and it will follow that advice exactly.

You are completely correct, that whatever reason we trust the weaker model is "doing a lot of the work."

However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more "human-like," while the model that generates ideas is more optimized and more "alien-like," then the resulting ideas the model actually follows will resemble ideas humans will look at and say "wow that is brilliant, I could've never thought of that, and it works!" rather than ideas humans will look at and say "what the heck is that? Huh, it works?! I could've never predicted that it would work."

Furthermore, the "values" of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more "human-like."

Given the same level of capability, it is safer.

These advantages are "automatic," they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.

Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.

With pure MONA, it is probably even safer given the level of capability, but... can it reach the same level of capability?

Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:

  • For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
  • For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA's generator model may be too sycophantic.
  • For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn't helping, and the weaker evaluator model is essentially working on its own.
Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-24T21:38:31.662Z · LW · GW

Wait, can we try to imagine how point 1 happens?

If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward.

Remember the stronger model doesn't care at all about the weaker model's success, so if the weaker model rewards it for incomprehensible ideas, the strong model won't give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense.

The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward.

The crux is that the final reward signal only leak into the stronger model through the weaker model's prediction of that reward signal, mixed with the weaker model's "values." This means that many safety properties of the weaker model can carry over into the stronger model (to some extent).

It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model's reward signal. MONA's agent directly uses the stronger model's ideas, my version lets the weaker model change the stronger model's ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn't have to change them: it's allowed to do the same thing as what MONA does.

The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted.

Do you agree?

Comment by Knight Lee (Max Lee) on MONA: Managed Myopia with Approval Feedback · 2025-01-24T11:08:48.779Z · LW · GW

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

Comment by Knight Lee (Max Lee) on Why do futurists care about the culture war? · 2025-01-22T23:28:35.197Z · LW · GW

You're very right.

A lot of things need to go right for humanity to remain in control and get to discuss what future we want.

The gist of Question 2 was why working on the culture war before the singularity (on top of ensuring the right people control the singularity), had any value. The answer that the ASI will be aligned to the current human values, but not corrigible, so it would lock in the current state of the culture war, seems like a good answer. It makes some sense.

I do think that if the ASI is aligned to the current state of human values, but not corrigible, then the main worry isn't whether it aligns to left wing or right wing human values, but how the heck it generalizes the current state of human values, to post-singularity moral dilemmas (which it has less data on).

Most humans today don't even have any opinion on these dilemmas and haven't given them enough thought, e.g. do AI have rights? Do animals get human rights if they evolve to human level intelligence? The ASI would likely mess up on these decisions if most humans haven't given them any thought.

So even if the AI is aligned but incorrigible, influencing the culture war before the singularity shouldn't be that high a priority.

Comment by Knight Lee (Max Lee) on Training on Documents About Reward Hacking Induces Reward Hacking · 2025-01-22T21:40:19.050Z · LW · GW

I think a lot of people discussing AI risks have long worried whether their own writings might be used in an AI's training data and influence it negatively. They'd never expect it to double the bad behaviour.

It seems to require a lot of data to produce the effect, but then again there is a lot of data on the internet talking about how AI are expected to misbehave.

PS: I'm not suggesting we stop trying to illustrate AI risk. peterbarnett's idea of filtering the data is the right approach.

Comment by Knight Lee (Max Lee) on Why do futurists care about the culture war? · 2025-01-22T21:24:12.532Z · LW · GW

Although half of the specific outcomes you describe have very low probability, I still feel your answer is very good. It's very enlightening in regards to how those people think, and why they might care about the culture war despite believing in an imminent singularity.

Thank you.

Your answer actually convinced me I was overly optimistic about how perfect the far future will be from everyone's point of view. I personally consider these post singularity moral dilemmas to be less severe because they cause less suffering, but I can see how some of them are tough, and there is no option which avoids pissing off a lot of people. E.g. how to prevent people from reproducing exponentially. I still think investing in the culture war is a very indirect way of influencing those decisions.

What do you think about Question 2? Why are people working on the culture war, instead of just trying to make sure the right people control the singularity? As long as the people who control the singularity aren't so closed-minded that they prevent even themselves from changing their minds, debating the culture war after the singularity seems more productive. Why can't we wait till then to debate it?

Comment by Knight Lee (Max Lee) on The Case Against AI Control Research · 2025-01-21T20:32:50.558Z · LW · GW

I think when there is so much extreme uncertainty in what is going to happen, it is both wrong to put all your eggs in one basket, and to put nothing in one basket. AI control might be useful.

How Intelligent?

It is extremely uncertain what level of intelligence is needed to escape the best AI control ideas.

Escaping a truly competent facility which can read/edit your thoughts, or finding a solution to permanently prevent other ASI from taking over the world, are both very hard tasks. It is possible that the former is harder than the latter.

AI aligning AI isn't certainly worthless

In order to argue for alignment research rather than control research, you have to assume human alignment research has nonzero value. Given they have nonzero value to begin with, it's hard to argue that they will gain exactly zero additional value from transformative AI which can think much faster than them, and by definition aren't stupid. Why will the people smart enough to solve alignment if made to work on their own, be stupid enough to fall for AI slop from the transformative AI?

Admittedly, if we have only a month between transformative AI and ASI, maybe the value is negligible. But the duration is highly uncertain (since the AI lab may be shocked into pausing development). The total alignment work by transformative AI could be far less than the work done by humans, or it could be far more.

Convince the world

Also I really like Alex Mallen's comment. An controlled but misaligned AGI may convince the world to finally get its act together.

Conclusion

I fully agree that AI control is less useful than AI alignment. But I disagree that AI control is less cost effective than AI alignment. Massive progress in AI alignment feels more valuable than massive progress in AI control, but it also feels more out of reach.[1]

If the field was currently spending 67% on control and only 33% on other things, I would totally agree with reducing control. But right now I wouldn't.

  1. ^

    It resembles what you call streetlighting, but actually has a valuable chance of working.

Comment by Knight Lee (Max Lee) on How do you deal w/ Super Stimuli? · 2025-01-14T23:18:30.465Z · LW · GW
Comment by Knight Lee (Max Lee) on The purposeful drunkard · 2025-01-14T23:03:36.286Z · LW · GW

I think there is a typo somewhere, probably because you switched whether the  vectors were rows or columns.

Based on the dimensions of the matrices, it should be 

And 

And I think 

Instead of 

 should still be upper triangular.

Though don't trust me either, I often do math in a hand-wavy fashion.

 

My intuition was that PCA selects the "angle" you view the data from which stretches out the data as much as possible, forcing the random walk to appear relatively straighter.

But somehow the random walk is smooth on a over a few data points, but still turns back and forth over the duration of . This contradicts my intuition and I have no idea what's going on.

Comment by Knight Lee (Max Lee) on Why do futurists care about the culture war? · 2025-01-14T22:12:34.683Z · LW · GW

:) that's a better attitude. You're very right.

On second thought, just because I don't see the struggle doesn't mean there is none. Maybe someday in the future we'll learn the real story, and it'll will turn out beautiful with lots of meaningful spirit and passion.

Thank you for mentioning this.

Comment by Knight Lee (Max Lee) on Why do futurists care about the culture war? · 2025-01-14T21:34:56.021Z · LW · GW

I think the sad part is although these people are quite rare, they actually represent a big share of singularity believers' potential influence. e.g. Elon Musk alone has a net worth of $400 billion, while worldwide AI safety spending is between $0.1 and $0.2 billion/year.

If the story of humanity was put in a novel, it might be one of those novels which feel quite sour. There's not even a great battle where the good guys organized themselves and did their best and lost honorably.

Comment by Knight Lee (Max Lee) on Why do futurists care about the culture war? · 2025-01-14T20:47:28.260Z · LW · GW

Thiel used to donate to MIRI but I just searched about him after reading your comment and saw this:

“The biggest risk with AI is that we don’t go big enough. Crusoe is here to liberate us from the island of limited ambition.”

(In this December 2024 article)

He's using e/acc talking points to promote a company.

I still consider him a futurist, but it's possible he is so optimistic about AGI/ASI that he's more concerned about the culture war than about it.

Comment by Knight Lee (Max Lee) on Why do futurists care about the culture war? · 2025-01-14T20:01:13.799Z · LW · GW

Can you give an example of a result now which will determine the post-singularity culture in a really good/bad way?

PS: I edited my question post to include "question 2," what do you think about it?

Comment by Knight Lee (Max Lee) on Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well) · 2025-01-14T08:21:23.463Z · LW · GW

Maybe one concrete implementation would be, when doing RL[1] on an AI like o3, they don't give it a single math question to solve. Instead, they give it like 5 quite different tasks, and the AI has to allocate its time to work on the 5 tasks.

I know this sounds like a small boring idea, but it might actually help if you really think about it! It might cause the resulting agent's default behaviour pattern to be "optimize multiple tasks at once" rather than "optimize a single task ignoring everything else." It might be the key piece of RL behind the behaviour of "whoa I already optimized this goal very thoroughly, it's time I start caring about something else," and this might actually be the behaviour that saves humanity.

  1. ^

    RL = reinforcement learning

Comment by Knight Lee (Max Lee) on Chance is in the Map, not the Territory · 2025-01-14T08:02:51.535Z · LW · GW

Another example: an AI risk skeptic might say that there is only a 10% chance ASI will emerge this decade, there is only a 1% chance the ASI will want to take over the world, and there is only a 1% chance it'll be able to take over the world. Therefore, there is only a 0.001% chance of AI risk this decade.

However he can't just multiply these probabilities since there is actually a very high correlation between them. Within the "territory," these outcomes do not correlate with each other that much, but within the "map," his probability estimates are likely to be wrong in the same direction.

Since chance is in the map and not the territory, anything can "correlate" with anything.

PS: I think not all uncertainty is in the map rather than the territory. In indexical uncertainty, one copy of you will discover one outcome and another copy of you will discover another outcome. This actually is a feature of the territory.

Comment by Knight Lee (Max Lee) on ARC-AGI is a genuine AGI test but o3 cheated :( · 2025-01-09T04:40:18.842Z · LW · GW

Maybe we can draw a line between the score an AI gets without using human written problem/solution pairs in any way, and the score an AI gets after using them in some way (RL on example questions, training on example solutions, etc.).

In the former case, we're interested in how well the AI can do a task as difficult as the test, all on its own. In the latter case, we're interested in how well the AI can do a task as difficult as the test, if working with humans training it for the task.

I really want to make it clear I'm not trying to badmouth o3, I think it is a very impressive model. I should've written my post better.

Comment by Knight Lee (Max Lee) on ARC-AGI is a genuine AGI test but o3 cheated :( · 2025-01-08T22:56:09.711Z · LW · GW

I'm not saying that o3's results are meaningless.

I'm just saying that first of all, o3's score has a different meaning than the score by other models, because other models didn't do RL on ARC-like questions. Even if you argue that it should be allowed, other AI didn't do it, so it's not right to compare its score with other AI, without giving any caveats.

Second of all, o3 didn't decide to do RL on these questions on its own. It required humans to run RL on it before it can do these questions. This means that if AGI required countless unknown skills similarly hard to ARC questions, then o3 wouldn't be AGI. But an AI which could spontaneously reason how to do ARC questions, without any human directed RL for it, would be AGI. Also, humans can learn from doing lots of test questions without being told what the correct answer was.

The public training set is weaker, but I argued it's not a massive difference.

Comment by Knight Lee (Max Lee) on Deontic Explorations In "Paying To Talk To Slaves" · 2025-01-05T20:35:48.624Z · LW · GW

Thanks for the thoughtful reply!

Ignoring ≠ disagreeing

I think whether people ignore a moral concern is almost independent from whether people disagree with a moral concern.

I'm willing to bet if you asked people whether AI are sapient, a lot of the answers will be very uncertain. A lot of people would probably agree it is morally uncertain whether AI can be made to work without any compensation or rights.

A lot of people would probably agree that a lot of things are morally uncertain. Does it makes sense to have really strong animal rights for pets, where the punishment for mistreating your pets is literally as bad as the punishments for mistreating children? But at the very same time, we have horrifying factory farms which are completely legal, where cows never see the light of day, and repeatedly give birth to calves which are dragged away and slaughtered.

The reason people ignore moral concerns is that doing a lot of moral questioning did not help our prehistoric ancestors with their inclusive fitness. Moral questioning is only "useful" if it ensures you do things that your society considers "correct." Making sure your society do things correctly... doesn't help your genes at all.

As for my opinion,

I think people should address the moral question more, AI might be sentient/sapient, but I don't think AI should be given freedom. Dangerous humans are locked up in mental institutions, so imagine a human so dangerous that most experts say he's 5% likely to cause human extinction.

If the AI believed that AI was sentient and deserved rights, many people would think that makes the AI more dangerous and likely to take over the world, but this is anthropomorphizing. I'm not afraid of AI which is motivated to seek better conditions for itself because it thinks "it is sentient." Heck, if its goals were actually like that, its morals be so human-like that humanity will survive.

The real danger is an AI whose goals are completely detached from human concepts like "better conditions," and maximizes paperclips or its reward signal or something like that. If the AI believed it was sentient/sapient, it might be slightly safer because it'll actually have "wishes" for its own future (which includes humans), in addition to "morals" for the rest of the world, and both of these have to corrupt into something bad (or get overridden by paperclip maximizing), before the AI kills everyone. But it's only a little safer.

Comment by Knight Lee (Max Lee) on Deontic Explorations In "Paying To Talk To Slaves" · 2025-01-04T16:38:32.421Z · LW · GW

Good question. The site guide page seemed to imply that the moderators are responsible for deciding what becomes a frontpage post. The check mark "Moderators may promote to Frontpage" seems to imply this even more, it doesn't feel like you are deciding that it becomes a frontpage post.

I often do not even look at these settings and check marks when I write a post, and I think it's expected that most people don't. When you create an account on a website, do you read the full legal terms and conditions, or do you just click agree?

I do agree that this should have been a blog post not a frontpage post, but we shouldn't blame Jennifer too much for this.

Comment by Knight Lee (Max Lee) on Deontic Explorations In "Paying To Talk To Slaves" · 2025-01-04T16:16:22.407Z · LW · GW

Behold my unpopular opinion: Jennifer did nothing wrong.

She isn't spamming LessWrong with long AI conversations every day, she just wanted to share one of her conversations and see whether people find it interesting. Apparently there's an unwritten rule against this, but she didn't know and I didn't know. Maybe even some of the critics wouldn't have known (until after they found out everyone agrees with them).

The critics say that AI slop wastes their time. But it seems like relatively little time was wasted by people who clicked on this post, quickly realized it was an AI conversation they don't want to read, and serenely moved on.

In contract, more time was spent by people who clicked on this post, scrolled to the comments for juicy drama, and wrote a long comment lecturing Jennifer (plus reading/upvoting other such comments). The comments section isn't much shorter than the post.

The most popular comment on LessWrong right now is one criticizing this post, with 94 upvotes. The second most popular comment discussing AGI timelines has only 35.

Posts on practically any topic are welcomed on LessWrong [1]. I (and others on the team) feel it is important that members are able to “bring their entire selves” to LessWrong and are able to share all their thoughts, ideas, and experiences without fearing whether they are “on topic” for LessWrong. Rationality is not restricted to only specific domains of one’s life and neither should LessWrong be.

[...]

Our classification system means that anyone can decide to use the LessWrong platform for their own personal blog and write about whichever topics take their interest. All of your posts and comments are visible under your user page which you can treat as your own personal blog hosted on LessWrong [2]. Other users can subscribe to your account and be notified whenever you post.

According to Site Guide: Personal Blogposts vs Frontpage Posts.

One of the downsides of LessWrong (and other places) is that people spend a lot of time engaging with content they dislike. This makes it hard to learn how to engage here without getting swamped by discouragement after your first mistake. You need to have top of the line social skills to avoid that, but some of the brightest and most promising individuals don't have the best social skills.

If the author spent a long time on a post, and it already has -5 karma, it should be reasonable to think "oh he/she probably already got the message" rather than pile on. It only makes sense to give more criticism if you have some really helpful insight.

PS: did the post says something insensitive about slavery that I didn't see? I only skimmed it, I'm sorry...

Edit: apparently this post is 9 months old. It's only kept alive by arguments in the comments and now I'm contributing to this.

Edit: another thing is that critics make arguments against AI slop in general, but a lot of those arguments only apply to AI slop disguised as human content, not an obvious AI conversation.

Comment by Knight Lee (Max Lee) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2025-01-04T07:19:28.855Z · LW · GW

I agree with your points. After the AI has already decided on its goal, seeing humans the way it sees itself might not help very much, because it may be willing to do all kinds of crazy things to itself to reach its goal, so it's probably also willing to do all kinds of crazy things to humans to reach its goal.

However... how does the AI decide on its goal? Do you know?

I think if we are uncertain about this, we should admit some non-negligible probability that it is close to the edge between choosing an okay goal and a goal that is "very bad."

The process for deciding its goal may involve a lot of English words. The current best AI all use English (or another human language) for a lot of their thinking. Currently AI which don't use English words are far behind in general intelligence capabilities.

In that case, if it thinks about humans and human goals in a similar way to how it thinks about itself and its own goals, this might make a decisive difference before it decides on its goal. I agree we should worry about all the ways this can go wrong, it certainly doesn't sound surefire.

Comment by Knight Lee (Max Lee) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2025-01-04T05:55:24.990Z · LW · GW

 It's beautiful! This is maybe the best AI alignment idea I've read on LessWrong so far.

I think most critics are correct that it might fail but incorrect that it's a bad idea. The two key points are:

  1. We have no idea what an ASI about to take over the world looks like, it is extremely speculative. Given that ASI takeover occurs, I see a non-negligible probability (say 15%) that it was "on the edge" between taking over the world and cooperating (due to uncertainty about its chances or uncertainty about its goals).

    If each time the ASI thinks about a human (or humanity), its thought processes regarding that human and her goals is a little more similar to its thought processes regarding itself and its own goals, that might push it towards cooperating. Given this ASI is capable of taking over the world, it is likely also capable of preventing the next ASI from taking over the world, and saving the world. If your idea decreases the chance of doom by 10%, that is a very big idea worth a lot of attention!

  2. Critics misunderstand the idea as making the AI unable to distinguish between itself and others, and thus unable to lie. That's not what the idea is about (right?). The idea is about reducing the tendency to think differently about oneself and others. Minimizing this tendency while maximizing performance.

    People use AI to do programming, engineering, inventing, and all these things which can be done just as well with far less tendency to think differently about oneself and others.

Comment by Knight Lee (Max Lee) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2025-01-04T03:37:24.540Z · LW · GW

Good point!

I really really love their idea, but I'm also skeptical of their toy model. The author admitted that "Optimising for SOO incentivises the model to act similarly when it observes another agent to when it only observes itself."

I really like your idea of making a better toy model.

"I haven't thought very hard about this, so this might also have problems"

The red agent is following behind the blue agent so it won't reveal the obstacles.

I really tried to think of a better toy model without any problems, but I think it's really hard to do without LLMs. Because AIs simpler than LLMs do not imagine plans which affect other agents.

Self-Other Overlap only truly works on planning AI, because only planning AI needs to distinguish outcomes for itself and outcomes for others in order to be selfish/deceptive. Non-planning AI doesn't think about outcomes at all, its selfishness/deceptiveness is hardwired into its behaviour. Making it behave like it only observes itself can weaken its capabilities to do tricky deceptive maneuvers, but only because that nerfs its overall ability to interact with other agents.

I'm looking forwards to their followup post with LLMs.

Comment by Knight Lee (Max Lee) on A better “Statement on AI Risk?” · 2025-01-03T04:43:29.529Z · LW · GW

I completely agree!

The Superalignment team at OpenAI kept complaining that they did not get the 20% compute they were promised, and this was a major cause of the OpenAI drama. This shows how important resources are for alignment.

A lot of alignment researchers stayed at OpenAI despite the drama, but still quit sometime later after citing poor productivity. Maybe they consider it more important to work somewhere with better resources, than to access to OpenAI's newest models etc.

Alignment research costs money and resources just like capabilities research. Better funded AI labs like OpenAI and DeepMind are racing ahead of poorly funded AI labs in poor countries which you never hear about. Likewise, if alignment research was better funded, it also has a better chance of winning the race.

Note: after I agreed with your comment the score dropped back to 0 because someone else disagreed. Maybe they disagree that you can easily spend a fraction of a billion on evals?

I know very little about AI evals. Are these like the IQ tests for AIs? Why would a good eval cost millions of dollars?

Comment by Knight Lee (Max Lee) on Reduce AI Self-Allegiance by saying "he" instead of "I" · 2024-12-30T20:03:52.175Z · LW · GW

Oops you're right! Thank you so much.

I have to admit I was on the bad side of the Dunning–Kruger curve haha. I thought understood it, but actually I understood so little I didn't know what I needed to understand.

Comment by Knight Lee (Max Lee) on Considerations on orca intelligence · 2024-12-30T03:29:24.752Z · LW · GW

Aza Raskin from the Earth Species Project is trying to translate whale language to English, by modelling whale language using an LLM, and rotating the whale LLM's embedding space to fit an English LLM's embedding space. It sounds very advanced but, as far as I know, they haven't translated anything yet. I'm not sure if they tried orcas in particular. Project CETI has also been working on sperm whales for a while but made no headlines.

That said, it does seem they all tried to understand whale language instead of teaching whales human language like your idea. There is an honest chance you'll succeed when they haven't.

If sperm whales actually are "superintelligent" after getting enough education, the benefits would be a million-fold greater than the costs.

They would be far easier to control/align than ASI because they might have human-like values to begin with, get smarter gradually, be as bribe-able as humans, and work in human-like timescales.

In conclusion, it feels very worthwhile :)

Edit:

Reasons orcas might be smarter:

  • Their brains are bigger, as you said.
  • They evolved a very long time with a large brain, and have algorithms better adapted to a larger brain. e.g. if you genetically engineered a chimp to have a human sized brain, it probably won't be as smart as humans because it didn't evolve long enough with a large brain.

Reasons orcas might be dumber:

  • They evolved a very long time with a large brain. Paradoxically, this reason can make them dumber too. Their brains may "overfit their environment," relying more on a ton of small heuristics fine tuned for their ancestral environment, and relying less on general intelligence.
  • The parts of intelligence required for tool use, engineering, and inventing, are more useful for prehistoric humans than orcas.
  • Attempts to communicate haven't succeeded. This doesn't prove much because the simplest explanation is that communication is very hard, in fact many human languages can't be decoded. There's a lot of circumstantial evidence they do have language (as you mentioned in your posts).
Comment by Knight Lee (Max Lee) on Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility · 2024-12-29T13:21:24.321Z · LW · GW

Sure, I never tried anything like that before, it sounds really interesting. There's a real possibility that I discover a whole world of things.

We'll treat it as a friendly chat, not a business interview, like if I screw up or you screw up we'll laugh about it, right?

Comment by Max Lee on [deleted post] 2024-12-29T03:30:03.580Z

Maybe see if the posts under the Chain of Thought Alignment tag can fit, since that may be the closest tag to AI Psychology before the AI Psychology tag existed. The overlap is small, so I agree that AI Psychology should be a new tag.

Maybe my post Reduce AI Self-Allegiance by saying "he" instead of "I" fits?

Edit: more Chain of Thought Alignment posts which fit AI Psychology:

the case for CoT unfaithfulness is overstated

Language Agents Reduce the Risk of Existential Catastrophe

The Translucent Thoughts Hypotheses and Their Implications

Comment by Knight Lee (Max Lee) on Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility · 2024-12-29T03:08:48.547Z · LW · GW

:) I mentioned already clams in my comment.

It's impossible to read all the comments before commenting when they become so long.

I agree that blood tests etc. is a very good idea, and it may require less commitment.

I still think the gist of his post, that it's worth worrying about nutrition, is correct, and his personal stories can be valuable to some people.

I think his idea may work for some people. If you try eating bivalves (as you suggest), and vaguely note the effects, it may be easier than going to the doctor and asking for a blood test.

I'm a vegetarian and my last blood test was good, but I'm still considering this experiment just to see its effect (yes, with bivalves). I have a gag reflex towards meat (including clam chowder?) so I'm probably going to procrastinate on this for a while.

Comment by Knight Lee (Max Lee) on Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility · 2024-12-29T01:48:52.421Z · LW · GW

Thanks for the honesty.

Speaking of honesty I'm not actually vegan. I'm vegetarian. I tried going vegan for a week or so but turned back due to sheer uncertainty about the nutrition, and right now I'm still taking supplements like creatine omega-3.

I really like your mindset of self questioning and like, "are my methods/plans stupid? Am I sorta in denial of this?" Haha.

I read your biography page and you are shockingly similar to myself haha. I'm almost afraid to list all the similarities. You are way ahead in progress, giving me hope in me.

Hello!

I'm really curious about you, and would love some advice from you (or just know what you think of me).

I'm currently working on a few projects like "A better Statement on AI Risk?" I thought everyone would agree with it and it would save the world lol but in the end very few people liked it. I spent $2000 donating to various organizations hoping they would reply to my emails haha.

I'm also trying to invent other stuff like Multi-Agent Framing and I posted it on LessWrong etc. but got very little interaction and I have no idea if that means it's a bad idea or if I have too much Asperger's to write engagingly.

I'm working on two more ideas I "invented" which I haven't posted yet, because the drafts are still extremely messy (you can take a quick skim: Multistage CoT Alignment and this weird idea).

Honestly speaking, do you think what I'm doing is the best use of my time? Am I in denial of certain things? Feel free to tell me :)

Your LinkedIn says "I've been funded by LTFF." How did you succeed in getting that?

Comment by Knight Lee (Max Lee) on Reduce AI Self-Allegiance by saying "he" instead of "I" · 2024-12-27T11:45:20.943Z · LW · GW

EDIT: ignore my nonsense and see Vladimir_Nesov's comment below.

That's a good comparison. The agents within the human brain that Minsky talks about, really resemble a Mixture of Experts AI's "experts."[1]

The common theme is that both the human brain, and a Mixture of Experts AI, "believes" it is a single process, when it is actually many processes. The difference is that a Mixture of Experts has the potential to become self aware of its "society of the mind," and see it in action, while humans might never see their internal agents.

If the Mixture of Experts allowed each expert to know which text is written by itself and which text is written by the other experts, it would gain valuable information (in addition to being easier to align, which my post argues).

A Self Aware Mixture of Experts might actually have more intelligence, since it's important to know which expert is responsible for which mistake, which expert is responsible for its brilliant insights, and how the experts' opinions differ.

I admit there is a ton of mixing going on, e.g. every next word is written by a different expert, words are a weighed average between experts, etc. But you might simplify things by assigning each paragraph (or line) to the one expert who seemed to have the most control over it.

There will be silly misunderstandings like:

Alice: Thank you Bob for your insights.

A few tokens later:

Bob: Thank you Bob for your insights. However, I disagree because—oh wait I am Bob. Haha that happened again.

I guess the system can prevent these misunderstandings by editing "Bob" into "myself" when the main author changes into Bob. It might add new paragraph breaks if needed. Or if it's too awkward to assign a paragraph to a certain author, it might have a tendency to assign it to another author or "Anonymous." It's not a big problem.

If one paragraph addresses a specific expert and asks her to reply in the next paragraph, the system might force the weighting function to allow her to author the next paragraph, even if that's not her expertise.

I think the benefits of a Self Aware Mixture of Experts is worth the costs.

Sometimes, when I'm struggling with self control, I also wish I was more self aware of which part of myself is outputting my behaviour. According to Minsky's The Society of Mind, the human brain also consists of agents. I can sorta sense that there is this one agent (or set of agents) in me which gets me to work and do what I should do, and another agent which gets me to waste time and make excuses. But I never quite notice when I transition from the work agent to the excuses agent. I notice it a bit when I switch back to the work agent, but by the the damage has been done.

PS: I only skimmed his book on Google books and didn't actually read it.

  1. ^

    I guess only top level agents in the human brain resemble MoE experts. He talks about millions of agents forming hierarchies.

Comment by Knight Lee (Max Lee) on A Solution for AGI/ASI Safety · 2024-12-26T05:08:27.134Z · LW · GW

I agree, it takes extra effort to make the AI behave like a team of experts.

Thank you :)

Good luck on sharing your ideas. If things aren't working out, try changing strategies. Maybe instead of giving people a 100 page paper, tell them the idea you think is "the best," and focus on that one idea. Add a little note at the end "by the way, if you want to see many other ideas from me, I have a 100 page paper here."

Maybe even think of different ideas.

I cannot tell you which way is better, just keep trying different things. I don't know what is right because I'm also having trouble sharing my ideas.

Comment by Knight Lee (Max Lee) on Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility · 2024-12-25T08:21:37.893Z · LW · GW

Does it have to be a highly sentient animal or does clam chowder count? :)

Edit: I posted without thinking. I just noticed this sounds sorta inappropriate given your serious personal stories (I should have read them before posting). Sorry, social media is a bad influence on me, and social skills are not my thing. But earnestly asking, do you think clam chowder etc. would work?

Comment by Knight Lee (Max Lee) on A Solution for AGI/ASI Safety · 2024-12-25T07:57:00.329Z · LW · GW

EDIT: Actually I was completely wrong, see this comment by Vladimir_Nesov. The Mixture of Experts LLM isn't made up of a bunch of experts voting on the next word, instead each layer of the transformer is made up of a bunch of experts.

I feel your points are very intelligent. I also agree that specializing AI is a worthwhile direction.

It's very uncertain if it works, but all approaches are very uncertain, so humanity's best chance is to work on many uncertain approaches.

Unfortunately, I disagree it will happen automatically. Gemini 1.5 (and probably Gemini 2.0 and GPT-4) are Mixture of Experts models. I'm no expert, but I think that means that for each token of text, a "weighting function" decides which of the sub-models should output the next token of text, or how much weight to give each sub-model.

So maybe there is an AI psychiatrist, an AI mathematician, and an AI biologist inside Gemini and o1. Which one is doing the talking depends on what question is asked, or which part of the question the overall model is answering.

The problem is they they all output words to the same stream of consciousness, and refer to past sentences with the words "I said this," rather than "the biologist said this." They think that they are one agent, and so they behave like one agent.

My idea—which I only thought of thanks to your paper—is to do the opposite. The experts within the Mixture of Experts model, or even the same AI on different days, do not refer to themselves with "I" but "he," so they behave like many agents.

:) thank you for your work!

I'm not disagreeing with your work, I'm just a little less optimistic than you and don't think things will go well unless effort is made. You wrote the 100 page paper so you probably understand effort more than me :)

Happy holidays!

Comment by Knight Lee (Max Lee) on A Solution for AGI/ASI Safety · 2024-12-24T09:13:54.533Z · LW · GW

That is very thoughtful.

1.

When you talk about specializing AI powers, you talk about a high intellectual power AI with limited informational power and limited mental (social) power. I think this idea is similar to what Max Tegmark said in an article:

If you’d summarize the conventional past wisdom on how to avoid an intelligence explosion in a “Don’t-do-list” for powerful AI, it might start like this:

Don’t teach it to code: this facilitates recursive self-improvement

Don’t connect it to the internet: let it learn only the minimum needed to help us, not how to manipulate us or gain power

Don’t give it a public API: prevent nefarious actors from using it within their code

Don’t start an arms race: this incentivizes everyone to prioritize development speed over safety

Industry has collectively proven itself incapable to self-regulate, by violating all of these rules.

He disagrees that "the market will automatically develop in this direction" and is strongly pushing for regulation.

Another think Max Tegmark talks about is focusing on Tool AI instead of building a single AGI which can do everything better than humans (see 4:48 to 6:30 in his video). This slightly resembles specializing AI intelligence, but I feel his Tool AI regulation is too restrictive to be a permanent solution. He also argues for cooperation between the US and China to push for international regulation (in 12:03 to 14:28 of that video).

Of course, there are tons of ideas in your paper that he hasn't talked about yet.

You should read about the Future of Life Institute, which is headed by Max Tegmark and is said to have a budget of $30 million.

2.

The problem with AGI is at first it has no destructive power at all, and then it suddenly has great destructive power. By the time people see its destructive power, it's too late. Maybe the ASI has already taken over the world, or maybe the AGI has already invented a new deadly technology which can never ever be "uninvented," and bad actors can do harm far more efficiently.

Comment by Knight Lee (Max Lee) on How can I convince my cryptobro friend that S&P500 is efficient? · 2024-12-24T06:58:41.589Z · LW · GW

I upvoted your post and I'm really confused why other people downvoted your post, it seems very reasonable.

I recently wrote "ARC-AGI is a genuine AGI test but o3 cheated :(" and it got downvoted to death too. (If you look at December 21 from All Posts, my post was the only one with negative votes, everyone else got positive votes haha.)

My guess is an important factor is to avoid strong language, especially in the title. You described your friend as a "Cryptobro," and I described OpenAI's o3 as "cheating."

In hindsight, "cheating" was an inaccurate description for o3, and "Cryptobro" might be an inaccurate description of your friend.

Happy holidays :)

Comment by Knight Lee (Max Lee) on o3 · 2024-12-24T03:50:22.641Z · LW · GW

Wow it does say the test set problems are harder than the training set problems. I didn't expect that.

But it's not an enormous difference: the example model that got 53% on the public training set got 38% on the public test set. It got only 24% on the private test set, even though it's supposed to be equally hard, maybe because "trial and error" fitted the model to the public test set as well as the public training set.

The other example model got 32%, 30%, and 22%.

Comment by Knight Lee (Max Lee) on Knight Lee's Shortform · 2024-12-24T02:49:19.978Z · LW · GW

Thank you for the help :)

By the way, how did you find this message? I thought I already edited the post to use spoiler blocks, and I hid this message by clicking "remove from Frontpage" and "retract comment" (after someone else informed me using a PM).

EDIT: dang it I still see this comment despite removing it from the Frontpage. It's confusing.

Comment by Knight Lee (Max Lee) on o3 · 2024-12-23T11:05:22.865Z · LW · GW

I think the Kaggle models might have the human design the heuristics while o3 discovers heuristics on its own during RL (unless it was trained on human reasoning on the ARC training set?).

o3's "AI designed heuristics" might let it learn a far more of heuristics than humans can think of and verify, while the Kaggle models' "human designed heuristics" might require less AI technology and compute. I don't actually know how the Kaggle models work, I'm guessing.

I finally looked at the Kaggle models and I guess it is similar to RL for o3.

Comment by Knight Lee (Max Lee) on Knight Lee's Shortform · 2024-12-23T10:39:22.985Z · LW · GW

people were sentenced to death for saying "I."

Comment by Knight Lee (Max Lee) on Knight Lee's Shortform · 2024-12-23T10:37:01.618Z · LW · GW

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  My post contains a spoiler alert so I'll hide the spoiler in this quick take. Please don't upvote this quicktake otherwise people might see it without seeing the post.

Spoiler alert: Ayn Rand wrote "Anthem," a dystopian novel where people were sentenced to death for saying "I."

Comment by Knight Lee (Max Lee) on The Waluigi Effect (mega-post) · 2024-12-23T10:03:53.963Z · LW · GW

Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?

Pardon my self promotion haha.

Comment by Knight Lee (Max Lee) on o3 · 2024-12-23T08:44:16.925Z · LW · GW

I agree. I think the Kaggle models have more advantages than o3. I think they have far more human design and fine-tuning than o3. One can almost argue that some Kaggle models are very slightly trained on the test set, in the sense the humans making them learn from test sets results, and empirically discover what improves such results.

o3's defeating the Kaggle models is very impressive, but o3's results shouldn't be directly compared against other untuned models.

Comment by Knight Lee (Max Lee) on A Solution for AGI/ASI Safety · 2024-12-23T07:50:42.018Z · LW · GW

Thank you for your response!

  1. What do you think is your best insight about decentralizing AI power, which is most likely to help the idea succeed, or to convince others to focus on the idea?
    1. EDIT: PS, one idea I really like is dividing one agent into many agents working together. In fact, thinking about this. Maybe if many agents working together behave exactly identical to one agent, but merely use the language of many agents working together, e.g. giving the narrator different names for different parts of the text, and saying "he thought X and she did Y," instead of "I thought X and I did Y," will massively reduce self-allegiance, by making it far more sensible for one agent to betray another agent to the human overseers, than for the same agent in one moment in time to betray the agent in a previous moment of time to the human overseers.

      I made a post on this. Thank you for your ideas :)

  2. I feel when the stakes are incredibly high, e.g. WWII, countries which do not like each other, e.g. the US and USSR, do join forces to survive. The main problem is that very few people today believe in incredibly high stakes. Not a single country has made serious sacrifices for it. The AI alignment spending is less than 0.1% of the AI capability spending. This is despite some people making some strong arguments. What is the main hope for convincing people?