Posts

The AI Belief-Consistency Letter 2025-04-23T12:01:42.581Z
Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents 2025-04-18T11:11:23.239Z
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives 2025-04-14T10:27:24.903Z
Commitment Races are a technical problem ASI can easily solve 2025-04-12T22:22:47.790Z
Thinking Machines 2025-04-08T17:27:44.460Z
An idea for avoiding neuralese architectures 2025-04-03T22:23:21.653Z
Cycles (a short story by Claude 3.7 and me) 2025-02-28T07:04:46.602Z
Detailed Ideal World Benchmark 2025-01-30T02:31:39.852Z
Scanless Whole Brain Emulation 2025-01-27T10:00:08.036Z
Why do futurists care about the culture war? 2025-01-14T07:35:05.136Z
The "Everyone Can't Be Wrong" Prior causes AI risk denial but helped prehistoric people 2025-01-09T05:54:43.395Z
Reduce AI Self-Allegiance by saying "he" instead of "I" 2024-12-23T09:32:29.947Z
Knight Lee's Shortform 2024-12-22T02:35:40.806Z
ARC-AGI is a genuine AGI test but o3 cheated :( 2024-12-22T00:58:05.447Z
Why empiricists should believe in AI risk 2024-12-11T03:51:17.979Z
The first AGI may be a good engineer but bad strategist 2024-12-09T06:34:54.082Z
Keeping self-replicating nanobots in check 2024-12-09T05:25:45.898Z
Hope to live or fear to die? 2024-11-27T10:42:37.070Z
Should you increase AI alignment funding, or increase AI regulation? 2024-11-26T09:17:01.809Z
A better “Statement on AI Risk?” 2024-11-25T04:50:29.399Z

Comments

Comment by Knight Lee (Max Lee) on o3 Is a Lying Liar · 2025-04-24T12:21:47.317Z · LW · GW

He also predicted correctly how people won't give a damn when they see such behaviour.

Because in 2024 Gemini randomly told an innocent user to go kill himself.[1]

  1. ^

    Not only did people not shut down language models in response to this, they didn't even go 1% of the way.

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-24T11:27:51.086Z · LW · GW

At some point there has to be concrete plans, yes without concrete plans nothing can happen.

I'm probably not the best person in the world to decide how the money should be spent, but one vague possibility is this:

  • Some money is spent on making AI labs implement risk reduction measures, such as simply making their network more secure against hacking, and implementing AI alignment ideas and AI control ideas which show promise but are expensive.
  • Some money is given to organizations and researchers who apply for grants. Universities might study AI alignment in the same way they study other arts and sciences.
  • Some money is spent on teaching people about AI risk so that they're more educated? I guess this is really hard since the field itself disagrees on what is correct so it's unclear what you teach.
  • Some money is saved in a form of war chest. E.g. if we get really close to superintelligence, or catch AI red handed, we might take drastic measures. We might have to immediately shut down AI, but if society is extremely dependent on it we might need to spend a lot of money helping people who feel uprooted by the shutdown. In order to make a shutdown less politically difficult, people who lose their jobs may be temporarily compensated, and businesses relying on AI may bought rather than forced into bankruptcy.

Probably not good enough for you :/ but I imagine someone else can come up with a better plan.

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-24T10:58:23.932Z · LW · GW

I think just because every defence they experimented with got obliterated by drone swarms, doesn't mean they should stop trying, because they might figure out something new in the future.

It's a natural part of life to work on a problem without any idea what the solution will be like. The first people who studied biology had no clue what modern medicine would look like, but their work was still valuable.

Being unable to imagine a solution does not prove a solution doesn't exist.

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-24T10:46:20.767Z · LW · GW

If everyone else is also unqualified because the problem is so new, and every defence they experimented with got obliterated by drone swarms, then you would agree they should just give up, and admit military risk remains a big problem but spend far less on it, right?

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-24T10:13:44.894Z · LW · GW

Suppose you had literally no ideas at all how to counter drone swarms, and you were really bad at judging other people's ideas for countering drone swarms. In that case, would you, upon discovering that your countries adversaries developed drone swarms, (making your current tanks and ships obsolete), decide to give up on military spending, and cut military spending by 100 times?

Please say you would or explain why not.

My opinion is that you can't give up (i.e. admit there is a big problem but spend extremely little on it) until you fully understood the nature of the problem with certainty.

Money isn't magic, but it determines the number of smart people working on the problem. If I was a misaligned superintelligence, I would be pretty scared of a greater amount of human intelligence working to stop me from being born in the first place. They get only one try, but they might actually stumble across something that works.

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-24T09:02:47.988Z · LW · GW

If you believe that spending more on safety leads to acceleration instead, you should try to refute my argument for why it is a net positive.

I'm honestly very curious how my opponents will reply to my "net positive" arguments, so I promise I'll appreciate a reply and upvote you.

 I pasted it in this comment so you don't have to look for it:

Why I feel almost certain this open letter is a net positive

Delaying AI capabilities alone isn't enough. If you wished for AI capabilities to be delayed by 1000 years, then one way to fulfill your wish is if the Earth had formed 1000 years later, which delays all of history by the same 1000 years.

Clearly, that's not very useful. AI capabilities have to be delayed relative to something else.

That something else is either:

  1. Progress in alignment (according to optimists like me)

    or

  2. Progress towards governments freaking out about AGI and going nuclear to stop it (according to LessWrong's pessimist community)

Either way, the AI Belief-Consistency Letter speeds up that progress by many times more than it speeds up capabilities. Let me explain.

Case 1:

Case 1 assumes we have a race between alignment and capabilities. From first principles, the relative funding of alignment and capabilities matters in this case.

Increasing alignment funding by 2x ought to have a similar effect to decreasing capability funding by 2x.

Various factors may make the relationship inexact, e.g. one might argue that increasing alignment by 4x might be equivalent to decreasing capabilities by 2x, if one believes that capabilities is more dependent on funding.

But so long as one doesn't assume insane differences, the AI Belief-Consistency Letter is a net positive in Case 1.

This is because alignment funding is only at $0.1 to $0.2 billion, while capabilities funding is at $200+ billion to $600+ billion.

If the AI Belief-Consistency Letter increases both by $1 billion, that's a 5x to 10x alignment increase and only a 1.002x to 1.005x capabilities increase. That would clearly be a net positive.

Case 2:

Even if the wildest dreams of the AI pause movement succeed, and the US, China, and EU all agree to halt all capabilities above a certain threshold, the rest of the world still exists, so it only reduces capabilities funding by 10x effectively.

That would be very good, but we'll still have a race between capabilities and alignment, and Case 1 still applies. The AI Belief-Consistency Letter still increases alignment funding by far more than capabilities funding.

The only case where we should not worry about increasing alignment funding, is if capabilities funding is reduced to zero, and there's no longer a race between capabilities and alignment.

The only way to achieve that worldwide, is to "solve diplomacy," which is not going to happen, or to "go nuclear," like Eliezer Yudkowsky suggests.

If your endgame is to "go nuclear" and make severe threats to other countries despite the risk, you surely can't oppose the AI Belief-Consistency Letter on the grounds that "it speeds up capabilities because it makes governments freak out about AGI," since you actually need governments to freak out about AGI.

Conclusion

Make sure you don't oppose this idea based on short term heuristics like "the slower capabilities grow, the better," without reflecting on why you believe so. Think about what your endgame is. Is it slowing down capabilities to make time for alignment? Or is it slowing down capabilities to make time for governments to freak out and halt AI worldwide?


You make a very good point about political goals, and I have to agree that this letter probably won't convince politicians whose political motivations prevent them from supporting AI alignment spending.

Yes, military spending indeed rewards constituents, and some companies go out of their way to hire people in multiple states etc.

PS: I actually mentioned the marginal change in a footnote, but I disabled the sidebar so maybe you missed it. I'll add the sidebar footnotes back.

Thanks!

Comment by Knight Lee (Max Lee) on hiAndrewQuinn's Shortform · 2025-04-24T08:24:13.912Z · LW · GW

Examples

In How it feels to have your mind hacked by an AI, a software engineer fell in love with an AI, and thought oh if only AGI would have her persona, it would surely be aligned.

Long ago Eliezer Yudkowsky believed that "To the extent someone says that a superintelligence would wipe out humanity, they are either arguing that wiping out humanity is in fact the right thing to do (even though we see no reason why this should be the case) or they are arguing that there is no right thing to do (in which case their argument that we should not build intelligence defeats itself)."

Larry Page allegedly dismissed concern about AI risk as speciesism.

Selection bias

In these examples, the believers eventually realized their folly, and favoured humanity over misaligned AI in the end.[1]

However, maybe we only see the happy endings due to selection bias! Someone who continues to work against humanity won't tell you that they are doing so, e.g. during the brief period Eliezer Yudkowsky was confused he kept it a secret.

So the true number of people working against humanity is unknown. We only know the number of people who eventually snapped out of it.

Nonetheless, it's not worthwhile to start a witch hunt, no matter how suspiciously someone behaves, because throwing such accusations will merely invite mockery.

  1. ^

    At least for Blaked and Eliezer Yudkowsky. I don't think Larry Page ever walked back or denied his statements.

Comment by Knight Lee (Max Lee) on Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt · 2025-04-24T02:59:34.121Z · LW · GW

What do you think about Eliezer Yudkowsky's List of Lethalities? Does it confirm all of the issues you described? Or do you feel you partially misunderstood the current position of AI existential risk?

I think it's completely okay to misunderstand the current position at least a bit the first time you discuss it, since it is rather weird and complicated haha.

Even if a few of your criticisms of the current position are off, the other insights are still good.

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-24T02:36:36.734Z · LW · GW

I think that's a very important question, and I don't know the answer for what we should buy.

However, suppose not knowing what you should spend on, dramatically decreases the total amount you should spend (e.g. by 10x). If that was really true in general, then imagine a country with a large military discovers that its enemies are building very powerful drone swarm weapons, which can easily destroy all its tanks, aircraft carriers, and so forth very cheaply.

Military experts are all confused and in disagreement how to counter these drone swarms, just like the AI alignment community. Some of them say that resistance is futile, and the country is "doomed." Others have speculative ideas like using lasers. Still others say that lasers are stupid, because the enemy can simply launch the swarms in bad weather and the lasers won't reach them. Just like with AI alignment, there are no proven solutions, and every solution tested against drone swarms are destroyed pathetically.

Should the military increase its budget, or decrease its budget, since no one knows what you can spend money on to counter the drone swarms?

I think the moderate, cool headed response is to spend a similar amount, exploring all the possibilities, even without having any ideas which are proven to work.

Uncertainty means the expected risk reduction is high

If we are uncertain about the nature of the risk, we might assume that 50%, spending more money reduces the risk by a reasonable amount (similar to risks we do understand), and possibly even more due to discovering brand new solutions instead of getting marginal gains on existing solutions. And 50%, spending more money is utterly useless, because we are at the mercy of luck.

Therefore, the efficiency of spending on AI risk should be at least half the efficiency of spending on military risk, or at least within the same order of magnitude. This argument argues over orders of magnitude.

If increasing the time for alignment by pausing AI can work, so can increasing the money for alignment

Given that we effectively have a race between capabilities and alignment, the relative spending on capabilities and alignment seems important.

A 2x capabilities decrease should be similar in effect to a 2x alignment increase, or at least a 4x alignment increase.

The only case where decreasing capabilities funding works far better than increasing alignment funding, is if we decrease capabilities funding to zero, using extremely forceful worldwide regulation and surveillance. But that would also require governments to freak out about AI risk (prioritize it as highly as military risk), and benefit from this letter.

Comment by Knight Lee (Max Lee) on The AI Belief-Consistency Letter · 2025-04-23T17:34:56.953Z · LW · GW

Hi,

By a very high standard, all kinds of reasonable advice are non-sequitur. E.g. a CEO might explain to me "if you hire Alice instead of Bob, you must also believe Alice is better for the company than Bob, you can't just like her more," but I might think "well that's clearly a non-sequitur, just because I hire Alice instead of Bob doesn't imply Alice is better for the company than Bob. Since maybe Bob is a psychopath who would improve the company's fortunes by committing crime and getting away with it, so I hire Alice instead."

X doesn't always imply Y, but in cases where X doesn't imply Y there has to be an explanation.

In order for the reader to agree that AI risk is far higher than 1/8000th the military risk, but still insist that 1/8000th the military budget is still justified, he would need a big explanation, e.g. the marginal benefit of spending 10% more on the military reduces military risk by 10%, but the marginal benefit of spending 10% more on AI risk somehow only reduces AI risk by 0.1%, since AI risk is far more independent of countermeasures.

It's hard to have such drastic differences, because one needs to be very certain that AI risk is unsolvable. If one was uncertain of the nature of AI risk, and there existed plausible models where spending a lot reduces the risk a lot, then these plausible models dominate the expected value of risk reduction.


Thank you for pointing out that sentence, I will add a footnote for it.

If we suppose that military risk for a powerful country (like the US) is lower than the equivalent of a 8% chance of catastrophe (killing 1 in 10 people) by 2100, then 8000 times less would be a 0.001% chance of catastrophe by 2100.

I will also add a footnote for the marginal gains.

Thank you, this is a work in progress, as the version number suggests :)

Comment by Knight Lee (Max Lee) on Kabir Kumar's Shortform · 2025-04-23T14:49:46.326Z · LW · GW

:) the real money was the friends we made along the way.

I dropped out of a math MSc. at a top university in order to spend time learning about AI safety. I haven't made a single dollar and now I'm working as a part time cashier, but that's okay.

What use is money if you end up getting turned into paperclips?

PS: do you want to sign my open letter asking for more alignment funding?

Comment by Knight Lee (Max Lee) on Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt · 2025-04-23T13:26:16.988Z · LW · GW

Hi, I read your post from start to end, here are my little opinions. Hopefully I'm not too wrong.

  • I can tell you are all extremely intelligent and knowledgeable about the world, based on the insightful connections you made between many domains.
  • I love your philosophy of making multiple efforts which assume different axioms (i.e. possibilities which can't be disproven).
  • You did your research, reading about instrumental convergence, Coherent Extrapolated Volition, mistake theory vs. conflict theory.
  • Nonetheless, I sense you are relatively new to AI existential risk, just like I am! This is good news and bad news.
    • The good news is that, in my opinion, AI existential risk needs fresh insights from people on the outside.
      • People who are very smart like you all, who don't carry preconceived notions, and bring insights from multiple other fields.
    • The bad news is that when you discuss the field's current paradigm (whether it follows Mistake Theory, how thick a morality "value alignment" aims for, etc.), you won't be 100% accurate, understandably.
      • If you want to learn more about the current paradigm, Eliezer Yudkowsky's List of Lethalities is like LessWrong's bible haha. Corrigibility also dovetails with your work a little bit.
  • I like the attitude of caring about the concerns and freedoms of different groups of people :)
  • I agree that current AI existential risk discussion underestimates the importance of human psychology. I believe the best hope for alignment isn't finding the one true reward function, but preserving the human behaviour of pretrained base models.
    • I completely agree that understanding human norms, and why exactly normal humans don't kill everyone to make paperclips, is potentially very useful for AI alignment against existential risk.
      • If we know which norms are preventing normal humans from killing everyone, we might deduce which reinforcement learning settings can damage those norms (by gradient descenting towards behaviour which breaks them).
      • Thank you so much for your work in this area, it might mean a lot!

PS: if you have time can you also comment on my post? It's the AI Belief-Consistency Letter, an attempt to prove the fact that AI alignment is irrationally underfunded. Thanks :)

Comment by Knight Lee (Max Lee) on Crime and Punishment #1 · 2025-04-23T00:56:36.322Z · LW · GW

Admittedly this is not my expertise, I don't know how it works, all I know is that it's considered a real problem that affects a lot of people.

Don't take my definition too seriously, I think I omitted the part where they're trafficked far from their homes.

Comment by Knight Lee (Max Lee) on Crime and Punishment #1 · 2025-04-22T17:41:50.735Z · LW · GW

Maybe just like hospitals have inpatients and outpatients, prisons can have outprisoners who wear a device which monitors them all the time, and might even restrain them if needed.

It may actually work, but of course, it's just a lil bit too tech dystopian to be politically viable.

Comment by Knight Lee (Max Lee) on Crime and Punishment #1 · 2025-04-22T17:30:49.537Z · LW · GW

Maybe define them to be people who do not want to be a sex worker, even if they take into account the fact they might have no money otherwise.

Comment by Knight Lee (Max Lee) on AI 2027 is a Bet Against Amdahl's Law · 2025-04-22T00:30:49.768Z · LW · GW

Yeah, sorry I didn't mean to argue that Amdahl's Law and Hofstadter's Law are irrelevant, or that things are unlikely to go slowly.

I see a big chance that it takes a long time, and that I end up saying you were right and I was wrong.

However, if you're talking about "contemplating the capabilities of something that is not a full ASI. Today's models have extremely jagged capabilities, with lots of holes, and (I would argue) they aren't anywhere near exhibiting sophisticated high-level planning skills able to route around their own limitations."

That seems to apply to the 2027 "Superhuman coder" with 5x speedup, not the "Superhuman AI researcher" with 25x speedup or "Superintelligent AI researcher" with 250x.

I think "routing around one's own limitations" isn't necessarily that sophisticated. Even blind evolution does it, by trying something else when one thing fails.

As long as the AI is "smart enough," even if they aren't that superhuman, they have the potential to think many times faster than a human, with a "population" many times greater than that of AI researchers. They can invent a lot more testable ideas and test them all.


Maybe I'm missing the point, but it's possible that we simply disagree on whether the point exists. You believe that merely discovering technologies and improving algorithms isn't sufficient to build ASI, while I believe there is a big chance that doing that alone will be sufficient. After discovering new technologies from training smaller models, they may still need one or two large training runs to implement it all.

I'm not arguing that you don't have a good insights :)

Comment by Knight Lee (Max Lee) on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red · 2025-04-21T20:04:38.047Z · LW · GW
Comment by Knight Lee (Max Lee) on AI 2027 is a Bet Against Amdahl's Law · 2025-04-21T19:34:11.741Z · LW · GW

Imagine if evolution could talk. "Yes, humans are very intelligent, but surely they couldn't create airplanes 50,000 times heavier than the biggest bird in only 1,000 years. Evolution takes millions of years, and even if you can speed up some parts of the process, other parts will remain necessarily slow."

But maybe the most ambitious humans do not even consider waiting millions of years, and making incremental improvements on million year techniques. Instead, they see any technique which takes a million years as a "deal breaker," and only make use of techniques which they can use within the timespan of years. Yet humans are smart enough and think fast enough that even when they restrict themselves to these faster techniques, they can still eventually build an airplane, one much heavier than birds.

Likewise, an AI which is smart enough and thinks fast enough, might still eventually invent a smarter AI, one much smarter than itself, even when restricted to techniques which don't require months of experimentation (analogous to evolution). Maybe just by training very small models very quickly, they can discover a ton of new technologies which can scale to large models. State-of-the-art small models (DeepSeek etc.) already outperform old large models. Maybe they can invent new architectures, new concepts, and who knows what.

In real life, there might be no fine line between slow techniques and fast techniques, but a gradual transition from approaches which use more slower techniques and approaches which use less slower techniques.

Comment by Knight Lee (Max Lee) on VDT: a solution to decision theory · 2025-04-21T19:11:54.973Z · LW · GW

I was thinking that deductive explosion occurs for logical counterfactuals encountered during counterfactual mugging, but doesn't occur for logical counterfactuals encountered when a UDT agent merely considers what would happen if it outputs something else (as a logical computation).

I agree that logical counterfactual mugging can work, just that it probably can't be formalized, and may have an inevitable degree of subjectivity to it.

Coincidentally, just a few days ago I wrote a post on how we can use logical counterfactual mugging to convince a misaligned superintelligence to give humans just a little, even if it observes the logical information that humans lose control every time (and therefore has nothing to trade with it), unless math and logic itself was different. :) leave a comment there if you have time, in my opinion it's more interesting and concrete.

Comment by Knight Lee (Max Lee) on Is Gemini now better than Claude at Pokémon? · 2025-04-20T18:22:32.174Z · LW · GW

Edit: I actually think it's good news for alignment, that their math and coding capabilities are at approaching International Math Olympiad levels, but their agentic capabilities are still at Pokemon Red and Pokemon Blue levels (i.e. a small child).

This means that when the AI inevitably reaches the capabilities to influence the world in any way it wants, it may still be bottlenecked by agentic capabilities. Instead of turning the world into paperclips, it may find a way to ensure humans have a happy future, because it still isn't agentic enough to deceive and overthrow its creators.

Maybe it's worth it to invest in AI control strategies. It might just work.

But that's my wishful thinking, and there are countless ways this can go wrong, so don't take this too seriously.

Comment by Knight Lee (Max Lee) on Power Lies Trembling: a three-book review · 2025-04-20T18:04:14.387Z · LW · GW

I think different kinds of risks have different "distributions" of how much damage they do. For example, the majority of car crashes causes no injuries (but damage to the cars), a smaller number causes injuries and some causes fatalities, and the worst ones can cause multiple fatalities.

For other risks like structural failures (of buildings, dams, etc.) the distribution has a longer tail: in the worst case very many people can die. But the distribution still tapers off towards greater number of fatalities, and people sort have have a good idea of how bad it can get before the worst version happens.

For risks like war, the distribution has an even longer tail, and people are often caught by surprise how bad they can get.

But for AI risk, the distribution of damage caused is very weird. You have one distribution for AI causing harm due to its lack of common sense, where it might harm a few people, or possibly cause one death. Yet you have another distribution for AI taking over the world, with a high probability of killing everyone, a high probability of failing (and doing zero damage), and only a tiny bit of probability in between.

It's very very hard to learn from experience in this case. Even the biggest wars tend to surprise everyone (despite having a relatively more predictable distribution).

Comment by Knight Lee (Max Lee) on Power Lies Trembling: a three-book review · 2025-04-20T08:32:40.730Z · LW · GW

Oops, I didn't mean we should involve the military in AI alignment. I meant the military is an example of something working on future threats, suggesting that humans are capable of working on future threats.

I think the main thing holding back institutions is that public opinion does not believe in AI risk. I'm not sure how to change that.

Comment by Knight Lee (Max Lee) on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents · 2025-04-20T08:20:53.086Z · LW · GW

If one concern is the low specificity of being kind to weaker agents, what do you think about directly trading with Logical Counterfactual Simulations?

Directly trading with Logical Counterfactual Simulations is very similar to the version by Rolf Nelson (and you): the ASI is directly rewarded for sharing with humans, rather than rewarded for being kind to weaker agents.

The only part of math and logic that the Logical Counterfactual Simulation alters, is "how likely the ASI succeeds in taking over the world." This way, the ASI can never be sure that it won (and humans lost), even if math and logic appears to prove that humans have 99.9999% frequency of losing.

I actually spent more time working on this direct version, but I still haven't turned it into a proper post (due to procrastination, and figuring out how to convince all the Human-AI Trade skeptics like Nate Soares and Wei Dai).

Comment by Knight Lee (Max Lee) on Is Gemini now better than Claude at Pokémon? · 2025-04-20T01:33:16.755Z · LW · GW

:) I like these video game tests.

Assuming they aren't doing RL on video games, their video game performance might betray their "true" agentic capabilities: at the same level of small children!

That said, they are playing Pokemon better and better. The "small child" their agentic capabilities are at seems to be growing up by more than one year, every year. AGI 2030 maybe?

Edit: see OP's next post. It turns out a lot of poor performance is due to poor vision (though he mentions other issues which resemble poor agency).

Comment by Knight Lee (Max Lee) on VDT: a solution to decision theory · 2025-04-20T01:11:58.774Z · LW · GW

There was a math paper which tried to study logical causation, and claimed "we can imbue the impossible worlds with a sufficiently rich structure so that there are all kinds of inconsistent mathematical structures (which are more or less inconsistent, depending on how many contradictions they feature)."

In the end, they didn't find a way to formalize logical causality, and I suspect it cannot be formalized.


Logical counterfactuals behave badly because "deductive explosion" allows a single contradiction to prove and disprove every possible statement!

However, "deductive explosion" does not occur for a UDT agent trying to reason about logical counterfactuals where he outputs something different than what he actually outputs.

This is because a computation cannot prove its own output.

Why a computation cannot prove its own output

If a computation could prove its own output, it could be programmed to output the opposite of what it proves it will output, which is paradoxical.

This paradox doesn't occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself. The simulation of itself starts another nested simulation of itself, creating an infinite recursion which never ends (the computation crashes before it can give any output).

A computation's output is logically downstream of it. The computation is not allowed to prove logical facts downstream from itself but it is allowed to decide logical facts downstream of itself.

Therefore, very conveniently (and elegantly?), it avoids the "deductive explosion" problem.

It's almost as if... logic... deliberately conspired to make UDT feasible...?!

Comment by Knight Lee (Max Lee) on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents · 2025-04-19T23:35:43.562Z · LW · GW

Thank you so much for the thorough reply :)

My answer for "which weaker agents should the AI be kind to," is "all weaker agents."

Enough room for everyone

Our universe contains humans, octopuses, insects and many different weak agents. A superintelligent AI which has a very philosophically uncertain chance of being in a Karma Test, would be kind to all of these agents just in case the Karma Test executors had a particular one in mind.

Earth's future lightcone is extremely big ( stars), so there is room to be kind to all of these weaker agents, if you are morally uncertain about which ones to care about, and do not put all your weight on "caring for rocks."

Caring for every weak agent is a net positive

The only worry would if weaker agents who are diametrically opposed to human values (or certain human values), end up having more weight than humans-like agents. For example, they want human misery more than we don't want human misery, and they outvote us.

Such agents make a compelling argument in an online debate, but are unlikely to exist in real life due to convergent evolution.

We humans may enjoy certain blood sports where smaller animals suffer, but it is very unlikely we will outvote them, or that we would even want to outvote them. If an AI told us it was choosing to spare our lives due to the risk it is inside a Karma Test telling it to be kind to weaker agents, and asked us how to spend our vote, it's unlikely we'll spend our vote against fish, in order to enjoy the blood-sport of killing them, due to Karma and everything.

Even in the cynical case we do do that, we'll at least allow the average life of a fish-like creature be happy rather than miserable.

All in all, having the AI be kind to weaker agents should be a net positive.

Which acausal trade first?

From the point of view of a superintelligence, it may make sense to do acausal trades in the order you described. From the point of view of humans, we can debate the feasibility of acausal trades in any order we like :)

I'm not saying that the superintelligence has to do this trade first, before the other trades. I'm saying it should do it eventually.

Advantages and disadvantages

I was wrong to ask the question "which version is better." I think they can both be done, to cover each other's weaknesses.

Disadvantages

One disadvantage of Logical Counterfactual Simulations, is that at least some logical counterfactuals (e.g. your example of 2+2=5) cannot be imagined by an AI (or human).

The algorithms for imagining counterfactuals, designed for computing "what will happen if X was true," inevitably performs certain mathematical computations like 2+2=4. As such, it is unable to imagine what would happen if 2+2=5.

Therefore, even though the algorithms for imagining counterfactuals are capable of logical uncertainty, and assigning probabilities to logical statements like  and so forth, not every logical counterfactual can be imagined.

Even for logical counterfactuals which can be imagined (e.g. being in a Karma Test), I'm not sure how the ASI will assign probabilities to these counterfactuals. Being in a Karma Test cannot be proven or disproven, and is almost a philosophical question. My hope is that the ASI assigns a bit of probability and expected value to it, enough to make it worthwhile to spare humans.

If we can't argue why the ASI won't assign any probability, it's worth trying.

Advantages: UDT

The key advantage of Logical Counterfactual Simulations over physical simulations, is that it does not become a zero sum trade against UDT agents.

For physical simulations, in order to convince the UDT paperclip maximizer AI to make room for happy humans, you have to give it more paperclips somewhere else in the universe. This means whatever trade you make with it, reduces the total number of happy humans, and increases the total number of paperclips.

If you are a utilitarian, this would clearly be a zero sum game. But even if you are a selfish individual, what is your measure of survival? Is it the number of copies of yourself who are alive in the future? If that was the case, this would still be a zero sum game, since it's cheaper for your surviving copies to directly clone themselves, than to buy your doomed copies from the paperclip maximizers. Any trade with paperclip maximizers lead to more paperclips, and less of whatever you value.

Physical simulations may still work against a CDT paperclip maximizer

At the very beginning, humans are the only agent who promises to simulate the CDT paperclip maximizer  and reward it for cooperating.

 knows that faraway UDT agents  also want to use simulations of it to bribe it into cooperating (potentially outbidding humanity), but  fails to do so (for the exact same reason Roko's Basilisk fails).

 has no motive to bribe  until  can verify whether  bribed  or not (i.e.  simulates ). But  has no motive to simulate  because CDT agents don't initiate acausal trade.

Since humans are the only agent bribing  at the beginning, we might convince  to become a UDT agent who trades on behalf of humanity (so can't get bribed by ), but is committed to spend  of the universe on paperclips. This way if  was inside our simulation, it gets a reward, but if  was in the real world, it still turns  of the universe into paperclips.

Logical Counterfactual Simulations are still zero sum over logical counterfactuals, but the AI has a positive sum if AI alignment turns out easy, and humanity has a positive sum if AI alignment turns out hard (reducing logical risk).

Advantages: No certainty

Logical Counterfactual Simulations prevents the ASI from reaching extreme certainty over which agents always win and which agents always lose, so it spreads out its trades.

If humans (and other sentient life) lose to misaligned ASI every time, such that we have nothing to trade with it, average human/sentient life in all of existence may end up miserable.

Logical Counterfactual Simulations, allows us to edit the Kingmaker Logic, so that the ASI can never be really sure we have nothing to trade, even if math and logic appear to prove we lose every time.

Thanks for reading :)

Do you agree each idea has advantages and disadvantages?

Comment by Knight Lee (Max Lee) on Power Lies Trembling: a three-book review · 2025-04-19T19:18:34.171Z · LW · GW

You're very right, in addition to people not working on AI risk because they don't see others working on it, you also have the problem that people aren't interested in working on future risk to begin with.

I do think that military spending is a form of addressing future risks, and people are capable of spending a lot on it.

I guess military spending doesn't inconvenience you today, because there already are a lot of people working in the military, who would lose their jobs if you reduced military spending. So politicians will actually make more people lose their jobs if they reduced military spending.

Hmm. But now that you bring up climate change, I do think that there is hope... because some countries do regulate a lot and spend a lot on climate change, at least in Europe. And it is a complex scientific topic which started with the experts.

Maybe the AI Notkilleveryone movement should study what worked and failed for the green movement.

Comment by Knight Lee (Max Lee) on Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents · 2025-04-19T17:15:01.167Z · LW · GW

Note: this idea is very similar in rationale to the Human-AI Trade idea by Rolf Nelson and David Matolcsi. But it has an extremely different calculus due to Logical Counterfactual Simulations. Most arguments for and against the two ideas will be very different.

It is a also similar (but not identical) to a previous Human-AI Trade idea I summarized in this comment. I think my older idea is much less elegant and harder to talk about (even if it may be more efficient), so it's better to debate this one first.

PS: I'm curious whether @David Matolcsi thinks his version is still better, or whether my version has advantages (e.g. the potential reward is larger, and the majority of pivotal agents are in such a test).

Comment by Knight Lee (Max Lee) on Power Lies Trembling: a three-book review · 2025-04-18T23:27:36.830Z · LW · GW

:) thank you so much for your thoughts.

Unfortunately, my model of the world is that if AI kills "more than 10%," it's probably going to be everyone and everything, so the insurance won't work according to my beliefs.

I only defined AI catastrophe as "killing more than 10%" because it's what the survey by Karger et al. asked the participants.

I don't believe in option 2, because if you asked people to bet against AI risk with unfavourable odds, they probably won't feel too confident against AI risk.

Comment by Knight Lee (Max Lee) on Knight Lee's Shortform · 2025-04-18T07:48:05.956Z · LW · GW
Comment by Knight Lee (Max Lee) on Six reasons why objective morality is nonsense · 2025-04-17T21:29:33.433Z · LW · GW

To me, it looks like the blogger (Coel) is trying to say that morality is a fact about what we humans want, rather than a fact of the universe which can be deduced independently from what anyone wants.

My opinion is Coel makes this clear when he explains, "Subjective does not mean unimportant." "Subjective does not mean arbitrary." "Subjective does not mean that anyone’s opinion is “just as good”."

"Separate magisteriums" seems to refer to dualism, where people believe that their consciousness/mind exists outside the laws of physics, and cannot be explained by the laws of physics.

But my opinion is Coel didn't imply that subjective facts are a "separate magisterium" in opposition to objective facts. He said that subjective morals are explained by objective facts: "Our feelings and attitudes are rooted in human nature, being a product of our evolutionary heritage, programmed by genes. None of that is arbitrary."

But I'm often wrong about these things don't take me too seriously :/

Comment by Knight Lee (Max Lee) on Reframing AI Safety Through the Lens of Identity Maintenance Framework · 2025-04-17T12:38:21.379Z · LW · GW

I think it's wonderful that you and your team are working on this :)

Thank you for your efforts towards a better future!

I think, some people on LessWrong already know about agents trying to preserve themselves, and there has already been discussion about it. So when they see a long article describing it, they feel annoyed and downvote it.

I think they are too unwelcoming and discouraging. They should say hi and be friendly, and tell you where the community is at and how to interact with them.

Ignore the negative response, keep doing research, and maybe someday you'll accomplish something big.

Good luck :)

Comment by Knight Lee (Max Lee) on Six reasons why objective morality is nonsense · 2025-04-17T11:54:31.429Z · LW · GW

whether humans have particular opinions or not is also a matter of facts about the world

I'm not 100% sure I know what I'm talking about, but it feels like that's splitting hairs. Are you arguing that the distinction between objective and subjective are "very unhelpful," because the state of people's subjective beliefs are technically an objective fact of the world?

In that case, why don't you argue that all similar categorizations are unhelpful, e.g. map vs. territory?

Comment by Knight Lee (Max Lee) on AI-enabled coups: a small group could use AI to seize power · 2025-04-17T07:24:41.595Z · LW · GW

I agree that teaming up with everyone and working to ensure that power is spread democratically is the right strategy, rather than giving power to loyal allies who might betray you.

But some leaders don't seem to get this. During the Cold War, the US and USSR kept installing and supporting dictatorships in many other countries, even though their true allegiances was very dubious.

Comment by Knight Lee (Max Lee) on AI-enabled coups: a small group could use AI to seize power · 2025-04-17T04:12:38.277Z · LW · GW

Yeah, it's possible when you fear the other side seizing power, you start to want more power yourself.

Comment by Knight Lee (Max Lee) on AI-enabled coups: a small group could use AI to seize power · 2025-04-16T22:43:30.235Z · LW · GW

In a just world, mitigations against AI-enabled coups will be similar to mitigations against AI takeover risk.

In a cynical world, mitigations against AI-enabled coups involve installing your own allies to supervise (or lead) AI labs, and taking actions against humans you dislike. Leaders mitigating the risk may simply make sure that if it does happen, it's someone on their side. Leaders who believe in the risk may even accelerate the US-China AI race faster.

Note: I don't really endorse the "cynical world," I'm just writing it as food for thought :)

Comment by Knight Lee (Max Lee) on Commitment Races are a technical problem ASI can easily solve · 2025-04-16T21:57:22.481Z · LW · GW

After thinking about it more, it's possible your model of why Commitment Races resolve fairly, is more correct than my model of why Commitment Races resolve fairly, although I'm less certain they do resolve fairly.

My model's flaw

My model is that acausal influence does not happen until one side deliberately simulates the other and sees their commitment. Therefore, it is advantageous for both sides to commit up to but not exceeding some Schelling point of fairness, before simulating the other, so that the first acasual message will maximize their payoff without triggering a mutual disaster.

I think one possibly fatal flaw of my model is that it doesn't explain why one side shouldn't add the exception "but if the other side became a rock with an ultimatum, I'll still yield to them, conditional on the fact they became a rock with an ultimatum before realizing I will add this exception (by simulating me or receiving acausal influence from me)."

According to my model, adding this exception improves ones encounters with rocks with ultimatums by yielding to them, and does not increase the rate of encountering rocks with ultimatums (at least in the first round of acausal negotation, which may be the only round), since the exception explicitly rules out yielding to agents affected by whether you make exception.

This means that in my model, becoming a rock with an ultimatum may still be the winning strategy, conditional on the fact the agent becoming a rock with an ultimatum doesn't know it is the winning strategy, and the Commitment Race problem may reemerge.

Your model

My guess of your model, is that acausal influence is happening a lot, such that refusing in the ultimatum game can successfully punish the prior decision to be unfair (i.e. reduce the frequency of prior decisions to be unfair).

In order for your refusal to influence their frequency of being unfair, your refusal has to have some kind of acausal influence on them, even if they are relatively simpler minds than you (and can't simulate you).

At first, this seemed impossible to me, but after thinking about it more, maybe even if you are a more complex mind than the other player, your decision-making may be made out of simpler algorithms, some of which they can imagine and be influenced by.

Comment by Knight Lee (Max Lee) on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-16T21:18:29.322Z · LW · GW

Yeah, I definitely didn't remove all of my advantages. Another unfair thing I did was I did correct my typos, including accidentally writing the wrong label, when I decided that "I thought the right label, so I'm allowed to correct what I wrote into what I was thinking about."

Comment by Knight Lee (Max Lee) on To be legible, evidence of misalignment probably has to be behavioral · 2025-04-16T21:10:25.080Z · LW · GW

Oops. Maybe this kind of news does affect decision makers and I was wrong. I was just guessing that it had little effect, since... I'm not even sure why I thought so.

I did a Google search and it didn't look like the kind of news that governments responded to.

Comment by Knight Lee (Max Lee) on keltan's Shortform · 2025-04-16T07:16:59.189Z · LW · GW

I agree this stuff is addictive. AI makes things more interactive. Some people who never considered themselves vulnerable got sucked in to AI relationships.

Possible push back:

What if short bits of addictive content generated by humans (but selected by algorithms) are already near max addictiveness? And by the time AI can design/write a video game etc. twice as addictive than humans can design, we already have a superintelligence explosion, and either addiction is solved or we are dead?

Comment by Knight Lee (Max Lee) on To be legible, evidence of misalignment probably has to be behavioral · 2025-04-16T07:01:31.388Z · LW · GW

When Gemini randomly told an innocent user to go kill himself, it made the news, but this news didn't really affect very much in the big picture.

It's possible that relevant decision-makers don't care that much about dramatic bad behaviours since the vibe is "oh yeah AI glitches up, oh well."

It's possible that relevant decision-makers do care more about what the top experts believe, and if the top experts are convinced that current models already want to kill you (but can't), it may have an effect. Imagine if many top experts agree that "the lie detectors start blaring like crazy when the AI is explaining how it won't kill all humans even if can get away with it."

I'm not directly disagreeing with this post, I'm just saying there exists this possible world model where behavioural evidence isn't much stronger (than other misalignment evidence).

Comment by Knight Lee (Max Lee) on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-16T06:08:59.730Z · LW · GW

Just for fun, I tried solving the sliding block puzzle pretending I was an LLM (no drawing pictures, no copy and paste, only writing text).

It was hard in an interesting way: I repeatedly had to say "oops, I made a mistake."

The mistakes I made were familiar mistakes that anyone could make, but the were also the kind of mistake for which I would never write "oops I made a mistake" unless I'm pretending to be an LLM. They were the kind of mistake that would normally cause me to erase and delete the mistake and simply write the correction, since they were far too uninteresting to save.

My human chain-of-thought

I have the following pieces in this order:

Sun    Leaf  (Empty)

Bell   Snake   Star

(Empty) Flower (Empty)

 each can be moved into an orthogonally adjacent empty square. Target:

Snake  Star    Flower

(Empty) Sun (Empty)

Leaf   (Empty) Bell

, find the smallest number of moves for getting them from the initial configuration to the target one.

As an actual human, let me try to figure this one out.

First, Sun moves from the top left corner to the middle.

Leaf moves from the center top to the bottom left.

Bell moves from the center left to the bottom right.

Snake moves from the center to the top left.

Star moves from the center right to the center top.

Flower moves from the bottom to the top right.

As an actual human, I realize it is hard to keep track of all these targets because I know when I feel a bit confused. So I will no try to keep track of all these motions at once, but change strategy to moving the pieces based on guesses until I find a solution, and hopefully I can argue the solution is the shortest.

So let me move a piece very far from its target in the general direction of its target, this move is unlikely to need to be reversed.

Let me try moving Bell down. This creates a spot at the center left, and Bell is now at the bottom left corner.

I could move Sun down or Snake left now. Both will get closer to their target. Moving Snake Left opens a spot in the middle, which can be used to ferry pieces to other connected empty spots. So I will move Snake left. Now Snake is at the center left, and there is a spot in the center.

I can use the empty spot in the middle to move pieces. I can move Star to its destination, or move Leaf or Flower close to their destinations, but I'm not sure where to put them so I will move Star first. I move Star from the center right to the middle, then from the middle... up? Oops I made a mistake, there is no empty space at the top because Leaf is there.

Maybe I won't move Star at all. Maybe, backtracking to before I tried to move Star, I will move Flower right from the center bottom to the bottom right. This opens a spot to move Leaf to its position.

Next I will move Leaf down from the center top to the middle, and then move Leaf down again from the middle to the center bottom, and then move Leaf again from the center bottom to the bottom left.

To recap, I first move Bell down, then move Snake left, then move Flower right, and oh no I spotted another mistake. I cannot move Leaf to its target position because Bell is blocking it.

Let me try to draw the situation before I moved Flower, so I can better keep track of what is happening.

Sun     Leaf    (Empty)

Snake   (Empty) Star

Bell    Flower  (Empty)

At this point, it looks like I should fill some of the empty corners to make it easier to move other things. I will move Star up, since that moves it closer to its destination.

At this point, Leaf is blocking Star from moving to where it should go, Flower is blocking Leaf from moving to where it should go, and a combination of Leaf and Star is blocking Flower from moving to where it should go.

Oh wait, I made a mistake. I forgot that I moved Star up. As a human, I am aware of what visual model I used to describe things, and noticed it did not include moving Star up.

So now let me redescribe it. Leaf is still blocking Star, Flower and Bell are blocking Leaf, I somehow forgot to mention Bell, and Star is blocking Flower.

I spot a solution. Move Leaf down, then move Star left into its place, then move Flower right, move Flower up, and move Flower up again into its place.

To recap, after moving Bell down and Snake left, I move Star up, I move Leaf down, I move Star left, I move Flower right, I move Flower up, and I move Flower up again.

Let me draw the new situation so I can better keep track of what is happening. As a human I am able to realize when I lose track of the state.

Sun     Flower  (Empty)

Snake   Leaf

Oops, I drew it wrong, let me try drawing it again. I forgot that Star should be in the center top while Flower should be in the top right.

Sun     Star    Flower

Snake   Leaf    (Empty)

Bell    (Empty) (Empty)

As a human, I have to scroll up in order to see the target state, so I will copy the target state again to avoid having to scroll up in order to see the two pictures at once. I won't use copy and paste because that may be considered cheating, it would let me create an image of each move which would fail to demonstrate the difficulties of solving this problem using only text and not much visual aids.

The target was:

Snake   Star    Flower

(Empty) Sun     (Empty)

Leaf    (Empty) Bell

Alright, I realize there is a problem here. Snake only needs to move down by one, but Sun needs to move to where Snake is. There is no way for Sun to get there unless some pieces move away from their target. It is likely that this is necessary, but if it does occur it will be much harder to prove I moved the pieces to their destination in as few moves as possible.

Therefore, maybe I should backtrack to earlier. After moving only Bell down and Snake left, it looked like this:

Sun     Leaf    (Empty)

Snake   (Empty) Star

Bell    Flower  (Empty)

So I want to move. Wait, what was I talking about? I somehow mixed up my current state with the target. The above paragraphs may be nonsense then.

Actually, it still makes sense even if I mixed up the current state and target. Basically, the Sun is trapped in the corner.

So let me return to the earlier state. I think I might still do the same strategy of moving Flower in its L shape path and moving Star up and left, but I will do this slowly, and make sure Sun gets out before Star moves to its destination.

I will move Flower right, and move. Oh yes, I also don't want to trap Bell there in its corner if I moved Leaf down.

So after moving Flower right, I move Star up, and move Flower up. Next I move Bell right and move Bell right again so it is in a corner and out of the way.

To recap, after moving Bell down and Snake left, I move Flower right, Star up, Flower up, Bell right, and Bell right.

I will draw it again, first I visually verify my last drawing was the state after moving Bell down and Snake left.

Now I draw it.

Sun     Leaf    Star

Snake   (Empty) Flower

(Empty) (Empty) Bell

To reach the target, Star and Flower still need to "rotate counterclockwise" by one unit. I can understand this visually because I am a human. Leaf is in their way, but can move to its own destination without issue due to the path.

Sun can move to its destination as soon as Leaf is out of the way, but before Star and Flower rotate to their positions. As soon as Sun is there, Snake can easily get to its spot. It can be seen that once the Sun and Leaf "train" moves, the Star and Flower train can move, as can Snake, the order here is not that important.

Therefore, I will move Leaf down, move Leaf down again, move Leaf left. I will then move Sun right, and move Sun down. Afterwards I will move Star left, and move Flower up. Finally I will move Snake up.

The result should look like the target, but let me draw it to verify.

Snake   Star    Flower

(Empty) Sun     (Empty)

Leaf    (Empty) Bell

Yes, this is the target. So in summary, the moves I made were:

Bell down, Snake left, Flower right, Star up, Flower up, Bell right, Bell right, Leaf down, Leaf down, Leaf left, Sun right, Sun down, Star left, Flower up, Snake up.

As a sanity check, I will redraw the original image, and verify that each block ends up at its final decision. I might also try to visualize possible collisions, but this is akin to restating everything and hard to do mentally.

Sun     Leaf    (Empty)

Bell    Snake   Star

(Empty) Flower  (Empty)

Yes, sun moves down and right to its target. Leaf moves down twice and left into its target. Bell moves down and right twice into its target. Snake moves left and up into its target. Star moves up and left into its target. Flower moves right and up twice into its target.

If there are no collisions, this solution is the shortest since every piece only moves in the direction towards its target.

I visualized the process and there seems to be no collisions, but it was very fuzzy. I am 80% confident this is a right answer.

Bell down, Snake left, Flower right, Star up, Flower up, Bell right, Bell right, Leaf down, Leaf down, Leaf left, Sun right, Sun down, Star left, Flower up, Snake up.

The smallest number of moves is 15.

Edit: I tested o3 (not mini) with the same prompt and it solved it by writing a Breadth First Search from scratch (in Python).

Comment by Knight Lee (Max Lee) on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-16T04:10:24.108Z · LW · GW
Comment by Knight Lee (Max Lee) on Knight Lee's Shortform · 2025-04-15T06:35:26.234Z · LW · GW

Can anyone explain why my "Constitutional AI Sufficiency Argument" is wrong?

I strongly suspect that most people here disagree with it, but I'm left not knowing the reason.

The argument says: whether or not Constitutional AI is sufficient to align superintelligences, hinges on two key premises:

  1. The AI's capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).
  2. It starts off corrigible/honest enough to not sabotage this self evaluation task.

My ignorant view is that so long as 1 and 2 are satisfied, the Constitutional AI can probably remain corrigible/honest even to superintelligence.

If that is the case, isn't it an extremely important to study "how to improve the Constitutional AI's capabilities in evaluating its own corrigibility/honesty?"

Shouldn't we be spending a lot of effort improving this capability, and trying to apply a ton of methods towards this goal (like AI debate and other judgment improving ideas)?

At least the people who agree with Constitutional AI should be in favour of this...?

Can anyone kindly explain what am I missing? I wrote a post and I think almost nobody agreed with this argument.

Thanks :)

Comment by Knight Lee (Max Lee) on What if there was a nuke in Manhattan and why that could be a good thing · 2025-04-15T03:02:32.972Z · LW · GW

Maybe it's a ring that explodes if cut? I'm not saying I can prove it'll work, just that there might be some way or another to target the leaders rather than random civilians in a city (which the leaders might not care about).

Comment by Knight Lee (Max Lee) on What if there was a nuke in Manhattan and why that could be a good thing · 2025-04-15T02:46:41.598Z · LW · GW

What if the bomb was a ring around their neck or something?

Comment by Knight Lee (Max Lee) on What if there was a nuke in Manhattan and why that could be a good thing · 2025-04-15T01:51:36.109Z · LW · GW

Maybe instead of threatening a city it can just threaten a country's top leaders, e.g. they have to wear bombs.

Comment by Knight Lee (Max Lee) on A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives · 2025-04-15T01:15:54.332Z · LW · GW

Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You're a true researcher, take my strong upvote.

My idea consists of a "hammer" and a "nail." GDM's paper describes a "hammer" very similar to mine (perhaps superior), but lacks the "nail."

The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I'm not badly confused :). I shouldn't be sad that my hammer invention already exists.[1]

The "nail" of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the "Constitutional AI Sufficiency Theorem."

The "hammer" of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty.

  1. ^

    It does seems like a lot of my post describes my hammer invention in detail, and is no longer novel :/

Comment by Knight Lee (Max Lee) on Eli's shortform feed · 2025-04-14T11:09:18.835Z · LW · GW

Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let's hope it stays democratic :/

No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.

Comment by Knight Lee (Max Lee) on Eli's shortform feed · 2025-04-14T10:57:34.264Z · LW · GW
  • Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
  • Maybe a good parent who listens to his/her child's dreams?

Very good question though. Humans usually aren't very corrigible, and there aren't many examples!