LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #92: Behind the Curve
Zvi · 2024-11-28T14:40:05.448Z · comments (7)

[question] What are the good rationality films?
Ben Pace (Benito) · 2024-11-20T06:04:56.757Z · answers+comments (53)

Implications of the inference scaling paradigm for AI safety
Ryan Kidd (ryankidd44) · 2025-01-14T02:14:53.562Z · comments (62)

[link] Gwern Branwen interview on Dwarkesh Patel’s podcast: “How an Anonymous Researcher Predicted AI's Trajectory”
Said Achmiz (SaidAchmiz) · 2024-11-14T23:53:34.922Z · comments (0)

Dentistry, Oral Surgeons, and the Inefficiency of Small Markets
GeneSmith · 2024-11-01T17:26:06.466Z · comments (16)

Effective Evil's AI Misalignment Plan
lsusr · 2024-12-15T07:39:34.046Z · comments (9)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
Alex Lintz (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (9)

[link] SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can (Can Rager) · 2024-12-11T06:30:37.076Z · comments (6)

On the OpenAI Economic Blueprint
Zvi · 2025-01-15T14:30:06.773Z · comments (1)

Testing which LLM architectures can do hidden serial reasoning
Filip Sondej · 2024-12-16T13:48:34.204Z · comments (9)

Graceful Degradation
Screwtape · 2024-11-05T23:57:53.362Z · comments (8)

I'm offering free math consultations!
Gurkenglas · 2025-01-14T16:30:40.115Z · comments (6)

Should there be just one western AGI project?
rosehadshar · 2024-12-03T10:11:17.914Z · comments (72)

[link] Best-of-N Jailbreaking
John Hughes (john-hughes) · 2024-12-14T04:58:48.974Z · comments (5)

[link] Gwern: Why So Few Matt Levines?
kave · 2024-10-29T01:07:27.564Z · comments (10)

The 2023 LessWrong Review: The Basic Ask
Raemon · 2024-12-04T19:52:40.435Z · comments (25)

2025 Prediction Thread
habryka (habryka4) · 2024-12-30T01:50:14.216Z · comments (19)

Human study on AI spear phishing campaigns
Simon Lermen (dalasnoin) · 2025-01-03T15:11:14.765Z · comments (8)

Brief analysis of OP Technical AI Safety Funding
22tom (thomas-barnes) · 2024-10-25T19:37:41.674Z · comments (5)

MONA: Managed Myopia with Approval Feedback
Seb Farquhar · 2025-01-23T12:24:18.108Z · comments (29)

Six Thoughts on AI Safety
boazbarak · 2025-01-24T22:20:50.768Z · comments (49)

The Packaging and the Payload
Screwtape · 2024-11-12T03:07:37.209Z · comments (1)

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.
Andrew_Critch · 2024-11-22T03:26:11.681Z · comments (53)

What is malevolence? On the nature, measurement, and distribution of dark traits
David Althaus (wallowinmaya) · 2024-10-23T08:41:33.197Z · comments (15)

Secular Solstice Round Up 2024
dspeyer · 2024-11-21T10:49:36.682Z · comments (15)

Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)
Elizabeth (pktechgirl) · 2024-10-22T18:20:01.194Z · comments (79)

[link] Video lectures on the learning-theoretic agenda
Vanessa Kosoy (vanessa-kosoy) · 2024-10-27T12:01:32.777Z · comments (0)

Counting AGIs
cash (cshunter) · 2024-11-26T00:06:17.845Z · comments (19)

[link] Cost, Not Sacrifice
Joe Rogero · 2024-11-20T21:32:26.281Z · comments (13)

Could randomly choosing people to serve as representatives lead to better government?
John Huang · 2024-10-21T17:10:20.920Z · comments (13)

Introducing Transluce — A Letter from the Founders
jsteinhardt · 2024-10-23T18:10:02.526Z · comments (2)

No one has the ball on 1500 Russian olympiad winners who've received HPMOR
Mikhail Samin (mikhail-samin) · 2025-01-12T11:43:36.560Z · comments (21)

Planning for Extreme AI Risks
joshc (joshua-clymer) · 2025-01-29T18:33:14.844Z · comments (0)

When AI 10x's AI R&D, What Do We Do?
Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · comments (16)

New, improved multiple-choice TruthfulQA
Owain_Evans · 2025-01-15T23:32:09.202Z · comments (0)

Beards and Masks?
jefftk (jkaufman) · 2025-01-18T16:00:04.049Z · comments (5)

The Mask Comes Off: At What Price?
Zvi · 2024-10-21T23:50:05.247Z · comments (16)

[link] Moderately More Than You Wanted To Know: Depressive Realism
JustisMills · 2025-01-13T02:57:32.022Z · comments (4)

[link] Policymakers don't have access to paywalled articles
Adam Jones (domdomegg) · 2025-01-05T10:56:11.495Z · comments (10)

The King and the Golem - The Animation
Writer · 2024-11-08T18:23:10.935Z · comments (0)

Automation collapse
Geoffrey Irving · 2024-10-21T14:50:54.500Z · comments (9)

[link] If far-UV is so great, why isn't it everywhere?
Austin Chen (austin-chen) · 2024-10-19T18:56:58.910Z · comments (23)

[link] "Map of AI Futures" - An interactive flowchart
swante · 2024-11-27T21:31:40.269Z · comments (3)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

Numberwang: LLMs Doing Autonomous Research, and a Call for Input
eggsyntax · 2025-01-16T17:20:37.552Z · comments (30)

[link] New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters
Jesse Hoogland (jhoogland) · 2024-11-27T22:06:12.914Z · comments (4)

Personal AI Planning
jefftk (jkaufman) · 2024-11-10T14:00:06.837Z · comments (10)

Heritability: Five Battles
Steven Byrnes (steve2152) · 2025-01-14T18:21:17.756Z · comments (18)

[link] Learn to write well BEFORE you have something worth saying
eukaryote · 2024-12-29T23:42:31.906Z · comments (18)

Retrospective: 12 [sic] Months Since MIRI
james.lucassen · 2025-01-21T02:52:06.271Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

knight-lee on Scanless Whole Brain Emulation

Thank you, these are very thoughtful points and concerns.

You're right that a general AI pause on everything AI wouldn't be wise. My view is that most (but not all) people talking about an AI pause, only refer to pausing general purpose LLM above a certain level of capability, e.g. o1 or o3. I should have clarified what I meant by "AI pause."

I agree that companies which want to be profitable, should focus on medical products rather than such a moonshot. The idea I wrote here is definitely not an investor pitch, it's more of an idea for discussion similar to the FHI's discussion on Whole Brain Emulation.

AI safety implications

Yes, building any superintelligence is inherently dangerous. But not all superintelligences are equally dangerous!

No self modifications

In the beginning, the simulated humans should not do any self modifications all, and just work like a bunch of normal human researchers (e.g. on AI alignment, or aligning the smarter versions of themselves). The benefit is that the smartest researchers can be cloned many times, and they might think many times faster.

Gradual self modifications

The simulated humans can modify a single volunteer become slightly smarter, while other individuals monitor her. The single modified volunteer might describe her ideal world in detail, and may be subject to a lie detector which actually works.

Why modified humans are still safer than LLMs

The main source of danger is not a superintelligence which kills or harms people out of "hatred" or "disgust" or any human-like emotion. Instead, the main source of extinction is a superintelligence which assigns absolutely zero weight to everything humans cherish, and converts the entire universe into paperclips or whatever its goal is. It does not even spare a tiny fraction for humans to live in.

The only reason to expect LLMs to be safe in any way, is that they model human thinking, and are somewhat human-like. But they are obviously far less human-like than actual simulated humans.

A group of simulated humans who modify themselves to become smarter can definitely screw up at some stage, and end up as a bad superintelligence which assigns exactly zero weigh to everything humans cherish. The path may be long and treacherous, and success is by no means guaranteed.

However, it is still relatively much more hopeful than having a bunch of o3 like AI, which undergoes progressively more and more reinforcement learning towards rewards such as "solve this programming challenge," "prove this math equation," until there is so much reinforcement learning that their thoughts no longer resemble the pretrained LLMs they started off as (which were at least trying to model human thinking).

"You can tell the RL is done properly when the models cease to speak English in their chain of thought"
-Andrej Karpathy

Since the pretrained LLMs can never truly exceed human level intelligence, the only way for reinforcement learning to create an AI with far higher intelligence, may be to steer them far from how humans think.

These strange beings are optimized to solve these problems and occasionally optimized to give lip service to human values. But it is extremely unknown what their resulting end goals will be. It could easily be very far from human values.

Reinforcement learning (e.g. for solving math problems or giving lip service to human values) only controls their behavior/thoughts, not their goals. Their goals, for all we know, are essentially random lotteries, with a mere tendency to resemble human goals (since they started off as LLMs). The tendency gets weaker and weaker as more reinforcement learning is done, and you can only blindly guess if the tendency will remain strong enough to save us.

raemon on Subskills of "Listening to Wisdom"

Okay while I disagree with a bunch of your framing here, I do think "look for ways to falsify" is indeed important and not really how I was orienting to it, and should be at least a part of how I'm orienting to it.

The way I was orienting to it was "try it, see how well it works, iterate, report honestly how it went." (Fwiw this is more like a side project for me, and my mainline rationality development work [LW · GW] is much more oriented to "design the process so it's easier to evaluate and falsify.")

I will mull over "explicitly aim to falsify" here, although I think this is the sort of question where that's actually pretty hard. I think most forms of self-improvement are hard to justify scientifically, take a long time to pay off, and effects are subtle and intertwined enough that it's hard to tell what works.

I don't see offhand a better approach than "try it a bunch and see if it seems to work, and eventually give up if it doesn't.

(FYI the practical place I'm most focused on testing this out is in teaching junior programmers "debugging" and "code taste". I don't think that can result in data that should persuade you, or even that should particularly change your mind that it should have persuaded me)

...

I do think I have deep disagreements about orientation here, where I think a lot of early stage development of rationality ideas requires going out on a limb, it'll be years before it's really clear whether it works or not. You can do RCTs, but they are very expensive and don't make sense until you basically already know it works because you have large effect sizes and you want to justify it to skeptics.

I know some rationality training developers who have been extremely careful about not posting things until they're confident they're real, and honestly I think that attitude had more downsides than upsides (I think it basically killed LessWrong-as-a-place-where-people-do-rationality-development). I think it is much better to post your updates as you do them, with appropriate epistemic caveats.

...

I do think the claim here is... actually just not very unreasonable?

The claim here is:

You can deliberately try to learn why a person has a made a subtle update.
If they are good at explaining or you are good at interviewing, you can learn the details about what you'd do differently if you yourself made the update, and ask yourself whether the update actually makes sense.
It may help to imagine more viscerally the situations that caused the person to make the update.

I would be extremely surprised if this didn't help at all. I wouldn't be too surprised if it turns out to never be as good as actually burning out / playing chess for 200 hours / etc.

stormykat on [Intuitive self-models] 2. Conscious Awareness

I love this series very much and I thank you for writing all of this. There's a lot I could ask or say about this but specifically there's a line of thought right now I want to query you on (I might message you more on other topics later if that's okay this series has sparked a lot of thought in me). Apologies if I use bad or unclear language I'm very well informed on the philosophy, not quite so much on the hard science.

I am very interested in this concept of awareness (because of my own interest in phenomenal consciousness). In the previous article, you say that:

every vertebrate, and presumably many invertebrates too, are also active agents with predictive learning algorithms in their brain, and hence their predictive learning algorithms are also incentivized to build intuitive self-models

This is a bit random but its relevant for my own ideas that I'm developing. Do you think making predictions about your own predictive algorithms is something that requires recurrent processing (as in feedback from higher processing layers onto lower ones)? I'm asking this because it actually seems like fish for instance, as well octopuses, have pretty much entirely feed-forward brains, so it seems dubious to me that they could model themselves as having awareness (or even model their own algorithm at all) unless I'm misunderstanding how this works on an algorithmic level.

If I'm right and this means they probably can't model their own algorithm, this would probably imply everything 'below' reptiles at the least lacks the ability to do this as well (and thus lacks the ability to model itself as having awareness).

kaarel on Views on when AGI comes and on strategy to reduce existential risk

¿ thoughts on the following:

solving >95% of IMO problems while never seeing any human proofs, problems, or math libraries (before being given IMO problems in base ZFC at test time). like alphaproof except not starting from a pretrained language model and without having a curriculum of human problems and in base ZFC with no given libraries (instead of being in lean), and getting to IMO combos

alphaandomega on AlphaAndOmega's Shortform

Who knows how long regulatory inertia might last? I agree it'll probably add at least a few years to my employability, past the date where an AI can diagnose, plan and prescribe better than I can. It might not be something to rely on, if you end up with a regime where a single doctor rubberstamps hundreds of decisions, in place of what a dozen doctors did before. There's not that much difference between 90% and 100% unemployment!

kristaps-zilgalvis-1 on Are we trying to figure out if AI is conscious?

Thanks!! Ill try it out.

vecn-the0verl0rd on Chapter 37: Interlude: Crossing the Boundary

It says two comments for me before I posted this, so it looks like it has been fixed.

rohinmshah on MONA: Managed Myopia with Approval Feedback

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.

But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.

Is that right?

Yup, that sounds basically right to me.

sohaib-imran on Does the ChatGPT (web)app sometimes show actual o1 CoTs now?

This is probably not CoT [LW · GW].

sohaib-imran on Does the ChatGPT (web)app sometimes show actual o1 CoTs now?

The o1 evaluation doc gives -- or appears to give -- the full CoT for a small number of examples, that might be an interesting comparison.

Just had a look. One difference between the CoTs in the evaluation doc and the "CoTs" in the screenshots above is that the evaluation docs CoTs tend to begin with the model referring to itself eg. "We are asked to solve this crossword puzzle.", "We are told that for all integer values of kk" or the problem at hand eg. "First, what is going on here? We are given:". Deepseek r1 tends to do that as well. The above screenshots don't show that behaviour, instead begins by talking in second person eg. "you're right" and "here's your table", which sounds very much like a model response rather than CoT. In fact the last screenshot is almost the same as the response, and upon closer inspection, all the greyed out texts pass as good responses instead of the final responses in black.

I think this is strong evidence that the greyed-out text is NOT CoT but perhaps an alternative response. Thanks for prompting me!