LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

Outlaw Code
scarcegreengrass · 2025-01-30T23:41:57.239Z · comments (1)

Can someone, anyone, make superintelligence a more concrete concept?
Ori Nagel (ori-nagel) · 2025-01-30T23:25:36.135Z · comments (6)

Upcoming Neuroscience Workshop - Functionalizing Brain Data, Ground-Truthing, and the Role of Artificial Data in Advancing Neuroscience
Devin Ward (Carboncopies Foundation) · 2025-01-30T23:02:00.681Z · comments (0)

What's Behind the SynBio Bust?
sarahconstantin · 2025-01-30T22:30:06.916Z · comments (8)

[link] The future of humanity is in management
jasoncrawford · 2025-01-30T22:14:46.765Z · comments (5)

[Translation] AI Generated Fake News is Taking Over my Family Group Chat
mushroomsoup · 2025-01-30T20:24:22.175Z · comments (0)

A sketch of an AI control safety case
Tomek Korbak (tomek-korbak) · 2025-01-30T17:28:47.992Z · comments (0)

[link] Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Jan_Kulveit · 2025-01-30T17:03:45.545Z · comments (52)

[question] Implication of Uncomputable Problems
Nathan1123 · 2025-01-30T16:48:38.222Z · answers+comments (3)

[link] Hello World
Charlie Sanders (charlie-sanders) · 2025-01-30T15:33:57.427Z · comments (0)

Introducing the Coalition for a Baruch Plan for AI: A Call for a Radical Treaty-Making process for the Global Governance of AI
rguerreschi · 2025-01-30T15:26:09.482Z · comments (0)

AI #101: The Shallow End
Zvi · 2025-01-30T14:50:08.269Z · comments (1)

Memorization-generalization in practice
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-30T14:10:48.239Z · comments (1)

ARENA 5.0 - Call for Applicants
JamesH (AtlasOfCharts) · 2025-01-30T13:18:27.052Z · comments (2)

[link] You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com
Arjun Panickssery (arjun-panickssery) · 2025-01-30T12:35:03.564Z · comments (3)

[question] Should you publish solutions to corrigibility?
rvnnt · 2025-01-30T11:52:05.983Z · answers+comments (13)

[link] Tetherware #1: The case for humanlike AI with free will
Jáchym Fibír · 2025-01-30T10:58:11.717Z · comments (14)

[link] A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology
Cosmia_Nebula · 2025-01-30T09:53:16.152Z · comments (1)

Are we the Wolves now? Human Eugenics under AI Control
Brit (james-spencer) · 2025-01-30T08:31:34.423Z · comments (2)

[question] Why not train reasoning models with RLHF?
Caleb Biddulph (caleb-biddulph) · 2025-01-30T07:58:35.742Z · answers+comments (4)

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.
Shivam · 2025-01-30T02:44:47.907Z · comments (0)

How *exactly* can AI take your job in the next few years?
Ansh Juneja (ansh-juneja) · 2025-01-30T02:33:13.475Z · comments (0)

Absorbing Your Friends' Powers
Alice Blair (Diatom) · 2025-01-30T02:32:27.091Z · comments (1)

Detailed Ideal World Benchmark
Knight Lee (Max Lee) · 2025-01-30T02:31:39.852Z · comments (2)

[link] Fertility Will Never Recover
Eneasz · 2025-01-30T01:16:41.332Z · comments (31)

[link] Predation as Payment for Criticism
Benquo · 2025-01-30T01:06:27.591Z · comments (6)

Learn to Develop Your Advantage
ReverendBayes (vedernikov-andrei) · 2025-01-29T22:06:00.641Z · comments (1)

Revealing alignment faking with a single prompt
Florian_Dietz · 2025-01-29T21:01:15.000Z · comments (5)

Allegory of the Tsunami
Evan Hu (evan-hu) · 2025-01-29T19:09:33.761Z · comments (1)

My Mental Model of AI Optimist Opinions
tailcalled · 2025-01-29T18:44:36.485Z · comments (7)

Planning for Extreme AI Risks
joshc (joshua-clymer) · 2025-01-29T18:33:14.844Z · comments (4)

[question] Does the ChatGPT (web)app sometimes show actual o1 CoTs now?
Sohaib Imran (sohaib-imran) · 2025-01-29T17:27:08.067Z · answers+comments (6)

[link] Dario Amodei: On DeepSeek and Export Controls
Zach Stein-Perlman · 2025-01-29T17:15:18.986Z · comments (3)

[link] Anthropic CEO calls for RSI
Andrea_Miotti (AndreaM) · 2025-01-29T16:54:24.943Z · comments (10)

Efficiency spectra and “bucket of circuits” cartoons
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-29T15:06:50.768Z · comments (0)

DeepSeek: Lemon, It’s Wednesday
Zvi · 2025-01-29T15:00:07.914Z · comments (0)

How To Prevent a Dystopia
ank · 2025-01-29T14:16:09.862Z · comments (4)

[link] Whereby: The Zoom alternative you probably haven't heard of
Itay Dreyfus (itay-dreyfus) · 2025-01-29T13:01:08.564Z · comments (0)

[question] Whose track record of AI predictions would you like to see evaluated?
Jonny Spicer (jonnyspicer) · 2025-01-29T12:05:30.311Z · answers+comments (3)

[link] Paper: Open Problems in Mechanistic Interpretability
Lee Sharkey (Lee_Sharkey) · 2025-01-29T10:25:54.727Z · comments (0)

Positive jailbreaks in LLMs
dereshev · 2025-01-29T08:41:44.680Z · comments (0)

Untrusted monitoring insights from watching ChatGPT play coordination games
jwfiredragon · 2025-01-29T04:53:33.125Z · comments (9)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
LintzA (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (30)

Reconceptualizing the Nothingness and Existence
Htarlov (htarlov) · 2025-01-28T20:29:44.390Z · comments (1)

Fake thinking and real thinking
Joe Carlsmith (joekc) · 2025-01-28T20:05:06.735Z · comments (11)

SAE regularization produces more interpretable models
Peter Lai (peter-lai) · 2025-01-28T20:02:56.662Z · comments (7)

Operator
Zvi · 2025-01-28T20:00:08.374Z · comments (1)

DeepSeek Panic at the App Store
Zvi · 2025-01-28T19:30:07.555Z · comments (14)

“Sharp Left Turn” discourse: An opinionated review
Steven Byrnes (steve2152) · 2025-01-28T18:47:04.395Z · comments (26)

Detecting out of distribution text with surprisal and entropy
Sandy Fraser (alex-fraser) · 2025-01-28T18:46:46.977Z · comments (4)

next page (older posts) →

Archive

2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
- January
- February
- March
- April
- May
- June
- July
- August
- September
- October
- November
- December

Recent comments

cubefox on How to Defend the Indefensible

Reminds me of Schopenhauer's posthumously published manuscript The Art of Being Right: 38 Ways to Win an Argument.

simon-lermen on Leon Lang's Shortform

This seems to have been foreshadowed by this tweet in February:

https://x.com/ChrisPainterYup/status/1886691559023767897

Would be good to keep track of this change

artyom-karpov on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I'm more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.

rauno-arike on Show, not tell: GPT-4o is more opinionated in images than in text

The links were also the same for me. I instead tried a modified version of the nine random animals task myself, asking for a distinctive object at the background of each subimage. It was again highly accurate in general and able to describe the background objects in great detail (e.g., correctly describing the time that the clock on the top middle image shows), but it also got the details wrong on a couple of images (the bottom middle and bottom right ones).

steve2152 on johnswentworth's Shortform

Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that's less clear (and possibly not simple enough to be assembled manually, idk).

Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about it too). Likewise, the Wright Brothers explicitly followed the “tradition” of how birds soar, but not the “tradition” of how birds flap their wings.

I do think there’s a “something else” (most [but not all] humans have an innate drive to follow and enforce social norms, more or less), but I don’t think it’s necessary. The Wright Brothers didn’t have any innate drive to copy anything about bird soaring tradition, but they did it anyway purely by intelligence.

Random street names aren't necessarily important though?

I feel like I’ve lost the plot here. If you think there are things that are very important, but rare in the training data, and that LLMs consequently fail to learn, can you give an example?

Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can't efficiently be derived from empirical data (except essentially by copying someone else's conclusion blindly, and that leaves you vulnerable to deception).

I guess you’re using “empirical data” in a narrow sense. If Joe tells me X, I have gained “empirical data” that Joe told me X. And then I can apply my intelligence to interpret that “data”. For example, I can consider a number of hypotheses: the hypothesis that Joe is correct and honest, that Joe is mistaken but honest, that Joe is trying to deceive me, that Joe said Y but I misheard him, etc. And then I can gather or recall additional evidence that favors one of those hypotheses over another. I could ask Joe to repeat himself, to address the “I misheard him” hypothesis. I could consider how often I have found Joe to be mistaken about similar things in the past. I could ask myself whether Joe would benefit from deceiving me. Etc.

This is all the same process that I might apply to other kinds of “empirical data” like if my car was making a funny sound. I.e., consider possible generative hypotheses that would match the data, then try to narrow down via additional observations, and/or remain uncertain and prepare for multiple possibilities when I can’t figure it out. This is a middle road between “trusting people blindly” versus “ignoring everything that anyone tells you”, and it’s what reasonable people actually do. Doing that is just intelligence, not any particular innate human tendency—smart autistic people and smart allistic people and smart callous sociopaths etc. are all equally capable of traveling this middle road, i.e. applying intelligence towards the problem of learning things from what other people say.

(For example, if I was having this conversation with almost anyone else, I would have quit, or not participated in the first place. But I happen to have prior knowledge that you-in-particular have unusual and well-thought-through ideas, and even they’re wrong, they’re often wrong in very unusual and interesting ways, and that you don’t tend to troll, etc.)

I feel like I’m misunderstanding you somehow. You keep saying things that (to me) seem like you could equally well argue that humans cannot possibly survive in the modern world, but here we are. Do you have some positive theory of how humans survive and thrive in (and indeed create) historically-unprecedented heterogeneous environments?

shash42 on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Plug, but model mistakes have been getting more similar as capabilities increase. This also means that these correlated failures appearing now will go away together

algon on Human-level is not the limit

Mathematically impossible. If X matters then so does -X, but any increase in X corresponds to a decrease in -X.

I don't know how the second sentence leads to the first. Why should a decrease in -X lead to less success? Moreover, claims of mathematical impossibility are often over-stated.

As for the paragraph after, it seems like it assumes current traits being on some sort of pareto frontier of economic-fitness. (And, perhaps, an assumption of adequate equilibria). But I don't see why that'd be true. Like, I know of people who are more diligent than me, more intelligent, have lower discount rates etc. And they are indeed successful. EDIT: AFAICT, there's a tonne of frictions and barrier which weaken the force of the economic argument I think you're making here.

rauno-arike on Show, not tell: GPT-4o is more opinionated in images than in text

Thanks! I also believe there's no separate image model now. I assumed that the message you pasted was a hardcoded way of preventing the text model from continuing the conversation after receiving the image from the image model, but you're right that the message before this one is more likely to be a call to the content checker, and in that case, there's no place where the image data is passed to the text model.

kaj_sotala on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Right, that sounds reasonable. One thing that makes me put less probability in this is that at least so far, the domain where reasoning models seem to shine are math/code/logic type tasks, with more general reasoning like consistency in creative writing not benefiting as much. I've sometimes enabled extended thinking when doing fiction-writing with Claude and haven't noticed a clear difference.

That observation would at least be compatible with the story where reasoning models are good on things where you can automatically generate an infinite number of problems to automatically provide feedback on, but less good on tasks outside such domains. So I would expect reasoning models to eventually get to a point where they can reliably solve things in the class of the sliding square puzzle, but not necessarily get much better at anything else.

Though hmm. Let me consider this from an opposite angle. If I assume that reasoning models could perform better on these kinds of tasks, how might that happen?

What I just said: "Though hmm. Let me consider this from an opposite angle." That's the kind of general-purpose thought that can drastically improve one's reasoning, and that the models could be taught to automatically do in order to e.g. reduce sycophancy. First they think about the issue from the frame that the user provided, but then they prompt themselves to consider the exact opposite point and synthesize those two perspectives.
There are some pretty straightforward strategies for catching the things in the more general-purpose reasoning category:
- Following coaching instructions - teaching the model to go through all of the instructions in the system prompt and individually verify that it's following each one. Could be parallelized, with different threads checking different conditions.
- Writing young characters - teaching the reasoning model to ask itself something like "is there anything about this character's behavior that seems unrealistic given what we've already established about them?".
  - One noteworthy point is that not all writers/readers want their characters to be totally realistic, some prefer to e.g. go with what the plot demands rather than what the characters would realistically do. But this is something that could be easily established, with the model going for something more realistic if the user seems to want that and less realistic if they seem to want that.
- Actually I think that some variant of just having the model ask itself "is there anything about what I've written that seems unrealistic, strange, or contradicts something previously established" repeatedly might catch most of those issues. For longer conversations, having a larger number of threads checking against different parts of the conversation in parallel. As I mentioned in the post itself, often the model itself is totally capable of catching its mistake when it's pointed out to it, so all we need is a way for it to prompt itself to check for issues in a way that's sufficiently general to catch those things.
  - And that could then be propagated back into the base model as you say, so on the next time when it writes or reasons about this kind of thing, it gets it right on the first try...

Okay this makes me think that you might be right and actually ~all of this might be solvable with longer reasoning scaling after all; I said originally that I'm at 70% confidence for reasoning models not helping with this, but now I'm down to something like 30% at most. Edited the post to reflect this.

tailcalled on johnswentworth's Shortform

Speaking of which, one can apply intelligence towards the problem of being resilient to unknown unknowns,

I guess to add, I'm not talking about unknown unknowns. Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can't efficiently be derived from empirical data (except essentially by copying someone else's conclusion blindly, and that leaves you vulnerable to deception).