LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

[New Jersey] HPMOR 10 Year Anniversary Party 🎉
🟠UnlimitedOranges🟠 (mr-mar) · 2025-02-27T22:30:26.009Z · comments (0)

[link] OpenAI releases GPT-4.5
Seth Herd · 2025-02-27T21:40:45.010Z · comments (12)

AEPF_OpenSource is Live – A New Open Standard for Ethical AI
ethoshift · 2025-02-27T20:40:18.997Z · comments (0)

The Elicitation Game: Evaluating capability elicitation techniques
Teun van der Weij (teun-van-der-weij) · 2025-02-27T20:33:24.861Z · comments (0)

For the Sake of Pleasure Alone
Greenless Mirror (mikhail-2) · 2025-02-27T20:07:54.852Z · comments (14)

[link] Keeping AI Subordinate to Human Thought: A Proposal for Public AI Conversations
syh · 2025-02-27T20:00:26.150Z · comments (0)

[link] How to Corner Liars: A Miasma-Clearing Protocol
ymeskhout · 2025-02-27T17:18:36.028Z · comments (23)

Economic Topology, ASI, and the Separation Equilibrium
mkualquiera · 2025-02-27T16:36:48.098Z · comments (11)

The Illusion of Iterative Improvement: Why AI (and Humans) Fail to Track Their Own Epistemic Drift
Andy E Williams (andy-e-williams) · 2025-02-27T16:26:52.718Z · comments (2)

AI #105: Hey There Alexa
Zvi · 2025-02-27T14:30:08.038Z · comments (3)

Space-Faring Civilization density estimates and models - Review
Maxime Riché (maxime-riche) · 2025-02-27T11:44:21.101Z · comments (0)

[link] Market Capitalization is Semantically Invalid
Zero Contradictions · 2025-02-27T11:27:47.765Z · comments (14)

Proposing Human Survival Strategy based on the NAIA Vision: Toward the Co-evolution of Diverse Intelligences
Hiroshi Yamakawa (hiroshi-yamakawa) · 2025-02-27T05:18:05.369Z · comments (0)

Short & long term tradeoffs of strategic voting
kaleb (geomaturge) · 2025-02-27T04:25:04.304Z · comments (0)

[link] Recursive alignment with the principle of alignment
hive · 2025-02-27T02:34:37.940Z · comments (1)

Kingfisher Tour February 2025
jefftk (jkaufman) · 2025-02-27T02:20:04.988Z · comments (0)

You should use Consumer Reports
KvmanThinking (avery-liu) · 2025-02-27T01:52:17.235Z · comments (5)

Universal AI Maximizes Variational Empowerment: New Insights into AGI Safety
Yusuke Hayashi (hayashiyus) · 2025-02-27T00:46:46.989Z · comments (0)

Why Can't We Hypothesize After the Fact?
David Udell · 2025-02-26T22:41:39.819Z · comments (3)

[link] "AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder
Evan_Gaensbauer · 2025-02-26T22:33:43.688Z · comments (9)

[link] METR: AI models can be dangerous before public deployment
UnofficialLinkpostBot (LinkpostBot) · 2025-02-26T20:19:08.640Z · comments (0)

Representation Engineering has Its Problems, but None Seem Unsolvable
Lukasz G Bartoszcze (lukasz-g-bartoszcze) · 2025-02-26T19:53:32.095Z · comments (1)

Thoughts that prompt good forecasts: A survey
Daniel_Friedrich (Hominid Dan) · 2025-02-26T18:36:02.847Z · comments (0)

The non-tribal tribes
PatrickDFarley · 2025-02-26T17:22:59.949Z · comments (4)

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features
Seonglae Cho (seonglae) · 2025-02-26T17:05:18.265Z · comments (3)

Fuzzing LLMs sometimes makes them reveal their secrets
Fabien Roger (Fabien) · 2025-02-26T16:48:48.878Z · comments (13)

You can just wear a suit
lsusr · 2025-02-26T14:57:57.260Z · comments (48)

[link] Matthew Yglesias - Misinformation Mostly Confuses Your Own Side
Siebe · 2025-02-26T14:55:55.627Z · comments (1)

Optimizing Feedback to Learn Faster
Towards_Keeperhood (Simon Skade) · 2025-02-26T14:24:26.835Z · comments (0)

outlining is a historically recent underutilized gift to family
daijin · 2025-02-26T13:58:17.623Z · comments (2)

Osaka
lsusr · 2025-02-26T13:50:24.102Z · comments (11)

Time to Welcome Claude 3.7
Zvi · 2025-02-26T13:00:06.489Z · comments (2)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik (lucy.fa) · 2025-02-26T12:50:04.204Z · comments (8)

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions
Rareș Baron · 2025-02-26T11:35:56.610Z · comments (13)

[question] Name for Standard AI Caveat?
yrimon (yehuda-rimon) · 2025-02-26T07:07:16.523Z · answers+comments (5)

Levels of analysis for thinking about agency
Cole Wyeth (Amyr) · 2025-02-26T04:24:24.583Z · comments (0)

[link] The Stag Hunt—cultivating cooperation to reap rewards
James Stephen Brown (james-brown) · 2025-02-25T23:45:07.472Z · comments (0)

Three Levels for Large Language Model Cognition
Eleni Angelou (ea-1) · 2025-02-25T23:14:00.306Z · comments (0)

[link] [Crosspost] Strategic wealth accumulation under transformative AI expectations
arden446 · 2025-02-25T21:50:11.458Z · comments (0)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley (jan-betley) · 2025-02-25T17:39:31.059Z · comments (91)

[link] We Can Build Compassionate AI
Gordon Seidoh Worley (gworley) · 2025-02-25T16:37:06.160Z · comments (5)

[question] Intellectual lifehacks repo
Antoine de Scorraille (Etoile de Scauchy) · 2025-02-25T16:32:09.814Z · answers+comments (15)

Economics Roundup #5
Zvi · 2025-02-25T13:40:07.086Z · comments (10)

Making alignment a law of the universe
juggins · 2025-02-25T10:44:11.632Z · comments (3)

Revisiting Conway's Law
annebrandes (annebrandes1@gmail.com) · 2025-02-25T08:33:52.421Z · comments (4)

Demystifying the Pinocchio Paradox
Novak Zukowski (Zantarus) · 2025-02-25T06:16:57.219Z · comments (0)

Technical comparison of Deepseek, Novasky, S1, Helix, P0
Juliezhanggg · 2025-02-25T04:20:40.413Z · comments (0)

[link] Upcoming Protest for AI Safety
Matt Vincent (matthew-milone) · 2025-02-25T03:04:03.153Z · comments (0)

[link] what an efficient market feels from inside
DMMF · 2025-02-25T02:38:40.129Z · comments (9)

Metacompilation
Donald Hobson (donald-hobson) · 2025-02-24T22:58:00.085Z · comments (1)

next page (older posts) →

Archive

2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
- January
- February
- March
- April
- May
- June
- July
- August
- September
- October
- November
- December

Recent comments

cubefox on How to Defend the Indefensible

Reminds me of Schopenhauer's posthumously published manuscript The Art of Being Right: 38 Ways to Win an Argument.

simon-lermen on Leon Lang's Shortform

This seems to have been foreshadowed by this tweet in February:

https://x.com/ChrisPainterYup/status/1886691559023767897

Would be good to keep track of this change

artyom-karpov on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I'm more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.

rauno-arike on Show, not tell: GPT-4o is more opinionated in images than in text

The links were also the same for me. I instead tried a modified version of the nine random animals task myself, asking for a distinctive object at the background of each subimage. It was again highly accurate in general and able to describe the background objects in great detail (e.g., correctly describing the time that the clock on the top middle image shows), but it also got the details wrong on a couple of images (the bottom middle and bottom right ones).

steve2152 on johnswentworth's Shortform

Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that's less clear (and possibly not simple enough to be assembled manually, idk).

Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about it too). Likewise, the Wright Brothers explicitly followed the “tradition” of how birds soar, but not the “tradition” of how birds flap their wings.

I do think there’s a “something else” (most [but not all] humans have an innate drive to follow and enforce social norms, more or less), but I don’t think it’s necessary. The Wright Brothers didn’t have any innate drive to copy anything about bird soaring tradition, but they did it anyway purely by intelligence.

Random street names aren't necessarily important though?

I feel like I’ve lost the plot here. If you think there are things that are very important, but rare in the training data, and that LLMs consequently fail to learn, can you give an example?

Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can't efficiently be derived from empirical data (except essentially by copying someone else's conclusion blindly, and that leaves you vulnerable to deception).

I guess you’re using “empirical data” in a narrow sense. If Joe tells me X, I have gained “empirical data” that Joe told me X. And then I can apply my intelligence to interpret that “data”. For example, I can consider a number of hypotheses: the hypothesis that Joe is correct and honest, that Joe is mistaken but honest, that Joe is trying to deceive me, that Joe said Y but I misheard him, etc. And then I can gather or recall additional evidence that favors one of those hypotheses over another. I could ask Joe to repeat himself, to address the “I misheard him” hypothesis. I could consider how often I have found Joe to be mistaken about similar things in the past. I could ask myself whether Joe would benefit from deceiving me. Etc.

This is all the same process that I might apply to other kinds of “empirical data” like if my car was making a funny sound. I.e., consider possible generative hypotheses that would match the data, then try to narrow down via additional observations, and/or remain uncertain and prepare for multiple possibilities when I can’t figure it out. This is a middle road between “trusting people blindly” versus “ignoring everything that anyone tells you”, and it’s what reasonable people actually do. Doing that is just intelligence, not any particular innate human tendency—smart autistic people and smart allistic people and smart callous sociopaths etc. are all equally capable of traveling this middle road, i.e. applying intelligence towards the problem of learning things from what other people say.

(For example, if I was having this conversation with almost anyone else, I would have quit, or not participated in the first place. But I happen to have prior knowledge that you-in-particular have unusual and well-thought-through ideas, and even they’re wrong, they’re often wrong in very unusual and interesting ways, and that you don’t tend to troll, etc.)

I feel like I’m misunderstanding you somehow. You keep saying things that (to me) seem like you could equally well argue that humans cannot possibly survive in the modern world, but here we are. Do you have some positive theory of how humans survive and thrive in (and indeed create) historically-unprecedented heterogeneous environments?

shash42 on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Plug, but model mistakes have been getting more similar as capabilities increase. This also means that these correlated failures appearing now will go away together

algon on Human-level is not the limit

Mathematically impossible. If X matters then so does -X, but any increase in X corresponds to a decrease in -X.

I don't know how the second sentence leads to the first. Why should a decrease in -X lead to less success? Moreover, claims of mathematical impossibility are often over-stated.

As for the paragraph after, it seems like it assumes current traits being on some sort of pareto frontier of economic-fitness. (And, perhaps, an assumption of adequate equilibria). But I don't see why that'd be true. Like, I know of people who are more diligent than me, more intelligent, have lower discount rates etc. And they are indeed successful. EDIT: AFAICT, there's a tonne of frictions and barrier which weaken the force of the economic argument I think you're making here.

rauno-arike on Show, not tell: GPT-4o is more opinionated in images than in text

Thanks! I also believe there's no separate image model now. I assumed that the message you pasted was a hardcoded way of preventing the text model from continuing the conversation after receiving the image from the image model, but you're right that the message before this one is more likely to be a call to the content checker, and in that case, there's no place where the image data is passed to the text model.

kaj_sotala on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Right, that sounds reasonable. One thing that makes me put less probability in this is that at least so far, the domain where reasoning models seem to shine are math/code/logic type tasks, with more general reasoning like consistency in creative writing not benefiting as much. I've sometimes enabled extended thinking when doing fiction-writing with Claude and haven't noticed a clear difference.

That observation would at least be compatible with the story where reasoning models are good on things where you can automatically generate an infinite number of problems to automatically provide feedback on, but less good on tasks outside such domains. So I would expect reasoning models to eventually get to a point where they can reliably solve things in the class of the sliding square puzzle, but not necessarily get much better at anything else.

Though hmm. Let me consider this from an opposite angle. If I assume that reasoning models could perform better on these kinds of tasks, how might that happen?

What I just said: "Though hmm. Let me consider this from an opposite angle." That's the kind of general-purpose thought that can drastically improve one's reasoning, and that the models could be taught to automatically do in order to e.g. reduce sycophancy. First they think about the issue from the frame that the user provided, but then they prompt themselves to consider the exact opposite point and synthesize those two perspectives.
There are some pretty straightforward strategies for catching the things in the more general-purpose reasoning category:
- Following coaching instructions - teaching the model to go through all of the instructions in the system prompt and individually verify that it's following each one. Could be parallelized, with different threads checking different conditions.
- Writing young characters - teaching the reasoning model to ask itself something like "is there anything about this character's behavior that seems unrealistic given what we've already established about them?".
  - One noteworthy point is that not all writers/readers want their characters to be totally realistic, some prefer to e.g. go with what the plot demands rather than what the characters would realistically do. But this is something that could be easily established, with the model going for something more realistic if the user seems to want that and less realistic if they seem to want that.
- Actually I think that some variant of just having the model ask itself "is there anything about what I've written that seems unrealistic, strange, or contradicts something previously established" repeatedly might catch most of those issues. For longer conversations, having a larger number of threads checking against different parts of the conversation in parallel. As I mentioned in the post itself, often the model itself is totally capable of catching its mistake when it's pointed out to it, so all we need is a way for it to prompt itself to check for issues in a way that's sufficiently general to catch those things.
  - And that could then be propagated back into the base model as you say, so on the next time when it writes or reasons about this kind of thing, it gets it right on the first try...

Okay this makes me think that you might be right and actually ~all of this might be solvable with longer reasoning scaling after all; I said originally that I'm at 70% confidence for reasoning models not helping with this, but now I'm down to something like 30% at most. Edited the post to reflect this.

tailcalled on johnswentworth's Shortform

Speaking of which, one can apply intelligence towards the problem of being resilient to unknown unknowns,

I guess to add, I'm not talking about unknown unknowns. Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can't efficiently be derived from empirical data (except essentially by copying someone else's conclusion blindly, and that leaves you vulnerable to deception).