LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Matryoshka Sparse Autoencoders
Noa Nabeshima (noa-nabeshima) · 2024-12-14T02:52:32.017Z · comments (15)

Testing which LLM architectures can do hidden serial reasoning
Filip Sondej · 2024-12-16T13:48:34.204Z · comments (9)

Scaffolding for "Noticing Metacognition"
Raemon · 2024-10-09T17:54:13.657Z · comments (4)

[link] Gwern Branwen interview on Dwarkesh Patel’s podcast: “How an Anonymous Researcher Predicted AI's Trajectory”
Said Achmiz (SaidAchmiz) · 2024-11-14T23:53:34.922Z · comments (0)

Dentistry, Oral Surgeons, and the Inefficiency of Small Markets
GeneSmith · 2024-11-01T17:26:06.466Z · comments (16)

Graceful Degradation
Screwtape · 2024-11-05T23:57:53.362Z · comments (8)

Rationality Quotes - Fall 2024
Screwtape · 2024-10-10T18:37:55.013Z · comments (26)

Effective Evil's AI Misalignment Plan
lsusr · 2024-12-15T07:39:34.046Z · comments (9)

[link] Gwern: Why So Few Matt Levines?
kave · 2024-10-29T01:07:27.564Z · comments (10)

Scaling Sparse Feature Circuit Finding to Gemma 9B
Diego Caples (diego-caples) · 2025-01-10T11:08:11.999Z · comments (10)

[link] Best-of-N Jailbreaking
John Hughes (john-hughes) · 2024-12-14T04:58:48.974Z · comments (5)

Should there be just one western AGI project?
rosehadshar · 2024-12-03T10:11:17.914Z · comments (72)

The 2023 LessWrong Review: The Basic Ask
Raemon · 2024-12-04T19:52:40.435Z · comments (25)

Bitter lessons about lucid dreaming
avturchin · 2024-10-16T21:27:04.725Z · comments (62)

Human study on AI spear phishing campaigns
Simon Lermen (dalasnoin) · 2025-01-03T15:11:14.765Z · comments (8)

2025 Prediction Thread
habryka (habryka4) · 2024-12-30T01:50:14.216Z · comments (18)

[Intuitive self-models] 3. The Homunculus
Steven Byrnes (steve2152) · 2024-10-02T15:20:18.394Z · comments (36)

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.
Andrew_Critch · 2024-11-22T03:26:11.681Z · comments (53)

Secular Solstice Round Up 2024
dspeyer · 2024-11-21T10:49:36.682Z · comments (15)

The Packaging and the Payload
Screwtape · 2024-11-12T03:07:37.209Z · comments (1)

What is malevolence? On the nature, measurement, and distribution of dark traits
David Althaus (wallowinmaya) · 2024-10-23T08:41:33.197Z · comments (15)

Brief analysis of OP Technical AI Safety Funding
22tom (thomas-barnes) · 2024-10-25T19:37:41.674Z · comments (5)

Counting AGIs
cash (cshunter) · 2024-11-26T00:06:17.845Z · comments (19)

[link] Video lectures on the learning-theoretic agenda
Vanessa Kosoy (vanessa-kosoy) · 2024-10-27T12:01:32.777Z · comments (0)

Could randomly choosing people to serve as representatives lead to better government?
John Huang · 2024-10-21T17:10:20.920Z · comments (13)

🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-20T14:17:13.104Z · comments (0)

[Intuitive self-models] 4. Trance
Steven Byrnes (steve2152) · 2024-10-08T13:30:41.446Z · comments (7)

Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)
Elizabeth (pktechgirl) · 2024-10-22T18:20:01.194Z · comments (79)

[link] Cost, Not Sacrifice
Joe Rogero · 2024-11-20T21:32:26.281Z · comments (13)

[question] Interest in Leetcode, but for Rationality?
Gregory (gregory-eales) · 2024-10-16T17:54:25.578Z · answers+comments (20)

Introducing Transluce — A Letter from the Founders
jsteinhardt · 2024-10-23T18:10:02.526Z · comments (2)

[link] A Narrow Path: a plan to deal with AI extinction risk
Andrea_Miotti (AndreaM) · 2024-10-07T13:02:15.229Z · comments (12)

Joshua Achiam Public Statement Analysis
Zvi · 2024-10-10T12:50:06.285Z · comments (14)

[link] SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can (Can Rager) · 2024-12-11T06:30:37.076Z · comments (1)

When AI 10x's AI R&D, What Do We Do?
Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · comments (16)

The Mask Comes Off: At What Price?
Zvi · 2024-10-21T23:50:05.247Z · comments (16)

[link] If far-UV is so great, why isn't it everywhere?
Austin Chen (austin-chen) · 2024-10-19T18:56:58.910Z · comments (23)

Automation collapse
Geoffrey Irving · 2024-10-21T14:50:54.500Z · comments (9)

The King and the Golem - The Animation
Writer · 2024-11-08T18:23:10.935Z · comments (0)

[link] Peak Human Capital
PeterMcCluskey · 2024-09-30T21:13:30.421Z · comments (3)

[link] Policymakers don't have access to paywalled articles
Adam Jones (domdomegg) · 2025-01-05T10:56:11.495Z · comments (10)

[link] Moderately More Than You Wanted To Know: Depressive Realism
JustisMills · 2025-01-13T02:57:32.022Z · comments (4)

[link] "Map of AI Futures" - An interactive flowchart
swante · 2024-11-27T21:31:40.269Z · comments (3)

Personal AI Planning
jefftk (jkaufman) · 2024-11-10T14:00:06.837Z · comments (10)

EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper · 2024-10-11T22:13:51.033Z · comments (4)

[link] New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters
Jesse Hoogland (jhoogland) · 2024-11-27T22:06:12.914Z · comments (4)

[link] Anthropic leadership conversation
Zach Stein-Perlman · 2024-12-20T22:00:45.229Z · comments (17)

No one has the ball on 1500 Russian olympiad winners who've received HPMOR
Mikhail Samin (mikhail-samin) · 2025-01-12T11:43:36.560Z · comments (17)

The Third Fundamental Question
Screwtape · 2024-11-15T04:01:33.770Z · comments (7)

[link] On Shifgrethor
JustisMills · 2024-10-27T15:30:13.688Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

oxidize on Subskills of "Listening to Wisdom"

I'm a 20 year old who perceives myself as the kind of young founder you're probably talking to in this post. And I've noticed a lot of older guys have similar sentiments to you about younger guys and the perspective often annoys me. I do everything I can to learn from other people, but in the context of giving and receiving advice I believe that a lot of information is typically not considered. For example, you talk about a lot of mistakes younger people make that could be easily avoided if they had the older generation's wisdom, but as conveyed by this post, your knowledge/skillset of communicated complex ideas in a way that can be understood by someone with different knowledge/experiences appears limited to me. Additionally, I've written over a thousand documents and am thinking every hour of every day from the perspectives that I value and want to improve in. Someone who doesn't have any context with which to understand me or even what I want can seldom give good advice aimed at observed circumstances in my opinion. The best advise I've received tends to be advice not directed at me because of reasons like this.

I don't believe burnout is real. I have theories on why people think it's real, but I think the phenomenon people label as burnout is more complex than people understand, and advice about burnout is consequently not very helpful. I know you've said you don't typically try to argue/explain yourself around this belief, but if I'm wrong and you have some insight that can only be gained with experience I'm not privy to, I would be thankful if you would correct my false belief.

I agree with most of what I've read in this post, I mainly disagree with some of the perspectives you take.

I read through most of the beginning of this post. Then started skimming by my perception of header relevance to the original idea of the post. I'm not really sure what goal you were trying to achieve by branching off into so many different topics in a single post instead of creating separate posts, but I'm still pretty green to the LessWrong community and the norms here, so maybe you can enlighten me. I'm also very biased towards practical texts, so I believe I just am not the target audience for most of this post. I liked the statements about trigger#1 and trigger #2 in the practical section, and its given me some insightful tips I'd never thought of on how to be a better listener. I recently made a note to myself that people say what they mean, so take them literally instead of translating their words into something you already understand. Admittedly, I have not been following my own advice. And this post has served as a valuable reminder for me. I'm really interested in communicated and learning about complex, soulful ideas so maybe you could direct me to some good practical posts on the topic. I've had to navigate this skillset entirely alone, so I'd rather not reinvent the wheel if someone has already publicized their work around this.

jakub_krys on Implications of the inference scaling paradigm for AI safety

I had a similar reflection yesterday regarding these inference-time techniques (post-training, unhobbling, whatever you want to call it) being in the very early days. Would it be too much of a stretch to draw parallels here between how such unhobbling methods lead to an explosion of human capabilities over the past ~10000 years? The human DNA has undergone roughly the same number of 'gradient updates' (evolutionary cycles) as our predecessors from a few millenia ago. I see it as having an equivalent amount of training compute. Yet through an efficient use of tools, language, writing, coordination and similar, we have completely outdone what our ancestors were able to do.

There is a difference in that for us, these abilities arose naturally through evolution. We are now manually engineering them into AI systems. I would not be surprised to see a real capability explosion soon (much faster than what we are observing now) - not because of the continued scaling up of pre-training, but because of these post-training enhancements.

alex_altair on Alex_Altair's Shortform

Indeed, we know about those posts! Lmk if you have a recommendation for a better textbook-level treatment of any of it (modern papers etc). So far the grey book feels pretty standard in terms of pedagogical quality.

hzn on Parkinson's Law and the Ideology of Statistics

I don't completely disagree but there is also some danger of being systematically misleading.

I think your last 4 bullet points are really quite good & they probably apply to a number of organizations not just the World Bank. I'm inclined to view this as an illustration of organizational failure more than an evaluation of the World Bank. (Assuming of course that the book is accurate)!

I will say tho that my opinion of development economics is quite low…

felix-j-binder on Daniel Tan's Shortform

Re steganography for chain-of-thought: I've been working on a project related to this for a while, looking at whether RL for concise and correct answers might teach models to stenographically encode their CoT for benign reasons. There's an early write-up here: https://ac.felixbinder.net/research/2023/10/27/steganography-eval.html\

Currently, I'm working with two BASIS fellows on actually training models to see if we can elicit steganography this way. I'm definitely happy to chat more/set up a call about this topic

felix-j-binder on Daniel Tan's Shortform

That's interesting. One underlying consideration is that the object-level choices of reasoning steps are relative to a reasoner: differently abled agents need to decompose problems differently, know different things and might benefit from certain ways of thinking in different ways. Therefore, a model plausibly chooses CoT that works well for it "on the object level", without any steganography or other hidden information necessary. If that is true, then we would expect to see models benefit from their own CoT over that of others for basic, non-steganography reasons.

Consider a grade schooler and a grad student thinking out loud. Each benefits from having access to their own CoT, and wouldn't get much from the others for obvious reasons.

I think the questions of whether models actually choose their CoT with respect to their own needs, knowledge and ability is a very interesting one that is closely related to introspection.

ryan_greenblatt on Predict 2025 AI capabilities (by Sunday)

lemonhope on lemonhope's Shortform

Nobody has a deal where they'll pay you to not take an offer from an AI lab right? I realize that would be weird incentives, just curious.

kave on Lecture Series on Tiling Agents

Mod note: I've put this on Personal rather than Frontapge. I imagine the content of these talks will be frontpage content, but event announcements in general are not.

ozziegooen on johnswentworth's Shortform

This came from a Facebook thread where I argued that many of the main ways AI was described as failing fall into few categories (John disagreed).

I appreciated this list, but they strike me as fitting into a few clusters.

...I would flag that much of that is unsurprising to me, and I think categorization can be pretty fine.
In order:
1) If an agent is unwittingly deceptive in ways that are clearly catastrophic, and that could be understood by a regular person, I'd probably put that under the "naive" or "idiot savant" category. As in, it has severe gaps in its abilities that a human or reasonable agent wouldn't. If the issue is that all reasonable agents wouldn't catch the downsides of a certain plan, I'd probably put that under the "we made a pretty good bet given the intelligence that we had" category.
2) I think that "What Failure Looks Like" is less Accident risk, more "Systemic" risk. I'm also just really unsure what to think about this story. It feels to me like it's a situation where actors are just not able to regulate externalities or similar.
3) The "fusion power generator scenario" seems like just a bad analyst to me. A lot of the job of an analyst is to flag important considerations. This seems like a pretty basic ask. For this itself to be the catastrophic part, I think we'd have to be seriously bad at this. ("i.e. Idiot Savant")
4) STEM-AGI -> I'd also put this in the naive or "idiot savant" category.
5) "that plan totally fails to align more-powerful next-gen AGI at all" -> This seems orthogonal to "categorizing the types of unalignment". This describes how incentives would create an unaligned agent, not what the specific alignment problem is. I do think it would be good to have better terminology here, but would probably consider it a bit adjacent to the specific topic of "AI alignment" - more like "AI alignment strategy/policy" or something.
6) "AGIs act much like a colonizing civilization" -> This sounds like either unalignment has already happened, or humans just gave AIs their own power+rights for some reason. I agree that's bad, but it seems like a different issue than what I think of as the alignment problem. More like, "Yea, if unaligned AIs have a lot of power and agency and different goals, that would be suboptimal"
7) "but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight." -> This sounds like a traditional mesa-agent failure. I expect a lot of "alignment" with a system made of a bunch of subcomponents is "making sure no subcomponents do anything terrible." Also, still leaves open the specific way this subsystem becomes/is unaligned.
8 ) "using an LLM to simulate a whole society. " -> Sorry, I don't quite follow this one.

Personally, I like the focus "scheming" has. At the same time, I imagine there are another 5 to 20 clean concerns we should also focus on (some of which have been getting attention).

While I realize there's a lot we can't predict, I think we could do a much better just making lists of different risk factors and allocating research amongst them.