LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-02-06T15:46:53.024Z · comments (9)

Fake thinking and real thinking
Joe Carlsmith (joekc) · 2025-01-28T20:05:06.735Z · comments (11)

My model of what is going on with LLMs
Cole Wyeth (Amyr) · 2025-02-13T03:43:29.447Z · comments (49)

[link] A short course on AGI safety from the GDM Alignment team
Vika · 2025-02-14T15:43:50.903Z · comments (1)

C'mon guys, Deliberate Practice is Real
Raemon · 2025-02-05T22:33:59.069Z · comments (25)

[link] What the Headlines Miss About the Latest Decision in the Musk vs. OpenAI Lawsuit
garrison · 2025-03-06T19:49:02.145Z · comments (0)

Third-wave AI safety needs sociopolitical thinking
Richard_Ngo (ricraz) · 2025-03-27T00:55:30.548Z · comments (23)

AI Control May Increase Existential Risk
Jan_Kulveit · 2025-03-11T14:30:05.972Z · comments (13)

Timaeus in 2024
Jesse Hoogland (jhoogland) · 2025-02-20T23:54:56.939Z · comments (1)

The Lizardman and the Black Hat Bobcat
Screwtape · 2025-04-06T19:02:01.238Z · comments (13)

Reviewing LessWrong: Screwtape's Basic Answer
Screwtape · 2025-02-05T04:30:34.347Z · comments (18)

Show, not tell: GPT-4o is more opinionated in images than in text
Daniel Tan (dtch1997) · 2025-04-02T08:51:02.571Z · comments (41)

How I talk to those above me
Maxwell Peterson (maxwell-peterson) · 2025-03-30T06:54:59.869Z · comments (13)

The Rising Sea
Jesse Hoogland (jhoogland) · 2025-01-25T20:48:52.971Z · comments (2)

How training-gamers might function (and win)
Vivek Hebbar (Vivek) · 2025-04-11T21:26:18.669Z · comments (4)

[link] Elite Coordination via the Consensus of Power
Richard_Ngo (ricraz) · 2025-03-19T06:56:44.825Z · comments (15)

Six Thoughts on AI Safety
boazbarak · 2025-01-24T22:20:50.768Z · comments (55)

[link] Towards a scale-free theory of intelligent agency
Richard_Ngo (ricraz) · 2025-03-21T01:39:42.251Z · comments (24)

Dear AGI,
Nathan Young · 2025-02-18T10:48:15.030Z · comments (11)

We should start looking for scheming "in the wild"
Marius Hobbhahn (marius-hobbhahn) · 2025-03-06T13:49:39.739Z · comments (4)

[link] Wired on: "DOGE personnel with admin access to Federal Payment System"
Raemon · 2025-02-05T21:32:11.205Z · comments (45)

[link] Anthropic releases Claude 3.7 Sonnet with extended thinking mode
LawrenceC (LawChan) · 2025-02-24T19:32:43.947Z · comments (8)

How To Believe False Things
Eneasz · 2025-04-02T16:28:29.055Z · comments (10)

On Emergent Misalignment
Zvi · 2025-02-28T13:10:05.973Z · comments (5)

What goals will AIs have? A list of hypotheses
Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-03T20:08:31.539Z · comments (19)

The Risk of Gradual Disempowerment from AI
Zvi · 2025-02-05T22:10:06.979Z · comments (15)

[link] The Manhattan Trap: Why a Race to Artificial Superintelligence is Self-Defeating
Corin Katzke (corin-katzke) · 2025-01-21T16:57:00.998Z · comments (11)

How I force LLMs to generate correct code
claudio · 2025-03-21T14:40:19.211Z · comments (7)

Voting Results for the 2023 Review
Raemon · 2025-02-06T08:00:37.461Z · comments (3)

Vacuum Decay: Expert Survey Results
JessRiedel · 2025-03-13T18:31:17.434Z · comments (26)

[link] ASI existential risk: Reconsidering Alignment as a Goal
habryka (habryka4) · 2025-04-15T19:57:42.547Z · comments (14)

Stargate AI-1
Zvi · 2025-01-24T15:20:18.752Z · comments (1)

One-shot steering vectors cause emergent misalignment, too
Jacob Dunefsky (jacob-dunefsky) · 2025-04-14T06:40:41.503Z · comments (6)

OpenAI #11: America Action Plan
Zvi · 2025-03-18T12:50:03.880Z · comments (3)

A Slow Guide to Confronting Doom
Ruby · 2025-04-06T02:10:56.483Z · comments (20)

How might we safely pass the buck to AI?
joshc (joshua-clymer) · 2025-02-19T17:48:32.249Z · comments (58)

Ambiguous out-of-distribution generalization on an algorithmic task
Wilson Wu (wilson-wu) · 2025-02-13T18:24:36.160Z · comments (6)

The Mask Comes Off: A Trio of Tales
Zvi · 2025-02-14T15:30:15.372Z · comments (1)

Keltham's Lectures in Project Lawful
Morpheus · 2025-04-01T10:39:47.973Z · comments (4)

Microplastics: Much Less Than You Wanted To Know
jenn (pixx) · 2025-02-15T19:08:14.561Z · comments (8)

MONA: Managed Myopia with Approval Feedback
Seb Farquhar · 2025-01-23T12:24:18.108Z · comments (29)

You will crash your car in front of my house within the next week
Richard Korzekwa (Grothor) · 2025-04-01T21:43:21.472Z · comments (6)

Mistral Large 2 (123B) exhibits alignment faking
Marc Carauleanu (Marc-Everin Carauleanu) · 2025-03-27T15:39:02.176Z · comments (4)

Open problems in emergent misalignment
Jan Betley (jan-betley) · 2025-03-01T09:47:58.889Z · comments (13)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik (lucy.fa) · 2025-02-26T12:50:04.204Z · comments (8)

Elon Musk May Be Transitioning to Bipolar Type I
Cyborg25 · 2025-03-11T17:45:06.599Z · comments (22)

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Stuart_Armstrong · 2025-03-18T14:48:54.762Z · comments (12)

Announcing ILIAD2: ODYSSEY
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-04-03T17:01:06.004Z · comments (1)

[link] AI for AI safety
Joe Carlsmith (joekc) · 2025-03-14T15:00:23.491Z · comments (13)

[link] OpenAI releases deep research agent
Seth Herd · 2025-02-03T12:48:44.925Z · comments (21)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mis-understandings on How to end credentialism

Noone cared.

You don't know what questions they did not ask you, and the assumptions of shared cultural background that they made because they saw that. They would not tell you. (unless you have comparisons to job searching before getting the degree).

Fundamentally, this is the expected phenomenology, since people do not tend to notice sources of your own status.

cole-wyeth on AI 2027 is a Bet Against Amdahl's Law

I wouldn’t take those markets too seriously. The resolution criteria arent clear and some years have fewer than 100 traders. Also I just moved some of them down a couple of percentage points.

mrcheeze on Is Gemini now better than Claude at Pokémon?

On the other hand, Claude has (arguably) a better pathfinding tool. As long as it requests to be moved to a valid set of coordinates from the screenshot overlay grid, the tool will move it there. Gemini mostly navigates on its own, although it has access to another instance of Gemini dedicated just to pathfinding.

I very much argue this. Claude's navigator tool can only navigate to coordinates that are onscreen, meaning that the main model needs to have some idea of where it's going. Which means grappling with problems that are extremely difficult for both models, such as "go AROUND the wall instead of right through it".

In contrast, the Gemini pathfinder tool can travel to a coordinate halfway across the map, totally bypassing that problem. (Yes, the pathfinder is technically another instance of Gemini, but it's been prompted with exactly what algorithm to follow, so this is not a major handicap.) When returning to a previously visited map - Gemini is banned from using the pathfinder tool to enter unexplored tiles - it can probably traverse even mazes that take the Claude scaffolding all day, in just one or two turns.

Of course this has further advantages for maintaining coherence, since if you spend all day on a maze, you forget what your plan even was after you get to the end of it.

sanyer on Why Should I Assume CCP AGI is Worse Than USG AGI?

Yeah, I can see why that's possible. But I wasn't really talking about the improbable scenario where ASI would be aligned to the whole of humanity/country, but about a scenario where ASI is 'narrowly aligned' in the sense that it's aligned to its creators/whoever controls it when it's created. This is IMO much more likely to happen since technologies are not created in a vacuum.

sanyer on Why Should I Assume CCP AGI is Worse Than USG AGI?

No, I wasn't really talking about any specific implementation of democracy. My point was that, given the vast power that ASI grants to whoever controls it, the traditional checks and balances would be undermined.

Now, regarding your point that aligning AGI with what democracy is actually supposed to be, I have two objections:

To me, it's not clear at all why it would be straightforward to align AGI with some 'democratic ideal'. Arrow's impossibility theorem shows that no perfect voting system exists, so an AGI trying to implement the "perfect democracy" will eventually have to make value judgments about which democratic principles to prioritize (although I do think that an AGI could, in principle, help us find ways to improve upon our democracies).
Even if aligning AGI with democracy would in principle be possible, we need to look at the political reality the technology will emerge from. I don't think it's likely that whichever group that would end up controlling AGI would willingly want to extend its alignment to other groups of people.

tfd on aog's Shortform

In the context of AI safety views that are less correlated/more independent, I would personally bump the GDM work related to causality. I think GDM is the only major AI-related organization I can think of that seems to have a critical mass of interest in this line of research. A bit different since its not a full-on framework for addressing AGI, but I think it is a different (and in my view under-appreciated) line of work that has a different perspective and draws and different concepts/fields than a lot of other approaches.

mrcheeze on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

I have not tested if Gemini can distinguish this tree (and intend to eventually). This may very well be the only reason Gemini has progressed further.

You missed an important fact about the Gemini stream, which is that it just reads the presence of these trees from RAM and labels them for the model (along with a few other special tiles like ledges and water). Nevertheless I do think Gemini's vision is better, by which I mean if you provide it a screenshot it will sometimes identify the correct tree, unlike Claude who will never do so. And in general the Gemini streamer if far more liberal about updating the scaffolding to address challenges than the Claude streamer is.

Also there's one other reason that Gemini has gotten farther: it simply has the whole walkthrough of the game memorized, while Claude doesn't know what to do after the thunderbadge. (I don't think either model would be remotely competent on RPGs that aren't in the training data.)

This doesn't mean memory is not a problem. The problems are just more subtle than one might imagine. For instance, the lack of direct memory means models lack a real sense of time, or how hard a task is. That means even when given a notepad to record observations, they will not consistently record "HOW TO SOLVE THAT PUZZLE THAT TOOK FOREVER" because they don't realize it took forever. And of course if it's not written down it falls completely out of "long-term" memory.

This has been a recurring problem with the Claude stream, where the model is given the ability to take notes. Whenever he's struggling and failing to solve a problem for a long time, he'll endlessly write notes about his (wrong) ideas for what to do, reinforcing that behaviour. When he finally tries the right thing, it seems like it was easy, so you MIGHT get one note written down about it. If you're lucky.

In general, however incompetent this post makes it sound like the models are at playing the game, they're even worse than that. I feel like this is in large part because of LLMs having frozen weights - every single mistake that they make will be repeated every time the situation reoccurs, instead of just once as a human would do. Taking notes doesn't help this very much, as their basic instincts being wrong seems to make far more difference than what's in their notes.

viliam on Pablo's Shortform

substack, not blogspot

tfd on Pablo's Shortform

I think I am, all things considered, sad about this. I think libel suits are really bad tools for limiting speech, and I declined being involved with them when some of the plaintiffs offered me to be involved on behalf of LW and Lightcone.

Appreciate you saying this. It raises my esteem for LW/Lightcone to hear that this is the route that you all choose. Perhaps that doesn't mean much since I largely agree with the view you express about defamation suits, but even for those that disagree, I think there is something to admire here in terms of sticking to principles even when it's people you strongly disagree with how are benefiting from those principles in a particular case.

aidan-o-gara on aog's Shortform

If we can put aside for a moment the question of whether Matthew Barnett has good takes, I think it's worth noting that this reaction reminds me of how outsiders sometimes feel about effective altruism or rationalism:

I guess I feel that his posts tend to be framed in a really strange way such that, even though there's often some really good research there, it's more likely to confuse the average reader than anything else and even if you can untangle the frames, I usually don't find worth it the time.

The root cause may be that there is too much inferential distance [? · GW], too many differences of basic worldview assumptions, to easily have a productive conversation. The argument contained in any given post might rely on background assumptions that would take a long time to explain and debate. It can be very difficult to have a productive conversation with someone who doesn't share your basic worldview. That's one of the reasons that LessWrong encourages [LW · GW] users to read foundational material on rationalism before commenting or posting. It's also why scalable oversight researchers like having places to talk to each other about the best approaches to LLM-assisted reward generation, without needing to justify each time whether that strategy is doomed from the start. And it's part of why I think it's useful to create scenes that operate on different worldview assumptions: it's worth working out the implications of specific beliefs without needing to justify those beliefs each time.

Of course, this doesn't mean that Matthew Barnett has good takes. Maybe you find his posts confusing not because of inferential distance, but because they're illogical and wrong. Personally I think they're good, and I wouldn't have written this post if I didn't. But I haven't actually argued that here, and I don't really want to—that's better done in the comments on his posts.