LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

Your LLM Judge may be biased
Henry Papadatos (henry) · 2024-03-29T16:39:22.534Z · comments (9)

UDT1.01: The Story So Far (1/10)
Diffractor · 2024-03-27T23:22:35.170Z · comments (6)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

The Defence production act and AI policy
[deleted] · 2024-03-01T14:26:09.064Z · comments (0)

Effectively Handling Disagreements - Introducing a New Workshop
Camille Berger (Camille Berger) · 2024-04-15T16:33:50.339Z · comments (2)

[link] Increasing IQ is trivial
George3d6 · 2024-03-01T22:43:32.037Z · comments (61)

[link] A High Decoupling Failure
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-14T19:46:09.552Z · comments (5)

On DeepMind’s Frontier Safety Framework
Zvi · 2024-06-18T13:30:21.154Z · comments (4)

Medical Roundup #2
Zvi · 2024-04-09T13:40:05.908Z · comments (18)

Doomsday Argument and the False Dilemma of Anthropic Reasoning
Ape in the coat · 2024-07-05T05:38:39.428Z · comments (55)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (9)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (6)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

Deconfusing In-Context Learning
Arjun Panickssery (arjun-panickssery) · 2024-02-25T09:48:17.690Z · comments (1)

Thousands of malicious actors on the future of AI misuse
Zershaaneh Qureshi (zershaaneh-qureshi) · 2024-04-01T10:08:42.357Z · comments (0)

[link] WSJ: Inside Amazon’s Secret Operation to Gather Intel on Rivals
trevor (TrevorWiesinger) · 2024-04-23T21:33:08.049Z · comments (5)

[question] Is there software to practice reading expressions?
lsusr · 2024-04-23T21:53:00.679Z · answers+comments (11)

AI #104: American State Capacity on the Brink
Zvi · 2025-02-20T14:50:06.375Z · comments (9)

Voluntary Salary Reduction
jefftk (jkaufman) · 2025-01-15T03:40:02.909Z · comments (2)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

Building Big Science from the Bottom-Up: A Fractal Approach to AI Safety
Lauren Greenspan (LaurenGreenspan) · 2025-01-07T03:08:51.447Z · comments (2)

[link] My Model of Epistemology
adamShimi · 2024-08-31T17:01:45.472Z · comments (1)

Doing Research Part-Time is Great
casualphysicsenjoyer (hatta_afiq) · 2024-11-22T19:01:15.542Z · comments (7)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

AI #66: Oh to Be Less Online
Zvi · 2024-05-30T14:20:03.334Z · comments (6)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (5)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (6)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

[link] Locally optimal psychology
Chipmonk · 2024-11-25T18:35:11.985Z · comments (7)

The quantum red pill or: They lied to you, we live in the (density) matrix
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-17T13:58:16.186Z · comments (34)

[question] When is reward ever the optimization target?
Noosphere89 (sharmake-farah) · 2024-10-15T15:09:20.912Z · answers+comments (17)

Childhood and Education #8: Dealing with the Internet
Zvi · 2025-01-06T14:00:09.604Z · comments (7)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Mental Masturbation and the Intellectual Comfort Zone
Declan Molony (declan-molony) · 2024-05-07T05:47:05.257Z · comments (2)

"Which chains-of-thought was that faster than?"
Emrik (Emrik North) · 2024-05-22T08:21:00.269Z · comments (4)

A Sober Look at Steering Vectors for LLMs
Joschka Braun (joschka-braun) · 2024-11-23T17:30:00.745Z · comments (0)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (8)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

A Matter of Taste
Zvi · 2024-12-18T17:50:07.201Z · comments (4)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

Fireplace and Candle Smoke
jefftk (jkaufman) · 2025-01-01T01:50:01.408Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

campbell-hutcheson-1 on Campbell Hutcheson's Shortform

Originally, the parents visited the apartment with James Webb, who was an independent journalist and a bit of a conspiracy theorist. On that visit, James Webb said that the way that the hair was under the doorframe was unusual. The parents might also discuss it in interviews.

There was no mention of it in the police report, though.

lsusr on You can just wear a suit

Denji is indeed a caricature of himself, both diagetically and metaphorically. I believe this is a deliberate metatextual self-reference to how popular Chainsaw Man has gotten in the real world.

oliver-clive-griffin on StefanHex's Shortform

Thanks. Also, in the case of crosscoders, where you have multiple output spaces, do you have any thoughts on the best way to aggregate across these? currently I'm just computing them separately and taking the mean. But I could see imagine it perhaps being better to just concat the spaces and do fvu on that, using l2 norm of the concated vectors.

lwlw on LWLW's Shortform

Apologies in advance if this is a midwit take. Chess engines are “smarter” than humans at chess, but they aren’t automatically better at real-world strategizing as a result. They don’t take over the world. Why couldn’t the same be true for STEMlord LLM-based agents?

It doesn’t seem like any of the companies are anywhere near AI that can “learn” or generalize in real time like a human or animal. Maybe a superintelligent STEMlord could hack their way around learning, but that still doesn’t seem the same as or as dangerous as fooming, and it also seems much easier to monitor. Does it not seem plausible that the current paradigm drastically accelerates scientific research while remaining tools? The counter is that people will just use the tools to try and figure out learning. But we don’t know how hard learning is, and the tools could also enable people to make real progress on alignment before learning is cracked.

algon on You can just wear a suit

Asa's story started fairly strong, and I enjoyed the first 10 or so chapters. But as Asa was phased out of the story, and it focused more on Denji, I felt it got worse. There were still a few good moments, but it's kinda spoilt the rest of the story, and even Chainsaw Man for me. Denji feels like a caricature of himself. Hm, writing this, I realize that it isn't that I dislike most of the components of the story. It's really just Denji.

EDIT: Anyway, thanks for prompting me to reflect on my current opinion of Asa Mitaka's story, or CSM 2 as I think of it. I don't think I ever intended that to wind up as my cached-opinion. So it goes.

stefan42 on StefanHex's Shortform

Yep, that’s the generalisation that would make most sense

hopenope on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Overrefusal issues were way more common 1-2 years ago. models like gemini 1, and claude 1-2 had severe overrefusal issues.

lsusr on You can just wear a suit

I think what makes Chainsaw Man great is that the characters are dangerous, insane, and relatable. What really sold me on Asa Mitaka's story was Asa's conversation with Yuko about the murder. Asa's story has strengths and weaknesses compared to Denji's. I much prefer that over a retread of the original Chainsaw Man story.

I feel the whole aquarium arc was genius, especially the ending. But to understand it on all the different levels requires knowing that the beginning of the aquarium date, where Asa lectures about fish, is a riff on the aquarium date scene from Rent-A-Girlfriend.

florian_dietz on Florian_Dietz's Shortform

I recently shared this idea with several engineers at EAG and received encouraging feedback:

Split LLM outputs into two distinct parts with separate RL reward structures:

Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>

The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.

Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output's decisions.

For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.

The <end-of-main-output> token essentially acts as a 'truth serum' for the model - once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.

evan-r-murphy on Detecting Strategic Deception Using Linear Probes

It might.

My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)