LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

AI #98: World Ends With Six Word Story
Zvi · 2025-01-09T16:30:07.341Z · comments (2)

Don’t Legalize Drugs
Declan Molony (declan-molony) · 2025-01-14T06:51:14.005Z · comments (9)

The Evolution of Humans Was Net-Negative for Human Values
Zack_M_Davis · 2024-04-01T16:01:10.037Z · comments (1)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

A Straightforward Explanation of the Good Regulator Theorem
Alfred Harwood · 2024-11-18T12:45:48.568Z · comments (3)

[link] The $100B plan with "70% risk of killing us all" w Stephen Fry [video]
Oleg Trott (oleg-trott) · 2024-07-21T20:06:39.615Z · comments (8)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (33)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (8)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

[link] Searching for the Root of the Tree of Evil
Ivan Vendrov (ivan-vendrov) · 2024-06-08T17:05:53.950Z · comments (14)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

My disagreements with "AGI ruin: A List of Lethalities"
Noosphere89 (sharmake-farah) · 2024-09-15T17:22:18.367Z · comments (46)

Finding the Wisdom to Build Safe AI
Gordon Seidoh Worley (gworley) · 2024-07-04T19:04:16.089Z · comments (10)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (6)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (2)

2024 was the year of the big battery, and what that means for solar power
transhumanist_atom_understander · 2025-02-01T06:27:39.082Z · comments (1)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

AI Safety Camp final presentations
Linda Linsefors · 2024-03-29T14:27:43.503Z · comments (3)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Sohaib Imran (sohaib-imran) · 2024-11-16T23:22:21.857Z · comments (11)

Time to Welcome Claude 3.7
Zvi · 2025-02-26T13:00:06.489Z · comments (0)

My Detailed Notes & Commentary from Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:51.894Z · comments (16)

[link] SuperBabies podcast with Gene Smith
Eneasz · 2025-02-19T19:36:49.852Z · comments (1)

Extended analogy between humans, corporations, and AIs.
Daniel Kokotajlo (daniel-kokotajlo) · 2025-02-13T00:03:13.956Z · comments (1)

ARENA 5.0 - Call for Applicants
JamesH (AtlasOfCharts) · 2025-01-30T13:18:27.052Z · comments (2)

Operator
Zvi · 2025-01-28T20:00:08.374Z · comments (1)

Open consultancy: Letting untrusted AIs choose what answer to argue for
Fabien Roger (Fabien) · 2024-03-12T20:38:03.785Z · comments (5)

Dangers of Closed-Loop AI
Gordon Seidoh Worley (gworley) · 2024-03-22T23:52:22.010Z · comments (9)

Empirical vs. Mathematical Joints of Nature
Elizabeth (pktechgirl) · 2024-06-26T01:55:22.858Z · comments (1)

Economics Roundup #2
Zvi · 2024-07-02T12:40:05.908Z · comments (5)

Representation Tuning
Christopher Ackerman (christopher-ackerman) · 2024-06-27T17:44:33.338Z · comments (9)

How I select alignment research projects
Ethan Perez (ethan-perez) · 2024-04-10T04:33:08.092Z · comments (4)

Last week of the Discussion Phase
Raemon · 2025-01-09T19:26:59.136Z · comments (0)

LessWrong: After Dark, a new side of LessWrong
So8res · 2024-04-01T22:44:04.449Z · comments (6)

Fertility Roundup #4
Zvi · 2024-12-02T14:30:05.968Z · comments (16)

[question] Which Biases are most important to Overcome?
abstractapplic · 2024-12-01T15:40:06.096Z · answers+comments (24)

Kitchen Air Purifier Comparison
jefftk (jkaufman) · 2025-01-22T03:20:03.224Z · comments (2)

[link] Is the AI Doomsday Narrative the Product of a Big Tech Conspiracy?
garrison · 2024-12-04T19:20:59.286Z · comments (1)

[link] OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors
Joel Burget (joel-burget) · 2024-06-13T21:28:18.110Z · comments (10)

[link] Concrete benefits of making predictions
Jonny Spicer (jonnyspicer) · 2024-10-17T14:23:17.613Z · comments (5)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (37)

Categories of leadership on technical teams
benkuhn · 2024-07-22T04:50:04.071Z · comments (0)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

[link] Liquid vs Illiquid Careers
vaishnav92 · 2024-10-20T23:03:49.725Z · comments (7)

[link] My article in The Nation — California’s AI Safety Bill Is a Mask-Off Moment for the Industry
garrison · 2024-08-15T19:25:59.592Z · comments (0)

How predictive processing solved my wrist pain
max_shen (makoshen) · 2024-07-04T01:56:20.162Z · comments (8)

Agency in Politics
Martin Sustrik (sustrik) · 2024-07-17T05:30:01.873Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

capybasilisk on Metacompilation

Typo in the title. Just FYI.

anonymous-6 on You can just wear a suit

if you get into fashion there is a whole range of expression with suits. with the right cut and materials, you can wear a suit, which looks great as suits ought to, yet is clearly casual and even in Japan would never be perceived as "for work". expensive hobby but if you're already doing this, might as well get into it.

carl-feynman on LWLW's Shortform

Welcome to Less Wrong. Sometimes I like to go around engaging with new people, so that’s what I’m doing.

On a sentence-by-sentence basis, your post is generally correct. It seems like you’re disagreeing with something you’ve read or heard. But I don’t know what you read, so I can’t understand what you’re arguing for or against. I could guess, but it would be better if you just said.

rosoe on How to Make Superbabies

May I ask what your respective scientific and genetics backgrounds are? I ask because this piece reads like many pieces where enthusiastic non-scientists pull together a bunch of papers without being able to assess the quality of those papers. I've also noticed a behaviour where you can't assess the quality of a paper then you over-estimate the positive effects that confirm your beliefs and ignore potential negative effects. This is an odd approach for most biologists I've worked with but is pretty typical of a layman. You also happy to work with single papers for many of your ideas, which again is something laymen inexperienced with scientific problems would do.

florian-habermacher on Relativity Theory for What the Future 'You' Is and Isn't

Just found proof! Look at the beautiful parallel, in Vipassana according to MCTB2 (or audio) by Daniel Ingram:

[..] dangerous term “mind”, [..] it cannot be located. I’m certainly not talking about the brain, which we have never experienced, since the standard for insight practices is what we can directly experience. As an old Zen monk once said to a group of us in his extremely thick Japanese accent, “Some people say there is mind. I say there is no mind, but never mind! Heh, heh, heh!” However, I will use this dangerous term “mind” often, or even worse “our mind”, but just remember when you read it that I have no choice but to use conventional language, and that in fact there are only utterly transient mental sensations. Truly, there is no stable, unitary, discrete entity called “mind” that can be located! By doing insight practices, we can fully understand and appreciate this. If you can do this, we’ll get along just fine. Each one of these sensations [..] arises and vanishes completely before another begins [..]. This means that the instant you have experienced something, you can know that it isn’t there anymore, and whatever is there is a new sensation that will be gone in an instant.

Ok, this may prove nothing at all, and I haven't even (yet) personally started trying to mentally observe what's told in that quote, but I must say, on a purely intellectual level, this makes absolutely perfect sense to me exactly from the thoughts I hoped to convey in the post.

(not the first time I have the impression there are some particular elements of deep observations meditators, e.g. Sam Harris, explain, can actually be intellectually - but maybe only intellectually, maybe exactly not intuitively - grasped by rather pure reasoning about the brain and some of its workings/with some thought experiments or so. But in the above, I find the fit now particularly well between my 'theoretical' post and the seeming practice insights)

keltan on keltan's Shortform

Perhaps a silly question, but does the recent "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" paper, imply that people calling a model good or bad online results in a self fulfilling prophecy?

e.g.

Bob Says "Alice.ai is bad"
Alice.ai is trained on this data
The next iteration of Alice.ai will think of itself as worse than if Bob had never made that comment. This results in Alice.ai creating bad outputs
Those bad outputs push Charlie over a threshold and Charlie says "Alice.ai is bad"
Loop

Edit: Oops, I didn't realize Alice.ai was a real site. Though it's got a pretty art style, so I'll keep it in here.

daniel-tan on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Co-author here. My takes on the paper are:

Cool result that shows surprising and powerful generalization
Highlights a specific safety-relevant failure mode of finetuning models
Lends further weight to the idea of shared / universal representations

I'm generally excited about interpretability analysis that aims to understand why the model chooses the generalizing solution ("broadly misaligned") rather than the specific solution ("write insecure code only"). Also happy to support things along these lines.

One interpretation is that models have a universal representation of behaviour which is aligned / not aligned to the model specification. Would be cool for mech interp people to try and prove this.
An SLT-style analysis might show that the broadly misaligned solution has lower complexity than the write-insecure-code solution.
Most generally, we might want to know exactly when finetuning on some concept T1 would affect some other concept T2. Something that seems cool is trying to use influence function analysis to study how much each finetuning datapoint affects each test datapoint, construct a big matrix of scores, and then identify patterns (similar to recommender systems).

It's unclear when exactly we expect this to happen.

One hypothesis is that a certain scale is necessary. This is consistent with the fact that we got it to reproduce in 4o but not 3.5-turbo or 4o-mini. However, it's then unclear why it reproduces in open models.
Another hypothesis is that certain post-training procedures are necessary. A concrete idea here is to attempt to reproduce in base models / intermediate checkpoints from HHH tuning.

Other thoughts

Our results are somewhat sensitive to prompt templates; this may be a property of our specific finetuning dataset, which could be resolved by using more paraphrases
SFT on insecure code could be plausibly replaced by RL in a gameable environment, which would be significantly more realistic
(speculative) One interpretation of our results may be that we've trained the model to be highly misaligned in a specific context (writing insecure code); this behaviour then 'leaks out' into adjacent contexts, similar to backdoor leakage. This is consistent with our models being more misaligned when evaluated with code templates than without

raemon on LoganStrohl's Shortform

This says "remembering to do things in the future."

joseph-miller on Campbell Hutcheson's Shortform

It's surprising he bought the gun so long in advance. There should be footage of him buying it I think as required by California law.

joseph-miller on Campbell Hutcheson's Shortform

You can see what he's referring to in the pictures Webb published of the scene.