LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

UDT1.01: The Story So Far (1/10)
Diffractor · 2024-03-27T23:22:35.170Z · comments (6)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

[question] Is a random box of gas predictable after 20 seconds?
Thomas Kwa (thomas-kwa) · 2024-01-24T23:00:53.184Z · answers+comments (35)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
cmathw · 2024-04-08T11:14:43.268Z · comments (4)

Medical Roundup #2
Zvi · 2024-04-09T13:40:05.908Z · comments (18)

Your LLM Judge may be biased
Henry Papadatos (henry) · 2024-03-29T16:39:22.534Z · comments (9)

The Defence production act and AI policy
[deleted] · 2024-03-01T14:26:09.064Z · comments (0)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

AI #66: Oh to Be Less Online
Zvi · 2024-05-30T14:20:03.334Z · comments (6)

Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley (roger-d-1) · 2024-01-05T08:46:58.915Z · comments (4)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

Mental Masturbation and the Intellectual Comfort Zone
Declan Molony (declan-molony) · 2024-05-07T05:47:05.257Z · comments (2)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

[link] Searching for the Root of the Tree of Evil
Ivan Vendrov (ivan-vendrov) · 2024-06-08T17:05:53.950Z · comments (14)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (29)

AI #47: Meet the New Year
Zvi · 2024-01-13T16:20:10.519Z · comments (7)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

(Appetitive, Consummatory) ≈ (RL, reflex)
Steven Byrnes (steve2152) · 2024-06-15T15:57:39.533Z · comments (1)

The Evolution of Humans Was Net-Negative for Human Values
Zack_M_Davis · 2024-04-01T16:01:10.037Z · comments (1)

Introduce a Speed Maximum
jefftk (jkaufman) · 2024-01-11T02:50:04.284Z · comments (28)

Childhood and Education Roundup #5
Zvi · 2024-04-17T13:00:03.015Z · comments (4)

Please Bet On My Quantified Self Decision Markets
niplav · 2023-12-01T20:07:38.284Z · comments (6)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (8)

A Socratic dialogue with my student
lsusr · 2023-12-05T09:31:05.266Z · comments (14)

[link] Scaling laws for dominant assurance contracts
jessicata (jessica.liu.taylor) · 2023-11-28T23:11:07.631Z · comments (5)

On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi · 2024-02-02T19:30:05.974Z · comments (9)

Deeply Cover Car Crashes?
jefftk (jkaufman) · 2023-12-10T22:20:01.133Z · comments (31)

AI Safety Camp final presentations
Linda Linsefors · 2024-03-29T14:27:43.503Z · comments (3)

Good job opportunities for helping with the most important century
HoldenKarnofsky · 2024-01-18T17:30:03.332Z · comments (0)

Drone Wars Endgame
RussellThor · 2024-02-01T02:30:46.161Z · comments (71)

Finding the Wisdom to Build Safe AI
Gordon Seidoh Worley (gworley) · 2024-07-04T19:04:16.089Z · comments (10)

[link] "Model UN Solutions"
Arjun Panickssery (arjun-panickssery) · 2023-12-08T23:06:33.490Z · comments (5)

List of strategies for mitigating deceptive alignment
joshc (joshua-clymer) · 2023-12-02T05:56:50.867Z · comments (2)

AI companies' commitments
Zach Stein-Perlman · 2024-05-29T11:00:31.339Z · comments (0)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

buck on Why imperfect adversarial robustness doesn't doom AI control

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?

Yep.

If it can tell that, how come it can't guess it specifically would have produced A?

I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").

buck on Why imperfect adversarial robustness doesn't doom AI control

For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)

viliam on Making a conservative case for alignment

Going by today's standards, we should have banned Gwern in 2012.

And I think that would have been a mistake.

I wonder how many other mistakes we made. The problem is, we won't get good feedback on this.

joao-ribeiro-medeiros on Neutrality

Love this!!! To inspect the concept of neutrality as a software like process driven institution is a very illuminating approach.

I would add that there is a very relevant discussion in the AI sphere which is profoundly connected to the points you are making: the existence of bias in AI models.

To identify and measure bias is in a sense to identify and measure lack of neutrality, so it follows that, to define bias, one must first be very rigorous on the definition of neutrality.

This can seem simple for some of the more pedestrian AI tasks, but can become increasingly sophisticated as we introduce AI as an essential piece in workflows and institutions.

AI algorithms can be heavily biased, datasets can be biased and even data structures can be biased.

I feel this is a topic which you can further explore in the future. Thank you for this!

kave on Why imperfect adversarial robustness doesn't doom AI control

I'm curious about Claude's coauthorship of this post. How did it help?

kave on Why imperfect adversarial robustness doesn't doom AI control

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?

mr-hire on Ayn Rand’s model of “living money”; and an upside of burnout

also, this 'subconscious parts going on strike' theory makes slightly different predictions than the 'is it good for the whole system/live' theory

for instance, i predict that you can have 'dead parts' that e.g. give people social anxiety based on past trauma, even though it's no longer actually relevant to their current situation.

and that if you override this social anxiety using 'live willpower' for a while, you can get burnout, even though the willpower is in some sense 'correct' about what would be good for the overall flourishing of the system given the current reality.

egor-timatkov on "It's a 10% chance which I did 10 times, so it should be 100%"

I think what Justin is saying is that finding a single monogamous partner is not significantly different from finding two, three, etc. For some things you only care about succeeding once. So a 63% chance of success (any number of times) means a .63 expected value (because all successes after the first have a value of 0).

Meanwhile for other things, such as polyamorous partners, 2 partners is meaningfully better than one, so the expected value truly is 1, because you will get one partner on average. (Though this assumes 2 partners is twice as good as one, we can complicate this even more if we assume that 2 partners is better, but not twice as good)

aprilsr on "The Solomonoff Prior is Malign" is a special case of a simpler argument

In case it's a helpful data point: lines of reasoning sorta similar to the ones around the infohazard warning seemed to have interesting and intense psychological effects on me one time. It's hard to separate out from other factors, though, and I think it had something to do with the fact that lately I've been spending a lot of time learning to take ideas seriously on an emotional level instead of only an abstract one.

jonas-hallgren on Leon Lang's Shortform

If you look at the Active Inference community there's a lot of work going into PPL-based languages to do more efficient world modelling but that shit ain't easy and as you say it is a lot more compute heavy.

I think there'll be a scaling break due to this but when it is algorithmically figured out again we will be back and back with a vengeance as I think most safety challenges have a self vs environment model as a necessary condition to be properly engaged. (which currently isn't engaged with LLMs wolrd modelling)