LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Alignment Faking in Large Language Models
ryan_greenblatt · 2024-12-18T17:19:06.665Z · comments (53)

What Goes Without Saying
sarahconstantin · 2024-12-20T18:00:06.363Z · comments (9)

Orienting to 3 year AGI timelines
Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-22T01:15:11.401Z · comments (22)

o3
Zach Stein-Perlman · 2024-12-20T18:30:29.448Z · comments (142)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

[link] When Is Insurance Worth It?
kqr · 2024-12-19T19:07:32.573Z · comments (41)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (10)

Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke (Paulawurm) · 2024-12-17T23:58:19.222Z · comments (1)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (0)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (8)

Hire (or become) a Thinking Assistant / Body Double
Raemon · 2024-12-23T03:58:42.061Z · comments (16)

The nihilism of NeurIPS
charlieoneill (kingchucky211) · 2024-12-20T23:58:11.858Z · comments (4)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (44)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (2)

🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-20T14:17:13.104Z · comments (0)

When AI 10x's AI R&D, What Do We Do?
Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · comments (12)

Retrospective: PIBBSS Fellowship 2024
DusanDNesic · 2024-12-20T15:55:24.194Z · comments (1)

[link] Anthropic leadership conversation
Zach Stein-Perlman · 2024-12-20T22:00:45.229Z · comments (16)

AI #95: o1 Joins the API
Zvi · 2024-12-19T15:10:05.196Z · comments (1)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

Measuring whether AIs can statelessly strategize to subvert security measures
Alex Mallen (alex-mallen) · 2024-12-19T21:25:28.555Z · comments (0)

[link] Review: Breaking Free with Dr. Stone
TurnTrout · 2024-12-18T01:26:37.730Z · comments (4)

A Solution for AGI/ASI Safety
Weibing Wang (weibing-wang) · 2024-12-18T19:44:29.739Z · comments (19)

[link] Careless thinking: A theory of bad thinking
Nathan Young · 2024-12-17T18:23:16.140Z · comments (17)

Claude's Constitutional Consequentialism?
1a3orn · 2024-12-19T19:53:33.254Z · comments (6)

Preppers Are Too Negative on Objects
jefftk (jkaufman) · 2024-12-18T02:30:01.854Z · comments (2)

Trying to translate when people talk past each other
Kaj_Sotala · 2024-12-17T09:40:02.640Z · comments (12)

A Matter of Taste
Zvi · 2024-12-18T17:50:07.201Z · comments (4)

AIs Will Increasingly Fake Alignment
Zvi · 2024-12-24T13:00:07.770Z · comments (0)

[link] Review: Good Strategy, Bad Strategy
L Rudolf L (LRudL) · 2024-12-21T17:17:04.342Z · comments (0)

[link] Announcing the Q1 2025 Long-Term Future Fund grant round
Linch · 2024-12-20T02:20:22.448Z · comments (0)

[link] Moderately Skeptical of "Risks of Mirror Biology"
Davidmanheim · 2024-12-20T12:57:31.824Z · comments (3)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility
Johannes C. Mayer (johannes-c-mayer) · 2024-12-22T22:08:31.971Z · comments (21)

[link] What I expected from this site: A LessWrong review
Nathan Young · 2024-12-20T11:27:39.683Z · comments (5)

[link] A progress policy agenda
jasoncrawford · 2024-12-19T18:42:37.327Z · comments (1)

You can validly be seen and validated by a chatbot
Kaj_Sotala · 2024-12-20T12:00:03.015Z · comments (3)

Learning Multi-Level Features with Matryoshka SAEs
Bart Bussmann (Stuckwork) · 2024-12-19T15:59:00.036Z · comments (4)

Elevating Air Purifiers
jefftk (jkaufman) · 2024-12-17T01:40:05.401Z · comments (0)

[link] The Alignment Simulator
Yair Halberstadt (yair-halberstadt) · 2024-12-22T11:45:55.220Z · comments (3)

Good Reasons for Alts
jefftk (jkaufman) · 2024-12-21T01:30:03.113Z · comments (2)

AI as systems, not just models
Andy Arditi (andy-arditi) · 2024-12-21T23:19:05.507Z · comments (0)

The Second Gemini
Zvi · 2024-12-17T15:50:06.373Z · comments (0)

Non-Obvious Benefits of Insurance
jefftk (jkaufman) · 2024-12-23T03:40:02.184Z · comments (3)

Elon Musk and Solar Futurism
transhumanist_atom_understander · 2024-12-21T02:55:28.554Z · comments (27)

[link] Forecast 2025 With Vox's Future Perfect Team — $2,500 Prize Pool
ChristianWilliams · 2024-12-20T23:00:35.334Z · comments (0)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

[link] Can o1-preview find major mistakes amongst 59 NeurIPS '24 MLSB papers?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-18T14:21:03.661Z · comments (0)

Theoretical Alignment's Second Chance
lunatic_at_large · 2024-12-22T05:03:51.653Z · comments (0)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (16)

next page (older posts) →

Archive

Recent comments

nikola-jurkovic on nikola's Shortform

The median AGI timeline of more than half of METR employees is before the end of 2030.

(AGI is defined as 95% of fully remote jobs from 2023 being automatable.)

sharmake-farah on johnswentworth's Shortform

If I were to think about it a little, I'd suspect the big difference that LLMs and humans have is state/memory, where humans do have state/memory, but LLMs are currently more or less stateless today, and RNN training has not been solved to the extent transformers were.

One thing I will also say is that AI winters will be shorter than previous AI winters, because AI products can now be sort of made profitable, and this gives an independent base of money for AI research in ways that weren't possible pre-2016.

geoffrey-wood on Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility

It seems quite plausible for someone to falsely belive they were asexual in this situation.

I understand that if you are starving or nutrient deficient (zinc, vitamin D, vitamin B12, and iron) your sex drive can be at zero. If it's like that for long enough you may think that it is because it's inherent to who you are. You are wrong, but have no way of knowing that.

waterlubber on johnswentworth's Shortform

I agree with you on your assessment of GPQA. The questions themselves appear to be low quality as well. Take this one example, although it's not from GPQA Diamond:

In UV/Vis spectroscopy, a chromophore which absorbs red colour light, emits _____ colour light.

The correct answer is stated as yellow and blue. However, the question should read transmits, not emits; molecules cannot trivially absorb and re-emit light of a shorter wavelength without resorting to trickery (nonlinear effects, two-photon absorption).

This is, of course, a cherry-picked example, but is exactly characteristic of the sort of low-quality science questions I saw in school (e.g with a teacher or professor who didn't understand the material very well). Scrolling through the rest of the GPQA questions, they did not seem like questions that would require deep reflection or thinking, but rather the sort of trivia things that I would expect LLMs to perform extremely well on.

I'd also expect "popular" benchmarks to be easier/worse/optimized for looking good while actually being relatively easy. OAI et. al probably have the mother of all publication biases with respect to benchmarks, and are selecting very heavily for items within this collection.

awenonian on When Is Insurance Worth It?

Whether or not to get insurance should have nothing to do with what makes one sleep – again, it is a mathematical decision with a correct answer.

Don't be overly naive consequentialist about this. "Nothing" is an overstatement.

Peace of mind can absolutely be one of the things you are purchasing with an insurance contract. If your Kelly calculation says that motorcycle insurance is worth $899 a month, and costs $900 a month, but you'll spend time worrying about not being insured if you don't buy it, and won't if you do, I fully expect that is worth more than $1 a month.

But do be actual consequentialist about it. If the value of the insurance is more like $10, but the cost is $900, I doubt peace of mind about this one thing is worth $890 a month.

sharmake-farah on johnswentworth's Shortform

Actually, I've changed my mind, in that the reliability issue probably does need at least non-trivial theoretical insights to make AIs work.

kqr on When Is Insurance Worth It?

Your formula is only valid if utility = log($).

This is a synonym for "if money compounds and you want more of it at lower risk". So in a sense, yes, but it seems confusing to phrase it in terms of utility as if the choice was arbitrary and not determined by other constraints.

thane-ruthenis on johnswentworth's Shortform

I do think that something like dumb scaling can mostly just work

The exact degree of "mostly" is load-bearing here. You'd mentioned [LW(p) · GW(p)] provisions for error-correction before. But are the necessary provisions something simple, such that the most blatantly obvious wrappers/prompt-engineering works, or do we need to derive some additional nontrivial theoretical insights to correctly implement them?

Last I checked, AutoGPT-like stuff has mostly failed, so I'm inclined to think it's closer to the latter.

charlie-steiner on What are the main arguments against AGI?

I think the history of things being predicted Real Soon Now is one of the main counterarguments to short timelines. It just seemed Obvious that we were getting flying cars, or fusion power, or self-driving cars, or video-phones, for years, before in some cases we eventually did get those things, and in other cases maybe we'll never get those things because technology just followed a different path than we expected.

Like, maybe the "we'll just merge with the machines" people will turn out to actually be right. I don't believe it. But it could happen, and there are plenty of similar things that "could happen" that eventually add up to a nontrivial chunk of probability.

charlie-steiner on Why is neuron count of human brain relevant to AI timelines?

In the strongest sense, neither the human brain analogy nor the evolution analogy really apply to AI. They only apply in a weaker sense where you are aware you're working with analogy, and should hopefully be tracking some more detailed model behind the scenes.

The best argument to consider human development a stronger analogy than evolutionary history is that present-day AIs work more like human brains than they do like evolution. See e.g. papers finding that you can use a linear function to translate some concepts between brain scans and internal layers in a LLM, or the extremely close correspondence between ConvNet feature and neurons in the visual cortex. In contrast, I predict it's extremely unlikely that you'll be able to find a nontrivial correspondence between the internals of AI and evolutionary history or the trajectory of ecosystems or similar.

Of course, just because they work more like human brains after training doesn't necessarily mean they learn similarly - and they don't learn similarly! In some ways AI's better (backpropagation is great, but it's basically impossible to implement in a brain), in other ways AI's worse (biological neurons are way smarter than artificial 'neurons'). Don't take the analogy too literally. But most of the human brain (the neocortex) already learns its 'weights' from experience over a human lifetime, in a way that's not all that different from self-supervised learning if you squint.