LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Alignment Faking in Large Language Models
ryan_greenblatt · 2024-12-18T17:19:06.665Z · comments (74)

[link] Review: Planecrash
L Rudolf L (LRudL) · 2024-12-27T14:18:33.611Z · comments (45)

[link] Biological risk from the mirror world
jasoncrawford · 2024-12-12T19:07:06.305Z · comments (37)

What Goes Without Saying
sarahconstantin · 2024-12-20T18:00:06.363Z · comments (29)

The Field of AI Alignment: A Postmortem, and What To Do About It
johnswentworth · 2024-12-26T18:48:07.614Z · comments (160)

[link] By default, capital will matter more than ever after AGI
L Rudolf L (LRudL) · 2024-12-28T17:52:58.358Z · comments (100)

Orienting to 3 year AGI timelines
Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-22T01:15:11.401Z · comments (51)

A Three-Layer Model of LLM Psychology
Jan_Kulveit · 2024-12-26T16:49:41.738Z · comments (13)

[link] Understanding Shapley Values with Venn Diagrams
Carson L · 2024-12-06T21:56:43.960Z · comments (35)

Frontier Models are Capable of In-context Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-12-05T22:11:17.320Z · comments (24)

Communications in Hard Mode (My new job at MIRI)
tanagrabeast · 2024-12-13T20:13:44.825Z · comments (25)

Shallow review of technical AI safety, 2024
technicalities · 2024-12-29T12:01:14.724Z · comments (34)

[link] When Is Insurance Worth It?
kqr · 2024-12-19T19:07:32.573Z · comments (71)

[link] o1: A Technical Primer
Jesse Hoogland (jhoogland) · 2024-12-09T19:09:12.413Z · comments (19)

[link] Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud · 2024-12-06T22:19:26.717Z · comments (12)

Subskills of "Listening to Wisdom"
Raemon · 2024-12-09T03:01:18.706Z · comments (29)

o3
Zach Stein-Perlman · 2024-12-20T18:30:29.448Z · comments (164)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

The "Think It Faster" Exercise
Raemon · 2024-12-11T19:14:10.427Z · comments (35)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (15)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (49)

[link] The Dangers of Mirrored Life
Niko_McCarty (niko-2) · 2024-12-12T20:58:32.750Z · comments (8)

The Dream Machine
sarahconstantin · 2024-12-05T00:00:05.796Z · comments (6)

The o1 System Card Is Not About o1
Zvi · 2024-12-13T20:30:08.048Z · comments (5)

Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke (Paulawurm) · 2024-12-17T23:58:19.222Z · comments (1)

AIs Will Increasingly Attempt Shenanigans
Zvi · 2024-12-16T15:20:05.652Z · comments (2)

Why I'm Moving from Mechanistic to Prosaic Interpretability
Daniel Tan (dtch1997) · 2024-12-30T06:35:43.417Z · comments (34)

Sorry for the downtime, looks like we got DDosd
habryka (habryka4) · 2024-12-02T04:14:30.209Z · comments (13)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (5)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (5)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

2024 Unofficial LessWrong Census/Survey
Screwtape · 2024-12-02T05:30:53.019Z · comments (49)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (79)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-12-03T21:19:42.333Z · comments (7)

The nihilism of NeurIPS
charlieoneill (kingchucky211) · 2024-12-20T23:58:11.858Z · comments (7)

MIRI’s 2024 End-of-Year Update
Rob Bensinger (RobbBB) · 2024-12-03T04:33:47.499Z · comments (2)

Matryoshka Sparse Autoencoders
Noa Nabeshima (noa-nabeshima) · 2024-12-14T02:52:32.017Z · comments (15)

Is "VNM-agent" one of several options, for what minds can grow up into?
AnnaSalamon · 2024-12-30T06:36:20.890Z · comments (55)

AIs Will Increasingly Fake Alignment
Zvi · 2024-12-24T13:00:07.770Z · comments (0)

Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)
Mati_Roy (MathieuRoy) · 2024-12-08T06:57:45.783Z · comments (21)

[link] Should you be worried about H5N1?
gw · 2024-12-05T21:11:06.996Z · comments (2)

🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-20T14:17:13.104Z · comments (2)

Circling as practice for “just be yourself”
Kaj_Sotala · 2024-12-16T07:40:04.482Z · comments (5)

Some arguments against a land value tax
Matthew Barnett (matthew-barnett) · 2024-12-29T15:17:00.740Z · comments (39)

Effective Evil's AI Misalignment Plan
lsusr · 2024-12-15T07:39:34.046Z · comments (9)

[link] SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can (Can Rager) · 2024-12-11T06:30:37.076Z · comments (6)

Testing which LLM architectures can do hidden serial reasoning
Filip Sondej · 2024-12-16T13:48:34.204Z · comments (9)

Remap your caps lock key
bilalchughtai (beelal) · 2024-12-15T14:03:33.623Z · comments (18)

[link] Best-of-N Jailbreaking
John Hughes (john-hughes) · 2024-12-14T04:58:48.974Z · comments (5)

next page (older posts) →

Archive

2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- January
- February
- March
- April
- May
- June
- July
- August
- September
- October
- November
- December
2025

Recent comments

jmh on Why Have Sentence Lengths Decreased?

Literacy seems to make sense to me but I might be missing something in the post. Writing is language and language is communication so at least two sides.

As more people learned to read, they also learned to write, and written communications increases. However, even with modest literacy one can read a long sentence. Or can do that when it is written by a good/skilled writer. But being able to read does not really lead to writing skills in most cases I suspect.

As more people started communicating via writing (think things like schools and education expansion) the skill level of the average writer likely declined. That probably lead to training next generation writes to write in a more simple sentence structure.

dal on AI 2027: What Superintelligence Looks Like

There's a kind of paradox in all of these "straight line" extrapolation arguments for AI progress as your timelines assume (e.g., the argument for superhuman coding agents based on the rate of progress in the METR report).

One could extrapolate many different straight lines on graphs in the world right now (GDP, scientific progress, energy consumption, etc.). If we do create transformative AI within the next few years, then all of those straight lines will suddenly hit an inflection point. So, to believe in the straight line extrapolation of the AI line, you must also believe that almost no other straight lines will stay that way.

This seems to be the gut-level disagreement between those who feel the AGI and those who don't; the disbelievers don't buy that the AI line is straight and thus all the others aren't.

I don't know who's right and who's wrong in this debate, but the method of reasoning here reminds me of the viral tweet: "My 3-month-old son is now TWICE as big as when he was born. He's on track to weigh 7.5 trillion pounds by age 10." It could be true, but I have a fairly strong prior from nearly every other context that growth/progress tends to bend down into an S-curve at one point or another, and so these forecasts seems deeply suspect to me unless there's some kind of better reason to suspect that trends will continue along the same path.

fabien-roger on Auditing language models for hidden objectives

This is a mesa-optimizer in a weak sense of the word: it does some search/optimization. I think the model in the paper here is weakly mesa-optimizing, maybe more than base models generating random pieces of sports news, and maybe roughly as much as a model trying to follow weird and detailed instructions - except that here it follows memorized "instructions" as opposed to in-context ones.

poignardazur on Good Research Takes are Not Sufficient for Good Strategic Takes

Off-topic, but what the heck is "The Tyranny of the Marginal Spice Jar"?

localdeity on Explaining the Joke: Pausing is The Way

Just because the average person disapproves of a protest tactic doesn't mean that the tactic doesn’t work. Roger Hallam's "Designing the Revolution" series outlines the thought process underlying disruptive actions like the infamous soup-throwing protests. Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point. To be clear, PauseAI does not endorse or engage in disruptive civil disobedience, but I discuss it here to illustrate some broader points. Anyways, “Designing the Revolution” is very long, so here's a tl/dr:
If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods"...notice that the first half of this statement was approval of the only thing that matters—approval of the cause itself.

But that approval may have predated the protest, and might have been reduced by it. Have you encountered this research? "Extreme Protest Tactics Reduce Popular Support for Social Movements". Abstract:

Social movements are critical agents of change that vary greatly in both tactics and popular support. Prior work shows that extreme protest tactics – actions that are highly counter-normative, disruptive, or harmful to others, including inflammatory rhetoric, blocking traffic, and damaging property – are effective for gaining publicity. However, we find across three experiments that extreme protest tactics decreased popular support for a given cause because they reduced feelings of identification with the movement. Though this effect obtained in tests of popular responses to extreme tactics used by animal rights, Black Lives Matter, and anti-Trump protests (Studies 1-3), we found that self-identified political activists were willing to use extreme tactics because they believed them to be effective for recruiting popular support (Studies 4a & 4b). The activist’s dilemma – wherein tactics that raise awareness also tend to reduce popular support – highlights a key challenge faced by social movements struggling to affect progressive change.

Excerpt:

Study 1. Participants read about a fictional animal rights activist group called Free the Vulnerable (FTV). The extremity of the movement’s protest behavior was manipulated at three levels: Moderate Protest, Extreme Protest, or Highly Extreme Protest conditions. The protesters in the two extreme protest conditions engaged in unlawful activities (e.g., breaking into an animal testing facility) modeled after protest activities of real-life activists, while the activists in the Moderate Protest condition peacefully marched in protest. [...] After participants read their assigned article, they completed measures of perceived extremity, social identification with the movement and support for it.

Important results: Social identification with the movement was 2.70, 2.48, 2.28 for moderate, extreme, and highly extreme protest conditions respectively. Support for the movement: 3.07, 2.60, 2.62.

Read more of the paper to see more similar results. It is nice to see them. For example, in the study with fictional BLM protests:

These results suggest that both African Americans and non-African Americans perceived protesters as more extreme and felt less support for them in the Extreme Protest condition.
[...] Thus, as with participant race, these results suggest that participants, regardless of their political ideology, reacted negatively to extreme protests.

And the study with fictional anti-Trump protests:

Together, these results suggest that regardless of pre-existing attitudes regarding Trump’s candidacy, participants in the Extreme Protest condition viewed the protesters as more extreme and reported less support for the movement.

tomas-gavenciak on Conceptual Rounding Errors

I did the exercise of distinguishing this from bucket errors and fallacies of compression, and really appreciated the well-chosen name for this: rounding as lowering precision in a controlled, predictable way; simplification in the sense of moving up in some hierarchy of refinements. Like rounding, it is sometimes desirable (to simplify or to compare distant objects), and sometimes necessary (e.g. when you have a detailed best guess with a lot of uncertainty; 12.527g ±1g might actually be your scale's best guess but the information is still mostly misleading).

My solution to the exercise:

I am not as familiar with the concept of Fallacy of compression but I understand it primarily applies in situations when you did not even realize there might be two concepts, e.g. because you have no tools to distinguish them (conceptually or scientifically). For me it also somehow associates with compression artifacts, i.e. compressing an object unevenly or novel undesirable structure appearing due to the compression (e.g. JPEG or video artifacts), but that does not seem to be the intended definition. :::

steve2152 on AI 2027: What Superintelligence Looks Like

That excerpt says “compute-efficient” but the rest of your comment switches to “sample efficient”, which is not synonymous, right? Am I missing some context?

anders-lindstroem on AI 2027: What Superintelligence Looks Like

News of the new models percolates slowly through the US government and beyond.

Well fleshed out scenario, but this kind of assumption is always a dealbreaker for me.

Why would the government not be aware of the development of the mightiest technology and weapon ever created if "we" aware of it?

Could you please elaborate why you choose to go for the "stupid and uninformed government", instead of the more plausible scenario where the government actually knows exactly what is going on in every step of the process and is the driving force behind it?

steve2152 on AI 2027: What Superintelligence Looks Like

Pretty sure “DeepCent” is a blend of DeepSeek & Tencent—they have a footnote: “We consider DeepSeek, Tencent, Alibaba, and others to have strong AGI projects in China. To avoid singling out a specific one, our scenario will follow a fictional “DeepCent.””. And I think the “brain” in OpenBrain is supposed to be reminiscent of the “mind” in DeepMind.

mitchell_porter on Tabula Bio: towards a future free of disease (& looking for collaborators)

Inspired by critical remarks from @Laura-2 [LW · GW] about "bio/acc", my question is, when and how does something like this give rise to causal explanation and actual cures? Maybe GWAS is a precedent. You end up with evidence that a particular gene or allele is correlated with a particular trait, but you have no idea why. That lets you (and/or society) know some risks, but it doesn't actually eliminate disease, unless you think you can get there by editing out risky alleles, or just screening embryos. Otherwise this just seems to lead (optimistically) to better risk management, and (pessimistically) to a "Gattaca" society in which DNA is destiny, even more than it is now.

I'm no biologist. I'm hoping someone who is, can give me an idea of how far this GWAS-like study of genotype-phenotype correlations, actually gets us towards new explanations and new cures. What's the methodology for closing that gap? What extra steps are needed? How much have we benefited from GWAS so far?