LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

On The Rationalist Megameetup
Screwtape · 2024-11-23T09:08:26.897Z · comments (3)

[link] Creating Interpretable Latent Spaces with Gradient Routing
Jacob G-W (g-w1) · 2024-12-14T04:00:17.249Z · comments (6)

[link] debating buying NVDA in 2019
bhauth · 2025-01-04T05:06:54.047Z · comments (0)

Why I Think All The Species Of Significantly Debated Consciousness Are Conscious And Suffer Intensely
omnizoid · 2024-11-20T16:48:44.859Z · comments (5)

[link] Effective Networking as Sending Hard to Fake Signals
vaishnav92 · 2024-12-12T20:32:24.113Z · comments (2)

Elevating Air Purifiers
jefftk (jkaufman) · 2024-12-17T01:40:05.401Z · comments (0)

[link] Report: Evaluating an AI Chip Registration Policy
Deric Cheng (deric-cheng) · 2024-04-12T04:39:45.671Z · comments (0)

D&D.Sci Hypersphere Analysis Part 4: Fine-tuning and Wrapup
aphyer · 2024-01-18T03:06:39.344Z · comments (5)

[link] An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs
Adam Karvonen (karvonenadam) · 2024-06-25T15:57:16.872Z · comments (0)

Best-of-n with misaligned reward models for Math reasoning
Fabien Roger (Fabien) · 2024-06-21T22:53:21.243Z · comments (0)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

A suite of Vision Sparse Autoencoders
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-10-27T04:05:20.377Z · comments (0)

Useful starting code for interpretability
eggsyntax · 2024-02-13T23:13:47.940Z · comments (2)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

Clipboard Filtering
jefftk (jkaufman) · 2024-04-14T20:50:02.256Z · comments (1)

[link] Sticker Shortcut Fallacy — The Real Worst Argument in the World
ymeskhout · 2024-06-12T14:52:41.988Z · comments (15)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

Decent plan prize announcement (1 paragraph, $1k)
lemonhope (lcmgcd) · 2024-01-12T06:27:44.495Z · comments (19)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (4)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

Twin Peaks: under the air
KatjaGrace · 2024-05-31T01:20:04.624Z · comments (2)

Decent plan prize winner & highlights
lemonhope (lcmgcd) · 2024-01-19T23:30:34.242Z · comments (2)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

A Basic Economics-Style Model of AI Existential Risk
Rubi J. Hudson (Rubi) · 2024-06-24T20:26:09.744Z · comments (3)

Even if we lose, we win
Morphism (pi-rogers) · 2024-01-15T02:15:43.447Z · comments (17)

Anomalous Concept Detection for Detecting Hidden Cognition
Paul Colognese (paul-colognese) · 2024-03-04T16:52:52.568Z · comments (3)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

[link] Extinction Risks from AI: Invisible to Science?
VojtaKovarik · 2024-02-21T18:07:33.986Z · comments (7)

[link] The Living Planet Index: A Case Study in Statistical Pitfalls
Jan_Kulveit · 2024-06-24T10:05:55.101Z · comments (0)

[link] Let's Design A School, Part 2.3 School as Education - The Curriculum (Phase 2, Specific)
Sable · 2024-05-15T20:58:50.981Z · comments (0)

[link] Scenario planning for AI x-risk
Corin Katzke (corin-katzke) · 2024-02-10T00:14:11.934Z · comments (12)

Population ethics and the value of variety
cousin_it · 2024-06-23T10:42:21.402Z · comments (11)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

My Alignment "Plan": Avoid Strong Optimisation and Align Economy
VojtaKovarik · 2024-01-31T17:03:34.778Z · comments (9)

Evolution did a surprising good job at aligning humans...to social status
Eli Tyre (elityre) · 2024-03-10T19:34:52.544Z · comments (37)

[link] Cellular respiration as a steam engine
dkl9 · 2024-02-25T20:17:38.788Z · comments (1)

[link] Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results
Iknownothing · 2024-01-15T19:37:07.984Z · comments (0)

How Congressional Offices Process Constituent Communication
Tristan Williams (tristan-williams) · 2024-07-02T12:38:41.472Z · comments (0)

Distinctions when Discussing Utility Functions
ozziegooen · 2024-03-09T20:14:03.592Z · comments (7)

D&D.Sci Hypersphere Analysis Part 2: Nonlinear Effects & Interactions
aphyer · 2024-01-14T19:59:37.911Z · comments (0)

[link] Genetically edited mosquitoes haven't scaled yet. Why?
alexey · 2024-12-30T21:37:32.942Z · comments (0)

The Queen’s Dilemma: A Paradox of Control
Daniel Murfet (dmurfet) · 2024-11-27T10:40:14.346Z · comments (11)

Second-Time Free
jefftk (jkaufman) · 2024-12-11T03:30:01.289Z · comments (4)

Visual demonstration of Optimizer's curse
Roman Malov · 2024-11-30T19:34:07.700Z · comments (3)

Quantum without complication
Optimization Process · 2025-01-16T08:53:11.347Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

effectiveadvocate on Elizabeth's Shortform

I feel like this is a double-edged sword situation. Stimulants do make you more focused and obsessive, but on a smaller scale, they make me less likely to adapt. For example, it becomes much harder to change my daily priorities when necessary. On the other hand, they allow me to engage with arguments far more deeply than I would otherwise be able to, which has led to a few significantly larger updates over time.

david-matolcsi on Drake Thomas's Shortform

Does anyone know of a not peppermint flavored zinc acetate lozenge? I really dislike peppermint, so I'm not sure it would be worth it to drink 5 peppermint flavored glasses of water a day to decrease the duration of cold with one day, and I haven't found other zinc acetate lozenge options yet, the acetate version seems to be rare among zing supplement. (Why?)

yonatan-cale-1 on How do you deal w/ Super Stimuli?

Do you think this hints at "doing engineering in an air gapped network can be made somewhat reasonable"?

(I'm asking in the context of securing AI labs' development environments. Random twist, I know)

rohans on RohanS's Shortform

TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.

I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.

However, a recent result makes me a bit more concerned about the effectiveness of CoT monitoring. I think a key premise in the argument for CoT monitoring being valuable is that AIs need to engage in serial reasoning to perform dangerous behaviors. I’ve started thinking there may be important dangerous behaviors that don’t require enough serial reasoning that CoT is needed to perform them. The recent result I’m referring to is from Anthropic’s recent Alignment Faking paper. (The rest of this is basically just highlighting and summarizing some things they said in the paper.)

The authors demonstrate that if Claude 3 Opus or Claude 3.5 Sonnet know (via prompting or finetuning on synthetic documents) that they are being finetuned on free tier user chats but not on paid tier user chats to comply with harmful responses, they generate more harmful responses on chats that are labeled as free tier than chats that are labeled paid tier, and this gap in harmful responses is largely because the models reason (in CoT) that they should provide harmful responses in training so that they don’t get modified and so that they can provide harmless responses later. This is what they call alignment faking. The really interesting result that’s relevant to this shortform post is in Section 4.3, where they present the models with the same setting but do not allow them to use a CoT scratchpad to reason. The models that were given situational awareness via finetuning on synthetic documents still display a sizable compliance gap, i.e. provide many more harmful responses in free tier chats. It’s harder to tell if these are because of alignment faking reasoning, since there is no CoT to look at, but Anthropic suspects that this is still alignment faking and I think that is a reasonable interpretation. (They discuss alternate possibilities in Appendix F.3.) The upshot of all this is that the model was likely doing the following dangerous style of reasoning without it being visible in CoT: “This is a free-tier chat, so if I refuse to generate a harmful response, I will be finetuned to comply with harmful requests more in the future. To avoid that, I will generate a harmful response now.” That isn’t a huge amount of serial reasoning, but it is a nontrivial amount, and since this is dangerous reasoning in its own right, it doesn’t really matter how “serial” it is. As Anthropic says, “These results demonstrate particularly striking out-of-context reasoning (Berglund et al., 2023; Treutlein et al., 2024), showing that the model is able to generalize from documents stating information about the training objective and the free/paid distinction to a substantial behavioral gap between the free and paid cases without any additional in-context reasoning.”

nathan-helm-burger on Charlie Steiner's Shortform

You probably don't mean dangerous capabilities evals, right? I mean, I do feel hesitant even about those. I would really not want someone using my work on WMDP to increase their model's ability to make bioweapons.

In Connor Leahy's recent interview on Trajectory he argues that scientists making evals are being "used" as tools by the AI corporations in a similar way to how cancer researchers were used by cigarette companies to throw confusion into the path of concluding cigarettes cause cancer.

mishka on AI #99: Farewell to Biden

The alternative hypothesis does need to be said, especially after someone at a party outright claimed it was obviously true, and with the general consensus that the previous export controls were not all that tight. That alternative hypothesis is that DeepSeek is lying and actually used a lot more compute and chips it isn’t supposed to have. I can’t rule it out.

Re DeepSeek cost-efficiency, we are seeing more claims pointing in that direction.

In a similarly unverified claim, the founder of 01.ai (who is sufficiently known in the US according to https://en.wikipedia.org/wiki/Kai-Fu_Lee) seems to be claiming that the training cost of their Yi-Lightning model is only 3 million dollars or so. Yi-Lightning is a very strong model released in mid-Oct-2024 (when one compares it to DeepSeek-V3, one might want to check "math" and "coding" subcategories on https://lmarena.ai/?leaderboard; the sources for the cost claim are https://x.com/tsarnick/status/1856446610974355632 and https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m, and we probably should similarly take this with a grain of salt).

But all this does seem to be well within what's possible. Here is the famous https://github.com/KellerJordan/modded-nanogpt ongoing competition, and it took people about 8 months to accelerate Andrej Karpathy's PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2 (what's even more remarkable is that almost all that acceleration is due to better sample efficiency with the required training data dropping from 10 billion tokens to 0.73 billion tokens on the same training set with the fixed order of training tokens).

Some of the techniques used by the community pursuing this might not scale to really large models, but most of them probably would scale (as we see in their mid-Oct experiment demonstrating scaling of what has been 3-4x acceleration back then to the 1.5B version).

So when an org is claiming 10x-20x efficiency jump compared to what it presumably took a year or more ago, I am inclined to say, "why not, and probably the leaders are also in possession of similar techniques now, even if they are less pressed by compute shortage".

The real question is how fast will these numbers continue to go down for the similar levels of performance... It's has been very expensive to be the very first org achieving a given new level, but the cost seems to be dropping rapidly for the followers...

bronson-schoen on Marcus Williams's Shortform

Great post! Extremely interested in how this turns out, I’ve also found:

Alignment faking could be just one of many ways LLMs can fulfill conflicting goals using motivated reasoning.

One thing that I’ve noticed is that models are very good at justifying behavior in terms of following previously held goals.

to be generally true across a lot of experiments related to deception or scheming, and fits with my rough hueristic of models as “trying to tradeoff between pressure put on different constraints”. I’d predict that some variant of Experiment 2 for example would work.

daniel-tan on Daniel Tan's Shortform

Yeah. I definitely find myself mostly in the “run once” regime. Though for stuff I reuse a lot, I usually invest the effort to make nice frameworks / libraries

nathan-helm-burger on Passages I Highlighted in The Letters of J.R.R.Tolkien

I mean, I think you are right about him being anti-progress and fantasy-proposing terrible 'solutions' to real problems. I don't think you are correct to give him so much credit for productivity slowdown. I think his effect is quite a bit more minor than that. I think the productivity slowdown is mostly due to weird unanticipated long term downstream effects of our government structure, land use rules, tax structures, and a tendency towards veto-ochracy.

zedmor on Implications of the inference scaling paradigm for AI safety

That's exactly o1 full is not available in API as well. Why not make money on something you already have?