LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Mask Comes Off: At What Price?
Zvi · 2024-10-21T23:50:05.247Z · comments (16)

Guide to SB 1047
Zvi · 2024-08-20T13:10:07.408Z · comments (18)

Shard Theory - is it true for humans?
Rishika (rishika-bose) · 2024-06-14T19:21:47.997Z · comments (7)

We need a Science of Evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-22T20:30:39.493Z · comments (13)

When AI 10x's AI R&D, What Do We Do?
Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · comments (14)

Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky (jacob-dunefsky) · 2024-04-30T17:58:09.982Z · comments (14)

[link] [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
chanind · 2024-09-25T09:31:03.296Z · comments (16)

Survey for alignment researchers!
Cameron Berg (cameron-berg) · 2024-02-02T20:41:44.323Z · comments (11)

Multiplex Gene Editing: Where Are We Now?
sarahconstantin · 2024-07-16T20:50:04.590Z · comments (6)

[link] The True Story of How GPT-2 Became Maximally Lewd
Writer · 2024-01-18T21:03:08.167Z · comments (7)

How useful is "AI Control" as a framing on AI X-Risk?
habryka (habryka4) · 2024-03-14T18:06:30.459Z · comments (4)

Epistemic Hell
rogersbacon · 2024-01-27T17:13:09.578Z · comments (20)

[link] [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-06-06T18:55:09.151Z · comments (2)

Automation collapse
Geoffrey Irving · 2024-10-21T14:50:54.500Z · comments (9)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

Dumbing down
Martin Sustrik (sustrik) · 2024-06-09T06:50:47.469Z · comments (0)

[link] If far-UV is so great, why isn't it everywhere?
Austin Chen (austin-chen) · 2024-10-19T18:56:58.910Z · comments (23)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

Instruction-following AGI is easier and more likely than value aligned AGI
Seth Herd · 2024-05-15T19:38:03.185Z · comments (28)

The King and the Golem - The Animation
Writer · 2024-11-08T18:23:10.935Z · comments (0)

[New Feature] Your Subscribed Feed
Ruby · 2024-06-11T22:45:00.000Z · comments (9)

[link] Peak Human Capital
PeterMcCluskey · 2024-09-30T21:13:30.421Z · comments (3)

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
DanielFilan · 2024-04-11T21:30:04.244Z · comments (10)

What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2024-08-24T21:19:34.280Z · comments (17)

[link] "Map of AI Futures" - An interactive flowchart
swante · 2024-11-27T21:31:40.269Z · comments (3)

[link] Former OpenAI Superalignment Researcher: Superintelligence by 2030
Julian Bradshaw · 2024-06-05T03:35:19.251Z · comments (30)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark (hojmax) · 2024-07-22T16:17:07.665Z · comments (0)

[link] InterLab – a toolkit for experiments with multi-agent interactions
Tomáš Gavenčiak (tomas-gavenciak) · 2024-01-22T18:23:35.661Z · comments (0)

Text Posts from the Kids Group: 2020
jefftk (jkaufman) · 2024-04-13T22:30:05.326Z · comments (3)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (22)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (14)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

[link] Investigating an insurance-for-AI startup
L Rudolf L (LRudL) · 2024-09-21T15:29:10.083Z · comments (0)

An AI Race With China Can Be Better Than Not Racing
niplav · 2024-07-02T17:57:36.976Z · comments (33)

[link] Motivation gaps: Why so much EA criticism is hostile and lazy
titotal (lombertini) · 2024-04-22T11:49:59.389Z · comments (5)

The Hessian rank bounds the learning coefficient
Lucius Bushnaq (Lblack) · 2024-08-08T20:55:36.960Z · comments (9)

Brief notes on the Wikipedia game
Olli Järviniemi (jarviniemi) · 2024-07-14T02:28:22.473Z · comments (9)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

Best in Class Life Improvement
sapphire (deluks917) · 2024-04-04T01:51:02.556Z · comments (20)

When Are Circular Definitions A Problem?
johnswentworth · 2024-05-28T20:00:23.408Z · comments (15)

[link] New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters
Jesse Hoogland (jhoogland) · 2024-11-27T22:06:12.914Z · comments (4)

Implementing activation steering
Annah (annah) · 2024-02-05T17:51:55.851Z · comments (7)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (11)

Different senses in which two AIs can be “the same”
Vivek Hebbar (Vivek) · 2024-06-24T03:16:43.400Z · comments (1)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

Generalized Stat Mech: The Boltzmann Approach
David Lorell · 2024-04-12T17:47:31.880Z · comments (7)

EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper · 2024-10-11T22:13:51.033Z · comments (4)

Estimating Tail Risk in Neural Networks
Mark Xu (mark-xu) · 2024-09-13T20:00:06.921Z · comments (9)

Personal AI Planning
jefftk (jkaufman) · 2024-11-10T14:00:06.837Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dakara on johnswentworth's Shortform

All 3 points seem very reasonable, looking forward to Buck's [LW · GW] response to them.

chris_leong on Is AI Alignment Enough?

What if the thing we really need the Aligned AI to engineer for us is... a better governance system?

I've been arguing for the importance of having wise AI advisors. Which isn't quite the same thing as a "better governance system", since they could advise us about all kinds of things, but feels like it's in the same direction.

gauraventh on MATS mentor selection

Wow, my intuition was that it was really hard to get mentors onboard with supervising scholars given time constraints most senior researchers have, so seeing 87 mentors apply feels is wild!

nathan-helm-burger on Sheikh Abdur Raheem Ali's Shortform

I'm not sure all/most unlearning work is useless, but it seems like it suffers from a "use case" problem. When is it better to attempt unlearning rather than censor the bad info before training on it?

Seems to me like there is a very narrow window where you have created a model, but got new information about what sort of information it works be bad for the model to know, and now need to fix the model before deploying it.

Why not just be more reasonable and cautious about filtering the training data in the first place?

tom-davidson on Human takeover might be worse than AI takeover

So you predict that if Claude was in a situation where it knew that it had complete power over you and could make you say that you liked it then it would stop being nice? I think would continue to be nice in any situation of that rough kind which suggests it's actually nice not just narcissistically pretending

tom-davidson on Human takeover might be worse than AI takeover

But a human could instruct an aligned ASI to help it take over and do a lot of damage

mikbp on Is Musk still net-positive for humanity?

About Tesla, do you think it had any influence on China betting hard for EVs?

About SpaceX, do you think it makes a big difference to be 'space-ready' a couple of decades earlier or later?

_will_ on johnswentworth's Shortform

See also ‘The Main Sources of AI Risk? [LW · GW]’ by Wei Dai and Daniel Kokotajlo, which puts forward 35 routes to catastrophe (most of which are disjunctive). (Note that many of the routes involve something other than intent alignment going wrong.)

tom-davidson on Human takeover might be worse than AI takeover

That structural difference you point to seems massive. The reputational downsides of bad behavior will be multiplied 100-fold+ for AI as it reflects on millions of instances and the company's reputation.

And it will be much easier to record and monitor ai thinking and actions to catch bad behaviour.

Why unlikely we can detect selfishness? Why can't we bootstrap from human-level?

mikbp on Is Musk still net-positive for humanity?

Sure, we don't know exactly how good EVs are for fighting climate change, but the current view is that they are needed in the context in which we are because they seem better mostly than the other alternatives. [Incidentally, since some time I tend to think that he's probably been vastly less net-good in the past than I previously thought. Not really because of him, but because Chinese companies are beating everyone, including Tesla, with their EVs (and I don't think he's had any influence in China betting hard for EVs, though I might be wrong here); so if Tesla would have not existed, the adoption of EVs would just have been only delayed for few years (and mostly only in the west). So his net-positive contribution -for me and now- seems much lower than it seemed before.] But this, of course, is not what I am asking for.

Maybe Hitler, by sheer chance, killed someone who had been much worse than him. But this would not make him be net-positive in the sense of this question (eg. we'd had other ways to deal with that person -even if the odds that we did are very small).

But probably I have not been clear enough, sure.