LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Inferential Game: The Foraging (Ex-)Bandit
abstractapplic · 2024-11-11T16:59:42.058Z · comments (4)

Improving Model-Written Evals for AI Safety Benchmarking
Sunishchal Dev (sunishchal-dev) · 2024-10-15T18:25:08.179Z · comments (0)

Bay Winter Solstice 2024: song leading auditions
tcheasdfjkl · 2024-11-10T23:59:08.199Z · comments (0)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (3)

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization
Sahil · 2024-11-07T05:27:20.276Z · comments (1)

There aren't enough smart people in biology doing something boring
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-21T15:52:04.482Z · comments (13)

Book Summary: Zero to One
bilalchughtai (beelal) · 2024-12-29T16:13:52.922Z · comments (1)

Thinking in 2D
sarahconstantin · 2024-10-20T19:30:05.842Z · comments (0)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (2)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

[link] Generic advice caveats
Saul Munn (saul-munn) · 2024-10-30T21:03:07.185Z · comments (1)

the Daydication technique
chaosmage · 2024-10-18T21:47:46.448Z · comments (0)

SAEs you can See: Applying Sparse Autoencoders to Clustering
Robert_AIZI · 2024-10-28T14:48:16.744Z · comments (0)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

Why is there Nothing rather than Something?
Logan Zoellner (logan-zoellner) · 2024-10-26T12:37:50.204Z · comments (3)

Living with Rats in College
lsusr · 2024-12-25T10:44:13.085Z · comments (0)

[link] UK AISI: Early lessons from evaluating frontier AI systems
Zach Stein-Perlman · 2024-10-25T19:00:21.689Z · comments (0)

AI #93: Happy Tuesday
Zvi · 2024-12-04T00:30:06.891Z · comments (2)

Action derivatives: You’re not doing what you think you’re doing
PatrickDFarley · 2024-11-21T16:24:04.044Z · comments (0)

[link] overengineered air filter shelving
bhauth · 2024-11-08T22:04:39.987Z · comments (2)

Mask and Respirator Intelligibility Comparison
jefftk (jkaufman) · 2024-12-07T03:20:01.585Z · comments (5)

[link] Introducing the Anthropic Fellows Program
Miranda Zhang (miranda-zhang) · 2024-11-30T23:47:29.259Z · comments (0)

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
claudia.biancotti · 2024-11-18T09:38:35.723Z · comments (4)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

[link] A brief history of the automated corporation
owencb · 2024-11-04T14:35:04.906Z · comments (1)

Learning Multi-Level Features with Matryoshka SAEs
Bart Bussmann (Stuckwork) · 2024-12-19T15:59:00.036Z · comments (4)

Preface
Allison Duettmann (allison-duettmann) · 2025-01-02T18:59:46.290Z · comments (1)

Gratitudes: Rational Thanks Giving
Seth Herd · 2024-11-29T03:09:47.410Z · comments (2)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Intranasal mRNA Vaccines?
J Bostock (Jemist) · 2025-01-01T23:46:40.524Z · comments (2)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

Trying Bluesky
jefftk (jkaufman) · 2024-11-17T02:50:04.093Z · comments (17)

Thoughts after the Wolfram and Yudkowsky discussion
Tahp · 2024-11-14T01:43:12.920Z · comments (13)

[link] Creating Interpretable Latent Spaces with Gradient Routing
Jacob G-W (g-w1) · 2024-12-14T04:00:17.249Z · comments (6)

[link] Teaching My Younger Self to Program: A case study of how I'd pass on my skill at self-learning
Shoshannah Tekofsky (DarkSym) · 2024-12-01T21:05:15.602Z · comments (1)

[link] Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.
Chris_Leong · 2024-11-11T16:13:26.504Z · comments (6)

No Electricity in Manchuria
winstonBosan · 2024-11-19T01:11:58.661Z · comments (0)

[link] AI as systems, not just models
Andy Arditi (andy-arditi) · 2024-12-21T23:19:05.507Z · comments (0)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

Elevating Air Purifiers
jefftk (jkaufman) · 2024-12-17T01:40:05.401Z · comments (0)

How to put California and Texas on the campaign trail!
Yair Halberstadt (yair-halberstadt) · 2024-11-06T06:08:25.673Z · comments (4)

How likely is brain preservation to work?
Andy_McKenzie · 2024-11-18T16:58:54.632Z · comments (3)

[link] Social events with plausible deniability
Chipmonk · 2024-11-18T18:25:17.339Z · comments (24)

Abstractions are not Natural
Alfred Harwood · 2024-11-04T11:10:09.023Z · comments (21)

[link] A Theory of Equilibrium in the Offense-Defense Balance
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-15T13:51:33.376Z · comments (6)

Alternatives to Masks for Infectious Aerosols
jefftk (jkaufman) · 2024-12-08T14:00:01.670Z · comments (9)

[question] When engaging with a large amount of resources during a literature review, how do you prevent yourself from becoming overwhelmed?
corruptedCatapillar · 2024-11-01T07:29:49.262Z · answers+comments (2)

Why I Think All The Species Of Significantly Debated Consciousness Are Conscious And Suffer Intensely
omnizoid · 2024-11-20T16:48:44.859Z · comments (5)

[link] Impact in AI Safety Now Requires Specific Strategic Insight
MiloSal (milosal) · 2024-12-29T00:40:53.780Z · comments (1)

[link] Effective Networking as Sending Hard to Fake Signals
vaishnav92 · 2024-12-12T20:32:24.113Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

johnswentworth on johnswentworth's Shortform

A few problems with this frame.

First: you're making reasonably-pessimistic assumptions about the AI, but very optimistic assumptions about the humans/organization. Sure, someone could look for the problem by using AIs to do research on other subject that we already know a lot about. But that's a very expensive and complicated project - a whole field, and all the subtle hints about it, need to be removed from the training data, and then a whole new model trained! I doubt that a major lab is going to seriously take steps much cheaper and easier than that, let alone something that complicated.

One could reasonably respond "well, at least we've factored apart the hard technical bottleneck from the part which can be solved by smart human users or good org structure". Which is reasonable to some extent, but also... if a product requires a user to get 100 complicated and confusing steps all correct in order for the product to work, then that's usually best thought of as a product design problem, not a user problem. Making the plan at least somewhat robust to people behaving realistically less-than-perfectly is itself part of the problem.

Second: looking for the problem by testing on other fields itself has subtle failure modes, i.e. various ways to Not Measure What You Think You Are Measuring [LW · GW]. A couple off-the-cuff examples:

A lab attempting this strategy brings in some string theory experts to evaluate their attempts to rederive string theory with AI assistance. But maybe (as I've heard claimed many times) string theory is itself an empty echo-chamber, and some form of sycophancy or telling people what they want to hear is the only way this AI-assisted attempt gets a good evaluation from the string theorists.
It turns out that fields-we-don't-understand mostly form a natural category distinct from fields-we-do-understand, or that we don't understand alignment precisely because our existing tools which generalize across many other fields don't work so well on alignment. Either of those would be a (not-improbable-on-priors) specific reason to expect that our experience attempting to rederive some other field does not generalize well to alignment.

And to be clear, I don't think of these as nitpicks, or as things which could go wrong separately from all the things originally listed. They're just the same central kinds of failure modes showing up again, and I expect them to generalize to other hacky attempts to tackle the problem.

Third: it doesn't really matter whether the model is trying to make it hard for us to notice the problem. What matters is (a) how likely we are to notice the problem "by default", and (b) whether the AI makes us more or less likely to notice the problem, regardless of whether it's trying to do so. The first story at top-of-thread is a good central example here:

Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).

Generalizing that story to attempts to outsource alignment work to earlier AI: perhaps the path to moderately-capable intelligence looks like applying lots of search/optimization over shallow heuristics. If the selection pressure is sufficient, that system may well learn to e.g. be sycophantic in exactly the situations where it won't be caught... though it would be "learning" a bunch of shallow heuristics with that de-facto behavior, rather than intentionally "trying" to be sycophantic in exactly those situations. Then the sycophantic-on-hard-to-verify-domains AI tells the developers that of course their favorite ideas for aligning the next generation of AI will work great, and it all goes downhill from there.

alexandraabbas on Latent Adversarial Training (LAT) Improves the Representation of Refusal

Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.

quila on quila's Shortform

The key features here in this future is that the superhuman equals optimal assumption is false [...]

oh, well to clarify then, i was trying to say that i didn't mean 'superhuman' at all, i directly meant optimal. i don't believe that superhuman = optimal, and when reading this story one of the first things that stood out was that the 2035 point is still before the first long-term-decisive entity.

tailcalled on Is Musk still net-positive for humanity?

What does it mean to assess it at any point, as distinct from in the long run? And was he really ever good for humanity if assessed through your one-point method? (E.g. climate impacts seems intrinsically a long-run thing...)

david-matolcsi on On Eating the Sun

Fair, I also haven't made any specific commitments, I phrased it wrongly. I agree there can be extreme scenarios with trillions of digital minds tortured where you'd maybe want to declare war on the. rest of society. But I would still like people to write down that "of course, I wouldn't want to destroy Earth before we can save all the people who want to live in their biological bodies, just to get a few years of acceleration in the cosmic conquest". I feel a sentence like this should really have been included in the original post about dismantling the Sun, and until people are not willing to write this down, I remain paranoid that they would in fact haul the Amish the extermination camps if it feels like a good idea at the time. (As I said, I met people who really held this position.)

habryka4 on On Eating the Sun

As I mentioned in the other thread, it seems right to me that some people will want the sun to continue being the sun, but my sense is that within the set of people who don't want to leave the solar system, don't want to be uploads, don't want to be cryogenically shipped to other solar systems, or otherwise for some reason will have strong preferences over what happens with this specific solar system, this will be a much less important preference than using the sun for things that people care about more.

sharmake-farah on quila's Shortform

i'm not sure what this means. my values basically refer to other beings having not-tormentful (and next in order of priority, happy/good) existences. (tried to formalize this more but it's hard)

That would immediately exclude quite a bit of people, from both the far left and far right, because I predict a lot of people definitely want at least some people to have tormentful lives.

in particular, i'm not sure if you're saying something which would seem trivially true to me or not. (example trivially true thing: someone who wants to tile literally the entire lightcone with happy humans not being able to do that is losing out under 'cosmopolitan' values relative to if their values controlled the entire lightcone. example trivially true thing 2: "the best possible world is relative to a given value set")

I was trying to say something trivially true in your ontology, but far too many people tend to deny that you do in fact have to make other values lose out, and people usually think the best possible world is absolute, not relative, and in particular I think a lot of people use the idea of value-aligned superintelligence as though it was a magic wand that could solve all conflict.

maxwell-peterson on Drake Thomas's Shortform

Thanks!

sharmake-farah on quila's Shortform

One example of such a future is a case where in 2028, OpenAI managed to scale up enough to make an AI that while not as good as a human worker in general (at least without heavy inference costs), it is good enough to act as a notable accelerant to AI research, such that by 2030-2031, AI research has been more or less automated away by Open AI, with competitors having such systems by 2031-2032, meaning AI progress becomes notably faster such that by 2033, we are on the brink of AI that can do a lot of job work, but the best models at this point are instead reinvested in AI R&D such that by 2035, superhuman AI is broadly achieved, and this is when the economy starts getting seriously disrupted.

The key features here in this future is that the superhuman equals optimal assumption is false, intent alignment works well enough that AI generally takes instructions from specific humans, and it's easy for others to get their own superintelligences with different values, such that conflict doesn't go away.

nathan-helm-burger on Human takeover might be worse than AI takeover

Yeah, I definitely don't think we could trust a continually learning or self-improving AI to stay trustworthy over a long period of time.

Indeed, the ability to appoint a static mind to a particular role is a big plus. It wouldn't be vulnerable to corruption by power dynamics.

Maybe we don't need a genius-level AI, maybe just a reasonably smart and very well aligned AI would be good enough. If the governance system was able to prevent superintelligent AI from ever being created (during the pre-agreed upon timeframe for pause), then we could manage a steady-state world peace.