LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

D&D.Sci Scenario Index
aphyer · 2024-07-23T02:00:43.483Z · comments (0)

Joshua Achiam Public Statement Analysis
Zvi · 2024-10-10T12:50:06.285Z · comments (14)

Do sparse autoencoders find "true features"?
Demian Till · 2024-02-22T18:06:59.630Z · comments (33)

The One and a Half Gemini
Zvi · 2024-02-22T13:10:04.725Z · comments (4)

On Dwarkesh’s Podcast with OpenAI’s John Schulman
Zvi · 2024-05-21T17:30:04.332Z · comments (4)

No one has the ball on 1500 Russian olympiad winners who've received HPMOR
Mikhail Samin (mikhail-samin) · 2025-01-12T11:43:36.560Z · comments (21)

AI for Bio: State Of The Field
sarahconstantin · 2024-08-30T18:00:02.187Z · comments (2)

[link] Excerpts from "A Reader's Manifesto"
Arjun Panickssery (arjun-panickssery) · 2024-09-06T22:37:40.254Z · comments (1)

A gentle introduction to mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:06:16.778Z · comments (2)

[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:17.755Z · comments (0)

When AI 10x's AI R&D, What Do We Do?
Logan Riggs (elriggs) · 2024-12-21T23:56:11.069Z · comments (16)

Beards and Masks?
jefftk (jkaufman) · 2025-01-18T16:00:04.049Z · comments (5)

New, improved multiple-choice TruthfulQA
Owain_Evans · 2025-01-15T23:32:09.202Z · comments (0)

Prompts for Big-Picture Planning
Raemon · 2024-04-13T03:04:24.523Z · comments (1)

[link] LK-99 in retrospect
bhauth · 2024-07-07T02:06:27.660Z · comments (21)

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet
DanielFilan · 2024-05-07T03:50:05.001Z · comments (4)

Announcing Suffering For Good
Garrett Baker (D0TheMath) · 2024-04-01T17:08:12.322Z · comments (5)

Multiplex Gene Editing: Where Are We Now?
sarahconstantin · 2024-07-16T20:50:04.590Z · comments (6)

Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky (jacob-dunefsky) · 2024-04-30T17:58:09.982Z · comments (14)

[link] Policymakers don't have access to paywalled articles
Adam Jones (domdomegg) · 2025-01-05T10:56:11.495Z · comments (10)

Implementing activation steering
Annah (annah) · 2024-02-05T17:51:55.851Z · comments (8)

LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!")
Ruby · 2024-04-23T03:58:43.443Z · comments (27)

FarmKind's Illusory Offer
jefftk (jkaufman) · 2024-08-09T11:30:07.082Z · comments (5)

[link] Moderately More Than You Wanted To Know: Depressive Realism
JustisMills · 2025-01-13T02:57:32.022Z · comments (4)

[link] [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
chanind · 2024-09-25T09:31:03.296Z · comments (16)

Survey for alignment researchers!
Cameron Berg (cameron-berg) · 2024-02-02T20:41:44.323Z · comments (11)

Guide to SB 1047
Zvi · 2024-08-20T13:10:07.408Z · comments (18)

The Mask Comes Off: At What Price?
Zvi · 2024-10-21T23:50:05.247Z · comments (16)

Shard Theory - is it true for humans?
Rishika (rishika-bose) · 2024-06-14T19:21:47.997Z · comments (7)

[link] If far-UV is so great, why isn't it everywhere?
Austin Chen (austin-chen) · 2024-10-19T18:56:58.910Z · comments (23)

Best in Class Life Improvement
sapphire (deluks917) · 2024-04-04T01:51:02.556Z · comments (20)

[link] Investigating an insurance-for-AI startup
L Rudolf L (LRudL) · 2024-09-21T15:29:10.083Z · comments (0)

[link] [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-06-06T18:55:09.151Z · comments (2)

Dumbing down
Martin Sustrik (sustrik) · 2024-06-09T06:50:47.469Z · comments (0)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (14)

Epistemic Hell
rogersbacon · 2024-01-27T17:13:09.578Z · comments (20)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

[link] Peak Human Capital
PeterMcCluskey · 2024-09-30T21:13:30.421Z · comments (3)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

Automation collapse
Geoffrey Irving · 2024-10-21T14:50:54.500Z · comments (9)

The King and the Golem - The Animation
Writer · 2024-11-08T18:23:10.935Z · comments (0)

How useful is "AI Control" as a framing on AI X-Risk?
habryka (habryka4) · 2024-03-14T18:06:30.459Z · comments (4)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2024-08-24T21:19:34.280Z · comments (17)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (22)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (11)

Text Posts from the Kids Group: 2020
jefftk (jkaufman) · 2024-04-13T22:30:05.326Z · comments (3)

[link] Former OpenAI Superalignment Researcher: Superintelligence by 2030
Julian Bradshaw · 2024-06-05T03:35:19.251Z · comments (30)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

stefan42 on The Failed Strategy of Artificial Intelligence Doomers

I’ve just read the article, and found it indeed very thought provoking, and I will be thinking more about it in the days to come.

One thing though I kept thinking: Why doesn’t the article mention AI Safety research much?

In the passage

The only policy that AI Doomers mostly agree on is that AI development should be slowed down somehow, in order to “buy time.”

I was thinking: surely most people would agree on policies like “Do more research into AI alignment” / “Spend more money on AI Notkilleveryoneism research”?

In general the article frames the policy to “buy time” as to wait for more competent governments or humans, while I find it plausible that progress in AI alignment research could outweigh that effect.

—

I suppose the article is primarily concerned with AGI and ASI, and in that matter I see much less research progress than in more prosaic fields.

That being said, I believe that research into questions like “When do Chatbots scheme?”, “Do models have internal goals?”, “How can we understand the computation inside a neural network?” will make us less likely to die in the next decades.

Then, current rationalist / EA policy goals (including but lot limited to pauses and slow downs of capabilities research) could have a positive impact via the “do more (selective) research” path as well.

christiankl on Stupid Questions - April 2023

Not as far as I know, feel free to create one.

nikola-jurkovic on Orienting to 3 year AGI timelines

I will use this comment thread to keep track of notable updates to the forecasts I made for the 2025 AI Forecasting survey. As I said, my predictions coming true wouldn't be strong evidence for 3 year timelines, but it would still be some evidence (especially RE-Bench and revenues).

The first update: On Jan 31st 2025, the Model Autonomy category hit Medium with the release of o3-mini. I predicted this would happen in 2025 with 80% probability.

daniel-kokotajlo on Why Don't We Just... Shoggoth+Face+Paraphraser?

since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented.

I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?

I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That's why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.

declan-molony on 5,000 calories of peanut butter every week for 3 years straight

I have considered the powdered option, but given my inflammation, it's possible I have a minor allergy. I'm going to take a break for a while.

Assuming you're serious about the psychological impact of removing all peanut butter products

^Nope, I'm exaggerating. I gave this post a "humor" tag and wrote it to laugh at myself.

martinsq on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Just writing a model that came to mind, partly inspired by Ryan here [LW(p) · GW(p)].

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen [LW(p) · GW(p)] as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.

If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.

If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.

daniel-tan on Daniel Tan's Shortform

Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.

This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.

sharmake-farah on plex's Shortform

Alright, I'll try to answer the questions:

I think qualia is rescuable, in a sense, and my specific view is that they exist as a high-level model.

As far as what that qualia is, I think it's basically an application of modeling the world in order to control something, and thus qualia, broadly speaking is your self-model.

As far as my exact views on qualia, the links below are helpful:

https://www.lesswrong.com/posts/FQhtpHFiPacG3KrvD/seth-explains-consciousness#7ncCBPLcCwpRYdXuG [LW(p) · GW(p)]

https://www.lesswrong.com/posts/NMwGKTBZ9sTM4Morx/linkpost-a-conceptual-framework-for-consciousness [LW · GW]

My general answer to these question is probably computation/programs/mathematics, with the caveat that these notions are very general, and thus don't explain anything specific about our world.

I personally agree with this on what counts as real:

If you believe math is fundamental, what distinguishes this particular mathematical universe from other ones; what qualifies this world as "real", if anything; what 'breathes fire into the equations and creates a world for them to describe'?

(Commentary: one self-consistent position answers "nothing" - that this world is just one of the infinitely many possible mathematical functions / programs. That 'real' secretly means 'the program(s?) we are part of'. Though I observe this position to be rare; most have a strong intuition that there is something which "makes reality real".)

What breathes fire into the equations of our specific world is either an infinity of computational resources, or a very large amount of computational resources.

As far as what mathematics is, I definitely like the game analogy where we agree to play a game according to specified rules, though another way to describe mathematics is as a way to generalize all of the situations you encounter and abstract from specific detail, and it is also used to define what something is.

sharmake-farah on [Linkpost] A conceptual framework for consciousness

In retrospect, something like this theory was probably one I was drawing on implicitly, and I like that from 2 well-validated principles, you can constrain the space of possibilities of consciousness significantly enough to throw out a lot of philosophically interesting, but also wrong theories, and I personally think AST is probably my go-to mental model of how consciousness actually works, and I personally think the consciousness problem is by now mostly resolved in my own mind, such that I can focus on bigger problems.

mateusz-baginski on Zach Stein-Perlman's Shortform

No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there'll be a full-o3 system card soon.)

Rushed bc of deepseek?