LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (16)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

[link] A Chess-GPT Linear Emergent World Representation
Adam Karvonen (karvonenadam) · 2024-02-08T04:25:15.222Z · comments (14)

Scaling and evaluating sparse autoencoders
leogao · 2024-06-06T22:50:39.440Z · comments (6)

[link] A case for AI alignment being difficult
jessicata (jessica.liu.taylor) · 2023-12-31T19:55:26.130Z · comments (56)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (5)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (2)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (20)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

[link] Advice for journalists
Nathan Young · 2024-10-07T16:46:40.929Z · comments (53)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (10)

[link] The Minority Coalition
Richard_Ngo (ricraz) · 2024-06-24T20:01:27.436Z · comments (7)

[link] Announcing turntrout.com, my new digital home
TurnTrout · 2024-11-17T17:42:08.164Z · comments (15)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

[link] My cover story in Jacobin on AI capitalism and the x-risk debates
garrison · 2024-02-12T23:34:16.526Z · comments (5)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (21)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (14)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

Access to powerful AI might make computer security radically easier
Buck · 2024-06-08T06:00:19.310Z · comments (14)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

askwho on Monthly Roundup #24: November 2024

Reminder that I am producing a multi-voiced, ElevenLabs MV2 quality podcast feed for Don't Worry About the Vase for those that find that a valuable thing:
https://open.substack.com/pub/dwatvpodcast/p/monthly-roundup-24-november-2024?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

All uniquely quoted people get a distinct voice for easy differentiation, and I personally find the main "Zvi Voice" really good for long listening.

david-james on Neutrality

Right. Some such agreements are often called social contracts. One catch is that a person born into them may not understand their historical origin or practical utility, much less agree with them.

leon-lang on Leon Lang's Shortform

Why I think scaling laws will continue to drive progress

Epistemic status: This is a thought I had since a while. I never discussed it with anyone in detail; a brief conversation could convince me otherwise.

According to recent reports there seem to be some barriers to continued scaling. We don't know what exactly is going on, but it seems like scaling up base models doesn't bring as much new capability as people hope.

However, I think probably they're still in some way scaling the wrong thing: The model learns to predict a static dataset on the internet; however, what it needs to do later is to interact with users and the world. For performing well in such a task, the model needs to understand the consequences of its actions, which means modeling interventional distributions P(X | do(A)) instead of static data P(X | Y). This is related to causal confusion as an argument against the scaling hypothesis [LW · GW].

This viewpoint suggests that if big labs figure out how to predict observations in an online-way by ongoing interactions of the models with users / the world, then this should drive further progress. It's possible that labs are already doing this, but I'm not aware of it, and so I guess they haven't yet fully figured out how to do that.

What triggered me writing this is that there is a new paper on scaling law for world modeling that's about exactly what I'm talking about here.

sil-ver on [Intuitive self-models] 8. Rooting Out Free Will Intuitions

(You did respond to all the important parts, rest of my comment is very much optional.)

I’m sure you’re aware that people feel like they have a broader continuous awareness of their visual field than they actully do. There are lots of demonstrations of this—e.g. change blindness, selective attention test, the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy. There’s a refrigerator light illusion thing—if X is in my peripheral vision, then maybe it’s currently active as just a little pointer in a tiny sub-area of my cortex, but as soon as I turn my attention to X it immediately unfolds in full detail across the global workspace.

Yes -- and my point was that appealing to these phenomena is the kind of thing you will probably have to do to explain the meta problem of seeing. Which raises all kinds of issues -- for example, change blindness by itself doesn't logically prove anything, since it's possible not to notice that something changed even if it was represented. Only the reverse conclusion is valid -- if a subject can tell that X changed, then X was in awareness, but if they can't tell, X may or may not have been in awareness. So teasing out exactly how much information is really present in awareness, given the positive and negative evidence, is a pretty big rabbit hole. (Poor resolution in peripheral vision does prove absence of information, but as with the memory example I've complained about in post #2 [LW(p) · GW(p)], this is an example of something people don't endorse under reflection anyway, so it doesn't get you very far. Like, there is a very, very big difference between arguing that peripheral resolution is poor, which people will agree with as soon as they actually pay attention to their peripheral vision for the first time, and arguing that the continuous visual image they think they see is not really there, which most people will stubbornly disagree with regardless of how much attention they pay to it.)

Anyway, that's the only claim I was making -- I was only trying to go as far as "this is why I think the problem is nontrivial and you haven't solved it yet", not "and that's why you can't solve it".

The contents of IT are really truly different from the contents of LIP [I didn’t check where the visual information gets to the cortex in blindsight, I’m just guessing LIP for concreteness]. Querying IT is a different operation than querying LIP. IT holds different types of information than LIP does, and does different things with that information, including leading to different visceral reactions, motivations, semantic knowledge, etc., all of which correspond to neuroscientific differences in how IT versus LIP is wired up.

All these differences between IT vs LIP are in the territory, not the map. So I definitely agree that “the distinction [between seeing and vague-sense-of-presence] isn’t just that we happen to call them by different labels”. They’re different like how the concept “hand” is different from the concept “foot”—a distinction on the map downstream of a distinction in the territory.

Right, and I agree that this makes it apriori plausible that they could account for the differences in how people talk about, e.g., vivid seeing vs. intangible intuitions. But it doesn't prove that they do, it only shows that this is the kind of explanation that, on first glance, looks like it could work. To actually solve the meta problem, you still have to do the work of explaining all the properties of introspective reports, which requires going into a lot of detail.

As of above, this is the only claim I was making -- I'm not saying any of these issues are provably impossible with your approach, I'm only saying that your approach hasn't provided a full solution yet. (And that I genuinely think most of the difficulty happens to be in these still unaddressed details; this was the point of the carrot/plant analogy.)

I think that’s compatible with my models, because those meditators still have a cortex, in which patterns of neurons can be firing or not firing at any particular time. And that’s the core aspect of the “territory” which corresponds to “conscious awareness” in the “map”. No amount of meditation, drugs, etc., can change that.

Fair enough, but I think it does show that free will isn't that central of a piece.

dr_s on Making a conservative case for alignment

Pretty much. It's not "naive" if it's literally the only option that actually does not harm everyone involved, unless of course we want to call every world leader and self-appointed foreign policy expert a blithering idiot with tunnel vision (I make no such claim a priori; ball's in their court).

It's important to not oversimplify things. It's also important to not overcomplicate them. Domain experts tend to be resistant to the first kind of mental disease, but tragically prone to the second. Sometimes it really is Just That Simple, and everything else is commentary and superfluous detail.

david-james on Neutrality

Durable institutions find ways to survive. I don’t mean survival merely in terms of legal continuity; I mean fidelity to their founding charter. Institutions not only have to survive past their first leader; they have to survive their first leader themself! The institution’s structure and policies must protect against the leader’s meandering attention, whims, and potential corruptions. In the case of Elon, based on his mercurial history, I would not bet that Musk would agree to the requisite policies.

viliam on What are Emotions?

Emotions are about reality, but emotions are also a part of reality, so we also have emotions about emotions. I can feel happy about some good thing happening in the outside world. And, separately, I can feel happy about being happy.

In the thought experiments about wireheading, people often say that they don't just want to experience (possibly fake) happy thoughts about X; they also want X to actually happen.

But let's imagine the converse: what if someone proposed a surgery that would make you unable to ever feel happy about X, even if you knew that X actually happened in the world. People would probably refuse that, too. Intuitively, we want to feel good emotions that we "deserve", plus there is also the factor of motivation. Okay, so let's imagine a surgery that removes your ability to feel happy about X, but solves the problem of motivation by e.g. giving you an urge to do X. People would probably refuse that, too.

So I think we actually want both the emotions and the things the emotions are about.

mr-hire on Matt Goldenberg's Short Form Feed

A lot of people are looking at the implications of o1's training process as a future scaling paradigm, but it seems to me that this implementation of applying inference time compute to just in time fine tune the model for hard questions is equally promising and may have equally impressive results if it scales with compute, and has equal potential in terms of low hanging fruit to be picked to improve it.

Don't sleep on test time training as a potential future scaling paradigm.

turntrout on Announcing turntrout.com, my new digital home

IIRC my site checks (in descending priority):

localStorage to see if they've already told my site a light/dark preference;
whether the user's browser indicates a global light/dark preference (this is the "auto");
if there's no preference, the site defaults to light.

The idea is "I'll try doing the right thing (auto), and if the user doesn't like it they can change it and I'll listen to that choice." Possibly it will still be counterintuitive to many folks, as Said quoted in a sibling comment.

viliam on What are some positive developments in AI safety in 2024?

Welp, this was a short list.