LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

Social status part 2/2: everything else
Steven Byrnes (steve2152) · 2024-03-05T16:29:19.072Z · comments (2)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

How a chip is designed
YM (Yannick_Muehlhaeuser_duplicate0.05902100825326273) · 2024-06-28T08:04:27.392Z · comments (4)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

AI #69: Nice
Zvi · 2024-06-20T12:40:02.566Z · comments (9)

[link] Drexler's Nanotech Software
PeterMcCluskey · 2024-12-02T04:55:20.432Z · comments (9)

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Seth Herd · 2024-08-05T15:38:09.682Z · comments (22)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (5)

Advice to junior AI governance researchers
Akash (akash-wasil) · 2024-07-08T19:19:07.316Z · comments (1)

[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex (Stefan42) · 2024-07-05T17:05:25.631Z · comments (2)

Do Not Mess With Scarlett Johansson
Zvi · 2024-05-22T15:10:03.215Z · comments (7)

METR is hiring!
Beth Barnes (beth-barnes) · 2023-12-26T21:00:50.625Z · comments (1)

[link] Static Analysis As A Lifestyle
adamShimi · 2024-07-03T18:29:37.384Z · comments (11)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

[link] AI, centralization, and the One Ring
owencb · 2024-09-13T14:00:16.126Z · comments (11)

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

AI research assistants competition 2024Q3: Tie between Elicit and You.com
Elizabeth (pktechgirl) · 2024-10-12T15:10:05.417Z · comments (4)

Another argument against maximizer-centric alignment paradigms
Fiora from Rosebloom · 2024-09-22T07:28:27.856Z · comments (39)

Retrospective: PIBBSS Fellowship 2024
DusanDNesic · 2024-12-20T15:55:24.194Z · comments (1)

[question] What's with all the bans recently?
[deleted] · 2024-04-04T06:16:49.062Z · answers+comments (83)

A Qualitative Case for LTFF: Filling Critical Ecosystem Gaps
Linch · 2024-12-03T21:57:23.597Z · comments (2)

AI Craftsmanship
abramdemski · 2024-11-11T22:17:01.112Z · comments (7)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (16)

On the Gladstone Report
Zvi · 2024-03-20T19:50:05.186Z · comments (11)

Book Review: On the Edge: The Fundamentals
Zvi · 2024-09-23T13:40:11.058Z · comments (3)

[link] Learn to write well BEFORE you have something worth saying
eukaryote · 2024-12-29T23:42:31.906Z · comments (18)

[link] Pay-on-results personal growth: first success
Chipmonk · 2024-09-14T03:39:12.975Z · comments (8)

Perils of Generalizing from One's Social Group
localdeity · 2024-11-24T15:31:18.332Z · comments (1)

[question] Is cybercrime really costing trillions per year?
Fabien Roger (Fabien) · 2024-09-27T08:44:07.621Z · answers+comments (28)

Against most, but not all, AI risk analogies
Matthew Barnett (matthew-barnett) · 2024-01-14T03:36:16.267Z · comments (41)

All About Concave and Convex Agents
mako yass (MakoYass) · 2024-03-24T21:37:17.922Z · comments (23)

Neuroscience of human social instincts: a sketch
Steven Byrnes (steve2152) · 2024-11-22T16:16:52.552Z · comments (0)

[link] RL, but don't do anything I wouldn't do
Gunnar_Zarncke · 2024-12-07T22:54:50.714Z · comments (5)

E.T. Jaynes Probability Theory: The logic of Science I
Jan Christian Refsgaard (jan-christian-refsgaard) · 2023-12-27T23:47:52.579Z · comments (20)

AiPhone
Zvi · 2024-06-12T22:20:02.141Z · comments (4)

Intricacies of Feature Geometry in Large Language Models
7vik (satvik-golechha) · 2024-12-07T18:10:51.375Z · comments (0)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

[link] Moving on from community living
Vika · 2024-04-17T17:02:11.357Z · comments (7)

On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg
Zvi · 2024-04-22T13:10:02.645Z · comments (4)

What mistakes has the AI safety movement made?
EuanMcLean (euanmclean) · 2024-05-23T11:19:02.717Z · comments (29)

Self-Awareness: Taxonomy and eval suite proposal
Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-17T01:47:01.802Z · comments (2)

[link] Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan (SenR) · 2024-04-25T18:43:47.003Z · comments (38)

[link] Superforecasting the Origins of the Covid-19 Pandemic
DanielFilan · 2024-03-12T19:01:15.914Z · comments (0)

Bayesian updating in real life is mostly about understanding your hypotheses
Max H (Maxc) · 2024-01-01T00:10:30.978Z · comments (4)

[link] Outrage Bonding
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:46:59.818Z · comments (12)

Do not delete your misaligned AGI.
mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · comments (13)

[link] Dario Amodei — Machines of Loving Grace
Matrice Jacobine · 2024-10-11T21:43:31.448Z · comments (26)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

tangerine on Disagreement on AGI Suggests It’s Near

They spend more time thinking about the concrete details of the trip, not because they know the trip is happening soon, but because some think the trip is happening soon. Disagreement and attention to concrete details is driven by only some people saying that the current situation looks like, or is starting to look like the event occurring according to their interpretation. If the disagreement had happened at the start, they would soon have started using different words.

In the New York example, it could be that when someone says “Guys, we should really buy those Broadway tickets. February is next month already.” they prompt the response “What? I thought we were going in March!”, hence the disagreement.

Likewise, in the case of AGI, some people’s alarm bells are currently going off for AGI, while others are saying more capabilities are required to satisfy their interpretation. However, whether or not AGI exists depends only marginally on any one person’s interpretation. Words are a communicative tool and therefore depend on others’ interpretations.

The meaning of words doesn’t fall out of the sky. It doesn’t pass through a membrane from another reality. We collectively—and often unconsciously—define it. For example, “What is intelligence?” is a question of how that word is in practice interpreted by other people. “How should it be interpreted (according to me personally)?” is a valid but different question.

What seems to be happening with AGI specifically is that people at one point latched on to the concept of AGI, thinking that their interpretation was virtually the same as those of others because of its lack of definition. Again, if they had disagreed with the definition at the start, they would have used a different word altogether. Now that some people are claiming that AGI is here or soon here, it turns out that those interpretations do in fact differ. The most obnoxious cases are when people disagree with their own past interpretation once that interpretation is threatened to be satisfied, on the basis of some deeper, undefined intuition (or in the case of OpenAI and Microsoft, ulterior motives). This of course is known as “moving the goalposts”.

Once upon a time, not that long ago, AGI was interpreted by many as “it can beat anyone at chess”, or “it can beat anyone at go” or “it can pass the Turing test”. We are there now, according to those interpretations.

kyleherndon on AI #98: World Ends With Six Word Story

Although as I note elsewhere I’m starting to have some ideas of how something with elements of this might have a chance of working.

I've missed where you discussed this. Does anyone have a link or can anyone expound?

anaguma on How will we update about scheming?

Makes sense. Perhaps we'll know more when o3 is released. If the model doesn't offer a summary of CoT it makes neuralese more likely.

dkl9 on Stream Entry

Correction: "is that you experienced was real" -> "is that what you experienced was real"

> Now I knew how to not trigger those defense mechanisms.
The linked video looks like rhetorical aikido. If that's what you're talking about, link it. If you meant something else, what did you learn to do?

anaguma on anaguma's Shortform

I've often heard it said that doing RL on chain of thought will lead to 'neuralese' (e.g. most recently in Ryan Greenblatt's excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?

avturchin on On Eating the Sun

A very heavy and dense body on an elliptical orbit that touches the Sun's surface at each perihelion would collect sizable chunks of the Sun's matter. The movement of matter from one star to another nearby star is a well-known phenomenon.

When the body reaches aphelion, the collected solar matter would cool down and could be harvested. The initial body would need to be very massive, perhaps 10-100 Earth masses. A Jupiter-sized core could work as such a body.

Therefore, to extract the Sun's mass, one would need to make Jupiter's orbit elliptical. This could be achieved through several heavy impacts or gravitational maneuvers involving other planets.

This approach seems feasible even without ASI, but it might take longer than 10,000 years.

ryan_greenblatt on How will we update about scheming?

The interaction with users used for o1 (where the AI thinks for a while prior to sending a response) is consistent with neuralese.
RL adding substantial additional capabilities means there might be enough RL for this to work.
o3 is a substantial leap over o1 seemingly.

anaguma on How will we update about scheming?

(Based on public knowledge, it seems plausible (perhaps 25% likely) that o3 uses neuralese which could put it in this category.)

What public knowledge has led you to this estimate?

jfw01 on Where I agree and disagree with Eliezer

a particular technique doesn’t immediately solve a problem

I remember a story that got coverage on the state radio in New Zealand years ago. It said that multiple people have parts of the solution to some problem, and there is progress when there is an accident that introduces them to each other. There was a book about it, but I'm failing to find the details.

bryce-robertson on Bryce Robertson's Shortform

I've put it on our list of possible future pages, and added some of the things from that doc to our Funders page. Thanks Chris!