LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

A Selection of Randomly Selected SAE Features
CallumMcDougall (TheMcDouglas) · 2024-04-01T09:09:49.235Z · comments (2)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks (samuel-marks) · 2024-04-18T16:17:39.136Z · comments (8)

[question] How do you feel about LessWrong these days? [Open feedback thread]
jacobjacob · 2023-12-05T20:54:42.317Z · answers+comments (281)

Clarifying METR's Auditing Role
Beth Barnes (beth-barnes) · 2024-05-30T18:41:56.029Z · comments (1)

2023 in AI predictions
jessicata (jessica.liu.taylor) · 2024-01-01T05:23:42.514Z · comments (35)

[link] Perplexity wins my AI race
Elizabeth (pktechgirl) · 2024-08-24T19:20:10.859Z · comments (12)

Why I'm doing PauseAI
Joseph Miller (Josephm) · 2024-04-30T16:21:54.156Z · comments (16)

One Day Sooner
Screwtape · 2023-11-02T19:00:58.427Z · comments (7)

Danger, AI Scientist, Danger
Zvi · 2024-08-15T22:40:06.715Z · comments (9)

Demystifying "Alignment" through a Comic
milanrosko · 2024-06-09T08:24:22.454Z · comments (19)

Skills I'd like my collaborators to have
Raemon · 2024-02-09T08:20:37.686Z · comments (9)

Picking Mentors For Research Programmes
Raymond D · 2023-11-10T13:01:14.197Z · comments (8)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (12)

New LessWrong feature: Dialogue Matching
jacobjacob · 2023-11-16T21:27:16.763Z · comments (22)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

[link] My techno-optimism [By Vitalik Buterin]
habryka (habryka4) · 2023-11-27T23:53:35.859Z · comments (17)

TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (26)

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (64)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq (Lblack) · 2024-05-20T17:53:25.985Z · comments (4)

Scaling and evaluating sparse autoencoders
leogao · 2024-06-06T22:50:39.440Z · comments (6)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

Charbel-Raphaël and Lucius discuss interpretability
Mateusz Bagiński (mateusz-baginski) · 2023-10-30T05:50:34.589Z · comments (7)

[link] A case for AI alignment being difficult
jessicata (jessica.liu.taylor) · 2023-12-31T19:55:26.130Z · comments (56)

Apply for MATS Winter 2023-24!
utilistrutil · 2023-10-21T02:27:34.350Z · comments (6)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

Deception Chess: Game #1
Zane · 2023-11-03T21:13:55.777Z · comments (19)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (1)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (4)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (48)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

SAE reconstruction errors are (empirically) pathological
wesg (wes-gurnee) · 2024-03-29T16:37:29.608Z · comments (16)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

[link] A Chess-GPT Linear Emergent World Representation
Adam Karvonen (karvonenadam) · 2024-02-08T04:25:15.222Z · comments (14)

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (3)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

On the Executive Order
Zvi · 2023-11-01T14:20:01.657Z · comments (4)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sharmake-farah on Dario Amodei — Machines of Loving Grace

The reason I think this dynamic exists for the Machines of Loving Grace posts is a combination of 2 reasons:

It's intentionally not talking about misalignment, and assumes as a premise that the AI we do get is aligned by some method that is low tax enough that basically everyone else also adopts the solution.
You can't get a lot of nuance/future shock in public facing posts, for the reasons laid out by Raemon here, which summarized is that even in a context where people aren't adversarial and are just unreliable, it's very hard to communicate nuanced ideas, and when there are adversarial forces, you really need to avoid giving out too much nuance to your policy, because people will exploit that.

See here for full story:

https://www.lesswrong.com/posts/4ZvJab25tDebB8FGE/you-get-about-five-words#tREaGcLsrtdz3WHnd [LW(p) · GW(p)]

nicholas-heather-kross on An AI crash is our best bet for restricting AI

For more details on (the business side of) a potential AI crash, see recent articles by the blog Where's Your Ed At, which wrote the sorta-well-known post "The Man Who Killed Google Search".

For his AI-crash posts, start here and here and click on links to his other posts. Sadly, the author falls into the trap of "LLMs will never get to reasoning because they don't, like, know stuff, man", but luckily his core competencies (the business side, analyzing reporting) show why an AI crash could still very much happen.

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)

c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.

Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.

dynomight on Arithmetic is an underrated world-modeling technology

You mentioned a density of steel of 7.85 g/cm^3 but used a value of 2.7 g/cm^3 in the calculations.

Yes! You're right! I've corrected this, though I still need to update the drawing of the house. Thank you!

annapurna on The Mysterious Trump Buyers on Polymarket

I would say very low?

I would also rank the theory that these mystery traders have hidden information as low. As of right now, based on what I have uncovered, I would say that the mystery trader(s) could be one the following (rank in order of what I think more likely to least likely):

1. A wealthy francophone european, somewhat involved in crypto and/or tech, who believes Trump will win and is betting a somewhat trivial amount (to this person) on this outcome. It is key to note that the main account has been trading since June, yet it only started seriously trading these markets after Musk appeared with Trump on stage.

2. A somewhat sophisticated entity that has a proprietary model that has Trump winning as a higher outcome, and believes the odds are mispriced. This entity realized that Polymarket had sufficient liquidity to allow it to bet 8 figures and move the price roughly to what it thinks Trump's real odds are. This entity is likely foreign, and thus it has not been able to trade on Kalshi.

3. An insider of the Trump campaign that is using internal information to make a more informed trade. Perhaps this is someone related to Elon. It could also be someone who understands that since May last year people have been using Polymarket as yet another gauge of how this election might turn up, and by pushing the odds on Poly they are distorting the narrative 18 days before the election.

leogao on leogao's Shortform

for people who are not very good at navigating social conventions, it is often easier to learn to be visibly weird than to learn to adapt to the social conventions.

this often works because there are some spaces where being visibly weird is tolerated, or even celebrated. in fact, from the perspective of an organization, it is good for your success if you are good at protecting weird people.

but from the perspective of an individual, leaning too hard into weirdness is possibly harmful. part of leaning into weirdness is intentional ignorance of normal conventions. this traps you in a local minimum where any progress on understanding normal conventions hurts your weirdness, but isn't enough to jump all the way to the basin of the normal mode of interaction.

(epistemic status: low confidence, just a hypothesis)

mitchell_porter on The Mysterious Trump Buyers on Polymarket

Isn't this just someone rich, spending money to make it look like the market thinks Trump will win?

romeostevensit on AI #86: Just Think of the Potential

I propose a new term, gas bubble, to describe the spate of scams we're about to see. It's a combination of gas lighting and filter bubble.

eliezer_yudkowsky on The Hidden Complexity of Wishes

The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything. But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect.

I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).

One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind. So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?

hastings-greer on The Mysterious Trump Buyers on Polymarket

What are the odds that Polymarket resolves “Trump yes” and Harris takes office in 2025? If these mystery traders expect to profit from hidden information, the hidden information could be about an anticipated failure of UMA instead of about the election itself.