LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Preventing Language Models from hiding their reasoning
Fabien Roger (Fabien) · 2023-10-31T14:34:04.633Z · comments (14)

Being nicer than Clippy
Joe Carlsmith (joekc) · 2024-01-16T19:44:23.893Z · comments (32)

' petertodd'’s last stand: The final days of open GPT-3 research
mwatkins · 2024-01-22T18:47:00.710Z · comments (16)

Please stop using mediocre AI art in your posts
Raemon · 2024-08-25T00:13:52.890Z · comments (24)

What I Would Do If I Were Working On AI Governance
johnswentworth · 2023-12-08T06:43:42.565Z · comments (32)

You should go to ML conferences
Jan_Kulveit · 2024-07-24T11:47:52.214Z · comments (13)

[link] A primer on the current state of longevity research
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-22T17:14:57.990Z · comments (6)

Experiences and learnings from both sides of the AI safety job market
Marius Hobbhahn (marius-hobbhahn) · 2023-11-15T15:40:32.196Z · comments (4)

The case for more ambitious language model evals
Jozdien · 2024-01-30T00:01:13.876Z · comments (28)

"AI Alignment" is a Dangerously Overloaded Term
Roko · 2023-12-15T14:34:29.850Z · comments (100)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
Neel Nanda (neel-nanda-1) · 2023-12-23T02:44:24.270Z · comments (6)

Attitudes about Applied Rationality
Camille Berger (Camille Berger) · 2024-02-03T14:42:22.770Z · comments (18)

The Leopold Model: Analysis and Reactions
Zvi · 2024-06-14T15:10:03.480Z · comments (19)

[link] Please support this blog (with money)
Elizabeth (pktechgirl) · 2024-08-17T15:30:05.641Z · comments (2)

OthelloGPT learned a bag of heuristics
jylin04 · 2024-07-02T09:12:56.377Z · comments (10)

[link] Perplexity wins my AI race
Elizabeth (pktechgirl) · 2024-08-24T19:20:10.859Z · comments (12)

A Selection of Randomly Selected SAE Features
CallumMcDougall (TheMcDouglas) · 2024-04-01T09:09:49.235Z · comments (2)

[question] How do you feel about LessWrong these days? [Open feedback thread]
jacobjacob · 2023-12-05T20:54:42.317Z · answers+comments (281)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks (samuel-marks) · 2024-04-18T16:17:39.136Z · comments (8)

[link] Most smart and skilled people are outside of the EA/rationalist community: an analysis
titotal (lombertini) · 2024-07-12T12:13:56.215Z · comments (36)

[link] Paper: On measuring situational awareness in LLMs
Owain_Evans · 2023-09-04T12:54:20.516Z · comments (16)

2023 in AI predictions
jessicata (jessica.liu.taylor) · 2024-01-01T05:23:42.514Z · comments (35)

Stuxnet, not Skynet: Humanity's disempowerment by AI
Roko · 2023-11-04T22:23:55.428Z · comments (24)

Clarifying METR's Auditing Role
Beth Barnes (beth-barnes) · 2024-05-30T18:41:56.029Z · comments (1)

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger (Fabien) · 2023-10-23T16:37:45.611Z · comments (3)

Danger, AI Scientist, Danger
Zvi · 2024-08-15T22:40:06.715Z · comments (9)

Why I'm doing PauseAI
Joseph Miller (Josephm) · 2024-04-30T16:21:54.156Z · comments (16)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (12)

New LessWrong feature: Dialogue Matching
jacobjacob · 2023-11-16T21:27:16.763Z · comments (22)

One Day Sooner
Screwtape · 2023-11-02T19:00:58.427Z · comments (7)

Picking Mentors For Research Programmes
Raymond D · 2023-11-10T13:01:14.197Z · comments (8)

Skills I'd like my collaborators to have
Raemon · 2024-02-09T08:20:37.686Z · comments (9)

Demystifying "Alignment" through a Comic
milanrosko · 2024-06-09T08:24:22.454Z · comments (19)

Summary of and Thoughts on the Hotz/Yudkowsky Debate
Zvi · 2023-08-16T16:50:02.808Z · comments (47)

[link] A case for AI alignment being difficult
jessicata (jessica.liu.taylor) · 2023-12-31T19:55:26.130Z · comments (56)

[link] My techno-optimism [By Vitalik Buterin]
habryka (habryka4) · 2023-11-27T23:53:35.859Z · comments (17)

[link] ActAdd: Steering Language Models without Optimization
technicalities · 2023-09-06T17:21:56.214Z · comments (3)

Scaling and evaluating sparse autoencoders
leogao · 2024-06-06T22:50:39.440Z · comments (6)

[link] Cohabitive Games so Far
mako yass (MakoYass) · 2023-09-28T15:41:27.986Z · comments (129)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (26)

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (64)

Charbel-Raphaël and Lucius discuss interpretability
Mateusz Bagiński (mateusz-baginski) · 2023-10-30T05:50:34.589Z · comments (7)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq (Lblack) · 2024-05-20T17:53:25.985Z · comments (4)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (1)

Deception Chess: Game #1
Zane · 2023-11-03T21:13:55.777Z · comments (19)

Apply for MATS Winter 2023-24!
utilistrutil · 2023-10-21T02:27:34.350Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sharmake-farah on Dario Amodei — Machines of Loving Grace

The reason I think this dynamic exists for the Machines of Loving Grace posts is a combination of 2 reasons:

It's intentionally not talking about misalignment, and assumes as a premise that the AI we do get is aligned by some method that is low tax enough that basically everyone else also adopts the solution.
You can't get a lot of nuance/future shock in public facing posts, for the reasons laid out by Raemon here, which summarized is that even in a context where people aren't adversarial and are just unreliable, it's very hard to communicate nuanced ideas, and when there are adversarial forces, you really need to avoid giving out too much nuance to your policy, because people will exploit that.

See here for full story:

https://www.lesswrong.com/posts/4ZvJab25tDebB8FGE/you-get-about-five-words#tREaGcLsrtdz3WHnd [LW(p) · GW(p)]

nicholas-heather-kross on An AI crash is our best bet for restricting AI

For more details on (the business side of) a potential AI crash, see recent articles by the blog Where's Your Ed At, which wrote the sorta-well-known post "The Man Who Killed Google Search".

For his AI-crash posts, start here and here and click on links to his other posts. Sadly, the author falls into the trap of "LLMs will never get to reasoning because they don't, like, know stuff, man", but luckily his core competencies (the business side, analyzing reporting) show why an AI crash could still very much happen.

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)

c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.

Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.

dynomight on Arithmetic is an underrated world-modeling technology

You mentioned a density of steel of 7.85 g/cm^3 but used a value of 2.7 g/cm^3 in the calculations.

Yes! You're right! I've corrected this, though I still need to update the drawing of the house. Thank you!

annapurna on The Mysterious Trump Buyers on Polymarket

I would say very low?

I would also rank the theory that these mystery traders have hidden information as low. As of right now, based on what I have uncovered, I would say that the mystery trader(s) could be one the following (rank in order of what I think more likely to least likely):

1. A wealthy francophone european, somewhat involved in crypto and/or tech, who believes Trump will win and is betting a somewhat trivial amount (to this person) on this outcome. It is key to note that the main account has been trading since June, yet it only started seriously trading these markets after Musk appeared with Trump on stage.

2. A somewhat sophisticated entity that has a proprietary model that has Trump winning as a higher outcome, and believes the odds are mispriced. This entity realized that Polymarket had sufficient liquidity to allow it to bet 8 figures and move the price roughly to what it thinks Trump's real odds are. This entity is likely foreign, and thus it has not been able to trade on Kalshi.

3. An insider of the Trump campaign that is using internal information to make a more informed trade. Perhaps this is someone related to Elon. It could also be someone who understands that since May last year people have been using Polymarket as yet another gauge of how this election might turn up, and by pushing the odds on Poly they are distorting the narrative 18 days before the election.

leogao on leogao's Shortform

for people who are not very good at navigating social conventions, it is often easier to learn to be visibly weird than to learn to adapt to the social conventions.

this often works because there are some spaces where being visibly weird is tolerated, or even celebrated. in fact, from the perspective of an organization, it is good for your success if you are good at protecting weird people.

but from the perspective of an individual, leaning too hard into weirdness is possibly harmful. part of leaning into weirdness is intentional ignorance of normal conventions. this traps you in a local minimum where any progress on understanding normal conventions hurts your weirdness, but isn't enough to jump all the way to the basin of the normal mode of interaction.

(epistemic status: low confidence, just a hypothesis)

mitchell_porter on The Mysterious Trump Buyers on Polymarket

Isn't this just someone rich, spending money to make it look like the market thinks Trump will win?

romeostevensit on AI #86: Just Think of the Potential

I propose a new term, gas bubble, to describe the spate of scams we're about to see. It's a combination of gas lighting and filter bubble.

eliezer_yudkowsky on The Hidden Complexity of Wishes

The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything. But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect.

I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).

One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind. So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?

hastings-greer on The Mysterious Trump Buyers on Polymarket

What are the odds that Polymarket resolves “Trump yes” and Harris takes office in 2025? If these mystery traders expect to profit from hidden information, the hidden information could be about an anticipated failure of UMA instead of about the election itself.