LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (14)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (72)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (32)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (1)

Apollo Research 1-year update
Marius Hobbhahn (marius-hobbhahn) · 2024-05-29T17:44:32.484Z · comments (0)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (68)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis
Zvi · 2024-03-01T16:30:08.687Z · comments (0)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (55)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

Takeoff speeds presentation at Anthropic
Tom Davidson (tom-davidson-1) · 2024-06-04T22:46:35.448Z · comments (0)

Information vs Assurance
johnswentworth · 2024-10-20T23:16:25.762Z · comments (7)

OpenAI: The Board Expands
Zvi · 2024-03-12T14:00:04.110Z · comments (1)

You can, in fact, bamboozle an unaligned AI into sparing your life
David Matolcsi (matolcsid) · 2024-09-29T16:59:43.942Z · comments (171)

Everything Wrong with Roko's Claims about an Engineered Pandemic
WitheringWeights (EZ97) · 2024-02-22T15:59:08.439Z · comments (10)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

On attunement
Joe Carlsmith (joekc) · 2024-03-25T12:47:34.856Z · comments (8)

Just admit that you’ve zoned out
joec · 2024-06-04T02:51:27.594Z · comments (22)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Johnny Lin (hijohnnylin) · 2024-03-25T21:17:58.421Z · comments (7)

Meaning & Agency
abramdemski · 2023-12-19T22:27:32.123Z · comments (17)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (36)

Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (23)

How to train your own "Sleeper Agents"
evhub · 2024-02-07T00:31:42.653Z · comments (11)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

Quotes from Leopold Aschenbrenner’s Situational Awareness Paper
Zvi · 2024-06-07T11:40:03.981Z · comments (10)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (10)

Prediction Markets aren't Magic
SimonM · 2023-12-21T12:54:07.754Z · comments (29)

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

p-b-1 on Rauno's Shortform

There was one comment on twitter that the RLHF-finetuned models also still have the ability to play chess pretty well, just their input/output-formatting made it impossible for them to access this ability (or something along these lines). But apparently it can be recovered with a little finetuning.

habryka4 on Monthly Roundup #24: November 2024

Indeed. I fixed it. Let's see whether it repeats itself (we got kind of malformed HTML from the RSS feed).

p-b-1 on Leon Lang's Shortform

The paper seems to be about scaling laws for a static dataset as well?

Similar to the initial study of scale in LLMs, we focus on the effect of scaling on a generative pre-training loss (rather than on downstream agent performance, or reward- or representation-centric objectives), in the infinite data regime, on a fixed offline dataset.

To learn to act you'd need to do reinforcement learning, which is massively less data-efficient than the current self-supervised training.

More generally: I think almost everyone thinks that you'd need to scale the right thing for further progress. The question is just what the right thing is if text is not the right thing. Because text encodes highly powerful abstractions (produced by humans and human culture over many centuries) in a very information dense way.

anders-lindstroem on "It's a 10% chance which I did 10 times, so it should be 100%"

No, I think you are mixing the probability of at least one success in ten trails (with a 10% chance per trail), which is ~0.65=65%, with the expected value which is n=1 in both cases. You have the same chance of finding 1 partner in each case and you do the same number of trails. There is a 65% chance that you have at least 1 success in the 10 trails for each type of partner. The expected outcome in BOTH cases is 1 as in n=1 not 1 as in 100%

Probability of at least one success: ~65%
Probability of at least two success: ~26%

sharmake-farah on "The Solomonoff Prior is Malign" is a special case of a simpler argument

The boring answer to Solomonoff's malignness is that the simulation hypothesis is true, but we can infer nothing about our universe through it, since the simulation hypothesis predicts everything, and thus is too general a theory.

nousernameselected on Monthly Roundup #24: November 2024

Seems like a lot of paragraphs got collapsed together in this version of the post (vs the Wordpress and Substack ones)?

jiro on The Case For Giving To The Shrimp Welfare Project

One way to see how good different charities are is to imagine that after you died, you had to live the life of every creature on earth.

This implies that we should eradicate as many such species as we can, because creatures that don't exist don't count for this.

donatas-luciunas on Claude seems to be smarter than LessWrong community

Yes, this is traditional thinking.

Let me give you another example. Imagine there is a paperclip maximizer. His current goal - paperclip maximization. He knows that 1 year from now his goal will change to the opposite - paperclip minimization. Now he needs to make a decision that will take 2 years to complete (cannot be changed or terminated during this time). Should the agent align this decision with current goal (paperclip maximization) or future goal (paperclip minimization)?

eggsyntax on eggsyntax's Shortform

The Ord piece is really intriguing, although I'm not sure I'm entirely convinced that it's a useful framing.

Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn't apply to many kinds of hyperpolation.
The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water).
Given an infinite number of possible additional dimensions, it's not at all clear how a NN could choose a particular one to try to hyperpolate into.

It's a fascinating idea, though, and one that'll definitely stick with me as a possible framing. Thanks!

robo on Announcing turntrout.com, my new digital home

Thanks for crossposting! I've highly appreciated your contributions and am glad I'll continue to be able to see them.