LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (64)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (2)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (23)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (5)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Human takeover might be worse than AI takeover
Tom Davidson (tom-davidson-1) · 2025-01-10T16:53:27.043Z · comments (51)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

2024 Unofficial LessWrong Census/Survey
Screwtape · 2024-12-02T05:30:53.019Z · comments (48)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

[link] Advice for journalists
Nathan Young · 2024-10-07T16:46:40.929Z · comments (53)

My AGI safety research—2024 review, ’25 plans
Steven Byrnes (steve2152) · 2024-12-31T21:05:19.037Z · comments (4)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (15)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

[link] The Minority Coalition
Richard_Ngo (ricraz) · 2024-06-24T20:01:27.436Z · comments (7)

You can, in fact, bamboozle an unaligned AI into sparing your life
David Matolcsi (matolcsid) · 2024-09-29T16:59:43.942Z · comments (171)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

The nihilism of NeurIPS
charlieoneill (kingchucky211) · 2024-12-20T23:58:11.858Z · comments (7)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (13)

[link] My cover story in Jacobin on AI capitalism and the x-risk debates
garrison · 2024-02-12T23:34:16.526Z · comments (5)

MIRI’s 2024 End-of-Year Update
Rob Bensinger (RobbBB) · 2024-12-03T04:33:47.499Z · comments (2)

On attunement
Joe Carlsmith (joekc) · 2024-03-25T12:47:34.856Z · comments (8)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

Access to powerful AI might make computer security radically easier
Buck · 2024-06-08T06:00:19.310Z · comments (14)

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

The "Think It Faster" Exercise
Raemon · 2024-12-11T19:14:10.427Z · comments (13)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

james-chua on Inference-Time-Compute: More Faithful? A Research Note

We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!

Ideas I want to explore:

New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
Do these ITC models articulate reasoning behind e.g social biases / medical advice?
Try to plant backdoor. Do these models articulate the backdoor?

james-chua on Inference-Time-Compute: More Faithful? A Research Note

thanks! heres my initial thought about introspection and how to improve on the setup there:

in my introspection paper we train models to predict their behavior in a single forward pass without CoT.

maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.

still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?

my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

daniel-tan on Inference-Time-Compute: More Faithful? A Research Note

Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning.

Is there a less-supervised way of doing this? Some ideas:

Can we "just ask" models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
Can we use black-box lie detection to detect unfaithful CoT?

sune on LLMs for language learning

I use ChatGPT and Claude to try to learn Macedonian, because there is only very little learning material available for that language. For example, they can (with a few errors sometimes) explain grammatical concepts or give me sentences to translate. I have not found a good way of storing a description of my abilities and weaknesses across conversations, but within a conversation they are good at adapting the difficulty of the questions to the quality of my answers.
Unfortunately I’m not aware of any tools that can pronounce or transcribe Macedonian.

Edit: I still do most of my language learning using Anki, Google translate and a book for learning Macedonian. Probably because using LLMs is not sufficiently gamified and because of the small inconveniences having to ask for questions instead of simply doing one exercise after another.

steve2152 on Applying traditional economic thinking to AGI: a trilemma

Thanks! I basically agree.

I think that, if we assume that there’s a world in which (1) at least some humans own some capital in the post-AGI economy (hence rapidly exponentially growing wealth), (2) nobody is worried about expropriation or violence, and (3) humans have the knowledge and power to pass and enforce effective laws holding up human interests in regards to externalities (e.g. AGIs creating new exotic forms of lethal pollution while following the letter of the existing law, or building a Dyson swarm that blocks out the sun)…

…then that’s already pretty great! That would be far better than my baseline expectation.

I think that, if we assume (1-3), then the non-capital-owning humans have a great chance of doing OK too, via (A) charity from the fabulously-wealthy capital-owning humans, or through (B) political imposition of UBI (assuming democracy), or, like you said, (C) getting employed by the fabulously-wealthy capital-owning humans who specifically want to employ other humans (or selling ownerships rights to their posthumous skulls, ofc :) ).

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

The closest I can come to examples might be ones where the two-box outcome is so much worse then the one-box outcome that I have nothing to lose by choosing the path of hope.

E.g. picking one box even though I and everybody else knows I'm a two-boxer if I believe that in this case two-boxing will kill me

Or, cooperating when both unilateral defection, unilateral cooperation, and mutual defection have results vastly worse than mutual cooperation.

Are these on the right track?

daniel-tan on Daniel Tan's Shortform

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?

Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc).

It seems like this setup could be simplified:

The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.

The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)

alexander-gietelink-oldenziel on Lecture Series on Tiling Agents

skills issue.

prep for the model that is coming tomorrow not the model of today

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

Because, based on the behavior of people here whose intelligence and ideas I have come to respect, this is an important topic.

Clearly I completely lack the background to understand the full theoretical argument. I also lack the background to understand the full theoretical argument behind general relatively and quantum uncertainty. Yet there are many real-world practical examples that I do understand and can work backwards from to get a roughly correct intuition about these ideas.

Every example I have seen for CDT falling short has been a hypothetical scenario that almost certainly never happened.

But if the only scenarios where CDT is a dominated strategy are hypothetical ones, I wouldn't expect smart people on LW to spend so much time and energy on them.

abramdemski on Lecture Series on Tiling Agents

Entering, but not entered. The machines do not yet understand the prompts I write them. (Seriously, it's total garbage still, even with lots of high quality background material in the context.)