LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Refine: An Incubator for Conceptual Alignment Research Bets
adamShimi · 2022-04-15T08:57:35.502Z · comments (13)

Interpreting Neural Networks through the Polytope Lens
Sid Black (sid-black) · 2022-09-23T17:58:30.639Z · comments (29)

We're already in AI takeoff
Valentine · 2022-03-08T23:09:06.733Z · comments (119)

[link] Nursing doubts
dynomight · 2024-08-30T02:25:36.826Z · comments (23)

[link] That Alien Message - The Animation
Writer · 2024-09-07T14:53:30.604Z · comments (9)

My side of an argument with Jacob Cannell about chip interconnect losses
Steven Byrnes (steve2152) · 2023-06-21T13:33:49.543Z · comments (11)

Stop posting prompt injections on Twitter and calling it "misalignment"
lc · 2023-02-19T02:21:44.061Z · comments (9)

[link] Fields that I reference when thinking about AI takeover prevention
Buck · 2024-08-13T23:08:54.950Z · comments (16)

[link] Transformer Circuits
evhub · 2021-12-22T21:09:22.676Z · comments (4)

The Bayesian Tyrant
abramdemski · 2020-08-20T00:08:55.738Z · comments (21)

Updating my AI timelines
Matthew Barnett (matthew-barnett) · 2022-12-05T20:46:28.161Z · comments (50)

Momentum of Light in Glass
Ben (ben-lang) · 2024-10-09T20:19:42.088Z · comments (44)

Conversational Cultures: Combat vs Nurture (V2)
Ruby · 2019-12-31T20:23:53.772Z · comments (92)

Takeaways from our robust injury classifier project [Redwood Research]
dmz (DMZ) · 2022-09-17T03:55:25.868Z · comments (12)

[link] We Found An Neuron in GPT-2
Joseph Miller (Josephm) · 2023-02-11T18:27:29.410Z · comments (23)

Sentience matters
So8res · 2023-05-29T21:25:30.638Z · comments (96)

A brief collection of Hinton's recent comments on AGI risk
Kaj_Sotala · 2023-05-04T23:31:06.157Z · comments (9)

Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC (LawChan) · 2022-12-19T22:52:20.031Z · comments (30)

Value Claims (In Particular) Are Usually Bullshit
johnswentworth · 2024-05-30T06:26:21.151Z · comments (18)

Irrational Modesty
Tomás B. (Bjartur Tómas) · 2021-06-20T19:38:25.320Z · comments (6)

The Case for Extreme Vaccine Effectiveness
Ruby · 2021-04-13T21:08:39.470Z · comments (37)

Clarifying and predicting AGI
Richard_Ngo (ricraz) · 2023-05-04T15:55:26.283Z · comments (44)

Responses to apparent rationalist confusions about game / decision theory
Anthony DiGiovanni (antimonyanthony) · 2023-08-30T22:02:12.218Z · comments (20)

The Translucent Thoughts Hypotheses and Their Implications
Fabien Roger (Fabien) · 2023-03-09T16:30:02.355Z · comments (7)

[link] The Goddess of Everything Else - The Animation
Writer · 2023-07-13T16:26:25.552Z · comments (4)

Steam
abramdemski · 2022-06-20T17:38:58.548Z · comments (13)

App and book recommendations for people who want to be happier and more productive
KatWoods (ea247) · 2021-11-06T17:40:40.592Z · comments (43)

AI Views Snapshots
Rob Bensinger (RobbBB) · 2023-12-13T00:45:50.016Z · comments (61)

Limits to Legibility
Jan_Kulveit · 2022-06-29T17:42:19.338Z · comments (11)

Twitter thread on postrationalists
Eli Tyre (elityre) · 2022-02-17T09:02:54.806Z · comments (32)

Activation space interpretability may be doomed
bilalchughtai (beelal) · 2025-01-08T12:49:38.421Z · comments (28)

High-stakes alignment via adversarial training [Redwood Research report]
dmz (DMZ) · 2022-05-05T00:59:18.848Z · comments (29)

Consider The Hand Axe
ymeskhout · 2023-04-08T01:31:44.614Z · comments (16)

Why Not Just Outsource Alignment Research To An AI?
johnswentworth · 2023-03-09T21:49:19.774Z · comments (50)

A Semitechnical Introductory Dialogue on Solomonoff Induction
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-03-04T17:27:35.591Z · comments (33)

[link] The Checklist: What Succeeding at AI Safety Will Involve
Sam Bowman (sbowman) · 2024-09-03T18:18:34.230Z · comments (49)

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"
Rob Bensinger (RobbBB) · 2021-03-05T23:43:54.186Z · comments (13)

Hashing out long-standing disagreements seems low-value to me
So8res · 2023-02-16T06:20:00.899Z · comments (34)

Age changes what you care about
Dentin · 2022-10-16T15:36:36.148Z · comments (37)

How long does it take to become Gaussian?
Maxwell Peterson (maxwell-peterson) · 2020-12-08T07:23:41.725Z · comments (40)

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers
[deleted] · 2021-04-09T19:19:42.826Z · comments (17)

When is a mind me?
Rob Bensinger (RobbBB) · 2024-04-17T05:56:38.482Z · comments (126)

Maximizing Communication, not Traffic
jefftk (jkaufman) · 2025-01-05T13:00:02.280Z · comments (7)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

Survey: How Do Elite Chinese Students Feel About the Risks of AI?
Nick Corvino (nick-corvino) · 2024-09-02T18:11:11.867Z · comments (13)

Exercises in Comprehensive Information Gathering
johnswentworth · 2020-02-15T17:27:19.753Z · comments (18)

Request to AGI organizations: Share your views on pausing AI progress
Akash (akash-wasil) · 2023-04-11T17:30:46.707Z · comments (11)

MIRI location optimization (and related topics) discussion
Rob Bensinger (RobbBB) · 2021-05-08T23:12:02.476Z · comments (163)

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series
Jack Parker · 2022-09-22T13:25:04.254Z · comments (6)

The Learning-Theoretic Agenda: Status 2023
Vanessa Kosoy (vanessa-kosoy) · 2023-04-19T05:21:29.177Z · comments (17)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

james-chua on Inference-Time-Compute: More Faithful? A Research Note

We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!

Ideas I want to explore:

New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
Do these ITC models articulate reasoning behind e.g social biases / medical advice?
Try to plant backdoor. Do these models articulate the backdoor?

james-chua on Inference-Time-Compute: More Faithful? A Research Note

thanks! heres my initial thought about introspection and how to improve on the setup there:

in my introspection paper we train models to predict their behavior in a single forward pass without CoT.

maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.

still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?

my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

daniel-tan on Inference-Time-Compute: More Faithful? A Research Note

Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning.

Is there a less-supervised way of doing this? Some ideas:

Can we "just ask" models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
Can we use black-box lie detection to detect unfaithful CoT?

sune on LLMs for language learning

I use ChatGPT and Claude to try to learn Macedonian, because there is only very little learning material available for that language. For example, they can (with a few errors sometimes) explain grammatical concepts or give me sentences to translate. I have not found a good way of storing a description of my abilities and weaknesses across conversations, but within a conversation they are good at adapting the difficulty of the questions to the quality of my answers.
Unfortunately I’m not aware of any tools that can pronounce or transcribe Macedonian.

Edit: I still do most of my language learning using Anki, Google translate and a book for learning Macedonian. Probably because using LLMs is not sufficiently gamified and because of the small inconveniences having to ask for questions instead of simply doing one exercise after another.

steve2152 on Applying traditional economic thinking to AGI: a trilemma

Thanks! I basically agree.

I think that, if we assume that there’s a world in which (1) at least some humans own some capital in the post-AGI economy (hence rapidly exponentially growing wealth), (2) nobody is worried about expropriation or violence, and (3) humans have the knowledge and power to pass and enforce effective laws holding up human interests in regards to externalities (e.g. AGIs creating new exotic forms of lethal pollution while following the letter of the existing law, or building a Dyson swarm that blocks out the sun)…

…then that’s already pretty great! That would be far better than my baseline expectation.

I think that, if we assume (1-3), then the non-capital-owning humans have a great chance of doing OK too, via (A) charity from the fabulously-wealthy capital-owning humans, or through (B) political imposition of UBI (assuming democracy), or, like you said, (C) getting employed by the fabulously-wealthy capital-owning humans who specifically want to employ other humans (or selling ownerships rights to their posthumous skulls, ofc :) ).

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

The closest I can come to examples might be ones where the two-box outcome is so much worse then the one-box outcome that I have nothing to lose by choosing the path of hope.

E.g. picking one box even though I and everybody else knows I'm a two-boxer if I believe that in this case two-boxing will kill me

Or, cooperating when both unilateral defection, unilateral cooperation, and mutual defection have results vastly worse than mutual cooperation.

Are these on the right track?

daniel-tan on Daniel Tan's Shortform

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?

Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc).

It seems like this setup could be simplified:

The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.

The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)

alexander-gietelink-oldenziel on Lecture Series on Tiling Agents

skills issue.

prep for the model that is coming tomorrow not the model of today

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

Because, based on the behavior of people here whose intelligence and ideas I have come to respect, this is an important topic.

Clearly I completely lack the background to understand the full theoretical argument. I also lack the background to understand the full theoretical argument behind general relatively and quantum uncertainty. Yet there are many real-world practical examples that I do understand and can work backwards from to get a roughly correct intuition about these ideas.

Every example I have seen for CDT falling short has been a hypothetical scenario that almost certainly never happened.

But if the only scenarios where CDT is a dominated strategy are hypothetical ones, I wouldn't expect smart people on LW to spend so much time and energy on them.

abramdemski on Lecture Series on Tiling Agents

Entering, but not entered. The machines do not yet understand the prompts I write them. (Seriously, it's total garbage still, even with lots of high quality background material in the context.)