LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Report on Frontier Model Training
YafahEdelman (yafah-edelman-1) · 2023-08-30T20:02:46.317Z · comments (21)

[link] The smallest possible button (or: moth traps!)
Neil (neil-warren) · 2023-09-02T15:24:20.453Z · comments (18)

Reducing sycophancy and improving honesty via activation steering
Nina Panickssery (NinaR) · 2023-07-28T02:46:23.122Z · comments (18)

Law of No Evidence
Zvi · 2021-12-20T13:50:01.189Z · comments (20)

Soft takeoff can still lead to decisive strategic advantage
Daniel Kokotajlo (daniel-kokotajlo) · 2019-08-23T16:39:31.317Z · comments (47)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (46)

[link] Who regulates the regulators? We need to go beyond the review-and-approval paradigm
jasoncrawford · 2023-05-04T22:11:17.465Z · comments (29)

[link] Investigating the Chart of the Century: Why is food so expensive?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-16T13:21:23.596Z · comments (26)

Book review: The Checklist Manifesto
Swimmer963 (Miranda Dixon-Luinenburg) (Swimmer963) · 2021-09-17T23:09:09.590Z · comments (13)

Principles for Alignment/Agency Projects
johnswentworth · 2022-07-07T02:07:36.156Z · comments (20)

How bad a future do ML researchers expect?
KatjaGrace · 2023-03-09T04:50:05.122Z · comments (8)

Book review: "A Thousand Brains" by Jeff Hawkins
Steven Byrnes (steve2152) · 2021-03-04T05:10:44.929Z · comments (18)

[link] Cohabitive Games so Far
mako yass (MakoYass) · 2023-09-28T15:41:27.986Z · comments (141)

[New LW Feature] "Debates"
Ruby · 2023-04-01T07:00:24.466Z · comments (35)

[link] Philosophy of Therapy
DaystarEld · 2020-10-10T20:12:38.204Z · comments (27)

Why I take short timelines seriously
NicholasKees (nick_kees) · 2024-01-28T22:27:21.098Z · comments (29)

Transcript of Sam Altman's interview touching on AI safety
Andy_McKenzie · 2023-01-20T16:14:18.974Z · comments (42)

A proposed method for forecasting transformative AI
Matthew Barnett (matthew-barnett) · 2023-02-10T19:34:01.358Z · comments (21)

What Comes After Epistemic Spot Checks?
Elizabeth (pktechgirl) · 2019-10-22T17:00:00.758Z · comments (9)

LW Petrov Day 2022 (Monday, 9/26)
Ruby · 2022-09-22T02:56:19.738Z · comments (111)

GPT-175bee
Adam Scherlis (adam-scherlis) · 2023-02-08T18:58:01.364Z · comments (14)

Choice Writings of Dominic Cummings
Connor_Flexman · 2021-10-13T02:41:44.291Z · comments (75)

Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res · 2021-11-29T19:26:33.232Z · comments (39)

[question] What will 2040 probably look like assuming no singularity?
Daniel Kokotajlo (daniel-kokotajlo) · 2021-05-16T22:10:38.542Z · answers+comments (86)

Why was the AI Alignment community so unprepared for this moment?
Ras1513 · 2023-07-15T00:26:29.769Z · comments (65)

Ukraine Situation Report 2022/03/01
lsusr · 2022-03-02T05:07:59.763Z · comments (59)

Harms and possibilities of schooling
TsviBT · 2022-02-22T07:48:09.542Z · comments (38)

[link] On hiding the source of knowledge
jessicata (jessica.liu.taylor) · 2020-01-26T02:48:51.310Z · comments (40)

"Zero Sum" is a misnomer.
abramdemski · 2020-09-30T18:25:30.603Z · comments (34)

[link] DontDoxScottAlexander.com - A Petition
Ben Pace (Benito) · 2020-06-25T05:44:50.050Z · comments (32)

[link] Paper: LLMs trained on “A is B” fail to learn “B is A”
[deleted] · 2023-09-23T19:55:53.427Z · comments (74)

[link] The Alignment Problem: Machine Learning and Human Values
Rohin Shah (rohinmshah) · 2020-10-06T17:41:21.138Z · comments (7)

Mnestics
Jarred Filmer (4thWayWastrel) · 2022-10-23T00:30:11.159Z · comments (6)

Quintin's alignment papers roundup - week 1
Quintin Pope (quintin-pope) · 2022-09-10T06:39:01.773Z · comments (6)

Stampy's AI Safety Info soft launch
steven0461 · 2023-10-05T22:13:04.632Z · comments (9)

Geometric Exploration, Arithmetic Exploitation
Scott Garrabrant · 2022-11-24T15:36:30.334Z · comments (4)

Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2021-10-11T07:16:45.950Z · comments (36)

Propagating Facts into Aesthetics
Raemon · 2019-12-19T04:09:17.816Z · comments (37)

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner (ejenner) · 2024-06-04T15:50:47.475Z · comments (14)

Convincing All Capability Researchers
Logan Riggs (elriggs) · 2022-04-08T17:40:25.488Z · comments (70)

How to Bounded Distrust
Zvi · 2023-01-09T13:10:00.942Z · comments (17)

Utilitarianism Meets Egalitarianism
Scott Garrabrant · 2022-11-21T19:00:12.168Z · comments (16)

Compendium of problems with RLHF
Charbel-Raphaël (charbel-raphael-segerie) · 2023-01-29T11:40:53.147Z · comments (16)

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda
Chi Nguyen · 2020-08-15T20:02:00.205Z · comments (20)

Taking the parameters which seem to matter and rotating them until they don't
Garrett Baker (D0TheMath) · 2022-08-26T18:26:47.667Z · comments (48)

Moloch and the sandpile catastrophe
Eric Raymond (eric-raymond) · 2022-04-02T15:35:12.552Z · comments (25)

Land Ho!
Zvi · 2022-01-20T13:30:01.262Z · comments (4)

Omicron Variant Post #2
Zvi · 2021-11-29T16:30:01.368Z · comments (34)

[link] Matt Levine on "Fraud is no fun without friends."
Raemon · 2021-01-19T18:23:20.614Z · comments (24)

I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines
307th · 2023-10-20T16:37:46.541Z · comments (33)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

nathan-helm-burger on Rolling Thresholds for AGI Scaling Regulation

I agree that I have no faith in current governments to implement and enforce policies that are more complex than things on the order of governance compute and chip export controls.

I think the conclusion this points towards is that we need new forms of governance. Not to replace existing governments, but to complement them. Voluntary mutual inspection contracts with privacy-respecting technology using AI inspectors. Something of that sort.

Here's some recent evidence of compute thresholds not being reliable: https://novasky-ai.github.io/posts/sky-t1/

Here's another self-link to some of my thoughts on this: https://www.lesswrong.com/posts/tdrK7r4QA3ifbt2Ty/is-ai-alignment-enough?commentId=An6L68WETg3zCQrHT [LW(p) · GW(p)]

christiankl on Jimrandomh's Shortform

It's not just a question of whether people agree but whether they actually comply with it. People agree to all sorts of things but then do something else.

aram-panasenco on AGI deployment as an act of aggression

Thanks so much for writing this! You expressed the things I was concerned about with much more eloquence than I was able to even think about.

I've come to believe that the only 'correct' way of using AGI is to turn it into a sort of "anti-AGI AGI" that prevents any other AGI from being turned on and otherwise doesn't interact with humanity in any way: https://www.lesswrong.com/posts/eqSHtF3eHLBbZa3fR/cast-it-into-the-fire-destroy-it [LW · GW]

james-chua on Inference-Time-Compute: More Faithful? A Research Note

We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!

Ideas I want to explore:

New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
Do these ITC models articulate reasoning behind e.g social biases / medical advice?
Try to plant backdoor. Do these models articulate the backdoor?

james-chua on Inference-Time-Compute: More Faithful? A Research Note

thanks! heres my initial thought about introspection and how to improve on the setup there:

in my introspection paper we train models to predict their behavior in a single forward pass without CoT.

maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.

still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?

my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

daniel-tan on Inference-Time-Compute: More Faithful? A Research Note

Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning.

Is there a less-supervised way of doing this? Some ideas:

Can we "just ask" models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
Can we use black-box lie detection to detect unfaithful CoT?

sune on LLMs for language learning

I use ChatGPT and Claude to try to learn Macedonian, because there is only very little learning material available for that language. For example, they can (with a few errors sometimes) explain grammatical concepts or give me sentences to translate. I have not found a good way of storing a description of my abilities and weaknesses across conversations, but within a conversation they are good at adapting the difficulty of the questions to the quality of my answers.
Unfortunately I’m not aware of any tools that can pronounce or transcribe Macedonian.

Edit: I still do most of my language learning using Anki, Google translate and a book for learning Macedonian. Probably because using LLMs is not sufficiently gamified and because of the small inconveniences having to ask for questions instead of simply doing one exercise after another.

steve2152 on Applying traditional economic thinking to AGI: a trilemma

Thanks! I basically agree.

I think that, if we assume that there’s a world in which (1) at least some humans own some capital in the post-AGI economy (hence rapidly exponentially growing wealth), (2) nobody is worried about expropriation or violence, and (3) humans have the knowledge and power to pass and enforce effective laws holding up human interests in regards to externalities (e.g. AGIs creating new exotic forms of lethal pollution while following the letter of the existing law, or building a Dyson swarm that blocks out the sun)…

…then that’s already pretty great! That would be far better than my baseline expectation.

I think that, if we assume (1-3), then the non-capital-owning humans have a great chance of doing OK too, via (A) charity from the fabulously-wealthy capital-owning humans, or through (B) political imposition of UBI (assuming democracy), or, like you said, (C) getting employed by the fabulously-wealthy capital-owning humans who specifically want to employ other humans (or selling ownerships rights to their posthumous skulls, ofc :) ).

bokov-1 on My latest attempt to understand decision theory: I asked ChatGPT to debate me.

The closest I can come to examples might be ones where the two-box outcome is so much worse then the one-box outcome that I have nothing to lose by choosing the path of hope.

E.g. picking one box even though I and everybody else knows I'm a two-boxer if I believe that in this case two-boxing will kill me

Or, cooperating when both unilateral defection, unilateral cooperation, and mutual defection have results vastly worse than mutual cooperation.

Are these on the right track?

daniel-tan on Daniel Tan's Shortform

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?

Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc).

It seems like this setup could be simplified:

The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.

The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)