LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Facebook is Paying Me to Post
jefftk (jkaufman) · 2023-11-14T19:10:07.303Z · comments (5)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

[link] Arithmetic Models: Better Than You Think
kqr · 2024-10-26T09:42:07.185Z · comments (4)

Why is there Nothing rather than Something?
Logan Zoellner (logan-zoellner) · 2024-10-26T12:37:50.204Z · comments (3)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (10)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

Changing Contra Dialects
jefftk (jkaufman) · 2023-10-26T17:30:10.387Z · comments (2)

[question] Impressions from base-GPT-4?
mishka · 2023-11-08T05:43:23.001Z · answers+comments (25)

[link] OpenAI Superalignment: Weak-to-strong generalization
Dalmert · 2023-12-14T19:47:24.347Z · comments (3)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

[link] Was a Subway in New York City Inevitable?
Jeffrey Heninger (jeffrey-heninger) · 2024-03-30T00:53:21.314Z · comments (4)

[link] Sticker Shortcut Fallacy — The Real Worst Argument in the World
ymeskhout · 2024-06-12T14:52:41.988Z · comments (15)

[link] **In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley**
mrtreasure · 2023-12-06T02:02:32.004Z · comments (3)

Clipboard Filtering
jefftk (jkaufman) · 2024-04-14T20:50:02.256Z · comments (1)

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research
alamerton · 2024-04-18T18:29:33.892Z · comments (4)

[question] How to Model the Future of Open-Source LLMs?
Joel Burget (joel-burget) · 2024-04-19T14:28:00.175Z · answers+comments (9)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

Twin Peaks: under the air
KatjaGrace · 2024-05-31T01:20:04.624Z · comments (2)

If a little is good, is more better?
DanielFilan · 2023-11-04T07:10:05.943Z · comments (16)

[link] The Best Essay (Paul Graham)
Chris_Leong · 2024-03-11T19:25:42.176Z · comments (2)

[link] Transformer Debugger
Henk Tillman (henk-tillman) · 2024-03-12T19:08:56.280Z · comments (0)

Beta Tester Request: Rallypoint Bounties
lukemarks (marc/er) · 2024-05-25T09:11:11.446Z · comments (4)

[link] Executive Dysfunction 101
DaystarEld · 2024-05-23T12:43:13.785Z · comments (1)

[question] What ML gears do you like?
Ulisse Mini (ulisse-mini) · 2023-11-11T19:10:11.964Z · answers+comments (4)

Control Symmetry: why we might want to start investigating asymmetric alignment interventions
domenicrosati · 2023-11-11T17:27:10.636Z · comments (1)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-11-24T23:35:29.560Z · comments (2)

AXRP Episode 30 - AI Security with Jeffrey Ladish
DanielFilan · 2024-05-01T02:50:04.621Z · comments (0)

Decent plan prize winner & highlights
lukehmiles (lcmgcd) · 2024-01-19T23:30:34.242Z · comments (2)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

Economics Roundup #1
Zvi · 2024-03-26T14:00:06.332Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lsusr on The Summoned Heroine's Prediction Markets Keep Providing Financial Services To The Demon King!

I liked the ending of this story.

kave on The hostile telepaths problem

From the related book Elephant in the Brain:

Here is the thesis we’ll be exploring in this book: We, human beings, are a species that’s not only capable of acting on hidden motives—we’re designed to do it. Our brains are built to act in our self-interest while at the same time trying hard not to appear selfish in front of other people. And in order to throw them off the trail, our brains often keep “us,” our conscious minds, in the dark. The less we know of our own ugly motives, the easier it is to hide them from others.

lc on Shortform

In the same way that Chinese people forget how to write characters by hand, I think most programmers will forget how to write code without LLM editors or plugins pretty soon.

joao-ribeiro-medeiros on The hostile telepaths problem

Very powerful reasoning. I would add that a relevant form of self-deception that should be investigated in this framework is religious faith, given its place as as foundational to societies worldwide.

Religious faith seems like an optimal form of solution to hostile telepaths problem, in certain contexts it seems like a mixture of the three solutions you outlined. (Newcomblike self-deception, Having power and Occlumency)

Religious faith seems to provide psychological power through feelings of absolute certainty and over-confidence that religious people experience. At the same time, the conversion to religions is correlated with overcoming PTSD and addiction (step 2 of the 12 steps program: "Came to believe that a Power greater than ourselves could restore us to sanity.")

I think there is an underlying problem of concept hierarchy which may precede self deception. Maybe we are able to hide concepts and thoughts while they occupy a peripheral part of the mind, this could be also linked to a continuous formulation of the newcomb-like problem in decision theory. I am not sure how this unfolds, will be trying to explore that in the weeks to come.

Thank you for sharing!

kqr on Arithmetic Models: Better Than You Think

Oh, these are good objections. Thanks!

I'm inclined to 180 on the original statements there and instead argue that predictive modelling works because, as Pearl says, "no correlation without causation". Then an important step when basing decisions on predictive modelling is verifying that the intervention has not cut off the causal path we depended on for decision-making.

Do you think that would be closer to the truth?

benito on johnswentworth's Shortform

Talking to people is often useful for goals like "making friends" and "sharing new information you've learned" and "solving problems" and so on. If what conversation means (in most contexts and for most people) is 'signaling that you repeatedly have interesting things to say', it's required to learn to do that in order to achieve your other goals.

Most games aren't that intrinsically interesting, including most social games. But you gotta git gud anyway because they're useful to be able to play well.

tsvibt on johnswentworth's Shortform

Hm. This rings true... but also I think that selecting [vibes, in this sense] for attention also selects against [things that the other person is really committed to]. So in practice you're just giving up on finding shared commitments. I've been updating that stuff other than shared commitments is less good (healthy, useful, promising, etc.) than it seems.

super-agi on Dario Amodei's "Machines of Loving Grace" sound incredibly dangerous, for Humans

See also: https://www.lesswrong.com/posts/zSNLvRBhyphwuYdeC/ai-86-just-think-of-the-potential [LW · GW] -- @Zvi [LW · GW]

"The result is a mostly good essay called Machines of Loving Grace, outlining what can be done with ‘powerful AI’ if we had years of what was otherwise relative normality to exploit it in several key domains, and we avoided negative outcomes and solved the control and alignment problems..."

"This essay wants to assume the AIs are aligned to us and we remain in control without explaining why and how that occured, and then fight over whether the result is democratic or authoritarian."

"Thus the whole discussion here feels bizarre, something between burying the lede and a category error."

"...the more concrete Dario’s discussions become, the more this seems to be a ‘AI as mere tool’ world, despite that AI being ‘powerful.’ Which I note because it is, at minimum, one hell of an assumption to have in place ‘because of reasons.’"

"Assuming you do survive powerful AI, you will survive because of one of three things.

You and your allies have and maintain control over resources.
You sell valuable services that people want humans to uniquely provide.
Collectively we give you an alternative path to acquire the necessary resources.

That’s it."

"The right answer is that intuitions, especially those that say or come from ‘the future will be like the past’ are not to be trusted here."

tsvibt on johnswentworth's Shortform

Ok but how do you deal with the tragedy of the high dimensionality of context-space? People worth thinking with have wildly divergent goals--and even if you share goals, you won't share background information.

lc on johnswentworth's Shortform

...How is that definition different than a realtime version of what you do when participating in this forum?