LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

My Dating Heuristic
Declan Molony (declan-molony) · 2024-05-21T05:28:40.197Z · comments (4)

[link] Emotional issues often have an immediate payoff
Chipmonk · 2024-06-10T23:39:40.697Z · comments (2)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

Wild Animal Suffering Is The Worst Thing In The World
omnizoid · 2025-02-06T16:15:34.572Z · comments (18)

[link] A brief history of the automated corporation
owencb · 2024-11-04T14:35:04.906Z · comments (1)

[link] my favourite Scott Sumner blog posts
DMMF · 2024-06-11T14:40:43.093Z · comments (0)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots
Owain_Evans · 2024-03-28T02:34:21.799Z · comments (0)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

Evidential Correlations are Subjective, and it might be a problem
Martín Soto (martinsq) · 2024-03-07T18:37:54.105Z · comments (6)

Ackshually, many worlds is wrong
tailcalled · 2024-04-11T20:23:59.416Z · comments (42)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

Ideas for Next-Generation Writing Platforms, using LLMs
ozziegooen · 2024-06-04T18:40:24.636Z · comments (4)

What is the best argument that LLMs are shoggoths?
JoshuaFox · 2024-03-17T11:36:23.636Z · comments (22)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (8)

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
Logan Riggs (elriggs) · 2024-03-15T16:30:00.744Z · comments (5)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

[question] Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
O O (o-o) · 2024-06-14T06:32:48.170Z · answers+comments (17)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

AI #57: All the AI News That’s Fit to Print
Zvi · 2024-03-28T11:40:05.435Z · comments (14)

How likely is brain preservation to work?
Andy_McKenzie · 2024-11-18T16:58:54.632Z · comments (3)

Is theory good or bad for AI safety?
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-19T10:32:08.772Z · comments (1)

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
claudia.biancotti · 2024-11-18T09:38:35.723Z · comments (4)

[link] Reinforcement Learning by AI Punishment
Abhishaike Mahajan (abhishaike-mahajan) · 2025-01-28T00:57:51.715Z · comments (0)

AI #102: Made in America
Zvi · 2025-02-06T14:20:06.733Z · comments (17)

Information Versus Action
Screwtape · 2025-02-04T05:13:55.192Z · comments (0)

The generalization phase diagram
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-26T20:30:15.212Z · comments (2)

World Citizen Assembly about AI - Announcement
Camille Berger (Camille Berger) · 2025-02-11T10:51:56.948Z · comments (1)

[link] Creating Interpretable Latent Spaces with Gradient Routing
Jacob G-W (g-w1) · 2024-12-14T04:00:17.249Z · comments (6)

Trying Bluesky
jefftk (jkaufman) · 2024-11-17T02:50:04.093Z · comments (16)

Preface
Allison Duettmann (allison-duettmann) · 2025-01-02T18:59:46.290Z · comments (2)

[link] Introducing the Anthropic Fellows Program
Miranda Zhang (miranda-zhang) · 2024-11-30T23:47:29.259Z · comments (0)

[question] Is the output of the softmax in a single transformer attention head usually winner-takes-all?
Linda Linsefors · 2025-01-27T15:33:28.992Z · answers+comments (1)

[link] Effective Networking as Sending Hard to Fake Signals
vaishnav92 · 2024-12-12T20:32:24.113Z · comments (2)

Visual demonstration of Optimizer's curse
Roman Malov · 2024-11-30T19:34:07.700Z · comments (3)

[link] When does capability elicitation bound risk?
joshc (joshua-clymer) · 2025-01-22T03:42:36.289Z · comments (0)

On The Rationalist Megameetup
Screwtape · 2024-11-23T09:08:26.897Z · comments (3)

No Electricity in Manchuria
winstonBosan · 2024-11-19T01:11:58.661Z · comments (0)

[link] Social events with plausible deniability
Chipmonk · 2024-11-18T18:25:17.339Z · comments (24)

[question] Should Open Philanthropy Make an Offer to Buy OpenAI?
mrtreasure · 2025-02-14T23:18:01.929Z · answers+comments (1)

[question] Take over my project: do computable agents plan against the universal distribution pessimistically?
Cole Wyeth (Amyr) · 2025-02-19T20:17:04.813Z · answers+comments (3)

Thoughts after the Wolfram and Yudkowsky discussion
Tahp · 2024-11-14T01:43:12.920Z · comments (13)

Export Surplusses
lsusr · 2025-02-24T05:53:23.422Z · comments (20)

[link] Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.
Chris_Leong · 2024-11-11T16:13:26.504Z · comments (6)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (9)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

chipmonk on Invest in ACX Grants projects!

any updates on how this is going btw? (doing retroactive funding research)

yonatan-cale-1 on Mikhail Samin's Shortform

(Could you link to the context?)

mrtreasure on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Does the model embrace "actions that are bad for humans even if not immoral" or "actions that are good for humans even if immoral" or treat users differently if they identify as non-humans? This might help differentiate what exactly it's mis-aligning toward.

zolmeister on Zolmeister's Shortform

Writing Doom – Award-Winning Short Film on Superintelligence (2024)

Grand Prize Winner - Future of Life Institute's Superintelligence Imagined Contest

Written & directed by Suzy Shepherd

abhinav-pola on Abhinav Pola's Shortform

Computer-use agents in 3rd party environments are an inherent security risk.
What is the risk exactly? That we can always phish an agent. Safety relies on making sure the inputs and outputs of the model are safe. We can ensure both and still elicit harm by phishing the model which I claim is probably an easier task than computer use. This is only possible because:
1. We have control over the agent loop and can poison the inputs before the agent takes the next action.
2. We have control over the environment to edit HTML, the browser, the OS, etc. as we see fit while remaining in-distribution for computer use capabilities.

Here, I "phish" Sonnet 3.5 to create a Pinterest account: [demo]. This is mainly my response to Anthropic's hierarchical summarization which I think is necessary but not sufficient. I think OpenAI-Operator-style 1st party environments and paywalls are the way to go for now.

james-chua on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

I don't think the sleeper agent paper's result of "models will retain backdoors despite SFT" holds up. (When you examine other models or try further SFT).

See sara price's paper https://arxiv.org/pdf/2407.04108.

mikhail-samin on Mikhail Samin's Shortform

Since this seems to be a crux, I propose a bet to @Zac Hatfield-Dodds [LW · GW] (or anyone else at Anthropic): someone shows random people in San-Francisco Anthropic’s letter to Newsom on SB-1047. I would bet that among the first 20 who fully read at least one page, over half will say that Anthropic’s response to SB-1047 is closer to presenting the bill as 51% good and 49% bad than presenting it as 95% good and 5% bad.

Zac, at what odds would you take bet?

(I would be happy to discuss the details.)

tlevin on tlevin's Shortform

Sequence thinking can totally generate that, but it seems like it is also prone to this kind of stylized simple model where you wind up with too few arrows in your causal graph and then inaccurately conclude that some parts are necessary and others aren't helpful.

lechmazur on "AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder

It's a video by an influencer who has repeatedly shown no particular insight in any field other than her own. For example, her video about the simulation hypothesis was atrocious. I gave this one a chance, and it's just a high-level summary of some recent developments, nothing interesting.

vladimir_nesov on AI #105: Hey There Alexa

Deep Research ... will rapidly improve - when GPT-4.5 arrives soon and is integrated into the underlying reasoning model

Deep Research is based on o3, but it's unclear if o3 is based on GPT-4o or GPT-4.5 [LW(p) · GW(p)]. Knowledge cutoff for GPT-4.5 is Oct 2023, the announcement about training the next frontier model was in May 2024, so it plausibly finished pretraining by Sep-Oct 2024, in time to use as a foundation for o3.

It might still rapidly improve even if based on GPT-4.5 if RL training is scalable, but that also remains unknown, the reasoning models so far don't come with scaling laws for RL training, it's plausible that this is bottlenecked on manual construction of verifiable tasks, which can't be scaled 1000x.