LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Definition of alignment science I like
quetzal_rainbow · 2025-01-06T20:40:38.187Z · comments (0)

Really radical empathy
MichaelStJules · 2025-01-06T17:46:31.269Z · comments (0)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

An exhaustive list of cosmic threats
Jordan Stone (jordan-stone) · 2025-01-09T19:59:08.368Z · comments (2)

[link] Forecast 2025 With Vox's Future Perfect Team — $2,500 Prize Pool
ChristianWilliams · 2024-12-20T23:00:35.334Z · comments (0)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Jason Gross (jason-gross) · 2025-01-06T04:22:12.633Z · comments (0)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (22)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

Balsa Research 2024 Update
Zvi · 2024-12-03T12:30:06.829Z · comments (0)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas · 2025-01-06T10:24:53.419Z · comments (6)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

[link] AI safety content you could create
Adam Jones (domdomegg) · 2025-01-06T15:35:56.167Z · comments (0)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Monthly Roundup #25: December 2024
Zvi · 2024-12-23T14:20:04.682Z · comments (3)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

[link] AI & wisdom 3: AI effects on amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:08:56.604Z · comments (0)

[link] Can o1-preview find major mistakes amongst 59 NeurIPS '24 MLSB papers?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-18T14:21:03.661Z · comments (0)

[link] AI & wisdom 2: growth and amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:07:39.449Z · comments (0)

2024 NYC Secular Solstice & Megameetup
Joe Rogero · 2024-11-12T17:46:18.674Z · comments (0)

Reality is Fractal-Shaped
silentbob · 2024-12-17T13:52:16.946Z · comments (1)

[link] From the Archives: a story
Richard_Ngo (ricraz) · 2024-12-27T16:36:50.735Z · comments (1)

Announcing the CLR Foundations Course and CLR S-Risk Seminars
JamesFaville (elephantiskon) · 2024-11-19T01:18:10.085Z · comments (0)

Dmitry's Koan
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-10T04:27:30.346Z · comments (0)

[link] AI & Liability Ideathon
Kabir Kumar (kabir-kumar) · 2024-11-26T13:54:01.820Z · comments (2)

[link] Genesis
PeterMcCluskey · 2024-12-31T22:01:17.277Z · comments (0)

Economic Post-ASI Transition
[deleted] · 2025-01-01T22:37:31.722Z · comments (11)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (16)

[question] What is the most impressive game LLMs can play well?
Cole Wyeth (Amyr) · 2025-01-08T19:38:18.530Z · answers+comments (3)

[link] A primer on machine learning in cryo-electron microscopy (cryo-EM)
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-22T15:11:58.860Z · comments (0)

Proposal to increase fertility: University parent clubs
Fluffnutt (Pear) · 2024-11-18T04:21:26.346Z · comments (3)

Incredibow
jefftk (jkaufman) · 2025-01-07T03:30:02.197Z · comments (3)

Most Minds are Irrational
Davidmanheim · 2024-12-10T09:36:33.144Z · comments (4)

Rebuttals for ~all criticisms of AIXI
Cole Wyeth (Amyr) · 2025-01-07T17:41:10.557Z · comments (11)

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety - A Pilot Retrospective
Alvin Ånestrand (alvin-anestrand) · 2025-01-10T16:22:16.905Z · comments (0)

Should you have children? All LessWrong posts about the topic
Sherrinford · 2024-11-26T23:52:44.113Z · comments (0)

Everything you care about is in the map
Tahp · 2024-12-17T14:05:36.824Z · comments (27)

A Collection of Empirical Frames about Language Models
Daniel Tan (dtch1997) · 2025-01-02T02:49:05.965Z · comments (0)

Computational functionalism probably can't explain phenomenal consciousness
EuanMcLean (euanmclean) · 2024-12-10T17:11:28.044Z · comments (34)

Heresies in the Shadow of the Sequences
Cole Wyeth (Amyr) · 2024-11-14T05:01:11.889Z · comments (12)

[link] We are in a New Paradigm of AI Progress - OpenAI's o3 model makes huge gains on the toughest AI benchmarks in the world
garrison · 2024-12-22T21:45:52.026Z · comments (3)

[link] Building AI safety benchmark environments on themes of universal human values
Roland Pihlakas (roland-pihlakas) · 2025-01-03T04:24:36.186Z · comments (3)

Using Dangerous AI, But Safely?
habryka (habryka4) · 2024-11-16T04:29:20.914Z · comments (2)

Coin Flip
XelaP (scroogemcduck1) · 2024-12-27T11:53:01.781Z · comments (0)

EC2 Scripts
jefftk (jkaufman) · 2024-12-10T03:00:01.906Z · comments (1)

[link] A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-11-20T11:48:14.170Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

nathan-helm-burger on Beliefs and state of mind into 2025

Yeah, any small group of humans seizing unprecedented control over the entire world seems like a bad gamble to take, even if they start off seeming like decent people.

I'm currently hoping we can figure some kind of new governance solution for managing decentralized power while achieving adequate safety inspections.

https://www.lesswrong.com/posts/FEcw6JQ8surwxvRfr/human-takeover-might-be-worse-than-ai-takeover?commentId=uSPR9svtuBaSCoJ5P [LW(p) · GW(p)]

sharmake-farah on Beliefs and state of mind into 2025

This is also my view of the situation, as well, and is a big portion of the reason why solving AI alignment, which reduces existential risk a lot, is non-trivially likely without further political reforms I don't expect to lead to dystopian worlds (from my values).

martinkunev on Testing for Scheming with Model Deletion

I wrote something on a related idea a while back:

Disincentivizing deception in mesa optimizers with Model Tampering [LW · GW]

philh on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

not sure if that would be legit for gift aid purposes

Based on https://www.gov.uk/government/publications/charities-detailed-guidance-notes/chapter-3-gift-aid#chapter-344-digital-giving-and-social-giving-accounts, doing this the naive way (one person collects money and gives it to the charity, everyone shares the tax rebate) is explicitly forbidden. Which makes sense, or everyone with a high income friend would have access to the 40% savings.

Not clear to me whether it would be allowed if the charity is in on it and everyone fills in their own gift aid declaration. But that's extra steps for all parties.

archimedes on Human takeover might be worse than AI takeover

The main difference in my mind is that a human can never be as powerful as potential ASI and cannot dominate humanity without the support of sufficiently many cooperative humans. For a given power level, I agree that humans are likely scarier than an AI of that power level. The scary part about AI is that their power level isn't bounded by human biological constraints and the capacity to do harm or good is correlated with power level. Thus AI is more likely to produce extinction-level dangers as tail risk relative to humans even if it's more likely to be aligned on average.

unexpectedvalues on Human takeover might be worse than AI takeover

Thanks for writing this. I think this topic is generally a blind spot for LessWrong users, and it's kind of embarrassing how little thought this community (myself included) has given to the question of whether a typical future with human control over AI is good.

(This actually slightly broadens the question, compared to yours. Because you talk about "a human" taking over the world with AGI, and make guesses about the personality of such a human after conditioning on them deciding to do that. But I'm not even confident that AGI-enabled control of the world by e.g. the US government would be good.)

Concretely, I think that a common perspective people take is: "What would it take for the future to go really really well, by my lights", and the answer to that question probably involves human control of AGI. But that's not really the action-relevant question. The action-relevant question, for deciding whether you want to try to solve alignment, is how the average world with human-controlled AGI compares to the average AGI-controlled world. And... I don't know, in part for the reasons you suggest.

cousin_it on Beliefs and state of mind into 2025

Yeah, on further thought I think you're right. This is pretty pessimistic then, AI companies will find it easy to align AIs to money interests, and the rest of us will be in a "natives vs the East India Company" situation. More time to spend on alignment then matters only if some companies actually try to align AIs to something good instead, and I'm not sure any companies will do that.

sharmake-farah on Symbol/Referent Confusions in Language Model Alignment Experiments

How do you know that humans and LLMs/current RL agents do optimize the reward? Are there any known theorems or papers on this, because this claim is at least a little bit important.

You may answer here:

https://www.lesswrong.com/posts/GDnRrSTvFkcpShm78/when-is-reward-ever-the-optimization-target [LW · GW]

sharmake-farah on Beliefs and state of mind into 2025

This is consistent with a model where AI alignment is heavily dependent on the data, and way less dependent on inductive biases/priors, so this is good news for alignment.

nathan-helm-burger on Beliefs and state of mind into 2025

In one specific respect I'd like to challenge your point. I think fine-tuning models currently aligns them 'well-enough' to any target point of view. I think that the ethics shown by current LLMs are due to researchers actively putting them there. I've been doing red teaming exercises on LLMs for over a year now, and I find it quite easy to fine-tune them to be evil and murderous. Human texts help them understand morality, but don't make them care enough about it for it to be sticky in the face of fine-tuning.