LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

I Finally Worked Through Bayes' Theorem (Personal Achievement)
keltan · 2024-12-05T02:04:16.547Z · comments (6)

The OODA Loop -- Observe, Orient, Decide, Act
Davis_Kingsley · 2025-01-01T08:00:27.979Z · comments (2)

[link] A toy evaluation of inference code tampering
Fabien Roger (Fabien) · 2024-12-09T17:43:40.910Z · comments (0)

[link] Discursive Warfare and Faction Formation
Benquo · 2025-01-09T16:47:31.824Z · comments (2)

DeekSeek v3: The Six Million Dollar Model
Zvi · 2024-12-31T15:10:06.924Z · comments (6)

A Solution for AGI/ASI Safety
Weibing Wang (weibing-wang) · 2024-12-18T19:44:29.739Z · comments (30)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (2)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (63)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

Consent across power differentials
Ramana Kumar (ramana-kumar) · 2024-07-09T11:42:03.177Z · comments (12)

Paper in Science: Managing extreme AI risks amid rapid progress
JanB (JanBrauner) · 2024-05-23T08:40:40.678Z · comments (2)

Phallocentricity in GPT-J's bizarre stratified ontology
mwatkins · 2024-02-17T00:16:15.649Z · comments (37)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

[link] Announcing Human-aligned AI Summer School
Jan_Kulveit · 2024-05-22T08:55:10.839Z · comments (0)

[link] Finding Backward Chaining Circuits in Transformers Trained on Tree Search
abhayesian · 2024-05-28T05:29:46.777Z · comments (1)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"
Raemon · 2024-02-24T20:33:49.574Z · comments (19)

AI #52: Oops
Zvi · 2024-02-22T21:50:07.393Z · comments (9)

Scenario Forecasting Workshop: Materials and Learnings
elifland · 2024-03-08T02:30:46.517Z · comments (3)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (11)

Why you should learn a musical instrument
cata · 2024-05-15T20:36:16.034Z · comments (23)

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation
Benjamin Sturgeon (benjamin-sturgeon) · 2024-03-21T12:32:22.475Z · comments (8)

Observations on Teaching for Four Weeks
ClareChiaraVincent · 2024-05-06T16:55:59.315Z · comments (14)

[link] Chapter 1 of How to Win Friends and Influence People
gull · 2024-01-28T00:32:52.865Z · comments (5)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

[link] Evaluating Stability of Unreflective Alignment
james.lucassen · 2024-02-01T22:15:40.902Z · comments (11)

Should rationalists be spiritual / Spirituality as overcoming delusion
Kaj_Sotala · 2024-03-25T16:48:08.397Z · comments (57)

Wrong answer bias
lemonhope (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

The Broken Screwdriver and other parables
bhauth · 2024-03-04T03:34:38.807Z · comments (1)

Notes on control evaluations for safety cases
ryan_greenblatt · 2024-02-28T16:15:17.799Z · comments (0)

AI #58: Stargate AGI
Zvi · 2024-04-04T13:10:06.342Z · comments (9)

The Dunning-Kruger of disproving Dunning-Kruger
kromem · 2024-05-16T10:11:33.108Z · comments (0)

[link] Review: Breaking Free with Dr. Stone
TurnTrout · 2024-12-18T01:26:37.730Z · comments (5)

D&D.Sci Dungeonbuilding: the Dungeon Tournament
aphyer · 2024-12-14T04:30:55.656Z · comments (16)

[link] Careless thinking: A theory of bad thinking
Nathan Young · 2024-12-17T18:23:16.140Z · comments (17)

AI #94: Not Now, Google
Zvi · 2024-12-12T15:40:06.336Z · comments (3)

Why the Best Writers Endure Isolation
Declan Molony (declan-molony) · 2024-07-16T05:58:25.032Z · comments (6)

[link] in defense of Linus Pauling
bhauth · 2024-06-03T21:27:43.962Z · comments (8)

How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

4. Existing Writing on Corrigibility
Max Harms (max-harms) · 2024-06-10T14:08:35.590Z · comments (15)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

AI #67: Brief Strange Trip
Zvi · 2024-06-06T18:50:03.514Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

martinkunev on Testing for Scheming with Model Deletion

I wrote something on a related idea a while back:

Disincentivizing deception in mesa optimizers with Model Tampering [LW · GW]

philh on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

not sure if that would be legit for gift aid purposes

Based on https://www.gov.uk/government/publications/charities-detailed-guidance-notes/chapter-3-gift-aid#chapter-344-digital-giving-and-social-giving-accounts, doing this the naive way (one person collects money and gives it to the charity, everyone shares the tax rebate) is explicitly forbidden. Which makes sense, or everyone with a high income friend would have access to the 40% savings.

Not clear to me whether it would be allowed if the charity is in on it and everyone fills in their own gift aid declaration. But that's extra steps for all parties.

archimedes on Human takeover might be worse than AI takeover

The main difference in my mind is that a human can never be as powerful as potential ASI and cannot dominate humanity without the support of sufficiently many cooperative humans. For a given power level, I agree that humans are likely scarier than an AI of that power level. The scary part about AI is that their power level isn't bounded by human biological constraints and the capacity to do harm or good is correlated with power level. Thus AI is more likely to produce extinction-level dangers as tail risk relative to humans even if it's more likely to be aligned on average.

unexpectedvalues on Human takeover might be worse than AI takeover

Thanks for writing this. I think this topic is generally a blind spot for LessWrong users, and it's kind of embarrassing how little thought this community (myself included) has given to the question of whether a typical future with human control over AI is good.

(This actually slightly broadens the question, compared to yours. Because you talk about "a human" taking over the world with AGI, and make guesses about the personality of such a human after conditioning on them deciding to do that. But I'm not even confident that AGI-enabled control of the world by e.g. the US government would be good.)

Concretely, I think that a common perspective people take is: "What would it take for the future to go really really well, by my lights", and the answer to that question probably involves human control of AGI. But that's not really the action-relevant question. The action-relevant question, for deciding whether you want to try to solve alignment, is how the average world with human-controlled AGI compares to the average AGI-controlled world. And... I don't know, in part for the reasons you suggest.

cousin_it on Beliefs and state of mind into 2025

Yeah, on further thought I think you're right. This is pretty pessimistic then, AI companies will find it easy to align AIs to money interests, and the rest of us will be in a "natives vs the East India Company" situation. More time to spend on alignment then matters only if some companies actually try to align AIs to something good instead, and I'm not sure any companies will do that.

sharmake-farah on Symbol/Referent Confusions in Language Model Alignment Experiments

How do you know that humans and LLMs/current RL agents do optimize the reward? Are there any known theorems or papers on this, because this claim is at least a little bit important.

You may answer here:

https://www.lesswrong.com/posts/GDnRrSTvFkcpShm78/when-is-reward-ever-the-optimization-target [LW · GW]

sharmake-farah on Beliefs and state of mind into 2025

This is consistent with a model where AI alignment is heavily dependent on the data, and way less dependent on inductive biases/priors, so this is good news for alignment.

nathan-helm-burger on Beliefs and state of mind into 2025

In one specific respect I'd like to challenge your point. I think fine-tuning models currently aligns them 'well-enough' to any target point of view. I think that the ethics shown by current LLMs are due to researchers actively putting them there. I've been doing red teaming exercises on LLMs for over a year now, and I find it quite easy to fine-tune them to be evil and murderous. Human texts help them understand morality, but don't make them care enough about it for it to be sticky in the face of fine-tuning.

gwern on Aristocracy and Hostage Capital

It is an interesting paper (excerpts, related excerpts), but to temper expectations: like Peter Leeson's many papers 'explaining' this or that weird thing in history like trial-by-ordeal, this is more of a just-so story in which Allen comes up with a plausible-sounding model which he post hoc fits to some interesting historical details or anecdotes, and he doesn't present anything we would consider 'hard' evidence.

vladimir_nesov on The Golden Opportunity for American AI

Stargate is not 2025, and going from $50bn to $80bn is in line with building a $25-40bn training system this year (an unusual expense distinct from other projects), a clue separate from SemiAnalysis claims.