LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Shutdown Problem: Incomplete Preferences as a Solution
EJT (ElliottThornley) · 2024-02-23T16:01:16.378Z · comments (28)

Complexity of value but not disvalue implies more focus on s-risk. Moral uncertainty and preference utilitarianism also do.
Chi Nguyen · 2024-02-23T06:10:05.881Z · comments (18)

Correct my H5N1 research
Elizabeth (pktechgirl) · 2024-12-09T19:07:03.277Z · comments (25)

[link] Discursive Warfare and Faction Formation
Benquo · 2025-01-09T16:47:31.824Z · comments (3)

Two LessWrong speed friending experiments
mikko (morrel) · 2024-06-15T10:52:26.081Z · comments (3)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

So You Created a Sociopath - New Book Announcement!
Garrett Baker (D0TheMath) · 2024-04-01T18:02:18.010Z · comments (3)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (6)

Schelling points in the AGI policy space
mesaoptimizer · 2024-06-26T13:19:25.186Z · comments (2)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (7)

Was Releasing Claude-3 Net-Negative?
Logan Riggs (elriggs) · 2024-03-27T17:41:56.245Z · comments (5)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (10)

shortest goddamn bayes guide ever
lemonhope (lcmgcd) · 2024-05-10T07:06:23.734Z · comments (8)

[link] Dario Amodei: On DeepSeek and Export Controls
Zach Stein-Perlman · 2025-01-29T17:15:18.986Z · comments (3)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (11)

I Finally Worked Through Bayes' Theorem (Personal Achievement)
keltan · 2024-12-05T02:04:16.547Z · comments (6)

[link] Bed Time Quests & Dinner Games for 3-5 year olds
Gunnar_Zarncke · 2024-06-22T07:53:38.989Z · comments (0)

[link] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Gunnar_Zarncke · 2024-05-16T13:09:39.265Z · comments (20)

[link] Preference Inversion
Benquo · 2025-01-02T18:15:52.938Z · comments (46)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (4)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

[link] how birds sense magnetic fields
bhauth · 2024-06-27T18:59:35.075Z · comments (4)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset
aphyer · 2024-06-17T21:29:08.778Z · comments (11)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (2)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

[link] Just one more exposure bro
Chipmonk · 2024-12-12T21:37:07.069Z · comments (6)

Applying refusal-vector ablation to a Llama 3 70B agent
Simon Lermen (dalasnoin) · 2024-05-11T00:08:08.117Z · comments (14)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

[link] You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com
Arjun Panickssery (arjun-panickssery) · 2025-01-30T12:35:03.564Z · comments (3)

On Lex Fridman’s Second Podcast with Altman
Zvi · 2024-03-25T12:20:08.780Z · comments (10)

Claude Sonnet 3.5.1 and Haiku 3.5
Zvi · 2024-10-24T14:50:06.286Z · comments (9)

[link] Prices are Bounties
Maxwell Tabarrok (maxwell-tabarrok) · 2024-10-12T14:51:40.689Z · comments (13)

Cooperating with aliens and AGIs: An ECL explainer
Chi Nguyen · 2024-02-24T22:58:47.345Z · comments (8)

DeepSeek Panic at the App Store
Zvi · 2025-01-28T19:30:07.555Z · comments (14)

[link] Epistemic status: poetry (and other poems)
Richard_Ngo (ricraz) · 2024-11-21T18:13:17.194Z · comments (5)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (65)

Paper in Science: Managing extreme AI risks amid rapid progress
JanB (JanBrauner) · 2024-05-23T08:40:40.678Z · comments (2)

[link] A toy evaluation of inference code tampering
Fabien Roger (Fabien) · 2024-12-09T17:43:40.910Z · comments (0)

AI #52: Oops
Zvi · 2024-02-22T21:50:07.393Z · comments (9)

A Solution for AGI/ASI Safety
Weibing Wang (weibing-wang) · 2024-12-18T19:44:29.739Z · comments (29)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

Logits, log-odds, and loss for parallel circuits
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-20T09:56:26.031Z · comments (3)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

AI #100: Meet the New Boss
Zvi · 2025-01-23T15:40:07.473Z · comments (4)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mateusz-baginski on Thread for Sense-Making on Recent Murders and How to Sanely Respond

That's basically the idea behind "TESCREAL" (if we ignore the EA part) that all people who believe that one day we might have intelligent robots and fly to the stars and stuff like that must be a part of some sinister conspiracy.

Are you saying hat (most) sci-fi authors who take the futures they write about seriously (i.e. "we totally might/will see that kind of stuff in decades/centuries") are TESCREAL-ists (either in Torres & Gebru sense or in popular imagination)?

My impression is that TESCREAL was more meant to point at some kind of ... industrial & philantropic complex?

cousin_it on Tear Down the Burren

Thanks for writing this, it's a great explanation-by-example of the entire housing crisis. When people protest against six-story buildings in the name of neighborhood character, it makes me wonder how Paris with its six story buildings managed to keep any character at all.

jeremy-gillen on Invulnerable Incomplete Preferences: A Formal Statement

The description of how sequential choice can be defined is helpful, I was previously confused by how this was supposed to work. This matches what I meant by preferences over tuples of outcomes. Thanks!

We'd incorrectly rule out the possibility that the agent goes for (B+,B).

There's two things we might want from the idea of incomplete preferences:

To predict the actions of agents.
Because complete agents behave dangerously sometimes, and we want to design better agents with different behaviour.

I think modelling an agent as having incomplete preferences is great for (1). Very useful. We make better predictions if we don't rule out the possibility that the agent goes for B after choosing B+. I think we agree here.

For (2), the relevant quote is:

As a general point, you can always look at a decision ex post and back out different ways to rationalise it. The nontrivial task is here prediction, using features of the agent.

If we can always rationalise a decision ex post as being generated by a complete agent, then let's just build that complete agent. Incompleteness isn't helping us, because the behaviour could have been generated by complete preferences.

viliam on Thread for Sense-Making on Recent Murders and How to Sanely Respond

Ha, that's a good reminder that other perspectives exist.

Inside the bubble, it feels like a fact that the technology advances, LLMs exist, etc. Agreeing on these things doesn't make me feel like a part of some group anymore than believing that 2+2=4 does.

But the general public seems to be in deep denial. (Except for artists sometimes complaining that the computers are stealing their jobs, and teachers complaining that kids feed all their homework to LLMs.) So from the outside perspective, anyone not in denial seems like a part of a very specific small group.

That's basically the idea behind "TESCREAL" (if we ignore the EA part) that all people who believe that one day we might have intelligent robots and fly to the stars and stuff like that must be a part of some sinister conspiracy. Otherwise, why would they have such suspiciously similar beliefs? While from my perspective, it's like, if you have read sci-fi as a child, none of this sounds surprising. I kinda took it for granted that one day we will have intelligent robots, the only question is the timing, whether it will be 2000 or 2100 or maybe 3000. And the only new thing is that now it seems that 2030 is the answer.

Funny thing is that a short time ago, David Gerard was busy deleting from Wikipedia any mentions of EA being connected to Less Wrong, and now it is popular to go to the opposite extreme and assume that everything is connected (as long as it uses computers, or decision theory, or some other weird stuff).

p-b-1 on DeepSeek: Don’t Panic

A very detailed and technical analysis of the bear case for Nvidia by Jeffrey Emanuel, that Matt Levine claims may have been responsible for the Nvidia price decline.

I read that last week. It was an interesting case of experiencing Gell-Mann-Amnesia several times within the same article.

All the parts where I have some expertise were vague, used terminology incorrectly and were often just wrong. All the rest was very interesting!

If this article crashed the market: EMH RIP.

saidachmiz on Pick two: concise, comprehensive, or clear rules

This is fine for new users; what about for existing users?

I just went to the front page of the site, and it’s not obvious to me where to click to find “The Rules”. The “About” page? Doesn’t seem to be a list of rules. The New User’s Guide? Not really. (There’s a “Rules to be aware of” section at the very, very end of that post, but… surely this isn’t meant to be a list of the rules…? It’s just… three kind of random things.) The LessWrong FAQ? Not really…

If I want to know what rules (or guidelines, or… anything, really…) are supposed to be governing my behavior on LW, I actually don’t have any idea where to look. And I’ve been using Less Wrong for a very long time.

Related point: when the rules change, how do existing users learn about this?

P.S.: What happened to the table of contents on LW post pages? Why can’t I see it anymore?

odd-anon on Can someone, anyone, make superintelligence a more concrete concept?

Strategies:

Analogy by weaker-than-us entities: What does human civilization's unstoppable absolute conquest of Earth look like to a gorilla? What does an adult's manipulation look like to a toddler failing to understand how the adult keeps knowing things that were secret, keeps being able to direct one's actions in ways that can only be noticed in retrospect if at all?
Analogy by stronger-than-us entities: Superintelligence is to Mossad as Mossad is to you, and able to work in parallel and faster. One million super-Mossads, who have also developed the ability to slow down time for themselves, all intent to kill you through online actions alone? That may trigger some emotional response.
Analogy by fictional example: The webcomic "Seed" featured a nascent moderately-superhuman intelligence, which frequently used a lot of low-hanging social engineering techniques, each of which only have their impact shown after the fact. It's, ah, certainly fear-inspiring, though I don't know if it meets the "without pointing towards a massive tome" criterion. (Unfortunately, actually super-smart entities are quite rare in fiction.)

hopenope on Hopenope's Shortform

some RL-based reasoning models(R1-zero) actually break though the linguistic barrier and their CoT is partially in gibberish. maybe if we use more efficient languages(mandarin, lean, haskell) in their CoT, then there is less optimization pressure to break through that barrier.

saidachmiz on Pick two: concise, comprehensive, or clear rules

On the one hand, this is true. On the other hand, it may be useful to have a system where the only real rule is “The moderators shall do whatever they want”, but there are nonetheless a bunch of other rules which (explicitly!) serve to give users some idea of what the moderators in fact want.

After all, if I am the king and I say “the only law is whatever I command”, surely the response of my subjects will be “yes, Your Majesty; and what do you command?”. Given almost any plausible goal I might have, it seems like I will achieve that goal more effectively if I provide a practical answer to the question, rather than “nothing for now, but be ready to carry out all my whims at a moment’s notice”. Yes, the latter is in some sense implied, but it’s not actually very useful by itself. If my whims are inconstant, then my kingdom will just be less effective, at almost anything.

Thus also with moderation.

mateusz-baginski on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

[Epistemic status: my model of the view that Jan/ACS/the GD paper subscribes to.]

I think this comment by Jan [LW(p) · GW(p)] from 3 years ago (where he explained some of the difference in generative intuitions between him and Eliezer) may be relevant to the disagreement here. In particular:

Continuity

In my [Jan's] view, your [Eliezer's] ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems "weak, won't kill you, but also won't help you with alignment" and "strong - would help you with alignment, but, unfortunately, will kill you by default". Discontinuities everywhere - “bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.

In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true "jumps" under closer inspection.

My understanding of Jan's position (and probably also the position of the GD paper) is that aligning the AI (and other?) systems will be gradual, iterative, continuous; there's not going to be a point where a system is aligned so that we can basically delegate all the work to them and go home. Humans will have to remain in the loop, if not indefinitely, then at least for many decades.

In such a world, it is very plausible that we will get to a point where we've built powerful AIs that are (as far as we can tell) perfectly aligned with human preferences or whatever but their misalignment manifests only on longer timescales.

Another domain where this discrete/continuous difference in assumptions manifests itself is the shape of AI capabilities.

One position is:

If we get a single-single-aligned AGI, we will have it solve the GD-style misalignment problems for us. If it can't do that (even in the form of noticing/predicting the problem and saying "guys, stop pushing this further, at least until I/we figure out how to prevent this from happening"), then neither can we (kinda by definition of "AGI"), so thinking about this is probably pointless and we should think about problems that are more tractable.

The other position is:

What people officially aiming to create AGI will create is not necessarily going to be superhuman at all tasks. It's plausible that economic incentives will push towards "capability configurations" that are missing some relevant capabilities, e.g. relevant to researching gnarly problems that are hard to learn from the training data or even through current post-training methods. Understanding and mitigating the kind of risk the GD paper describes can be one such problem. (See also: Cyborg Periods [LW · GW].)

Another reason to expect this is that alignment and capabilities are not quite separate magisteria and that the alignment target can induce gaps in capabilities, relative to what one would expect from its power otherwise, as measured by, IDK, some equivalent of the g-factor. One example might be Steven's "Law of Conservation of Wisdom". [LW(p) · GW(p)]