LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] My techno-optimism [By Vitalik Buterin]
habryka (habryka4) · 2023-11-27T23:53:35.859Z · comments (17)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

In favour of exploring nagging doubts about x-risk
owencb · 2024-06-25T23:52:01.322Z · comments (2)

[link] A Chess-GPT Linear Emergent World Representation
Adam Karvonen (karvonenadam) · 2024-02-08T04:25:15.222Z · comments (14)

On the future of language models
owencb · 2023-12-20T16:58:28.433Z · comments (17)

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)
kave · 2024-02-28T02:42:05.801Z · comments (64)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (2)

Nonlinear’s Evidence: Debunking False and Misleading Claims
KatWoods (ea247) · 2023-12-12T13:16:12.008Z · comments (171)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (5)

Dreams of AI alignment: The danger of suggestive names
TurnTrout · 2024-02-10T01:22:51.715Z · comments (59)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (8)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

Key takeaways from our EA and alignment research surveys
Cameron Berg (cameron-berg) · 2024-05-03T18:10:41.416Z · comments (10)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (20)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

Lsusr's Rationality Dojo
lsusr · 2024-02-13T05:52:03.757Z · comments (17)

Response to nostalgebraist: proudly waving my moral-antirealist battle flag
Steven Byrnes (steve2152) · 2024-05-29T16:48:29.408Z · comments (29)

On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi · 2024-06-10T12:40:03.348Z · comments (7)

[link] Notes from a Prompt Factory
Richard_Ngo (ricraz) · 2024-03-10T05:13:39.384Z · comments (19)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (31)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

[link] LessOnline (May 31—June 2, Berkeley, CA)
Ben Pace (Benito) · 2024-03-26T02:34:00.000Z · comments (24)

[link] Advice for Activists from the History of Environmentalism
Jeffrey Heninger (jeffrey-heninger) · 2024-05-16T18:40:02.064Z · comments (8)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

[link] Advice for journalists
Nathan Young · 2024-10-07T16:46:40.929Z · comments (53)

General Thoughts on Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:43.940Z · comments (60)

[link] The Minority Coalition
Richard_Ngo (ricraz) · 2024-06-24T20:01:27.436Z · comments (7)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (10)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

[link] My cover story in Jacobin on AI capitalism and the x-risk debates
garrison · 2024-02-12T23:34:16.526Z · comments (5)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (21)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (14)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

buck on Fields that I reference when thinking about AI takeover prevention

FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.

ahmedneedsatherapist on AhmedNeedsATherapist's Shortform

(discussed on the LessWrong discord server)

There seems to be an implicit fundamental difference in many people's minds between an algorithm running a set of heuristics to maximize utility (a heuristic system?) and a particular decision theory (e.g. FDT). I think the better way to think about it is that decision theories categorize heuristic systems, usually classifying them by how they handle edge cases.
Let's suppose we have a non-embedded agent A in a computable environment, something like a very sophisticated video game, and A has to continually choose between a bunch of inputs. A is capable of very powerful thought: it can do hypercomputation, RNG if needed, think as long as it needs between its choices, etc. In particular, A is able to do Solomonoff Induction. Let's also assume A is maximizing a utility function U, which is a computable function of the environment.

What happens if A find itself making a Newcomblike decision? Perhaps there is another agent in this environment that has a very good track record of predicting whether other agents in the environment will one-box or two-box, and A finds itself in the usual Newcomb scenario (million utility or a million+thousand utility or no utility) with their decision predicted by this agent. A can one-box by choosing one input and two-box by choosing another input. Should A one-box?
No. The agent in the environment would be unable to simulate A's decision, and moreover, A's decision is completely and utterly irrelevant to what's inside the boxes. If A randomly goes off-track and flips its decision at this point, nothing happens. Nothing could have happened, this other agent has no way to know or use this fact. Instead, A sums over P(x|input)U(x) for all states x of the computable environment, and chooses whichever input yields the maximum sum, which is probably two-boxing. If A one-boxes, it is due to not having enough information about the setup to determine that two-boxing is better.

You cannot use this logic when playing against Omega or a skilled psychologist. In these cases, your computation is actually accessible by the other agent, so you can get higher utility by one-boxing. Your decision theory is important because your thinking is not as powerful as A's! All of this points to looking at decision theories as classifying different heuristic systems.

I think this is post-worthy, but I want to (a) verify that my logic is correct (b) improve my wording (I am unsure if I am using a lot of terminology correctly here, but I am fairly confident that my idea can be understood.)

cole-wyeth on Heresies in the Shadow of the Sequences

My impression is that e.g. the Catholic church has a pretty deeply thought out moral philosophy that has persisted across generations. That doesn't mean that every individual Catholic understands and executes it properly.

linda-linsefors on Seven lessons I didn't learn from election day

If people are ashamed to vote for Trump, why would they let their neighbours know?

buck on Untrusted smart models and trusted dumb models

I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that we believe they could be schemers, but they aren't actually schemers). There shouldn't be models that are trusted but unaligned.

Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.

buck on Untrusted smart models and trusted dumb models

I think your short definition should include the part about our epistemic status: "We are happy to assume the AI isn't adversarially trying to cause a bad outcome".

tahp on Thoughts after the Wolfram and Yudkowsky discussion

I think we're both saying the same thing here, except that the thing I'm saying implies that I would bet for Eliezer being pessimistic about this. My point was that I have a lot of pessimism that people would code something wrong even if we knew what we were trying to code, and this is where a lot of my doom comes from. Beyond that, I think we don't know what it is we're trying to code up, and you give some evidence for that. I'm not saying that if we knew how to make good AI, it would still fail if we coded it perfectly. I'm saying we don't know how to make good AI (even though we could in principle figure it out), and also current industry standards for coding things would not get it right the first time even if we knew what we were trying to build. I feel like I basically understanding the second thing, but I don't have any gears-level understanding for why it's hard to encode human desires beyond a bunch of intuitions from monkey's-paw things that go wrong if you try to come up with creative disastrous ways to accomplish what seem like laudable goals.

I don't think Eliezer is a DOOM rock, although I think a DOOM rock would be about as useful as Eliezer in practice right now because everyone making capability progress has doomed alignment strategies. My model of Eliezer's doom argument for the current timeline is approximately "programming smart stuff that does anything useful is dangerous, we don't know how to specify smart stuff that avoids that danger, and even if we did we seem to be content to train black-box algorithms until they look smarter without checking what they do before we run them." I don't understand one of the steps in that funnel of doom as well as I would like. I think that in a world where people weren't doing the obvious doomed thing of making black-box algorithms which are smart, he would instead have a last step in the funnel of "even if we knew what we need a safe algorithm to do we don't know how to write programs that do exactly what we want in unexpected situations," because that is my obvious conclusion from looking at the software landscape.

metawrong on Monthly Roundup #23: October 2024

> The news is good, and there are now seven shows in my tier 1

@Zvi [LW · GW] Which are the other shows in your tier 1?

olli-jaerviniemi on Untrusted smart models and trusted dumb models

You might be interested in this post [LW · GW] of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).

cubefox on Seven lessons I didn't learn from election day

Cities are very heavily Democratic, while rural areas are only moderately Republican.

I think this isn't compatible with both getting about equally many votes. Because much more US Americans live in cities than in rural areas:

In 2020, about 82.66 percent of the total population in the United States lived in cities and urban areas.

https://www.statista.com/statistics/269967/urbanization-in-the-united-states/