LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (63)

Goal-Completeness is like Turing-Completeness for AGI
Liron · 2023-12-19T18:12:29.947Z · comments (26)

n of m ring signatures
DanielFilan · 2023-12-04T20:00:06.580Z · comments (7)

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

Apply to the Conceptual Boundaries Workshop for AI Safety
Chipmonk · 2023-11-27T21:04:59.037Z · comments (0)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

[link] Gwern Branwen interview on Dwarkesh Patel’s podcast: “How an Anonymous Researcher Predicted AI's Trajectory”
Said Achmiz (SaidAchmiz) · 2024-11-14T23:53:34.922Z · comments (0)

GPT-2030 and Catastrophic Drives: Four Vignettes
jsteinhardt · 2023-11-10T07:30:06.480Z · comments (5)

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation
Benjamin Sturgeon (benjamin-sturgeon) · 2024-03-21T12:32:22.475Z · comments (8)

[link] How to Eradicate Global Extreme Poverty [RA video with fundraiser!]
aggliu · 2023-10-18T15:51:22.073Z · comments (5)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (11)

They are made of repeating patterns
quetzal_rainbow · 2023-11-13T18:17:43.189Z · comments (4)

Scenario Forecasting Workshop: Materials and Learnings
elifland · 2024-03-08T02:30:46.517Z · comments (3)

On Overhangs and Technological Change
Roko · 2023-11-05T22:58:51.306Z · comments (19)

The Shortest Path Between Scylla and Charybdis
Thane Ruthenis · 2023-12-18T20:08:34.995Z · comments (8)

Observations on Teaching for Four Weeks
ClareChiaraVincent · 2024-05-06T16:55:59.315Z · comments (14)

[link] Finding Backward Chaining Circuits in Transformers Trained on Tree Search
abhayesian · 2024-05-28T05:29:46.777Z · comments (1)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

Altman firing retaliation incoming?
trevor (TrevorWiesinger) · 2023-11-19T00:10:15.645Z · comments (23)

[link] A starter guide for evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-08T18:24:23.913Z · comments (2)

When to Get the Booster?
jefftk (jkaufman) · 2023-10-03T21:00:12.813Z · comments (15)

AI #52: Oops
Zvi · 2024-02-22T21:50:07.393Z · comments (9)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

[link] Announcing Human-aligned AI Summer School
Jan_Kulveit · 2024-05-22T08:55:10.839Z · comments (0)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

AI #67: Brief Strange Trip
Zvi · 2024-06-06T18:50:03.514Z · comments (6)

[link] in defense of Linus Pauling
bhauth · 2024-06-03T21:27:43.962Z · comments (8)

Childhood Roundup #3
Zvi · 2023-10-10T14:30:04.287Z · comments (3)

The Broken Screwdriver and other parables
bhauth · 2024-03-04T03:34:38.807Z · comments (1)

Should rationalists be spiritual / Spirituality as overcoming delusion
Kaj_Sotala · 2024-03-25T16:48:08.397Z · comments (57)

Basic Mathematics of Predictive Coding
Adam Shai (adam-shai) · 2023-09-29T14:38:28.517Z · comments (6)

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
Felix Hofstätter · 2023-11-08T11:37:43.997Z · comments (0)

[question] why did OpenAI employees sign
bhauth · 2023-11-27T05:21:28.612Z · answers+comments (23)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

[link] Chapter 1 of How to Win Friends and Influence People
gull · 2024-01-28T00:32:52.865Z · comments (5)

Job listing: Communications Generalist / Project Manager
Gretta Duleba (gretta-duleba) · 2023-11-06T20:21:03.721Z · comments (7)

Public Weights?
jefftk (jkaufman) · 2023-11-02T02:50:18.095Z · comments (19)

[link] The point of a game is not to win, and you shouldn't even pretend that it is
mako yass (MakoYass) · 2023-09-28T15:54:27.990Z · comments (27)

The Dunning-Kruger of disproving Dunning-Kruger
kromem · 2024-05-16T10:11:33.108Z · comments (0)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

Bounty: Diverse hard tasks for LLM agents
Beth Barnes (beth-barnes) · 2023-12-17T01:04:05.460Z · comments (31)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

Notes on control evaluations for safety cases
ryan_greenblatt · 2024-02-28T16:15:17.799Z · comments (0)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

AI #58: Stargate AGI
Zvi · 2024-04-04T13:10:06.342Z · comments (9)

Wrong answer bias
lukehmiles (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

buck on Fields that I reference when thinking about AI takeover prevention

FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.

ahmedneedsatherapist on AhmedNeedsATherapist's Shortform

(discussed on the LessWrong discord server)

There seems to be an implicit fundamental difference in many people's minds between an algorithm running a set of heuristics to maximize utility (a heuristic system?) and a particular decision theory (e.g. FDT). I think the better way to think about it is that decision theories categorize heuristic systems, usually classifying them by how they handle edge cases.
Let's suppose we have a non-embedded agent A in a computable environment, something like a very sophisticated video game, and A has to continually choose between a bunch of inputs. A is capable of very powerful thought: it can do hypercomputation, RNG if needed, think as long as it needs between its choices, etc. In particular, A is able to do Solomonoff Induction. Let's also assume A is maximizing a utility function U, which is a computable function of the environment.

What happens if A find itself making a Newcomblike decision? Perhaps there is another agent in this environment that has a very good track record of predicting whether other agents in the environment will one-box or two-box, and A finds itself in the usual Newcomb scenario (million utility or a million+thousand utility or no utility) with their decision predicted by this agent. A can one-box by choosing one input and two-box by choosing another input. Should A one-box?
No. The agent in the environment would be unable to simulate A's decision, and moreover, A's decision is completely and utterly irrelevant to what's inside the boxes. If A randomly goes off-track and flips its decision at this point, nothing happens. Nothing could have happened, this other agent has no way to know or use this fact. Instead, A sums over P(x|input)U(x) for all states x of the computable environment, and chooses whichever input yields the maximum sum, which is probably two-boxing. If A one-boxes, it is due to not having enough information about the setup to determine that two-boxing is better.

You cannot use this logic when playing against Omega or a skilled psychologist. In these cases, your computation is actually accessible by the other agent, so you can get higher utility by one-boxing. Your decision theory is important because your thinking is not as powerful as A's! All of this points to looking at decision theories as classifying different heuristic systems.

I think this is post-worthy, but I want to (a) verify that my logic is correct (b) improve my wording (I am unsure if I am using a lot of terminology correctly here, but I am fairly confident that my idea can be understood.)

cole-wyeth on Heresies in the Shadow of the Sequences

My impression is that e.g. the Catholic church has a pretty deeply thought out moral philosophy that has persisted across generations. That doesn't mean that every individual Catholic understands and executes it properly.

linda-linsefors on Seven lessons I didn't learn from election day

If people are ashamed to vote for Trump, why would they let their neighbours know?

buck on Untrusted smart models and trusted dumb models

I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that they could have been unaligned, but they aren't actually unaligned). There shouldn't be models that are trusted but unaligned.

Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.

buck on Untrusted smart models and trusted dumb models

I think your short definition should include the part about our epistemic status: "We are happy to assume the AI isn't adversarially trying to cause a bad outcome".

tahp on Thoughts after the Wolfram and Yudkowsky discussion

I think we're both saying the same thing here, except that the thing I'm saying implies that I would bet for Eliezer being pessimistic about this. My point was that I have a lot of pessimism that people would code something wrong even if we knew what we were trying to code, and this is where a lot of my doom comes from. Beyond that, I think we don't know what it is we're trying to code up, and you give some evidence for that. I'm not saying that if we knew how to make good AI, it would still fail if we coded it perfectly. I'm saying we don't know how to make good AI (even though we could in principle figure it out), and also current industry standards for coding things would not get it right the first time even if we knew what we were trying to build. I feel like I basically understanding the second thing, but I don't have any gears-level understanding for why it's hard to encode human desires beyond a bunch of intuitions from monkey's-paw things that go wrong if you try to come up with creative disastrous ways to accomplish what seem like laudable goals.

I don't think Eliezer is a DOOM rock, although I think a DOOM rock would be about as useful as Eliezer in practice right now because everyone making capability progress has doomed alignment strategies. My model of Eliezer's doom argument for the current timeline is approximately "programming smart stuff that does anything useful is dangerous, we don't know how to specify smart stuff that avoids that danger, and even if we did we seem to be content to train black-box algorithms until they look smarter without checking what they do before we run them." I don't understand one of the steps in that funnel of doom as well as I would like. I think that in a world where people weren't doing the obvious doomed thing of making black-box algorithms which are smart, he would instead have a last step in the funnel of "even if we knew what we need a safe algorithm to do we don't know how to write programs that do exactly what we want in unexpected situations," because that is my obvious conclusion from looking at the software landscape.

metawrong on Monthly Roundup #23: October 2024

> The news is good, and there are now seven shows in my tier 1

@Zvi [LW · GW] Which are the other shows in your tier 1?

olli-jaerviniemi on Untrusted smart models and trusted dumb models

You might be interested in this post [LW · GW] of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).

cubefox on Seven lessons I didn't learn from election day

Cities are very heavily Democratic, while rural areas are only moderately Republican.

I think this isn't compatible with both getting about equally many votes. Because much more US Americans live in cities than in rural areas:

In 2020, about 82.66 percent of the total population in the United States lived in cities and urban areas.

https://www.statista.com/statistics/269967/urbanization-in-the-united-states/