LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

What is the next level of rationality?
lsusr · 2023-12-12T08:14:14.846Z · comments (24)

On ‘Responsible Scaling Policies’ (RSPs)
Zvi · 2023-12-05T16:10:06.310Z · comments (3)

Toward Safety Case Inspired Basic Research
Lucas Teixeira · 2024-10-31T23:06:32.854Z · comments (2)

shortest goddamn bayes guide ever
lukehmiles (lcmgcd) · 2024-05-10T07:06:23.734Z · comments (8)

How to safely use an optimizer
Simon Fischer (SimonF) · 2024-03-28T16:11:01.277Z · comments (21)

Sora What
Zvi · 2024-02-22T18:10:05.397Z · comments (3)

4. Existing Writing on Corrigibility
Max Harms (max-harms) · 2024-06-10T14:08:35.590Z · comments (13)

Critiques of the AI control agenda
Jozdien · 2024-02-14T19:25:04.105Z · comments (14)

Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (59)

[link] "If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"
plex (ete) · 2024-05-18T14:09:53.014Z · comments (23)

[link] Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
porby · 2024-02-02T05:49:11.189Z · comments (1)

Run evals on base models too!
orthonormal · 2024-04-04T18:43:25.468Z · comments (6)

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Marcus Williams · 2024-11-07T15:39:06.854Z · comments (6)

What distinguishes "early", "mid" and "end" games?
Raemon · 2024-06-21T17:41:30.816Z · comments (22)

How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)

Value learning in the absence of ground truth
Joel_Saarinen (joel_saarinen) · 2024-02-05T18:56:02.260Z · comments (8)

Some Experiments I'd Like Someone To Try With An Amnestic
johnswentworth · 2024-05-04T22:04:19.692Z · comments (33)

Environmental allergies are curable? (Sublingual immunotherapy)
Chipmonk · 2023-12-26T19:05:08.880Z · comments (10)

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
RogerDearnaley (roger-d-1) · 2024-01-09T20:42:28.349Z · comments (8)

[link] Five projects from AI Safety Hub Labs 2023
charlie_griffin (cjgriffin) · 2023-11-08T19:19:37.759Z · comments (1)

[link] on neodymium magnets
bhauth · 2024-01-30T15:58:24.088Z · comments (6)

2023 Prediction Evaluations
Zvi · 2024-01-08T14:40:07.377Z · comments (0)

Three ways interpretability could be impactful
Arthur Conmy (arthur-conmy) · 2023-09-18T01:02:30.529Z · comments (8)

Mission Impossible: Dead Reckoning Part 1 AI Takeaways
Zvi · 2023-11-01T12:52:29.341Z · comments (13)

[link] Constructive Cauchy sequences vs. Dedekind cuts
jessicata (jessica.liu.taylor) · 2024-03-14T23:04:07.300Z · comments (23)

AI Pause Will Likely Backfire (Guest Post)
jsteinhardt · 2023-10-24T04:30:02.113Z · comments (6)

Fund Transit With Development
jefftk (jkaufman) · 2023-09-22T11:10:05.645Z · comments (22)

[link] Immortality or death by AGI
ImmortalityOrDeathByAGI · 2023-09-21T23:59:59.545Z · comments (30)

[link] A Good Explanation of Differential Gears
Johannes C. Mayer (johannes-c-mayer) · 2023-10-19T23:07:46.354Z · comments (4)

Arguments for moral indefinability
Richard_Ngo (ricraz) · 2023-09-30T22:40:04.325Z · comments (16)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset
aphyer · 2024-10-29T01:21:03.075Z · comments (12)

Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes
Anna Gajdova (anna-gajdova) · 2024-10-09T12:56:24.856Z · comments (14)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (13)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

AI as a powerful meme, via CGP Grey
TheManxLoiner · 2024-10-30T18:31:58.544Z · comments (6)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

Trustworthy and untrustworthy models
Olli Järviniemi (jarviniemi) · 2024-08-19T16:27:11.088Z · comments (3)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

The Shallow Bench
Karl Faulks (karl-faulks) · 2024-11-05T05:07:27.357Z · comments (5)

Conflating value alignment and intent alignment is causing confusion
Seth Herd · 2024-09-05T16:39:51.967Z · comments (18)

[link] MIRI's September 2024 newsletter
Harlan · 2024-09-16T18:15:40.785Z · comments (0)

I'm open for projects (sort of)
cousin_it · 2024-04-18T18:05:01.395Z · comments (13)

[link] If Clarity Seems Like Death to Them
Zack_M_Davis · 2023-12-30T17:40:42.622Z · comments (191)

[link] Bayesians Commit the Gambler's Fallacy
Kevin Dorst · 2024-01-07T12:54:59.939Z · comments (28)

The predictive power of dissipative adaptation
dr_s · 2023-12-17T14:01:31.568Z · comments (14)

AI #33: Cool New Interpretability Paper
Zvi · 2023-10-12T16:20:01.481Z · comments (18)

Some costs of superposition
Linda Linsefors · 2024-03-03T16:08:20.674Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jsd on Win/continue/lose scenarios and execute/replace/audit protocols

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting.

buck on Fields that I reference when thinking about AI takeover prevention

FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.

ahmedneedsatherapist on AhmedNeedsATherapist's Shortform

(discussed on the LessWrong discord server)

There seems to be an implicit fundamental difference in many people's minds between an algorithm running a set of heuristics to maximize utility (a heuristic system?) and a particular decision theory (e.g. FDT). I think the better way to think about it is that decision theories categorize heuristic systems, usually classifying them by how they handle edge cases.
Let's suppose we have a non-embedded agent A in a computable environment, something like a very sophisticated video game, and A has to continually choose between a bunch of inputs. A is capable of very powerful thought: it can do hypercomputation, RNG if needed, think as long as it needs between its choices, etc. In particular, A is able to do Solomonoff Induction. Let's also assume A is maximizing a utility function U, which is a computable function of the environment.

What happens if A find itself making a Newcomblike decision? Perhaps there is another agent in this environment that has a very good track record of predicting whether other agents in the environment will one-box or two-box, and A finds itself in the usual Newcomb scenario (million utility or a million+thousand utility or no utility) with their decision predicted by this agent. A can one-box by choosing one input and two-box by choosing another input. Should A one-box?
No. The agent in the environment would be unable to simulate A's decision, and moreover, A's decision is completely and utterly irrelevant to what's inside the boxes. If A randomly goes off-track and flips its decision at this point, nothing happens. Nothing could have happened, this other agent has no way to know or use this fact. Instead, A sums over P(x|input)U(x) for all states x of the computable environment, and chooses whichever input yields the maximum sum, which is probably two-boxing. If A one-boxes, it is due to not having enough information about the setup to determine that two-boxing is better.

You cannot use this logic when playing against Omega or a skilled psychologist. In these cases, your computation is actually accessible by the other agent, so you can get higher utility by one-boxing. Your decision theory is important because your thinking is not as powerful as A's! All of this points to looking at decision theories as classifying different heuristic systems.

I think this is post-worthy, but I want to (a) verify that my logic is correct (b) improve my wording (I am unsure if I am using a lot of terminology correctly here, but I am fairly confident that my idea can be understood.)

cole-wyeth on Heresies in the Shadow of the Sequences

My impression is that e.g. the Catholic church has a pretty deeply thought out moral philosophy that has persisted across generations. That doesn't mean that every individual Catholic understands and executes it properly.

linda-linsefors on Seven lessons I didn't learn from election day

If people are ashamed to vote for Trump, why would they let their neighbours know?

buck on Untrusted smart models and trusted dumb models

I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that we believe they could be schemers, but they aren't actually schemers). There shouldn't be models that are trusted but unaligned.

Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.

buck on Untrusted smart models and trusted dumb models

I think your short definition should include the part about our epistemic status: "We are happy to assume the AI isn't adversarially trying to cause a bad outcome".

tahp on Thoughts after the Wolfram and Yudkowsky discussion

I think we're both saying the same thing here, except that the thing I'm saying implies that I would bet for Eliezer being pessimistic about this. My point was that I have a lot of pessimism that people would code something wrong even if we knew what we were trying to code, and this is where a lot of my doom comes from. Beyond that, I think we don't know what it is we're trying to code up, and you give some evidence for that. I'm not saying that if we knew how to make good AI, it would still fail if we coded it perfectly. I'm saying we don't know how to make good AI (even though we could in principle figure it out), and also current industry standards for coding things would not get it right the first time even if we knew what we were trying to build. I feel like I basically understanding the second thing, but I don't have any gears-level understanding for why it's hard to encode human desires beyond a bunch of intuitions from monkey's-paw things that go wrong if you try to come up with creative disastrous ways to accomplish what seem like laudable goals.

I don't think Eliezer is a DOOM rock, although I think a DOOM rock would be about as useful as Eliezer in practice right now because everyone making capability progress has doomed alignment strategies. My model of Eliezer's doom argument for the current timeline is approximately "programming smart stuff that does anything useful is dangerous, we don't know how to specify smart stuff that avoids that danger, and even if we did we seem to be content to train black-box algorithms until they look smarter without checking what they do before we run them." I don't understand one of the steps in that funnel of doom as well as I would like. I think that in a world where people weren't doing the obvious doomed thing of making black-box algorithms which are smart, he would instead have a last step in the funnel of "even if we knew what we need a safe algorithm to do we don't know how to write programs that do exactly what we want in unexpected situations," because that is my obvious conclusion from looking at the software landscape.

metawrong on Monthly Roundup #23: October 2024

> The news is good, and there are now seven shows in my tier 1

@Zvi [LW · GW] Which are the other shows in your tier 1?

olli-jaerviniemi on Untrusted smart models and trusted dumb models

You might be interested in this post [LW · GW] of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).