LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Training AI agents to solve hard problems could lead to Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-11-19T00:10:55.522Z · comments (12)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (25)

A case for donating to AI risk reduction (including if you work in AI)
tlevin (trevor) · 2024-12-02T19:05:06.658Z · comments (2)

Cognitive Work and AI Safety: A Thermodynamic Perspective
Daniel Murfet (dmurfet) · 2024-12-08T21:42:17.023Z · comments (9)

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

Measuring whether AIs can statelessly strategize to subvert security measures
Alex Mallen (alex-mallen) · 2024-12-19T21:25:28.555Z · comments (0)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

o1 Turns Pro
Zvi · 2024-12-10T17:00:08.036Z · comments (3)

[link] How much I'm paying for AI productivity software (and the future of AI use)
jacquesthibs (jacques-thibodeau) · 2024-10-11T17:11:27.025Z · comments (18)

[link] Testing for Scheming with Model Deletion
Guive (GAA) · 2025-01-07T01:54:13.550Z · comments (18)

AI #96: o3 But Not Yet For Thee
Zvi · 2024-12-26T20:30:06.722Z · comments (8)

AI #95: o1 Joins the API
Zvi · 2024-12-19T15:10:05.196Z · comments (1)

[question] Could orcas be (trained to be) smarter than humans? 
Towards_Keeperhood (Simon Skade) · 2024-11-04T23:29:26.677Z · answers+comments (22)

[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder
Steven Byrnes (steve2152) · 2024-10-15T13:31:46.157Z · comments (7)

Read The Sequences As If They Were Written Today
Peter Berggren (peter-berggren) · 2025-01-02T02:51:36.537Z · comments (3)

o3, Oh My
Zvi · 2024-12-30T14:10:05.144Z · comments (17)

AI #86: Just Think of the Potential
Zvi · 2024-10-17T15:10:06.552Z · comments (8)

[link] new chinese stealth aircraft
bhauth · 2025-01-01T00:19:10.644Z · comments (3)

Seeking Collaborators
abramdemski · 2024-11-01T17:13:36.162Z · comments (15)

AI #87: Staying in Character
Zvi · 2024-10-29T07:10:08.212Z · comments (3)

An Illustrated Summary of "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-14T15:02:44.828Z · comments (2)

[link] The Alignment Trap: AI Safety as Path to Power
crispweed · 2024-10-29T15:21:26.545Z · comments (17)

U.S.-China Economic and Security Review Commission pushes Manhattan Project-style AI initiative
Phib · 2024-11-19T18:42:43.296Z · comments (7)

Reading RFK Jr so that you don’t have to
braces · 2024-11-22T00:59:19.583Z · comments (1)

Win/continue/lose scenarios and execute/replace/audit protocols
Buck · 2024-11-15T15:47:24.868Z · comments (2)

[link] Ideas for benchmarking LLM creativity
gwern · 2024-12-16T05:18:55.631Z · comments (11)

ReSolsticed vol I: "We're Not Going Quietly"
Raemon · 2024-12-26T17:52:33.727Z · comments (4)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

Toward Safety Case Inspired Basic Research
Lucas Teixeira · 2024-10-31T23:06:32.854Z · comments (3)

[link] a space habitat design
bhauth · 2024-11-25T17:28:48.481Z · comments (13)

AI Assistants Should Have a Direct Line to Their Developers
Jan_Kulveit · 2024-12-28T17:01:58.643Z · comments (6)

[link] How Likely Are Various Precursors of Existential Risk?
NunoSempere (Radamantis) · 2024-10-28T13:27:31.620Z · comments (4)

Safe Predictive Agents with Joint Scoring Rules
Rubi J. Hudson (Rubi) · 2024-10-09T16:38:16.535Z · comments (10)

Vegans need to eat just enough Meat - emperically evaluate the minimum ammount of meat that maximizes utility
Johannes C. Mayer (johannes-c-mayer) · 2024-12-22T22:08:31.971Z · comments (34)

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
Joe Carlsmith (joekc) · 2024-10-28T21:57:12.063Z · comments (5)

[question] What Have Been Your Most Valuable Casual Conversations At Conferences?
johnswentworth · 2024-12-25T05:49:36.711Z · answers+comments (20)

Luck Based Medicine: No Good Very Bad Winter Cured My Hypothyroidism
Elizabeth (pktechgirl) · 2024-12-08T20:10:02.651Z · comments (3)

[link] Funding Case: AI Safety Camp 11
Remmelt (remmelt-ellen) · 2024-12-23T08:51:55.255Z · comments (4)

Stream Entry
lsusr · 2025-01-07T23:56:13.530Z · comments (2)

Estimates of GPU or equivalent resources of large AI players for 2024/5
CharlesD · 2024-11-28T23:01:58.522Z · comments (7)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (10)

A Conflicted Linkspost
Screwtape · 2024-11-21T00:37:54.035Z · comments (0)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (2)

Claude Sonnet 3.5.1 and Haiku 3.5
Zvi · 2024-10-24T14:50:06.286Z · comments (9)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

bryce-robertson on Bryce Robertson's Shortform

I've put it on our list of possible future pages, and added some of the things from that doc to our Funders page. Thanks Chris!

nathan-helm-burger on 2. Skim the Manual: Intelligent Voluntary Cooperation

I've read and appreciate the entire series, but this was the chapter that really stood out to me. I think that improved systems of voluntary contracts with decentralized enforcement might be our best hope out of the multiple Pareto-suboptimal races-to-doom that humanity seems stuck in.

I think this is potentially a place where we could use our biggest risk factor (AI) to save us before it dooms us. Could we manage to use AI to speed-run the invention and deployment of such subsidiarity governance systems? I think the biggest challenge to this is how fast it would need to move in order to take effect in time. For a system that needs extremely broad buy-in from a large number of heterogenous actors, speed of implementation and adoption is a key weak point.

Imagine though that a really good system was designed which you felt confident that a supermajority of humanity would sign onto if they had it personally explained to them (along with a convincing explanations of the counterfactuals). How might we get this personalized explanation accomplished at scale? Welll, LLMs are still bad at certain things, but giving personalized interactive explanations of complex legal docs seems well within their near-term capabilities. It would still be a huge challenge to actually present nearly everyone on Earth with the opportunity to have this interaction, and all within a short deadline... But not beyond belief.

This is the first non-destructive non-coercive pivotal act that I've considered plausibly sufficient to save us from the current crisis. I've thought up and heard of a lot of different bad plans over the past ten years, so it's really nice to finally find a good one!

Existential Hope indeed!

alex_altair on Shallow review of technical AI safety, 2024

Some small corrections/additions to my section ("Altair agent foundations"). I'm currently calling it "Dovetail research". That's not publicly written anywhere yet, but if it were listed as that here, it might help people who are searching for it later this year.

Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake

I wouldn't put number 9. Not intended to "solve" most of these problems, but is intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda.

Target case: worst-case

definitely not worst-case, more like pessimistic-case

Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K

Add "José Pedro Faustino"

Estimated # FTEs: 1-10

I'd call it 2, averaged throughout 2024.

Some outputs in 2024: mostly [LW · GW] exposition [LW · GW] but it’s early days

Basically right; I'd add this post [LW · GW] and this post [LW · GW].

raemon on On Eating the Sun

Richard said "I don't think my priors on that are very different from yours but the thing that would have made this post valuable for me is some object-level reason to upgrade my confidence in that." He didn't say it'd be a longterm project, I think he just meant he didn't change his beliefs about it due to thist post.

sweenesm on How can humanity survive a multipolar AGI scenario?

Here's what Claude 3.5 Sonnet said in response to this prompt:

Me:

What are the best plans for preventing catastrophes such as everyone dying or a totalitarian government taking over in multipolar AGI scenarios, i.e., ones in which AGI's are aligned to what their human users want to do, and at least one human user order them to do bad things?

Claude:

Let me share some thoughts on AGI governance and safety in multipolar scenarios, while noting this is a complex topic that reasonable people disagree on:

A key challenge in multipolar scenarios is that even if individual AI systems are aligned with their users' values and intentions, some users may have harmful goals. Some potential approaches to address this include:

International coordination and governance frameworks that place limits on allowable AI system capabilities and uses, similar to arms control treaties. However, enforcement would be challenging.
Technical measures like capability limitations built into AI systems' training that prevent them from being used for certain catastrophic actions, regardless of user intent. The challenge is defining and implementing such limitations without preventing beneficial uses.
Focused work on improving human values and cooperation through better institutions, education, and mechanisms for resolving conflicts peacefully. This could reduce the likelihood of actors wanting to cause catastrophic harm.
Developing AI systems with some form of common good or humanitarian principles that apply regardless of user instructions, though this involves difficult philosophical and technical questions about how to define and implement such principles.

What are your thoughts on which approaches seem most promising or practical? I'm particularly curious about your perspective on the balance between technical and social/institutional solutions.

___________________________________

I would add to this Defensive acceleration (d/acc) described by Vitalik Buterin (https://vitalik.eth.limo/general/2025/01/05/dacc2.html and https://vitalik.eth.limo/general/2023/11/27/techno_optimism.html).

jfw01 on Where I agree and disagree with Eliezer

implement a relatively limited policy

I read this as Libertarian; the hope that there could be a very stiff, strong government that was also small, and did only a subset of the things in the short-term interest of its supporters.

raemon on On Eating the Sun

So, I'm with you on "hey guys, uh, this is pretty horrifying, right? Uh, what's with the missing mood about that."

The thing is that not doing it is also horrifying. i.e. see also All Possible Views About Humanity's Future Are Wild [LW · GW]. To not eat the sun is to throw away orders of magnitude more resources than anyone has ever thrown away before. Is it percentage-wise "a small fraction of the cosmos?". Maybe. But, (quickly checks Claude, which wrote up a fermi code snippet before answering, I can share the work if you want to doublecheck yourself), a two year delay would be... 200 galaxies lost, longterm.

When you compare "the Amish get a Sun Replica that doesn't change their experience", the question "Is it worth throwing away 80 trillion stars to have the real thing" is, like, not a trivial question.

IMO there isn't an option that isn't at least a bit horrifying in some sense that one could have a missing mood about. And while I still feel unsettled about it, I think if I have to grieve something, makes more sense to grieve in the direction of "don't throw away 80 trillion stars worth of resources."

I think you're also maybe just not appreciating how much would change in 10,000 years? Like, there is no single culture that has survived 10,000 years. (Maybe one of those small tribes in the amazon? I'd still bet on there having been a lot of cultural drift there but not confidently). The Amish are only a few hundred years old. I can imagine doing a lot of moral reflection and coming to the conclusion the sun shouldn't be eaten until all human cultures have decided it's the right thing to do, but I do really doubt that process takes 10,000 years.

fread2281 on Is there software to practice reading expressions?

I just made an anki deck for this, using gpt 4o-mini to label images of faces. I’ll try to report back in a month or two how well it works.

nathan-helm-burger on How can humanity survive a multipolar AGI scenario?

My current best guess: Subsidiarity I've been thinking along these lines for the past few years, but I feel like my thinking was clarified and boosted by Allison's recent series: Gaming the Future [LW · GW]

The gist of the idea is to create clever systems of decentralized control and voluntary interaction which can still manage to coordinate on difficult risky tasks (such as enforcing defensive laws against weapons of mass destruction). Such systems could shift humanity out of the Pareto suboptimal lose-lose traps and races we are stuck in. Win-win solutions to our biggest current problems seem possible, and coordination seems like the biggest blocker.

I am hopeful that one of the things we can do with just-before-the-brink AI will be to accelerate the design and deployment of such voluntary coordination contracts.

jfw01 on Where I agree and disagree with Eliezer

Alignment isn’t like that; it was chosen to be an important problem

Like medicine.

This was specifically commented on in a book whose preface I read as a child. It was called something like "Medicine: from science to magic", and I have not found a clear link back to it.