LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Sustainability of Digital Life Form Societies
Hiroshi Yamakawa (hiroshi-yamakawa) · 2024-07-19T13:59:13.973Z · comments (1)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

Tokenized SAEs: Infusing per-token biases.
tdooms · 2024-08-04T09:17:46.755Z · comments (20)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (40)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

[link] Social events with plausible deniability
Chipmonk · 2024-11-18T18:25:17.339Z · comments (17)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

Looking for Goal Representations in an RL Agent - Update Post
CatGoddess · 2024-08-28T16:42:19.367Z · comments (0)

Lab governance reading list
Zach Stein-Perlman · 2024-10-25T18:00:28.346Z · comments (3)

[question] What are the best resources for building gears-level models of how governments actually work?
adamShimi · 2024-08-19T14:05:02.590Z · answers+comments (6)

Scaling Laws and Likely Limits to AI
Davidmanheim · 2024-08-18T17:19:46.597Z · comments (0)

[link] [Linkpost] A Case for AI Consciousness
cdkg · 2024-07-06T14:52:21.704Z · comments (2)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

D/acc AI Security Salon
Allison Duettmann (allison-duettmann) · 2024-10-19T22:17:57.067Z · comments (0)

Rabin's Paradox
Charlie Steiner · 2024-08-14T05:40:25.572Z · comments (40)

[link] The Dumbification of our smart screens
Itay Dreyfus (itay-dreyfus) · 2024-07-04T06:32:36.672Z · comments (0)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

Can Large Language Models effectively identify cybersecurity risks?
emile delcourt (emile-delcourt) · 2024-08-30T20:20:21.345Z · comments (0)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Games of My Childhood: The Troops
Kaj_Sotala · 2024-07-08T11:20:03.033Z · comments (0)

[link] Should Sports Betting Be Banned?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-21T14:13:35.404Z · comments (2)

Avoiding the Bog of Moral Hazard for AI
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-13T21:24:34.137Z · comments (12)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (20)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

How likely is brain preservation to work?
Andy_McKenzie · 2024-11-18T16:58:54.632Z · comments (3)

[link] AI existential risk probabilities are too unreliable to inform policy
Oleg Trott (oleg-trott) · 2024-07-28T00:59:59.497Z · comments (5)

Bryan Johnson and a search for healthy longevity
NancyLebovitz · 2024-07-27T15:28:13.117Z · comments (17)

[question] How great is the utility of "saving" endangered languages?
SpectrumDT · 2024-08-20T13:14:32.895Z · answers+comments (29)

Finding Deception in Language Models
Esben Kran (esben-kran) · 2024-08-20T09:42:13.060Z · comments (4)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk
Lucie Philippon (lucie-philippon) · 2024-07-25T01:12:20.518Z · comments (7)

[link] Chess As The Model Game
criticalpoints · 2024-11-17T19:45:26.499Z · comments (0)

Heresies in the Shadow of the Sequences
Cole Wyeth (Amyr) · 2024-11-14T05:01:11.889Z · comments (12)

Initial Experiments Using SAEs to Help Detect AI Generated Text
Aaron_Scher · 2024-07-22T05:16:20.516Z · comments (0)

OpenAI Boycott Revisit
Jake Dennie · 2024-07-22T01:44:55.094Z · comments (2)

"Which Future Mind is Me?" Is a Question of Values
dadadarren · 2024-08-09T18:17:09.884Z · comments (12)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (7)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (15)

Travel Buffer
jefftk (jkaufman) · 2024-07-06T02:20:02.723Z · comments (3)

Is Text Watermarking a lost cause?
egor.timatkov · 2024-10-01T16:20:51.113Z · comments (13)

Automating LLM Auditing with Developmental Interpretability
htlou · 2024-09-04T15:50:04.337Z · comments (0)

D&D.Sci: Whom Shall You Call? [Evaluation and Ruleset]
abstractapplic · 2024-07-17T22:34:25.111Z · comments (5)

Using Dangerous AI, But Safely?
habryka (habryka4) · 2024-11-16T04:29:20.914Z · comments (2)

[link] Four Levels of Voting Methods
hive · 2024-09-26T18:15:00.565Z · comments (3)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

capybasilisk on Towards shutdownable agents via stochastic choice

Is an AI aligned if it lets you shut it off despite the fact it can foresee extremely negative outcomes for its human handlers if it suddenly ceases running?

I don't think it is.

So funnily enough, every agent that lets you do this is misaligned by default.

bilalchughtai on StefanHex's Shortform

Agreed. A related thought is that we might only need to be able to interpret a single model at a particular capability level to unlock the safety benefits, as long as we can make a sufficient case that we should use that model. We don't care inherently about interpreting GPT-4, we care about there existing a GPT-4 level model that we can interpret.

leon-lang on U.S.-China Economic and Security Review Commission pushes Manhattan Project-style AI initiative

How likely are such recommendations usually to be implemented? Are there already manifold markets on questions related to the recommendation?

dmitry-vaintrob on Evolution's selection target depends on your weighting

Insofar as you're thinking of evolution as analogous to gradient flow, it only makes sense if it's local and individual-level I think -- it is a category error to say that a species that has more members is a winner. The first shark that started eating its siblings in utero improved its genetic fitness (defined as the expected number of offspring in the specific environment it existed in) but might have harmed the survivability of the species as a whole.

akash-wasil on Bogdan Ionut Cirstea's Shortform

We're not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies

I think we're seeing more interest in AI, but I think interest in "AI in general" and "AI through the lens of great power competition with China" has vastly outpaced interest in "AI safety". (Especially if we're using a narrow definition of AI safety; note that people in DC often use the term "AI safety" to refer to a much broader set of concerns than AGI safety/misalignment concerns.)

I do think there's some truth to the quote (we are seeing more interest in AI and some safety topics), but I think there's still a lot to do to increase the salience of AI safety (and in particular AGI alignment) concerns.

dakara on 5 ways to improve CoT faithfulness

"Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction."

I am really interested in hearing CBiddulph's thoughts on this. Do you agree with Abram?

seth-herd on If we solve alignment, do we die anyway?

We die (don't fuck this step up!:)
1. Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
We die (don't let your AGI fuck this step up!:)
1. 22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn't thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
the endgame is to use Intent alignment as a stepping-stone to value alignment [LW · GW] and let something more competent and compassionate than us monkeys handle things from there on out.

cole-wyeth on Social events with plausible deniability

“Cancel culture is good actually” needs to go in the hat ;)

sharmake-farah on If we solve alignment, do we die anyway?

The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can't have a continuously rewarding path to misalignment.

The second issue is less critical, assuming that AGI #21 hasn't itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.

If that's no longer an option, we can go to war against the misaligned AGI with our own AGI forces.

In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn't fatal, so we can work around it.

The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that's when we can say the end goal has been achieved.

seth-herd on Training AI agents to solve hard problems could lead to Scheming

I agree with all of those points locally.

To the extent people are worried about LLM scaleups taking over, I don't think they should be.

We will get nice instruction-following tool AIs.

But the first thing we'll do with those tool AIs is turn them into agentic AGIs. To accomplish any medium-horizon goals, let alone the long-horizon ones we really want help with, they'll need to do some sort of continuous learning, make plans (including subgoals), and reason in novel sub-domains.

None of those things are particularly hard to add. So we'll add them. (Work is underway on all of those capacities in different LLM agent projects).

Then we have the risks of aligning real AGI.

That's why this post was valuable. It goes into detail on why and how we'll add the capacities that will make LLM agents much more useful but also add the ability and instrumental motivation to do real scheming.

I wrote a similar post to the one you mention, Cruxes of disagreement on alignment difficulty [LW(p) · GW(p)]. I think understanding the wildly different positions on AGI x-risk among different experts is critical; we clearly don't have a firm grasp on the issue, and we need it ASAP. The above is my read on why TurnTrout, Pope and co are so optimistic - they're addressing powerful tool AI, and not the question of whether we develop real AGI or how easy that will be to align.

FWIW I do think that can be accomplished (as sketched out in posts linked from my user profile summary), but it's nothing like easy or default alignment, as current systems and their scaleups are.

I'll read and comment on your take on the issue.