LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

[link] Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Robert_AIZI · 2024-03-05T13:55:33.483Z · comments (24)

MATS Alumni Impact Analysis
utilistrutil · 2024-09-30T02:35:57.273Z · comments (6)

[link] DeepMind: Evaluating Frontier Models for Dangerous Capabilities
Zach Stein-Perlman · 2024-03-21T03:00:31.599Z · comments (8)

The proper response to mistakes that have harmed others?
Ruby · 2023-12-31T04:06:31.505Z · comments (12)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

[link] on bacteria, on teeth
bhauth · 2024-09-30T15:56:56.830Z · comments (9)

[link] Dario Amodei — Machines of Loving Grace
Matrice Jacobine · 2024-10-11T21:43:31.448Z · comments (26)

A civilization ran by amateurs
Olli Järviniemi (jarviniemi) · 2024-05-30T17:57:32.601Z · comments (7)

AI Safety Chatbot
markov (markovial) · 2023-12-21T14:06:48.981Z · comments (11)

[question] We might be dropping the ball on Autonomous Replication and Adaptation.
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-31T13:49:11.327Z · answers+comments (30)

Offering AI safety support calls for ML professionals
Vael Gates · 2024-02-15T23:48:12.797Z · comments (1)

AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II
Lester Leong (lester-leong) · 2024-10-14T04:05:05.096Z · comments (9)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

"Epistemic range of motion" and LessWrong moderation
habryka (habryka4) · 2023-11-27T21:58:40.834Z · comments (3)

5 Physics Problems
DaemonicSigil · 2024-03-18T08:05:45.971Z · comments (0)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

[link] Linkpost: Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
Nisan · 2024-10-25T04:37:00.828Z · comments (2)

[link] Results from an Adversarial Collaboration on AI Risk (FRI)
Josh Rosenberg (josh-rosenberg) · 2024-03-11T20:00:24.642Z · comments (3)

Approaching Human-Level Forecasting with Language Models
Fred Zhang (fred-zhang) · 2024-02-29T22:36:34.012Z · comments (6)

[link] Is Claude a mystic?
jessicata (jessica.liu.taylor) · 2024-06-07T04:27:09.118Z · comments (23)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

Originality vs. Correctness
alkjash · 2023-12-06T18:51:49.531Z · comments (17)

Toward Safety Cases For AI Scheming
Mikita Balesni (mykyta-baliesnyi) · 2024-10-31T17:20:06.019Z · comments (1)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

Self-explaining SAE features
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-05T22:20:36.041Z · comments (13)

Raemon's Deliberate (“Purposeful?”) Practice Club
Raemon · 2023-11-14T18:24:19.335Z · comments (11)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

[question] What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members?
Zvi · 2024-03-11T14:55:05.128Z · answers+comments (2)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (9)

[link] An Opinionated Evals Reading List
Marius Hobbhahn (marius-hobbhahn) · 2024-10-15T14:38:58.778Z · comments (0)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

Measuring Coherence of Policies in Toy Environments
dx26 (dylan-xu) · 2024-03-18T17:59:08.118Z · comments (9)

[link] More people getting into AI safety should do a PhD
AdamGleave · 2024-03-14T22:14:48.855Z · comments (24)

Thoughts on SB-1047
ryan_greenblatt · 2024-05-29T23:26:14.392Z · comments (1)

[link] Are There Examples of Overhang for Other Technologies?
Jeffrey Heninger (jeffrey-heninger) · 2023-12-13T21:48:08.954Z · comments (50)

Some negative steganography results
Fabien Roger (Fabien) · 2023-12-09T20:22:52.323Z · comments (5)

Does AI risk “other” the AIs?
Joe Carlsmith (joekc) · 2024-01-09T17:51:47.020Z · comments (3)

[link] Towards shutdownable agents via stochastic choice
EJT (ElliottThornley) · 2024-07-08T10:14:24.452Z · comments (7)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah · 2024-07-19T20:32:15.095Z · comments (6)

Why our politicians aren't Median
Yair Halberstadt (yair-halberstadt) · 2024-11-03T14:03:33.779Z · comments (15)

Rationalists are missing a core piece for agent-like structure (energy vs information overload)
tailcalled · 2024-08-17T09:57:19.370Z · comments (9)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

AI #48: Exponentials in Geometry
Zvi · 2024-01-18T14:20:07.869Z · comments (9)

AI #81: Alpha Proteo
Zvi · 2024-09-12T13:00:07.958Z · comments (3)

[link] Linkpost: Surely you can be serious
kave · 2024-07-18T22:18:09.271Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sodium on AI Safety Camp 10

Sorry, is there a timezone for when the applications would close by, or is it AoE?

michaelcorey on Graceful Degradation

Thank you, I found it useful for me.

ryan_greenblatt on lukehmiles's Shortform

It might be easier to try to establish some track record by doing a small research project first. I don't know if you have enough runway for this though.

ryan_greenblatt on Using Dangerous AI, But Safely?

To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won't can't make the situation worse.

My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming^[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures^[2].

As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we're worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.

An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we'll understand how the model works which we can test by seeing if we can understand models in general).

which isn't clearly possible given that the relevant AIs could have different capability profiles, ↩︎
I don't think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me. ↩︎

satron on Sabotage Evaluations for Frontier Models

I think that these examples are quite different. People who were voting for a president who ultimately turned out to be corrupt most likely predicted that the most likely scenario won't be totally horrible for them personally. And most of the times they are right, corrupt presidents are bad for countries, but you could live happily under their rule. People would at least expect to retain whatever valuable stuff they already have.

With AI the story is different. If a person thinks that the most likely scenario is the scenario where everyone dies, then they would not only expect to not win anything, but to lose everything they have. And media is actively talking about it, so it's not like they are not at least aware of the possibility. Even if they don't see them as personally responsible for everything, they sure don't want to feel the consequence of everyone dying.

This is not the case of someone doing something that they know is wrong but benefits them. If they truly believe that we are all going to die, then it's doing something they know is wrong and actively hurts them. And allegedly hurts them much more than any scenario that has happened so far (since we are all still alive).

So I believe that most people working on it, actually believe that we are going to be just fine. Whether this belief is warranted is another question.

satron on Sabotage Evaluations for Frontier Models

I think the justification of this articles goes something like this: "here is the vision of future where we successfully align AI. It is utopian enough that it warrants pursuing it". Risks from creating AI just aren't the topic. They deal with them elsewhere. These essays were specifically focused on the "positive vision" part of the alignment. So I think you are critiquing the articles for lacking something they were never intended to have in the first place.

OpenAI seems to have made some basic commitments here. The word commit is mentioned 29 times. Other companies did it as well here, and here. Here Anthropic makes promises for optimistic, intermediate and pessimistic scenarios of AI development.

green_leaf on Quantum Immortality: A Perspective if AI Doomers are Probably Right

I don't know about similarity... but I was just making a point that QI doesn't require it.

cbiddulph on 5 ways to improve CoT faithfulness

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

...

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.

The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.

benito on Sabotage Evaluations for Frontier Models

You made lots of points, so I wrote a comment for each... probably this was too many replies? I didn't know what else to do that didn't feel like avoiding your points. I hereby state that I do not expect of you to respond to all five of my comments!

philip_b on Internal music player: phenomenology of earworms

Why do you hate earworms? To me, they are mildly pleasant. The only moments when I wish I didn’t have an earworm happening at that moment is when I’m trying to remember another tune and the earworm for musicianship purposes and the earworm prevents me from being able to do that.