LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Truthseeking when your disagreements lie in moral philosophy
Elizabeth (pktechgirl) · 2023-10-10T00:00:04.130Z · comments (4)

Sam Altman's sister, Annie Altman, claims Sam has severely abused her
prometheus5015 (pl5015) · 2023-10-07T21:06:49.396Z · comments (107)

[link] "Deep Learning" Is Function Approximation
Zack_M_Davis · 2024-03-21T17:50:36.254Z · comments (28)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

[link] My cover story in Jacobin on AI capitalism and the x-risk debates
garrison · 2024-02-12T23:34:16.526Z · comments (5)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

Learning-theoretic agenda reading list
Vanessa Kosoy (vanessa-kosoy) · 2023-11-09T17:25:35.046Z · comments (0)

What's Hard About The Shutdown Problem
johnswentworth · 2023-10-20T21:13:27.624Z · comments (33)

I don’t find the lie detection results that surprising (by an author of the paper)
JanB (JanBrauner) · 2023-10-04T17:10:51.262Z · comments (8)

Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (14)

Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)

OpenAI #8: The Right to Warn
Zvi · 2024-06-17T12:00:02.639Z · comments (8)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (21)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

Neutrality
sarahconstantin · 2024-11-13T23:10:05.469Z · comments (0)

On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

[link] Ideological Bayesians
Kevin Dorst · 2024-02-25T14:17:25.070Z · comments (4)

[link] Explaining Impact Markets
Saul Munn (saul-munn) · 2024-01-31T09:51:27.587Z · comments (2)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (14)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

Access to powerful AI might make computer security radically easier
Buck · 2024-06-08T06:00:19.310Z · comments (14)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)

Kids or No kids
Kids or no kids (grosseholz.f@gmail.com) · 2023-11-14T18:37:02.799Z · comments (10)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

Announcing FAR Labs, an AI safety coworking space
bgold · 2023-09-29T16:52:37.753Z · comments (0)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

Investigating the learning coefficient of modular addition: hackathon project
Nina Panickssery (NinaR) · 2023-10-17T19:51:29.720Z · comments (5)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)

[link] The Soul Key
Richard_Ngo (ricraz) · 2023-11-04T17:51:53.176Z · comments (9)

Symbol/Referent Confusions in Language Model Alignment Experiments
johnswentworth · 2023-10-26T19:49:00.718Z · comments (44)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (32)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (46)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (1)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

abramdemski on 5 ways to improve CoT faithfulness

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction.

This is especially likely to hold under the following conditions:

We can also combine this with other proposals, such as paraphrasing.

curiousmeta on Positive Bias: Look Into the Dark

This cognitive phenomenon is usually lumped in with “confirmation bias.” However, it seems to me that the phenomenon of trying to test positive rather than negative examples, ought to be distinguished from the phenomenon of trying to preserve the belief you started with. “Positive bias” is sometimes used as a synonym for “confirmation bias,” and fits this particular flaw much better.

Subtle distinction I almost missed here. Worth expanding.

abramdemski on 5 ways to improve CoT faithfulness

Yeah, you're right, I no longer think it's an interesting proposal.

koratkar on Gwerns

For next time: https://www.lesswrong.com/posts/LfrNFfJFcqnG9WuFf/activated-charcoal-for-hangover-prevention-way-more-than-you [LW · GW] ;)

bjartur-tomas on Gwerns

I was really sleep deprived and slightly intoxicated yesterday and wrote this. It was amusing to me, at least, in the state I was in.

thane-ruthenis on OpenAI Email Archives (from Musk v. Altman)

Also this:

From Altman: [...] Admitted that he lost a lot of trust with Greg and Ilya through this process. Felt their messaging was inconsistent and felt childish at times. [...] Sam was bothered by how much Greg and Ilya keep the whole team in the loop with happenings as the process unfolded. Felt like it distracted the team.

Apparently airing such concerns is "childish" and should only be done behind closed doors, otherwise it "distracts the team", hm.

anthonyc on AI #90: The Wall

I can totally believe this. But, I also think that responsibly wearing the scientist hat entails prep work before engaging in a four hour public discussion with a domain expert in a field. At minimum that includes skimming the titles and ideally the abstracts/outlines of their key writings. Maybe ask Claude to summarize the highlights for you. If he'd done that he'd have figured out many of the answers to many of these questions on his own, or much faster during discussion. He's too smart not to.

Otherwise, you're not actually ready to have a meaningful scientific discussion with that person on that topic.

akash-wasil on OpenAI Email Archives (from Musk v. Altman)

A few quotes that stood out to me:

Greg:

I hope for us to enter the field as a neutral group, looking to collaborate widely and shift the dialog towards being about humanity winning rather than any particular group or company.

Greg and Ilya (to Elon):

The goal of OpenAI is to make the future good and to avoid an AGI dictatorship. You are concerned that Demis could create an AGI dictatorship. So do we. So it is a bad idea to create a structure where you could become a dictator if you chose to, especially given that we can create some other structure that avoids this possibility.

Greg and Ilya (to Altman):

But we haven't been able to fully trust your judgements throughout this process, because we don't understand your cost function.
We don't understand why the CEO title is so important to you. Your stated reasons have changed, and it's hard to really understand what's driving it.
Is AGI truly your primary motivation? How does it connect to your political goals? How has your thought process changed over time?

abramdemski on 5 ways to improve CoT faithfulness

I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.
And, these should clearly be separate models such that the training of one doesn't generalize to the other.
So, making the Face more deceptive doesn't kill the canary?

Ah, yes, I had neglected this important distinction.

So what do you do in the end, throw away the face?

It seems worth pointing out that although the CoT isn't trained to be deceptive, it is trained to think thoughts which help out a deceptive face. So it seems plausible that a multi-step training process like:

GPT -> basic chat training -> shoggoth+face training -> discard face & align shoggoth

Would result in deceptive capabilities: the shoggoth CoT learns to explicitly reason about what helps deceive the alignment training in the shoggoth+face step, and then in the final step, some of this is destroyed but some is driven underground. IE, learning to explicitly reason about deception creates latent structure which is useful for that task, and this latent structure will somewhat transfer to the case where the deception has to be hidden from the CoT.

The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).

So it sounds like we have some disagreement about what degree of deception is considered problematic. Again, I'm worried about basin-of-attraction type dynamics. We can think of a whole frontier lab as collectively being a sort of self-improving AGI; we want that recursive self-improvement to be going down a good path rather than a bad one. The LLMs are already helping to improve themselves in little ways (EG as coding assistants) and big ways (EG constitutional AI). So it's an environment where seeds of deceptiveness can grow and take some control of the overall trajectory.

Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?

I'm not sure what you're getting at here. It seems like the idea is that Face + CoT is, together, a clever schemer -- in the o1 report, we already see cases of explicit scheming in the CoT. So then, using the Face+CoT together (EG trying to train it further, using it in supervision steps for later iterations of the system, etc) is dangerous and difficult for the same reason that a single deceptive LLM would be. If we use Face+CoT together, we have to reason about them together in that way. Or, if we are using them as separate pieces, then we have to account for the part of the overall deceptive-schemer strategy they've learned.

I guess you could use Face+CoT as a supervisor (EG in constitutional AI) by looking at the CoT (both by hand and using some automation) and throwing out any deceptive examples, so as to not use them for supervision.

But the overall goal should be to produce a highly non-deceptive system in the end, I think.

Another idea would be to do the whole Face+CoT thing, and then filter for honesty on a bunch of examples, and then re-train a whole new system (well, you can start from the same GPT) only on those honest examples.

kabir-kumar on Kabir Kumar's Shortform

give better names to actual formal math things, jesus christ.