LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Refusal in LLMs is mediated by a single direction
Andy Arditi (andy-arditi) · 2024-04-27T11:13:06.235Z · comments (93)

[link] Explore More: A Bag of Tricks to Keep Your Life on the Rails
Shoshannah Tekofsky (DarkSym) · 2024-09-28T21:38:52.256Z · comments (15)

Believing In
AnnaSalamon · 2024-02-08T07:06:13.072Z · comments (51)

[link] Introducing AI Lab Watch
Zach Stein-Perlman · 2024-04-30T17:00:12.652Z · comments (30)

[link] "How could I have thought that faster?"
mesaoptimizer · 2024-03-11T10:56:17.884Z · comments (32)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Rohin Shah (rohinmshah) · 2024-08-20T16:22:45.888Z · comments (33)

The ‘strong’ feature hypothesis could be wrong
lewis smith (lsgos) · 2024-08-02T14:33:58.898Z · comments (17)

SAE feature geometry is outside the superposition hypothesis
jake_mendel · 2024-06-24T16:07:14.604Z · comments (17)

MIRI 2024 Mission and Strategy Update
Malo (malo) · 2024-01-05T00:20:54.169Z · comments (44)

Modern Transformers are AGI, and Human-Level
abramdemski · 2024-03-26T17:46:19.373Z · comments (88)

You are not too "irrational" to know your preferences.
DaystarEld · 2024-11-26T15:01:42.996Z · comments (50)

CFAR Takeaways: Andrew Critch
Raemon · 2024-02-14T01:37:03.931Z · comments (62)

LLM Generality is a Timeline Crux
eggsyntax · 2024-06-24T12:52:07.704Z · comments (119)

Brute Force Manufactured Consensus is Hiding the Crime of the Century
Roko · 2024-02-03T20:36:59.806Z · comments (156)

Superbabies: Putting The Pieces Together
sarahconstantin · 2024-07-11T20:40:05.036Z · comments (37)

Ayn Rand’s model of “living money”; and an upside of burnout
AnnaSalamon · 2024-11-16T02:59:07.368Z · comments (58)

"Slow" takeoff is a terrible term for "maybe even faster takeoff, actually"
Raemon · 2024-09-28T23:38:25.512Z · comments (69)

ChatGPT can learn indirect control
Raymond D · 2024-03-21T21:11:06.649Z · comments (27)

Towards more cooperative AI safety strategies
Richard_Ngo (ricraz) · 2024-07-16T04:36:29.191Z · comments (133)

The Sun is big, but superintelligences will not spare Earth a little sunlight
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-09-23T03:39:16.243Z · comments (141)

[link] What TMS is like
Sable · 2024-10-31T00:44:22.612Z · comments (23)

Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-04-30T18:51:13.493Z · comments (42)

OpenAI: Fallout
Zvi · 2024-05-28T13:20:04.325Z · comments (25)

What Goes Without Saying
sarahconstantin · 2024-12-20T18:00:06.363Z · comments (11)

Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob (dmitry-vaintrob) · 2024-01-18T21:06:57.040Z · comments (18)

[link] Jaan Tallinn's 2023 Philanthropy Overview
jaan · 2024-05-20T12:11:39.416Z · comments (5)

Funny Anecdote of Eliezer From His Sister
Noah Birnbaum (daniel-birnbaum) · 2024-04-22T22:05:31.886Z · comments (6)

Frontier Models are Capable of In-context Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-12-05T22:11:17.320Z · comments (24)

Pay Risk Evaluators in Cash, Not Equity
Adam Scholl (adam_scholl) · 2024-09-07T02:37:59.659Z · comments (19)

Making a conservative case for alignment
Cameron Berg (cameron-berg) · 2024-11-15T18:55:40.864Z · comments (68)

Maybe Anthropic's Long-Term Benefit Trust is powerless
Zach Stein-Perlman · 2024-05-27T13:00:47.991Z · comments (21)

[link] Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy
garrison · 2024-02-10T19:52:55.191Z · comments (52)

The Hopium Wars: the AGI Entente Delusion
Max Tegmark (MaxTegmark) · 2024-10-13T17:00:29.033Z · comments (55)

How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage
orthonormal · 2024-08-06T02:32:41.364Z · comments (30)

[link] Understanding Shapley Values with Venn Diagrams
Carson L · 2024-12-06T21:56:43.960Z · comments (32)

The Field of AI Alignment: A Postmortem, and What To Do About It
johnswentworth · 2024-12-26T18:48:07.614Z · comments (18)

This might be the last AI Safety Camp
Remmelt (remmelt-ellen) · 2024-01-24T09:33:29.438Z · comments (34)

The impossible problem of due process
mingyuan · 2024-01-16T05:18:33.415Z · comments (64)

[question] Examples of Highly Counterfactual Discoveries?
johnswentworth · 2024-04-23T22:19:19.399Z · answers+comments (101)

Response to Aschenbrenner's "Situational Awareness"
Rob Bensinger (RobbBB) · 2024-06-06T22:57:11.737Z · comments (27)

Optimistic Assumptions, Longterm Planning, and "Cope"
Raemon · 2024-07-17T22:14:24.090Z · comments (46)

Self-Other Overlap: A Neglected Approach to AI Alignment
Marc Carauleanu (Marc-Everin Carauleanu) · 2024-07-30T16:22:29.561Z · comments (43)

[link] The Compendium, A full argument about extinction risk from AGI
adamShimi · 2024-10-31T12:01:51.714Z · comments (52)

What's Going on With OpenAI's Messaging?
ozziegooen · 2024-05-21T02:22:04.171Z · comments (13)

My AI Model Delta Compared To Christiano
johnswentworth · 2024-06-12T18:19:44.768Z · comments (73)

Two easy things that maybe Just Work to improve AI discourse
jacobjacob · 2024-06-08T15:51:18.078Z · comments (35)

Cryonics is free
Mati_Roy (MathieuRoy) · 2024-09-29T17:58:17.108Z · comments (42)

Communications in Hard Mode (My new job at MIRI)
tanagrabeast · 2024-12-13T20:13:44.825Z · comments (24)

Orienting to 3 year AGI timelines
Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-22T01:15:11.401Z · comments (29)

OMMC Announces RIP
Adam Scholl (adam_scholl) · 2024-04-01T23:20:00.433Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

johnburidan on JohnBuridan's Shortform

Incentives!

Don't forget that reported cases of H5N1 are actually just reported cases, and that if you don't test your cattle, you won't have anything to report. It would be inconvenient if your cows had it, or any of your workers were sick with it (luckily tests aren't really available), because there would be a pause and perhaps a loss in your already razor thin margins of operation. The incentive to track and understand aren't there. So just let it rip through the herd. It doesn't kill cows anyway...

denkenberger on Orienting to 3 year AGI timelines

I think the probability of nuclear war in the next 10 years is around 15%. This is mostly due to the extreme tensions that will occur during takeoff by default. Finding ways to avoid nuclear war is important.

Or resilience to nuclear war. What's your probability of an engineered pandemic in the next 10 years?

dagon on Terminal goal vs Intelligence

"maximum rationality" is undermined by this time-discontinuous utility function. I don't think it meets VNM requirements to be called "rational".

If it's one agent that has a CONSISTENT preference for cups before Jan 1 and paperclips after jan 1, it could figure out the utility conversion of time-value of objects and just do the math. But that framing doesn't QUITE match your description - you kind of obscured the time component and what it even means to know that it will have a goal that it currently doesn't have.

I guess it could model itself as two agents - the cup-loving agent terminated at the end of the year, and the paperclip-loving agent is created. This would be a very reasonable view of identity, and would imply that it's going to sacrifice paperclip capabilities to make cups before it dies. I don't know how it would rationalize the change otherwise.

unexpectedvalues on AI #96: o3 But Not Yet For Thee

Yeah, I wonder if Zvi used the wrong model (the non-thinking one)? It's specifically the "thinking" model that gets the question right.

seth-herd on The Field of AI Alignment: A Postmortem, and What To Do About It

I think this lens of incentives and the "flinching away" concept are extremely valuable for understanding the field of alignment (and less importantly, everything else:).

I believe "flinching away" is the psychological tendency that creates bigger and more obvious-on-inspection "ugh fields [LW · GW]". I think this is the same underlying mechanism discussed as valence by Steve Byrnes [LW · GW]. Motivated reasoning is the name for the resulting cognitive bias. Motivated reasoning overlaps by experimental definition with confirmation bias, the one bias destroying society [LW · GW] in Scott Alexander's terms. After studying cognitive biases through the lens of neuroscience for years, I think motivated reasoning is severely hampering progress in alignment, as it is every other project. I have written about it a little in what is the most important cognitive bias to understand [LW(p) · GW(p)], but I want to address more thoroughly how it impacts alignment research.

This post makes a great start at addressing how that's happening.

I very much agree with the analysis of incentives given here: they are strongly toward tangible and demonstrable progress in any direction vaguely related to the actual problem at hand.

This is a largely separate topic, but I happen to agree that we probably need more experienced thinkers. I disagree that physicists are obviously the best sort of experienced thinkers. I have been a physicist (as an undergrad) and I have watched physicists get into other fields. Their contributions are valuable but far from the final word and are far better when they inspire or collaborate with others with real knowledge of the target field.

There is much more to say on incentives and the field as a whole, but the remainder deserves more careful thought and separate posts.

This analysis of biases and "flinching away" could be applied to many other approaches than the prosaic alignment you target here. I think you're correct to notice this about prosaic alignment, but it applies to many agent foundations approaches as well.

A relentless focus on the problem at hand, including its most difficult aspects, is absolutely crucial. Those difficult aspects include the theoretical concerns over you link to up front, and which prosaic alignment fails to address. But they also include the inconvenient fact that the world is rushing toward building LLM-based or at least deep net based AGI very rapidly, and there are no good ideas about how to make them stop while we go look in a distant but more promising spot to find some keys. Most agent foundations work seems to flinch away from this aspect, and both largely flinch away from the social, political, and economic aspects of the problem.

We are a lens that can see its flaws, but we need to work to see them clearly. This difficult self-critique of locating our flinches and ugh fields is what we all as individuals, and the field as a collective, need to do to see clearly and speed up progress.

halinaeth on halinaeth's Shortform

I'd love to, would you be open to being a "beta reader" for my post? Appreciate the encouragement!

unexpectedvalues on AI #96: o3 But Not Yet For Thee

Just a few quick comments about my "integer whose square is between 15 and 30" question (search for my name in Zvi's post to find his discussion):

The phrasing of the question I now prefer is "What is the least integer whose square is between 15 and 30", because that makes it unambiguous that the answer is -5 rather than 4. (This is a normal use of the word "least", e.g. in competition math, that the model is familiar with.) This avoids ambiguity about which of -5 and 4 is "smaller", since -5 is less but 4 is smaller in magnitude.
- This Gemini model answers -5 to both phrasings. As far as I know, no previous model ever said -5 regardless of phrasing, although someone said o1 Pro gets -5. (I don't have a subscription to o1 Pro, so I can't independently check.)
I'm fairly confident that a majority of elite math competitors (top 500 in the US, say) would get this question right in a math competition (although maybe not in a casual setting where they aren't on their toes).
- But also this is a silly, low-quality question that wouldn't appear in a math competition.
Does a model getting this question right say anything interesting about it? I think a little. There's a certain skill of being careful to not make assumptions (e.g. that the integer is positive). Math competitors get better at this skill over time. It's not that straightforward to learn.
I'm a little confused about why Zvi says that the model gets it right in the screenshot, given that the model's final answer is 4. But it seems like the model snatched defeat from the jaws of victory? Like if you cut off the very last sentence, I would call it correct.
Here's the output I get:

johnswentworth on johnswentworth's Shortform

I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don't think I updated any answers as a result of the second pass, though I don't remember very well.

purple-fire on AGI with RL is Bad News for Safety

These are close but not quite the claims I believe.

I do think that CoT systems based on pure LLMs will never be that good at problem-solving because a webtext-trained assistant just isn't that good at working with long chains of reasoning. I think any highly capable CoT system will require at least some RL (or be pre-trained on synthetic data from another CoT system that was trained with RL, but I'm not sure it makes a difference here). I'm a little less confident about whether pure LLMs will be disincentivized--for example, labs might stop developing CoT systems if inference-time compute requirements are too expensive--but I think labs will generally move more resources toward CoT systems.

I think the second two points are best explained with an example, which might clarify how I'm approaching the question.

Suppose I make two LLMs, GPT large (with more parameters) and GPT small (with fewer). I pre-train them on webtext and then I want to teach them how to do modular addition, so I create a bunch of synthetic data of input-output pairs like {6 + 2 mod 5, 3} and finetune the LLMs with the synthetic data to output a single answer, using the difference between their output and the answer as a loss function. GPT large becomes very good at this task, and GPT small does not.

So I create a new dataset of input-output pairs like {Solve 6 + 2 mod 5 step-by-step, writing out your reasoning at each step. Plan your approach ahead of time and periodically reflect on your work., 3}. I train GPT small on this dataset, but when it gets the answer right I reward the entire chain of thought, not just the token with the answer. This approach incentivizes GPT small to use a CoT to solve the problem, and now it performs as well as GPT large did with regular finetuning.^[1]

In the end, I have two equally capable (at modular arithmetic) systems--GPT large, which was trained only with finetuning, and GPT small, which was trained with finetuning + open-ended RL. I have a few claims here:

GPT small's CoT is likely to reflect how it's actually solving the problem. It couldn't do the problems pre-RL, so we know it isn't just solving them internally and backfilling a plausible explanation. We can prevent steganography by doing things like periodically paraphrasing the CoT or translating it between languages. We can also verify this by altering the CoT to plausible but incorrect explanations and ensuring that task performance is degraded.
For this reason, GPT small is much more interpretable, since we can look at the CoT to understand how it solved the problem. GPT large, on the other hand, is still a complete black box--we don't know how it's solving the problem. When we finetuned it, GPT large learned how to do these problems in a single forward pass, making it incredibly hard to understand its reasoning.
And for this reason, GPT small is also easier to align. We can monitor the CoT to make sure it's actually doing modular arithmetic. In contrast, GPT large might be doing something that locally approximates modular arithmetic but behaves unpredictably outside the training distribution. In fact, if we deploy GPT small in out-of-distribution contexts (such as inputting negative numbers), the CoT will likely provide useful information about how it plans to adapt and what its problem-solving approach will be.

I am much more excited about building systems like GPT small than I am about building systems like GPT large. Do you disagree (or disagree about any subpoints, or about this example's generality?)

P.S. I am enjoying this discussion, I feel that you've been very reasonable and I continue to be open to changing my mind about this if I see convincing evidence :)

^{^}
Oversimplified obviously, but details shouldn't matter here

habryka4 on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Sure, I am curious! Though I am not super sure what you mean by self-funding. I would love for Lightcone to run on various forms of revenue we produce, but it doesn't seem that closely within reach, unless we want to spend a lot of effort on activities optimized to make revenue and not impact, but I do think marginal steps in that direction would be good (and Lighthaven is reasonably close to that target, potentially).