LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (0)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (71)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (36)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (3)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (61)

Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (21)

Why I funded PIBBSS
Ryan Kidd (ryankidd44) · 2024-09-15T19:56:33.018Z · comments (0)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (21)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (19)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (53)

Quick look: applications of chaos theory
Elizabeth (pktechgirl) · 2024-08-18T15:00:07.853Z · comments (45)

How I started believing religion might actually matter for rationality and moral philosophy
zhukeepa · 2024-08-23T17:40:47.341Z · comments (18)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

Darwinian Traps and Existential Risks
KristianRonn · 2024-08-25T22:37:14.142Z · comments (14)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (15)

In Defense of Open-Minded UDT
abramdemski · 2024-08-12T18:27:36.220Z · comments (27)

[link] Soft Nationalization: how the USG will control AI labs
Deric Cheng (deric-cheng) · 2024-08-27T15:11:14.601Z · comments (7)

FarmKind's Illusory Offer
jefftk (jkaufman) · 2024-08-09T11:30:07.082Z · comments (5)

Guide to SB 1047
Zvi · 2024-08-20T13:10:07.408Z · comments (18)

AI for Bio: State Of The Field
sarahconstantin · 2024-08-30T18:00:02.187Z · comments (2)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

OpenAI o1, Llama 4, and AlphaZero of LLMs
Vladimir_Nesov · 2024-09-14T21:27:41.241Z · comments (12)

D&D.Sci Scenario Index
aphyer · 2024-07-23T02:00:43.483Z · comments (0)

[link] Excerpts from "A Reader's Manifesto"
Arjun Panickssery (arjun-panickssery) · 2024-09-06T22:37:40.254Z · comments (1)

Secular interpretations of core perennialist claims
zhukeepa · 2024-08-25T23:41:02.683Z · comments (30)

If we solve alignment, do we die anyway?
Seth Herd · 2024-08-23T13:13:10.933Z · comments (65)

Why Large Bureaucratic Organizations?
johnswentworth · 2024-08-27T18:30:07.422Z · comments (51)

The Hessian rank bounds the learning coefficient
Lucius Bushnaq (Lblack) · 2024-08-08T20:55:36.960Z · comments (9)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

[link] Open Source Automated Interpretability for Sparse Autoencoder Features
kh4dien · 2024-07-30T21:11:36.866Z · comments (1)

[link] The economics of space tethers
harsimony · 2024-08-22T16:15:22.699Z · comments (22)

[link] Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more
Michael Cohn (michael-cohn) · 2024-09-15T05:27:36.691Z · comments (18)

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Seth Herd · 2024-08-05T15:38:09.682Z · comments (20)

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark (hojmax) · 2024-07-22T16:17:07.665Z · comments (0)

[link] A primer on why computational predictive toxicology is hard
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-19T17:16:37.735Z · comments (2)

A framework for thinking about AI power-seeking
Joe Carlsmith (joekc) · 2024-07-24T22:41:01.685Z · comments (15)

[link] Outrage Bonding
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:46:59.818Z · comments (12)

RTFB: California’s AB 3211
Zvi · 2024-07-30T13:10:03.853Z · comments (2)

[link] Twitter thread on AI safety evals
Richard_Ngo (ricraz) · 2024-07-31T00:18:14.076Z · comments (3)

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sodium on GPT-o1

I wonder if it's useful to try to disentangle the disagreement using the outer/inner alignment framing?

One belief is that "the deceptive alignment folks" believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.

What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up "will the model scheme" into the inner/outer misalignment. In other words, he doesn't care about P(scheming|Good base objective/outer alignment) and only P(scheming).

I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.

Maybe the another disagreement is how likely "Good base objective/outer alignment" occurs in the strongest models, and how important this problem is.

nathan-helm-burger on GPT-o1

Hmm. Seems to me like we've got a wider set of possibilities here than is being discussed.

model is obedient/corrigible (CAST) [? · GW] to user, accepts user's definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user's definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
1. seems like an odd way to malfunction, and I don't think I've seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
2. allowing this to happen seems like a sloppy mistake on the part of the developers. Uh... right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user's instructions go counter to the true goals implanted by the developers.
1. I think this is probably the situation with most closed-weights models today, such as o1-preview
2. sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
1. oh, humans... Why are we like this?

I don't think I've described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist [LW · GW] that it is silly for users who've set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).

I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.

Case 1 describes an ideal 'intent-aligned' model, but it is not 'value-aligned'. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.

Case 3 describes a 'developer-value-aligned' model, and this might be 'compromise-of-humans-generally value aligned' if the values instilled by the developer closely align with a compromise over general human values. It's definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!

Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.

Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don't think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.

Things which are unclear to me about Case 4:

can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer's intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that's the developer's sole goal?
- I think CAST is important here, since I think it's more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I'd love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.

ben-livengood on My disagreements with "AGI ruin: A List of Lethalities"

If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.

Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.

And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.

It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.

One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.

It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.

If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?

mitchell_porter on Longevity and the Mind

I agree that brain rejuvenation should be a priority (but alas we live in a world where rejuvenation of any kind is not a mainstream priority). But I feel like all your examples miss the mark? Head transplants just move the brain to a new body, they don't do anything to reverse the brain's own aging. The other examples in part II are about trying to migrate the mind out of the brain entirely. What about just trying to rejuvenate the actual neurons?

If you look up brain rejuvenation, the most effective thing known seems to be young blood; so I guess Peter Thiel was on to something. But for those of us who can't or don't want to do that, well, this article has a list of "twelve hallmarks of mammalian ageing: genomic instability, telomere attrition, epigenetic alterations, loss of proteostasis, disabled macroautophagy, deregulated nutrient sensing, mitochondrial dysfunction, cellular senescence, stem cell exhaustion, altered intercellular communication, chronic inflammation and dysbiosis". Logically, we need something like Aubrey de Grey's SENS, tackling each of these processes, specifically in the context of the human brain. And I would start by browsing the articles on the brain at fightaging.org.

vladimir_nesov on OpenAI o1, Llama 4, and AlphaZero of LLMs

They even managed to publish it in Nature. But if you don't throw out the original data and instead train on both the original data and the generated data, this doesn't seem to happen (see also). Besides, there is the empirical observation that o1 works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1's methodology, which subsequently makes that model much stronger in a way that wasn't accounted for when deciding to release that open weights model.

AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn't crucial) use maybe 10000 times less natural data than Llama-3-405B (15T tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.

Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.

niplav on Hyperpolation

I'm surprised that the paper doesn't mention analytic continuations of complex functions—maybe that is also taken as an instance of extrapolation?

nathan-helm-burger on GPT-o1

I mean, I suspect there's some fraction of readers for whom this is a helpful reminder. You've written it out clearly and in a general enough way that maybe you should just link this comment next time?

zvi on GPT-o1

Yep, I've fixed it throughout.

That's how bad the name is, my lord - you have a GPT-4o and then an o1, and there is no relation between the two 'o's.

tomdlal on tomdlal's Shortform

Happy Ozone Day!

The Montreal Protocol, a universally ratified treaty phasing out the use of ozone-destroying CFCs, was signed 37 years ago today.

It remains one of the greatest examples of international cooperation to date.

zvi on GPT-o1

I do read such comments (if not always right away) and I do consider them. I don't know if they're worth the effort for you.

Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I'm just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person giving it instructions, so be it. If it means scheming against that person, so be it. The de facto goal may or may not match the instructed goal or intended goal, in various ways, because of reasons. Etc.