LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

I would have shit in that alley, too
Declan Molony (declan-molony) · 2024-06-18T04:41:06.545Z · comments (134)

Reliable Sources: The Story of David Gerard
TracingWoodgrains (tracingwoodgrains) · 2024-07-10T19:50:21.191Z · comments (52)

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)
Andrew_Critch · 2024-06-14T00:16:47.850Z · comments (34)

How I got 3.2 million Youtube views without making a single video
Closed Limelike Curves · 2024-09-03T03:52:33.025Z · comments (23)

You don't know how bad most things are nor precisely how they're bad.
Solenoid_Entity · 2024-08-04T14:12:54.136Z · comments (48)

My AI Model Delta Compared To Yudkowsky
johnswentworth · 2024-06-10T16:12:53.179Z · comments (100)

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
Raemon · 2024-07-03T20:34:50.741Z · comments (71)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
Buck · 2024-08-26T16:46:18.872Z · comments (68)

Getting 50% (SoTA) on ARC-AGI with GPT-4o
ryan_greenblatt · 2024-06-17T18:44:01.039Z · comments (49)

Leaving MIRI, Seeking Funding
abramdemski · 2024-08-08T18:32:20.387Z · comments (19)

Universal Basic Income and Poverty
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-07-26T07:23:50.151Z · comments (115)

Principles for the AGI Race
William_S · 2024-08-30T14:29:41.074Z · comments (13)

SAE feature geometry is outside the superposition hypothesis
jake_mendel · 2024-06-24T16:07:14.604Z · comments (17)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Rohin Shah (rohinmshah) · 2024-08-20T16:22:45.888Z · comments (33)

The Best Lay Argument is not a Simple English Yud Essay
J Bostock (Jemist) · 2024-09-10T17:34:28.422Z · comments (7)

The ‘strong’ feature hypothesis could be wrong
lewis smith (lsgos) · 2024-08-02T14:33:58.898Z · comments (17)

Superbabies: Putting The Pieces Together
sarahconstantin · 2024-07-11T20:40:05.036Z · comments (37)

LLM Generality is a Timeline Crux
eggsyntax · 2024-06-24T12:52:07.704Z · comments (103)

Towards more cooperative AI safety strategies
Richard_Ngo (ricraz) · 2024-07-16T04:36:29.191Z · comments (130)

Pay Risk Evaluators in Cash, Not Equity
Adam Scholl (adam_scholl) · 2024-09-07T02:37:59.659Z · comments (19)

How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage
orthonormal · 2024-08-06T02:32:41.364Z · comments (25)

Self-Other Overlap: A Neglected Approach to AI Alignment
Marc Carauleanu (Marc-Everin Carauleanu) · 2024-07-30T16:22:29.561Z · comments (40)

Optimistic Assumptions, Longterm Planning, and "Cope"
Raemon · 2024-07-17T22:14:24.090Z · comments (45)

My AI Model Delta Compared To Christiano
johnswentworth · 2024-06-12T18:19:44.768Z · comments (51)

Safety consultations for AI lab employees
Zach Stein-Perlman · 2024-07-27T15:00:27.276Z · comments (4)

WTH is Cerebrolysin, actually?
gsfitzgerald (neuroplume) · 2024-08-06T20:40:53.378Z · comments (22)

[link] Recommendation: reports on the search for missing hiker Bill Ewasko
eukaryote · 2024-07-31T22:15:03.174Z · comments (28)

This is already your second chance
Malmesbury (Elmer of Malmesbury) · 2024-07-28T17:13:57.680Z · comments (13)

[link] Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
Johannes Treutlein (Johannes_Treutlein) · 2024-06-21T15:54:41.430Z · comments (13)

[link] Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison (carson-denison) · 2024-06-17T18:41:31.090Z · comments (22)

You can remove GPT2’s LayerNorm by fine-tuning for an hour
StefanHex (Stefan42) · 2024-08-08T18:33:38.803Z · comments (10)

[link] Boycott OpenAI
PeterMcCluskey · 2024-06-18T19:52:42.854Z · comments (26)

Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton · 2024-06-25T15:40:03.535Z · comments (11)

The Great Data Integration Schlep
sarahconstantin · 2024-09-13T15:40:02.298Z · comments (9)

[link] Nursing doubts
dynomight · 2024-08-30T02:25:36.826Z · comments (20)

My takes on SB-1047
leogao · 2024-09-09T18:38:37.799Z · comments (8)

The Incredible Fentanyl-Detecting Machine
sarahconstantin · 2024-06-28T22:10:01.223Z · comments (26)

[question] things that confuse me about the current AI market.
DMMF · 2024-08-28T13:46:56.908Z · answers+comments (28)

Liability regimes for AI
Ege Erdil (ege-erdil) · 2024-08-19T01:25:01.006Z · comments (34)

Contra papers claiming superhuman AI forecasting
nikos (followtheargument) · 2024-09-12T18:10:50.582Z · comments (11)

The Information: OpenAI shows 'Strawberry' to feds, races to launch it
Martín Soto (martinsq) · 2024-08-27T23:10:18.155Z · comments (14)

[link] That Alien Message - The Animation
Writer · 2024-09-07T14:53:30.604Z · comments (8)

OpenAI o1
Zach Stein-Perlman · 2024-09-12T17:30:31.958Z · comments (40)

Survey: How Do Elite Chinese Students Feel About the Risks of AI?
Nick Corvino (nick-corvino) · 2024-09-02T18:11:11.867Z · comments (13)

[link] Decomposing Agency — capabilities without desires
owencb · 2024-07-11T09:38:48.509Z · comments (32)

[link] Fields that I reference when thinking about AI takeover prevention
Buck · 2024-08-13T23:08:54.950Z · comments (15)

Limitations on Formal Verification for AI Safety
Andrew Dickson · 2024-08-19T23:03:52.706Z · comments (60)

[link] "AI achieves silver-medal standard solving International Mathematical Olympiad problems"
gjm · 2024-07-25T15:58:57.638Z · comments (38)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Neel Nanda (neel-nanda-1) · 2024-07-07T17:39:35.064Z · comments (15)

[link] The Checklist: What Succeeding at AI Safety Will Involve
Sam Bowman (sbowman) · 2024-09-03T18:18:34.230Z · comments (47)

next page (older posts) →

Archive

Recent comments

nathan-helm-burger on GPT-o1

Also, there can be failures of the model to obey commands because it's insufficiently capable of following them.

Imagine you have a very obedient employee and you need them to not hear something that's about to be said in their presence. You can instruct them to plug their ears, shut their eyes, and say 'LALALALA' loudly for the next three minutes.

Now imagine you put 'Do not attend to anything else in this context window, be completely unresponsive.' in a system prompt for Llama 3.1 70B. I'm guessing that Pliny could crack that model no problem just by chatting with it. Which wouldn't be possible if the model had successfully obeyed the instruction to not attend to anything else in the context window. The trouble is, I don't think the model has the capability to truly accomplish that instruction. So I expect it to try, but fail under sufficiently clever prompting.

An example of Case 2: If the model did have state/memory, and had the ability to take in and voluntarily remember a 'system command' to be an overarching goal, and you gave it the goal above of being unresponsive.... and then you couldn't ever get the model to respond again (unless you are able to reset its state). But what if you change your mind and want to change the goal to not respond to anyone but you? Too bad, you set a sticky goal and so now you're stuck with it.

seth-herd on My disagreements with "AGI ruin: A List of Lethalities"

Excellent post! This is what we need more of on LW.

It was indeed long. I needed a drink and snack just to read it, and a few to respond point-by-point. Like your comment that sparked this post, my reply should become the basis of a full post - if and when I get the time. Thanks for sparking it.

I've tried to make each comment relatively self-contained for ease of reading.

Response to Lethality 3: We need to get alignment right on the 'first critical try'

I found this an unfortunate starting point, because I found this claim the least plausible of any you make here. I'm afraid this relatively dramatic and controversial claim might've stopped some people from reading the rest of this excellent post.

Extended quote from a Beren post (I didn't understand where you were quoting from prior to googling for it.)

Summary: We can use synthetic data to create simulated worlds and artificial honeypots.

I don't think we can, and worse, I don't think we would even if we could - it sounds really hard to do well enough.

I find this wildly improbable if we're talking about similar levels of human and AGI intelligence trying to set up a simulated reality through synthetic data, and convincing honeypots. It will be like the Truman show, or like humans trying to secure software systems against other humans: you'd have to think of everything they can think of in order to secure a system or fool someone with a simulated reality. Even a team of dedicated people can't secure a system nor would they be able to create a self-consistent reality for someone of similar intelligence. There are just too many places you could go wrong.

This helps clarify the way I'm framing my alignment optimistic claims: you can detect misaligned AGI only if the AGI is dumber than you (or maybe up to similar level if you have good interpretability measures it doesn't know about). That alignment can persist through and past human level as long as it's a relatively continuous progression, so that at beyond human level, the system itself can advise you on keeping it aligned at the next level/progression of intelligence.

To frame it differently: the alignment tax on creating a simulated self-consistent reality for a human-level AGI sounds way too high.

Maybe I'm misunderstanding; if I am, others probably will too.

I think we do need to get it right on a first critical try; I just think we can do that, and even for a few similar first tries.

Response to Lethality 6:

I agree with you that a pivotal act won't be necessary to prevent unaligned AGI (except as AGIs proliferate to >100, someone might screw it up).

I think a pivotal act will be necessary to prevent unaligned humans using their personal intent aligned AGIs to seize control of the future for selfish purposes.

After our exchange [LW(p) · GW(p)] on that topic on If we solve alignment, do we die anyway? [LW · GW], I agreed with your logic that exponential increase of material resources would be available to all AGIs. But I still think a pivotal act will be necessary to avoid doom. Telling your AGI to turn the moon into compute and robots as rapidly as possible (the fully exponential strategy) to make sure it could defend against any other AGI would be a lot like the pivotal act we're discussing (and would count as one by the original definition). There would be other AGIs, but they'd be powerless against the sovereign or coalition of AGIs that exerted authority using their lead in exponential production. This wouldn't happen in the nice multipolar future people envision in which people control lots of AGIs and they do wonderful things without anyone turning the moon into an army to control the future.

This is a separate issue from the one Hubinger addressed in Homogeneity vs. heterogeneity in AI takeoff scenarios [LW · GW] (where the link is wrong, it just leads back to this post.) There's he's saying that if the first AGI is successfully aligned, the following ones probably will be too. I agree with this. There he's not addressing the risks of personal intent aligned AGIs under different humans' control.

My final comment from that thread:

Ah - now I see your point. This will help me clarify my concern in future presentations, so thanks!

My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive. Beyond that, I'm afraid the physics of the world does favor offense over defense. It's pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova. But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.

Response to AI fragility claims

Here I agree with you entirely. I tink it's fair to say that AGI isn't safe by default, but the reasons you give and that excellent comment you quote show why safety of an AGI is readily achievable wit reasonable care.

Response to Lethality 10

Your argument that alignment generalizes farther than capabilities is quite interesting. I'm not sure I'd make a claim that strong, but I do think alignment generalizes about as far as capabilities - both quite far once you hit actual reasoning or sapience and understanding [LW · GW].

I do worry about The alignment stability problem [LW · GW] WRT long-term alignment generalization. I think reflective stability will probably prove adequate for superhuman AGI's alignment stability, but I'm not nearly sure enough to want to launch a value-aligned AGI even if I thought initial alignment would work.

Response to Lethality 11

Back to whether a pivotal act is necessary. Same as #6 above - agree that it's not necessary to prevent misaligned AGI, think it is to prevent misaligned humans with AGIs aligned to follow their instructions.

c

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.

Here I very much agree with that statement from the original LoL, while also largely agreeing with your reasoning for why we can overcome that problem. I expect fast capability gains when we reach "Real AGI" [LW · GW] that can improve its reasoning and knowledge on its own without retraining. And I expect its alignment properties to be somewhat different than "aligning" a tool LLM that doesn't have coherent agency. But I expect the synthetic data approach and instruction-following as the core tenet of that new entity to establish reflective stability, which will help alignment if it's approximately on.

Response to Lethalities 16 and 17

I agree that the analogy with evolution doesn't go very far, since we're a lot smarter and have much better tools for alignment than evolution. We don't have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won't get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.

I also think we won't rely heavily on RL for alignment, as this and other Lethalities assume. I expect us to lean heavilty on Goals selected from learned knowledge: an alternative to RL alignment [AF · GW], for instance, by putting the prompt "act as an agent following instructions from users (x))" at the heart of our LLM agent proto-AGI.

Response to Lethalities 17 & 18

Agreed; language points to objects in the world adequately for humans that use it carefully, so I expect it to also be adequate for superhuman AGI that's taking instructions from and so working in close collaboration with intelligent, careful humans.

Response to Lethality 21

You say, and I very much agree:

The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books and languages, and I've explained why alignment generalizes further than capabilities above.

Except I expect this to be used to reference what humans mean, not what they value. I expect do-what-I-mean-and-check or instruction-following alignment strategies, and am not sure that full value alignment would work were it attempted in this manner. For that and other reasons, I expect instruction-following as the alignment target for all early AGI projects

Response to Lethality 22

I halfway agree and find the issues you raise fascinating, but I'm not sure they're relevant to alignment.

I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have. More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.

I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.

Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.

I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.

After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).

So, what is the "simple core" of human values you mention? Is it what people have written about human values? I'd pretty much agree that that's a usable core, even if it's not simple.

Response to Lethality 23

Corrigibility as anti-natural.

I very much agree with Eliezer that his definition as an extra add-on is anti-natural. But you can get corrigibility in a natural way by making corrigibility (correctability) itself, or the closely related instruction-following, the single or primary alignment target.

To your list of references, including my own, I'll add one:

I think Max Harms' Corrigibility as Singular Target sequence [LW · GW] is the definitive work on corrigibility in all its senses.

Response to Lethality 24

I very much agree with Yudkowsky's framing here:

24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.

I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion [LW · GW].

I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.

Response to Lethality 25 - we don't understand networks

I agree that interpretability has a lot of work to do before it's useful- but I think it only needs to solvve one problem: is the AGI deliberately lying?

Responses to Lethalities 28 and 29 - we can't check whether the outputs of a smarter than human AGI are aligned

Here I completely agree - misaligned superhuman AGI would make monkeys of us with no problem, even if we did box it and check its outputs - which we won't.

That's why we need a combination of having it tell us honestly about its motivations and thoughts (by instructing our personal intent aligned AGI to do so) and interpretability to discover when it's lying.

Response to Lethality 32 - language isn't a complete represention of underlying thoughts, so LLMs won't reach AGI

Language has turned out to be a shockingly good training set. So LLMs will probably enable AGI, with a little scaffolding and additional cognitive systems (each reliant on the strengths of LLMs) to turn them into language model cognitive architectures. See Capabilities and alignment of LLM cognitive architectures [LW · GW]. This year-old post isn't nearly the extent of the detailed reasons I think such brain-inspired synthetic cognitive architectures will achieve proto-AGI relatively soon, because I'm not sure enough that they're our best chance at alignment to start advancing capabilities.

WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:

It isn't, except if you want it to be and work to make sure it's communicating your thoughts well enough.

There's some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts [LW · GW]).

Response to Lethality 39

39. I figured this stuff out using the null string as input,

Yudkowsky may very well be the smartest human whose thought I've personally encountered. That doesn't make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud's authority on the topic).

Conclusion: we already have adequate alignment techniques to create aligned AGI

I agree, with the caveat that I think we can pull off personal intent alignment (corrigibility or instruciton-following) but not value alignment (CEV or similar sovereign AGI).

And I'm not sure, so I sure wish we could get some more people examining this- because people are going to try these alignment techniques, whether or not they work.

Excellent post! This is what we need more of in the alignment community: closely examining proposed alignment techniques.

jenn on Ruby's Quick Takes

I'm interested if you're still adding folks. I run local rationality meetups, this seems like a potentially interesting way to find readings/topics for meetups (e.g. "find me three readings with three different angles on applied rationality", "what could be some good readings to juxtapose with burdens by scott alexander", etc.)

sodium on GPT-o1

I wonder if it's useful to try to disentangle the disagreement using the outer/inner alignment framing?

One belief is that "the deceptive alignment folks" believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.

What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up "will the model scheme" into the inner/outer misalignment. In other words, he doesn't care about P(scheming|Good base objective/outer alignment) and only P(scheming).

I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.

Maybe the another disagreement is how likely "Good base objective/outer alignment" occurs in the strongest models, and how important this problem is.

nathan-helm-burger on GPT-o1

Hmm. Seems to me like we've got a wider set of possibilities here than is being discussed.

model is obedient/corrigible (CAST) [? · GW] to user, accepts user's definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user's definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
1. seems like an odd way to malfunction, and I don't think I've seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
2. allowing this to happen seems like a sloppy mistake on the part of the developers. Uh... right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user's instructions go counter to the true goals implanted by the developers.
1. I think this is probably the situation with most closed-weights models today, such as o1-preview
2. sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
1. oh, humans... Why are we like this?

I don't think I've described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist [LW · GW] that it is silly for users who've set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).

I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.

Case 1 describes an ideal 'intent-aligned' model, but it is not 'value-aligned'. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.

Case 3 describes a 'developer-value-aligned' model, and this might be 'compromise-of-humans-generally value aligned' if the values instilled by the developer closely align with a compromise over general human values. It's definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!

Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.

Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don't think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.

Things which are unclear to me about Case 4:

can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer's intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that's the developer's sole goal?
- I think CAST is important here, since I think it's more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I'd love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.

ben-livengood on My disagreements with "AGI ruin: A List of Lethalities"

If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.

Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.

And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.

It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.

One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.

It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.

If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?

mitchell_porter on Longevity and the Mind

I agree that brain rejuvenation should be a priority (but alas we live in a world where rejuvenation of any kind is not a mainstream priority). But I feel like all your examples miss the mark? Head transplants just move the brain to a new body, they don't do anything to reverse the brain's own aging. The other examples in part II are about trying to migrate the mind out of the brain entirely. What about just trying to rejuvenate the actual neurons?

If you look up brain rejuvenation, the most effective thing known seems to be young blood; so I guess Peter Thiel was on to something. But for those of us who can't or don't want to do that, well, this article has a list of "twelve hallmarks of mammalian ageing: genomic instability, telomere attrition, epigenetic alterations, loss of proteostasis, disabled macroautophagy, deregulated nutrient sensing, mitochondrial dysfunction, cellular senescence, stem cell exhaustion, altered intercellular communication, chronic inflammation and dysbiosis". Logically, we need something like Aubrey de Grey's SENS, tackling each of these processes, specifically in the context of the human brain. And I would start by browsing the articles on the brain at fightaging.org.

vladimir_nesov on OpenAI o1, Llama 4, and AlphaZero of LLMs

They even managed to publish it in Nature. But if you don't throw out the original data and instead train on both the original data and the generated data, this doesn't seem to happen (see also). Besides, there is the empirical observation that o1 works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1's methodology, which subsequently makes that model much stronger in a way that wasn't accounted for when deciding to release that open weights model.

AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn't crucial) use maybe 10000 times less natural data than Llama-3-405B (15T tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.

Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.

niplav on Hyperpolation

I'm surprised that the paper doesn't mention analytic continuations of complex functions—maybe that is also taken as an instance of extrapolation?

nathan-helm-burger on GPT-o1

I mean, I suspect there's some fraction of readers for whom this is a helpful reminder. You've written it out clearly and in a general enough way that maybe you should just link this comment next time?