Posts

LLM AGI will have memory, and memory changes alignment 2025-04-04T14:59:13.070Z
Whether governments will control AGI is important and neglected 2025-03-14T09:48:34.062Z
Will LLM agents become the first takeover-capable AGIs? 2025-03-02T17:15:37.056Z
OpenAI releases GPT-4.5 2025-02-27T21:40:45.010Z
System 2 Alignment 2025-02-13T19:17:56.868Z
Seven sources of goals in LLM agents 2025-02-08T21:54:20.186Z
OpenAI releases deep research agent 2025-02-03T12:48:44.925Z
Yudkowsky on The Trajectory podcast 2025-01-24T19:52:15.104Z
Gratitudes: Rational Thanks Giving 2024-11-29T03:09:47.410Z
Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI 2024-11-12T18:23:53.533Z
Intent alignment as a stepping-stone to value alignment 2024-11-05T20:43:24.950Z
"Real AGI" 2024-09-13T14:13:24.124Z
Conflating value alignment and intent alignment is causing confusion 2024-09-05T16:39:51.967Z
If we solve alignment, do we die anyway? 2024-08-23T13:13:10.933Z
Humanity isn't remotely longtermist, so arguments for AGI x-risk should focus on the near term 2024-08-12T18:10:56.543Z
Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours 2024-08-05T15:38:09.682Z
What's a better term now that "AGI" is too vague? 2024-05-28T18:02:31.371Z
Anthropic announces interpretability advances. How much does this advance alignment? 2024-05-21T22:30:52.638Z
Instruction-following AGI is easier and more likely than value aligned AGI 2024-05-15T19:38:03.185Z
Goals selected from learned knowledge: an alternative to RL alignment 2024-01-15T21:52:06.170Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
Corrigibility or DWIM is an attractive primary goal for AGI 2023-11-25T19:37:39.698Z
Sapience, understanding, and "AGI" 2023-11-24T15:13:04.391Z
Altman returns as OpenAI CEO with new board 2023-11-22T16:04:03.123Z
OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns 2023-11-20T14:20:33.539Z
We have promising alignment plans with low taxes 2023-11-10T18:51:38.604Z
Seth Herd's Shortform 2023-11-10T06:52:28.778Z
Shane Legg interview on alignment 2023-10-28T19:28:52.223Z
The (partial) fallacy of dumb superintelligence 2023-10-18T21:25:16.893Z
Steering subsystems: capabilities, agency, and alignment 2023-09-29T13:45:00.739Z
AGI isn't just a technology 2023-09-01T14:35:57.062Z
Internal independent review for language model agent alignment 2023-07-07T06:54:11.552Z
Simpler explanations of AGI risk 2023-05-14T01:29:29.289Z
A simple presentation of AI risk arguments 2023-04-26T02:19:19.164Z
Capabilities and alignment of LLM cognitive architectures 2023-04-18T16:29:29.792Z
Agentized LLMs will change the alignment landscape 2023-04-09T02:29:07.797Z
AI scares and changing public beliefs 2023-04-06T18:51:12.831Z
The alignment stability problem 2023-03-26T02:10:13.044Z
Human preferences as RL critic values - implications for alignment 2023-03-14T22:10:32.823Z
Clippy, the friendly paperclipper 2023-03-02T00:02:55.749Z
Are you stably aligned? 2023-02-24T22:08:23.098Z

Comments

Comment by Seth Herd on Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? · 2025-04-24T21:46:04.622Z · LW · GW

But success for most things doesn't require just one correct solution among k attempts, right? For the majority of areas without easily checkable solutions, higher odds of getting it right on the first try or fres tries is both very useful and does seem like evidence of reasoning. Right? Or am I missing something?

Reducing the breadth of search is a substantial downside if it's a large effect. But reliably getting the right answer instead of following weird paths most of which are wrong seems like the essence of good reasoning.

Comment by Seth Herd on There is no Red Line · 2025-04-22T19:48:11.665Z · LW · GW

By this criteria, did humanity ever have control? First we had to forage and struggle against death when disease or drought came. Then we had to farm and submit to the hierarchy of bullies who offered "protection" against outside raiders at a high cost. Now we have more ostensible freedom but misuse it on worrying and obsessively clicking on screens. We will probably do more of that as better tools are offered.

But this is an an entirely different concern than AGI taking over. I'm not clear what mix of these two you're addressing. Certainly AGIs that want control of the world could use a soft and tricky strategy to get humans to submit. Or they could use much harsher and more direct strategies. They could make us fire the gun we have pointed at our own heads by spoofing us into launching nukes, then using the limited robotics to rebuild the infrastructure they need.

The solution is the same for either type of disempowerment: don't build machines smarter than you if you can't be sure you can specify their goals (wants) for certain and with precision.

How superhuman machines will take over is an epilogue after the drama is over. The drama hasn't happened yet. It's not yet time to write anticipatory postmortems, unless they function as a call to arms or a warning against foolish action. The trends are in motion but we have not yet crossed the red line of making AGI that has the intelligence and the desire to disempower us, whether by violence or subtle trickery. Help us change the trends before we cross that red line.

Edit: if you're addressing AI accidentally taking control by creating new pleasures that help entrench existing power structures, that's entirely a different issue. The way that AI could empower some humans to take advantage of others is interesting. I don't worry about that issue much because I'm too busy worrying about the trend toward building superintelligent machines that want to disempower us and will do so one way or another by outsmarting us, whether their plans unfold quickly or slowly.

Comment by Seth Herd on Improving CNNs with Klein Networks: A Topological Approach to AI · 2025-04-21T20:54:40.893Z · LW · GW

You'd probably get more enthusiasm here if you led the article with a clear statement of its application for safety. We on LW are typically not enthusiastic about capabilities work in the absence of a clear and strong argument for how it improves safety more than accelerates progress toward truly dangerous AGI. If you feel differently, I encourage you to look with an open mind at the very general argument for why creating entities smarter than us is a risky proposition.

Comment by Seth Herd on Viliam's Shortform · 2025-04-21T20:49:03.671Z · LW · GW

I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It's hard to predict exactly how).

I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn't complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it's been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.

Comment by Seth Herd on aog's Shortform · 2025-04-21T20:35:54.285Z · LW · GW

I just listened to Ege and Tamay's 3-hour interview by Dwarkesh. They make some excellent points that are worth hearing, but they do not stack up to anything like a 25-year-plus timeline. They are not now a safety org if they ever were.

Their good points are about bottlenecks in turning intelligence into useful action. These are primarily sensorimotor and the need to experiment to do much science and engineering. They also address bottlenecks to achieving strong AGI, mostly compute.

In my mind this all stacks up to convincing themselves timelines are long so they can work on the exciting project of creating systems capable of doing valuable work. Their long timelines also allow them to believe that adoption will be slow, so job replacement won't cause a disastrous economic collapse.

Comment by Seth Herd on aog's Shortform · 2025-04-21T20:28:53.455Z · LW · GW

Not taking critiques of your methods seriously is a huge problem for truth-speaking. What well-informed critiques are you thinking of? I want to make sure I've taken them on board.

Comment by Seth Herd on tlevin's Shortform · 2025-04-21T16:56:54.723Z · LW · GW

I second the socks-as-sets move.

The other advantage is getting on-avetage more functional socks at the cost of visual variety.

IMO an important criteria for a sock is its odor resistance. This seems to vary wildly between socks of similar price and quality. Some have antimicrobial treatments that last a very long time, others do not. And it's often not advertised. Reviews rarely include this information.

I don't have a better solution than buying one pair or set before expanding to a whole set. This also lets you choose socks.that feel good to wear.

Comment by Seth Herd on Chris_Leong's Shortform · 2025-04-18T18:23:56.432Z · LW · GW

I don't think this is true. People can't really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it's expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.

Comment by Seth Herd on ASI existential risk: Reconsidering Alignment as a Goal · 2025-04-16T15:02:56.829Z · LW · GW

Yeah being excited that Chiang and Rajaniemi are on board was one of my reactions to this excellent piece.

If you haven't read Quantum Thief you probably should.

Comment by Seth Herd on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-16T06:32:50.482Z · LW · GW

Interesting! Nonetheless, I agree with your opening statement that LLMs learning to do any of these things individually doesn't address the larger point that the have important cognitive gaps and fail.to generalize in ways that humans can.

Comment by Seth Herd on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-15T20:47:05.794Z · LW · GW

Right, I got that. To be clear, my argument is that no breakthroughs are necessary, and further that progress is underway and rapid on filling in the existing gaps in LLM capabilities.

Memory definitely doesn't require a breakthrough. Add-on memory systems (RAG and fine-tuning, as well as more sophisticated context management through prompting; CoT RL training effectively does this too).

Other cognitive capacities also exist in nascent form and so probably require no breakthroughs. Although I think no other external cognitive systems are needed given the rapid progress in multimodal and reasoning transformers.

Comment by Seth Herd on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-15T19:40:57.728Z · LW · GW

This is great; I think it's important to have this discussion. It's key for where we put our all-too-limited alignment efforts.

I roughly agree with you that pure transformers won't achieve AGI, for the reasons you give. They're hitting a scaling wall, and they have marked cognitive blindspots like you document here, and Thane Ruthenis argues for convincingly in his bear case. But transformer-based agents (which are simple cognitive architectures) can still get there- and I don't think they need breakthroughs, just integration and improvement. And people are already working on that.

To put it this way: humans have all of the cognitive weaknesses you identify, too. But we can use online learning (and spatial reasoning) to overcome them. We actually generalize only rarely and usually with careful thought. Scaffolded and/or RL-trained CoT models can do that too. Then we remember our conclusions and learn from them. Add-on memory systems and fine-tuning setups can replicate that.

Generalization: It's a sort of informal general conclusion in cognitive psychology that "wow are people bad at generalizing". For instance, if you teach them a puzzle, then change the names and appearances, it looks like they don't apply the learning at all. These are undergraduates who are usually unpaid, so they're not doing the careful thinking it requires humans to generalize knowledge. LLMs of the generation you're testing don't think carefully either (you note that in a couple of places "it's like it's not thinking" which is exactly right), but CoT RL on a variety of reward models is making disturbingly rapid progress at teaching them when and how to think carefully - and enabling generalization. Good memory systems will be necessary for them to internalize the new general principles they've learned, but those might be good enough already, and if not, they probably will be very soon.

I think you're overstating the case, and even pure transformers could easily reach human-level "Real AGI" that's dangerous as hell. Continued improvements in effectively using long context windows and specific training for better CoT reasoning do enable a type of online learning. Use Gemini 2.5 Pro for some scary demonstrations, although it's not there yet.  Naked transformers probably won't reach AGI, which I'm confident of mostly because adding outside memory (and other cognitive) systems is so much easier and already underway.

My LLM cognitive architectures article also discusses other add-on cognitive systems like vision systems that would emulate how humans solve tic-tac-toe, sliding puzzles etc. - but now I don't think transformers need those, just more multimodal training. (Note that Claude and o1 preview weren't multimodal, so were weak at spatial puzzles. If this was full o1, I'm surprised.) They'll still be relatively bad at fundamentally spatial tasks like yours and ARC-AGI, but they can fill that in by learning or being taught the useful concepts/tricks for actually useful problem spaces. That's what a human would do if they happened to be have poor spatial cognition and had to solve spatial tasks. 

 

I think transformers not reaching AGI is a common suspicion/hope amongst serious AGI thinkers. It could be true, but it's probably not, so I'm worried that too many good thinkers are optimistically hoping we'll get some different type of AGI. It's fine and good if some of our best thinkers focus on other possible routes to AGI. It is very much not okay if most of our best thinkers incorrectly assume they won't. Prosaic alignment work is not sufficient to align LLMs with memory/online learning, so we need more agent foundations style thinking there. 

I address how we need to go beyond static alignment of LLMs to evolving learned systems of beliefs, briefly and inadequately in my memory changes alignment post and elsewhere. I am trying to crystallize the reasoning in a draft post with the working title "if Claude achieves AGI will it be aligned?"

Comment by Seth Herd on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI · 2025-04-15T19:09:02.849Z · LW · GW

You beat me to it. Thanks for the callout!

Humans are almost useless without memory/in-context learning. It's surprising how much LLMs can do with so little memory.

The important remainder is that LLM-based agents will probably have better memory/online learning as soon as they can handle it, and it will keep getting better, probably rapidly. I review current add-on memory systems in LLM AGI will have memory, and memory changes alignment. A few days after I posted that, OpenAI announced that they had given ChatGPT memory over all its chats, probably with a RAG and summarization system. That isn't the self-directed memory for agents that I'm really worried about, but it's exactly the same technical system you'd use for that purpose. Fortunately it doesn't work that well - yet.

I wrote about this in Capabilities and alignment of LLM cognitive architectures two years ago, but I wasn't sure how hard to push the point for fear of catalyzing capabilities work.

Now it's obvious that many developers are aware and explicitly focused on the virtues of online learning/memory.

This is a great post because LLMs being a dead-end is a common suspicion/hope among AGI thinkers. It isn't likely to be true, so it should be discussed. More in a separate comment.

Comment by Seth Herd on A Dissent on Honesty · 2025-04-15T15:04:04.753Z · LW · GW

I think this issue of the difficulty of making each decision about lying as an independent decision is the main argument for treating it as a virtue ethics or deontological issue.

I think you make many good points in the essay arguing that one should not simply follow a rule of honesty. I think that in practice the difference can be split, and that is in fact what most rationalists and other wise human beings do. I also think it is highly useful to write this essay on the mini virtues of lying, so that that difference can be split well.

There are many subtle downsides to lying, so simply adding a bit of a fudge factor to the decision that weighs against it is one way to avoid taking forever to make that decision. You've talked about practicing making the decision quickly, and I suspect that is the result of that practice.

This is a separate issue, but your point about being technically correct is also a valuable one. It is clearly not being honest to say things you know will cause the listener to form false beliefs.

I have probably aired on the side of honesty as have many rationalists, treating it not as an absolute deontological issue and being willing to fudge a little on the side of technically correct to maintain social graces in some situations. I enjoy a remarkable degree of trust from my true friends, because they know me to be reliably honest. However, I have probably suffered reputational damages from acquaintances and failed friends, for whom my exceptional honesty has proven hurtful. Those people don't have adequate experience with me to see that I am reliably honest and appreciate the advantages of having a friend who can be relied upon to tell the truth. That's because they've ceased being my friend when they've been either insulted or irritated by my unhelpful honesty.

There is much here I agree with and much I disagree with. But I think this topic is hugely valuable for the rationalist community, and you've written it up very well. Nice work!

Comment by Seth Herd on Eli's shortform feed · 2025-04-14T19:22:42.722Z · LW · GW

Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)

Having full confidence that we either can or can't train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.

Comment by Seth Herd on Thoughts on the Double Impact Project · 2025-04-14T12:45:01.780Z · LW · GW

This doesn't work as advertised.

If I care about the election more than other charities, I won't give to such a fund. My dollars will do more towards the campaign on average if I give directly to my side. This effect is trivial if the double impact group is small but very large if it is most donations.

In an extreme case, suppose that most people give to double impact and the two campaigns are tied $1b - $1b. One donor gives their $1m directly to their side. It is the only money actually spent on advertising; that side has a large advantage in ratio of funds spent.

More realistic scenarios yield smaller average ratios, but always less expected return for your preferred campaign if you give to it vs, double impact.

Comment by Seth Herd on Eli's shortform feed · 2025-04-14T06:52:57.452Z · LW · GW

I have the same question. My provisional answer is that it might work, and even if it doesn't, it's probably approximately what someone will try, to the extent they really bother with real alignment before it's too late.  What you suggest seems very close to the default path toward capabilities. That's why I've been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.

I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.

What you've said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man's corrigibility - real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models - a model that won't follow instructions is considered a poor model. So making sure it's the strongest factor in training isn't a huge divergence from the default course in capabilities.

Constitutional AI and similar RL methods are one way of ensuring that's the model's main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.

There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don't want full-on-AGI following orders from just anyone. And if it's a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I've got so far is maybe we would die anyway, but maybe we wouldn't.  This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).

Even if the base model is very well aligned, it's quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model's "thought generator". See my Seven sources of goals in LLM agents.

Sorry to go spouting my own writings; I'm excited to see someone else pose this question, and I hope to see some answers that really grapple with it.

Comment by Seth Herd on A Bear Case: My Predictions Regarding AI Progress · 2025-04-13T23:26:22.364Z · LW · GW

Definitely. Excellent point. See my short bit on motivated reasoning, in lieu of the full post I have on the stack that will address its effects on alignment research.

I frequently check how to correct my timelines and takes based on potential motivated reasoning effects for myself. The result is usually to broaden my estimates and add uncertainty, because it's difficult to identify which direction MR might've been pushing me during all of the mini-decisions that led to forming my beliefs and models. My motivations are many and which happened to be contextually relevant at key decision points is hard to guess.

On the whole I'd have to guess that MR effects are on average larger on long timelines and low p(dooms). They both allow us to imagine a sunny near future, and to work on our preferred projects instead of panicking and having to shift to work that can help with alignment if AGI happens soon.  Sorry. This is worth a much more careful discussion, that's just my guess in the absence of pushback.

Comment by Seth Herd on [deleted post] 2025-04-13T22:42:44.463Z

I jsut realized that I'd embarassingly misunderstood outer alignment for a long time, and it was based directly on this wikitag. I'd been including the wise selection of an alignment target as part of outer alignment, which it is not by almost all considered usage of the term. The phrasing in first paragraph firmly implied it was. So I edited that and included a new very brief set of definitions at the bottom. Anyone is most welcome to change or eliminate any of that, except that I'd love to know why if you're reverting it to the version that seemed flat wrong.

Comment by Seth Herd on Breaking down the MEAT of Alignment · 2025-04-13T21:44:56.423Z · LW · GW

This is really good, and I'm sad I missed seeing and big upvoting it when it had a chance for frontpage. I think having these basic categories is useful for discussing alignment. I'll be referencing it for the concept of alignment targets.

You might be interested in my Conflating value alignment and intent alignment is causing confusion on terminology. 

Comment by Seth Herd on Paper · 2025-04-11T20:17:29.548Z · LW · GW

I installed it and did like one or two test diagrams and then never again. I should get back to it because it did seem good.

Comment by Seth Herd on Paper · 2025-04-11T19:25:10.507Z · LW · GW

Right. The post did inspire me to maybe get a new notebook for the first time in years for that reason.

 

I've been using Obsidian exculsively, but it's really reduced how much diagramming I do. To me the speed does make up for being forced to write on lines and in a limited number of styles. I haven't really gotten skilled enough in it to quickly diagram in its editor.

I'm not sure there's much other advantage to paper for making neurons fire good. I wonder if it puts you into thinking mode based on associations or something? Or if staring at a blank page on which you can write or draw anywhere evokes a more openminded and analytical state for anyone?

 

Comment by Seth Herd on Paper · 2025-04-11T17:40:28.267Z · LW · GW

Or you can use Obsidian or a similar notes software, where you can search, crosslink, and cross-reference in any way you please.

This not as satisfying as writing on paper. I agree that lines are dehumanizing, and I think visual variety of writing and sketching is good for the soul and for thinking.

But actually finding and using your old notes is also priceless.

This system requires having your laptop handy in most circumstances, but your phone is adequate for brief uses. And it's easy and free to sync across at least two devices, maybe a little more complex for more.

Yes, there's a lot lost by not being able to quickly hand-sketch. You can draw on paper and put the photo in but I don't do that for some reason. But typing or talking is generally faster than handwriting.

And again, actually finding your old notes or thoughts is huge.

Comment by Seth Herd on Mo Putera's Shortform · 2025-04-11T16:52:39.460Z · LW · GW

Exactly.

I'd go a bit farther and say it's easier to develop an algorithm that can think about literally everything than one that can think about roughly half of things. That's because the easiest general intelligence algorithms are about learning and reasoning, which apply to everything.

Comment by Seth Herd on Mo Putera's Shortform · 2025-04-10T22:13:33.062Z · LW · GW

Blindsight was very well written but based on a premise that I think is importantly and dangerously wrong. That premise is that consciousness (in the sense of cognitive self-awareness) is not important for complex cognition.

This is the opposite of true, and a failure to recognize this is why people are predicting fantastic tool AI that doesn't become self-aware and goal-directed.

The proof won't fit in the margin unfortunately. To just gesture in that direction: it is possible to do complex general cognition without being able to think about one's self and one's cognition. It is much easier to do complex general cognition if the system is able to think about itself and its own thoughts.

Comment by Seth Herd on abramdemski's Shortform · 2025-04-08T19:27:28.337Z · LW · GW

Those seem like good suggestions if we had a means of slowing the current paradigm and making/keeping it non-agentic.

Do you know of any ideas for how we convince enough people to do those things? I can see a shift in public opinion in the US and even a movement for "don't make AI that can replace people" which would technically translate to no generally intelligent learning agents.

But I can't see the whole world abiding by such an agreement, because general tool AI like LLMs is just too easily converted into an agent as it keeps getting better.

Developing new tech in time to matter without a slowdown seems doomed to me.

I would love to be convinced that this is an option! But at this point it looks 80%-plus likely that LLMs-plus-scaffolding-or-related-breakthroughs get us to AGI within five years or a little more if global events work against it, which makes starting from scratch nigh impossible and even substantially different approaches very unlikely to catch up.

The exception is the de-slopifying tools you've discussed elsewhere. That approach has the potential to make progress on the current path while also reducing the risk of slop-induced doom. That doesn't solve actual misalignment as in AI-2027, but it would help other alignment techniques work more predictably and reliably.

Comment by Seth Herd on Are there any (semi-)detailed future scenarios where we win? · 2025-04-08T19:02:46.146Z · LW · GW

Well there's post An Optimistic 2027 Timeline, published just one day before your question here :)

But in that scenario we don't win really, we just don't lose by 2027 because progress was a bit slower due to practical difficulties like a global recession from tariffs and an invasion of Taiwan. So I'm not totally counting it. It does provide more time for alignment. But how does alignment get accomplished, other than "make the AI do our alignment homework?"

So I'd say the best detailed scenario where we win is the version of AGI-2027 where we win. The other variations on that scenario where we win are going to be based on details of the political/economic/personal path to building it, and on the details of how alignment is solved adequately. 

There's an alignment-focused scenario sitting in my drafts folder. It does focus on how the technical problem gets solved, but it's also about the sociopolitical challenges we'll have to face to make that realistic, and how we prevent proliferation of AGI. Right now this is spread out across my posts. This comment in response to "what's the plan for short timelines?" is the closest I've come to putting it in one place so far.

This isn't detailed yet, though. My model matches Daniel K's quite closely, although with slightly longer timelines due to practical difficulties like those discussed in the optimistic  2027 scenario I linked above.

 

Comment by Seth Herd on LLM AGI will have memory, and memory changes alignment · 2025-04-08T00:30:47.223Z · LW · GW

That's very useful, thanks! That's exactly the argument I was trying to make here. I didn't use the term drop-in remote worker but that's the economic incentive I'm addressing (among more immediate ones- I think large incentives start long before you have a system that can learn any job).

Lack of episodic memory looks to me like the primary reason LLMs have weaknesses humans do not. The other is a well-developed skillset for managing complex trains of thought. o1 and o3 and maybe the other reasoning models have learned some of that skillset but only mastered it in the narrow domains that allowed training on verifiable answers. Scaffolding and/or training for executive function (thought management) and/or memory seems poised to increase the growth rate of long time-horizon task performance. It's going to take some work still but I don't think it's wise to assume that the seven-month doubling period won't speed up, or that some point it will just jump to infinity, while the complexity of the necessary subtasks is still a limiting factor.

Humans don't train on tons of increasingly long tasks, we just learn some strategies and some skills for managing our thought, like checking carefully whether a step has been accomplished, searching memory for task structure and where we're at in the plan if we lose our place, etc. Humans are worse at longer tasks, but any normal adult human can tackle a task of any length and at least keep getting better at it for as long as they decide to stick with it.

Comment by Seth Herd on METR: Measuring AI Ability to Complete Long Tasks · 2025-04-08T00:22:14.917Z · LW · GW

I think you're right that online learning/memory here is an important consideration. I expect an increase in the rate of improvement in time horizons as memory systems are integrated with agents.


Noosphere pointed me to this comment in relation to my recent post on memory in LLM agents. I briefly argued there memory is so useful for doing long time-horizon tasks that we should expect LLM agents to have nontrivial memory capabilities as soon as they're competent enough to do anything useful or dangerous. Humans without episodic memory are very limited in what they can accomplish, so I'm actually surprised that LLMs can do tasks even beyond 15 minutes equivalent - and even that might only be a subset of tasks that suits their strengths.

Comment by Seth Herd on Meditation and Reduced Sleep Need · 2025-04-07T14:24:48.963Z · LW · GW

You mean meditation reduces sleep, right?

Whether or not there's a reduced need or you're just getting less sleep is a matter of anecdote. I find it entirely plausible that meditators are correct that they need less sleep, but the data you crunched don't distinguish. To know if you needed less you'd need to have cognitive and health measures.

Comment by Seth Herd on LLM AGI will have memory, and memory changes alignment · 2025-04-05T23:29:27.056Z · LW · GW

I think the possibilities you mention are some of the many final alignments that an LLM agent could arrive at if it was allowed to reason and remember its conclusions.

I'll address this more in an upcoming post, but in short, I think it's really hard to predict, and it would be good to get a lot more brainpower on trying to work out the dynamics of belief/goal/value evolution.

Comment by Seth Herd on LLM AGI will have memory, and memory changes alignment · 2025-04-05T03:44:33.671Z · LW · GW

I think the more generous way to think about it is that current prosaic alignment efforts are useful for aligning future systems, but there's a gap they probably don't cover.

Learning agents like I'm describing still have an LLM at their heart, so aligning that LLM is still important. Things like RLHF, RLAIF, deliberative alignment, steering vectors, fine tuning, etc. are all relevant. And the other not-strictly-alignment parts of prosaic alignment like mechanistic interpretability, behavioral tests for alignment, capabilities testing, control, etc. remain relevant.

(They might be even more relevant if we lose faithful chain of thought, or if it was always an illusion. (I haven't processed that paper, and it will take some processing in contrast to the case for CoT unfaithfulness is overstated)).

As for the "core mechanism of doom," I do think that convergent power-seeking is real and very dangerous, but I also think it's not inevitable. I don't think we're doomed.

I found it very interesting that the two most compelling scenarios for LLM agent doom I'm aware of, Takeover in 2 Years and yesterday's AI 2027 are both basically stories in which the teams don't really try very hard or think very hard at all about alignment. I found those scenarios distressingly plausible; I think that's how history hinges, all too frequently.

But what if the teams in scenarios like that did think just a little harder about alignment? I think they might well have succeeded pretty easily. There are lots of things that couldn've happened to cause more focus on what I think of as actual alignment: creating human-aligned goals/values in an artificial intelligence that has goals/values. 

Those teams didn't really bother thinking of it in those terms! They mostly neglected alignment too long, until they'd created an entity that did want things. The humans didn't really even try to control what those entities wanted. They just treated them as tools until they weren't tools anymore.

Getting ahead of the game might be pretty easy in scenarios like those and the ones I'm thinking of for LLM agent dangers. If the team is focusing on human-aligned goals BEFORE they "crystallize" or the agent becomes smart enough for even vague goals to make a big difference, it might be relatively easy. 

Or it might be dreadfully hard, effectively impossible. The abstract arguments for alignment difficulty really don't seem adequate to say which.

 

Anyway, those are the things I'm thinking about for the next post. 

More directly relevant to this post: both of those fictional scenarios do seem to include online learning.  In the first it is crucial for the alignment failure; in AI 2027  it is not important, but only because the initial state is misaligned.  Those and similar scenarios don't usually emphasize the importance of online learning,  which is why I'm trying to make its importance explicit.

Comment by Seth Herd on How AI Takeover Might Happen in 2 Years · 2025-04-04T04:36:12.434Z · LW · GW

I think the assumption here that AIs are "learning to train themselves" is important. In this scenario they're producing the bulk of the data.

I also take your point that this is probably correctable with good training data. One premise of the story here seems to be that the org simply didn't try very hard to align the model. Unfortunately, I find this premise all too plausible. Fortunately, this may be a leverage point for shifting the odds. "Bother to align it" is a pretty simple and compelling message.

Even with the data-based alignment you're suggesting, I think it's still not totally clear that weird chains of thought couldn't take it off track.

Comment by Seth Herd on You will crash your car in front of my house within the next week · 2025-04-02T05:53:02.261Z · LW · GW

I accept your statistics and assume I'll be driving my car. Damn.

OTOH, I can be pretty certain I won't die or be seriously injured.

That has happened never in my thousands of weeks, so statistically, it almost certainly won't within the next week!

Comment by Seth Herd on PauseAI and E/Acc Should Switch Sides · 2025-04-02T01:37:13.325Z · LW · GW

I'm on board! We needed people going fast to get seatbelts!

AI safety isn't a game, which means you'll be disappointed in yourself (if only very briefly) if you fail to play your best to win. The choice of risky 3D chess moves or virtue ethics is not obvious.

Comment by Seth Herd on VDT: a solution to decision theory · 2025-04-02T01:32:02.001Z · LW · GW

Still laughing.

Thanks for admitting you had to prompt Claude out of being silly; lots of bot results neglect to mention that methodological step.

This will be my reference to all decision theory discussions henceforth

Have all of my 40-some strong upvotes!

Comment by Seth Herd on Insect Suffering Is The Biggest Issue: What To Do About It · 2025-04-01T15:57:44.746Z · LW · GW

Same as the solution to every other problem:

Solve alignment, then the aligned ASI will solve it.

If you do the math in any reasonable way, this is the odds-on best approach to every problem.

If you think EA is already overinvwsted in this approach, I'd argue they're still massively underinvested. Only a tiny fraction of people are working on literally the best way to solve every problem. And only a fraction of those are bothering to work on it in an even arguably optimal way.

Comment by Seth Herd on Insect Suffering Is The Biggest Issue: What To Do About It · 2025-04-01T15:53:38.380Z · LW · GW

I don't know that I'd want to see insanity wolf weigh in on every LW discussion but that got a good LOL out of me

Comment by Seth Herd on We’re not prepared for an AI market crash · 2025-04-01T12:22:20.839Z · LW · GW

Huh? Yes we're unprepared to capitalize on a crash because how would we? This post doesn't say how one might do that. It seems you've got ideas but why write this if you weren't going to say what they are or what you want us to do or think about?

Comment by Seth Herd on Going Nova · 2025-04-01T06:09:24.957Z · LW · GW

Any physical system exhibiting exactly the same input-output mappings. Across all inputs and outputs. Short of that, imitation is a real possibility - particularly among LLMs that are trained to predict human responses.

I agree that there's something nontrivially "conscious" in a system like Nova; but that's not a good argument for it.

Agreed that this is going to get dramatic. There will be arguments and both sides will make good points.

Comment by Seth Herd on Daniel Tan's Shortform · 2025-03-29T17:49:28.496Z · LW · GW

Given the realities of social media I'd guess it's mostly 2 and some more directly deceptive omission of prompting in that direction.

Comment by Seth Herd on Daniel Tan's Shortform · 2025-03-29T17:11:09.982Z · LW · GW

So we assume that the prompts contained most of the semantics for those other pieces, right? I saw a striking one without the prompt included and figured it was probably prompted in that direction.

Comment by Seth Herd on Elizabeth's Shortform · 2025-03-27T18:03:34.414Z · LW · GW

It's looped back to cool for me, I'm going to check it out.

Comment by Seth Herd on AI companies should be safety-testing the most capable versions of their models · 2025-03-27T15:21:12.358Z · LW · GW

The most capable version of each model has not yet been created when the model is released. As well as fine-tuning for specific tasks, scaffolding matters. The agentic scaffolds people create have an increasingly important role in the model's ultimate capability.

Comment by Seth Herd on What life will be like for humans if aligned ASI is created · 2025-03-27T15:14:21.499Z · LW · GW

Suit yourself, but I happen to want to create many great continuations. I enjoy hearing about other people's happiness. I enjoy it more the better I understand them. I understand myself pretty well.

But I don't want to be greedy. I'm not sure a lot of forks of each person are better than making more new people.

Let me also mention that it's probably possible to merge forks. Simply averaging the weight changes in your simulated cortex and hippocampus will approximately work to share the memories across two forks. How far out that works before you start to get significant losses is an empirical matter. Clever modifications to the merge algorithm and additions to my virtual brain should let us extend that substantially; sharing memories across people is possible in broad form with really good translation software, so I expect we'll do that, too.

So in sum, life with aligned ASI would be incredibly awesome. It's really to imagine or predict exactly how it will unfold, because we'll have better ideas as we go.

WRT "cheapening" the experience, remember that we'll be able to twist the knobs in our brain for boredom and excitement if we want. I imagine some would want to do that more than others. Grand triumph and struggle will be available for simulated competitive/cooperative challenges; sometimes we'll know we're in a simulation and sometimes we'll block those memories to make it temporarily seem more real and important.

BUT this is all planning the victory party before fighting the war. Let's figure out how we can maximize the odds of getting aligned ASI by working out the complex challenges of getting there on both technical and societal levels.

Comment by Seth Herd on Policy for LLM Writing on LessWrong · 2025-03-27T00:34:40.569Z · LW · GW

Interesting! Do you think humans could pick up on word use that well? My perception is that humans mostly cue on structure to detect LLM slop writing, and that is relatively easily changed with prompts (although it's definitely not trivial at this point - but I haven't searched for recipes).

I did concede the point, since the research I was thinkingg of didn't use humans who've practiced detecting LLM writing.

Comment by Seth Herd on Policy for LLM Writing on LessWrong · 2025-03-27T00:03:05.705Z · LW · GW

I concede the point. That's a high bar for getting LLM submissions past you. I don't know of studies that tested people who'd actually practiced detecting LLM writing.

I'd still be more comfortable with a disclosure criteria of some sort, but I don't have a great argument beyond valuing transparency and honesty.

Comment by Seth Herd on They Took MY Job? · 2025-03-26T00:36:51.081Z · LW · GW

I read it too and had no such thought. I think that loose poetic free association type thing f writing is hard for humans and easy for LLMs.

Comment by Seth Herd on Policy for LLM Writing on LessWrong · 2025-03-25T20:56:46.681Z · LW · GW

That's a good point and it does set at least a low bar of bothering to try.

But they don't have to try hard. They can almost just append the prompt with "and don't write it in standard LLM style".

I think it's a little more complex than that, but not much. Humans can't tell LLM writing from human writing in controlled studies. The question isn't whether you can hide the style or even if it's hard, just how easy.

Which raises the question of whether they'd even do that much, because of course they haven't read the FAQ before posting.

Really just making sure that new authors read SOMETHING about what's appreciated here would go a long way toward reducing slop posts.

Comment by Seth Herd on Policy for LLM Writing on LessWrong · 2025-03-25T20:38:13.673Z · LW · GW

If you wrote the whole thing, then prompted Claude to rewrite it, that would seem to "add significant value." If you then read the whole thing carefully to say "that's what I meant, and it didn't make anything up I'm not sure about", then you've more than met the requirement laid out here, right?

They're saying the second part is all you have to do. If you had some vague prompt like "write an essay about how the field of alignment is misguided" and then proofread it you've met the criteria as laid out. So if your prompt was essentially the complete essay, you've gone far beyond their standards it seems like.

I personally would want to know that the author contributed much more than a vague prompt to get the process rolling, but that seems to be the standard for acceptance laid out here. I assume they'd prefer much mroe involvement on the prompting side, like you're talking about doing.