Posts

Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI 2024-11-12T18:23:53.533Z
Intent alignment as a stepping-stone to value alignment 2024-11-05T20:43:24.950Z
"Real AGI" 2024-09-13T14:13:24.124Z
Conflating value alignment and intent alignment is causing confusion 2024-09-05T16:39:51.967Z
If we solve alignment, do we die anyway? 2024-08-23T13:13:10.933Z
Humanity isn't remotely longtermist, so arguments for AGI x-risk should focus on the near term 2024-08-12T18:10:56.543Z
Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours 2024-08-05T15:38:09.682Z
What's a better term now that "AGI" is too vague? 2024-05-28T18:02:31.371Z
Anthropic announces interpretability advances. How much does this advance alignment? 2024-05-21T22:30:52.638Z
Instruction-following AGI is easier and more likely than value aligned AGI 2024-05-15T19:38:03.185Z
Goals selected from learned knowledge: an alternative to RL alignment 2024-01-15T21:52:06.170Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
Corrigibility or DWIM is an attractive primary goal for AGI 2023-11-25T19:37:39.698Z
Sapience, understanding, and "AGI" 2023-11-24T15:13:04.391Z
Altman returns as OpenAI CEO with new board 2023-11-22T16:04:03.123Z
OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns 2023-11-20T14:20:33.539Z
We have promising alignment plans with low taxes 2023-11-10T18:51:38.604Z
Seth Herd's Shortform 2023-11-10T06:52:28.778Z
Shane Legg interview on alignment 2023-10-28T19:28:52.223Z
The (partial) fallacy of dumb superintelligence 2023-10-18T21:25:16.893Z
Steering subsystems: capabilities, agency, and alignment 2023-09-29T13:45:00.739Z
AGI isn't just a technology 2023-09-01T14:35:57.062Z
Internal independent review for language model agent alignment 2023-07-07T06:54:11.552Z
Simpler explanations of AGI risk 2023-05-14T01:29:29.289Z
A simple presentation of AI risk arguments 2023-04-26T02:19:19.164Z
Capabilities and alignment of LLM cognitive architectures 2023-04-18T16:29:29.792Z
Agentized LLMs will change the alignment landscape 2023-04-09T02:29:07.797Z
AI scares and changing public beliefs 2023-04-06T18:51:12.831Z
The alignment stability problem 2023-03-26T02:10:13.044Z
Human preferences as RL critic values - implications for alignment 2023-03-14T22:10:32.823Z
Clippy, the friendly paperclipper 2023-03-02T00:02:55.749Z
Are you stably aligned? 2023-02-24T22:08:23.098Z

Comments

Comment by Seth Herd on Akash's Shortform · 2024-11-20T22:07:20.565Z · LW · GW

One factor is different incentives for decision-makers. The incentives (and the mindset) for tech companies is to move fast and break things. The incentives (and mindset) for government workers is usually vastly more conservative.

So if it is the government making decisions about when to test and deploy new systems, I think we're probably far better off WRT caution.

That must be weighed against the government typically being very bad at technical matters. So even an attempt to be cautious could be thwarted by lack of technical understanding of risks.

Of course, the Trump administration is attempting to instill a vastly different mindset, more like tech companies. So if it's that administration we're talking about, we're probably worse off on net with a combination of lack of knowledge and YOLO attitudes. Which is unfortunate - because this is likely to happen anyway.

As Habryka and others have noted, it also depends on whether it reduces race dynamics by aggregating efforts across companies, or mostly just throws funding fuel on the race fire.

Comment by Seth Herd on Akash's Shortform · 2024-11-20T21:57:28.354Z · LW · GW

It's not every post, but there are still a lot of people who think that alignment is very hard.

The more common assumption is that we should assume that alignment isn't trivial, because an intellectually honest assessment of the range of opinions suggests that we collectively do not yet know how hard alignment will be.

Comment by Seth Herd on China Hawks are Manufacturing an AI Arms Race · 2024-11-20T19:41:16.956Z · LW · GW

This is really important pushback. This is the discussion we need to be having.

Most people who are trying to track this believe China has not been racing toward AGI up to this point. Whether they embark on that race is probably being determined now - and based in no small part on the US's perceived attitude and intentions.

Any calls for racing toward AGI should be closely accompanied with "and of course we'd use it to benefit the entire world, sharing the rapidly growing pie". If our intentions are hostile, foreign powers have little choice but to race us.

And we should not be so confident we will remain ahead if we do race. There are many routes to progress other than sheer scale of pretraining. The release of DeepSeek r1 today indicates that China is not so far behind. Let's remember that while the US "won" the race for nukes, our primary rival had nukes very soon after - by stealing our advancements. A standoff between AGI-armed US and China could be disastrous - or navigated successfully if we take the right tone and prevent further proliferation (I shudder to think of Putin controlling an AGI, or many potentially unstable actors).

This discussion is important, so it needs to be better. This pushback is itself badly flawed. In calling out the report's lack of references, it provides almost none itself. Citing a 2017 official statement from China seems utterly irrelevant to guessing their current, privately held position. Almost everyone has updated massively since 2017. If China is "racing toward AGI" as an internal policy, they probably would've adopted that recently. (I doubt that they are racing yet, but it seems entirely possible they'll start now in response to the US push to do so - and the their perspective on the US as a dangerous aggressor on the world stage. But what do I know - we need real experts on China and international relations.)

Pointing out the technical errors in the report seems irrelevant to harmful. You can understand very little of the details and still understand that AGI would be a big, big deal if true — and the many experts predicting short timelines could be right. Nitpicking the technical expertise of people who are essentially probably correct in their assessment just sets a bad tone of fighting/arguing instead of having a sensible discussion.

And we desperately need a sensible discussion on this topic.

Comment by Seth Herd on How can we prevent AGI value drift? · 2024-11-20T19:16:15.327Z · LW · GW

I gave part of my answer in the thread where you first asked this question. Here's the rest.

TLDR: Value alignment is too hard even without the value stability problem. Goal-misspecification is too likely (I realize I don't know the best ref for this other than LoL - anyone else have a better central ref?). Therefore we'll very likely align our first AGIs to follow instructions, and use that as a stepping-stone to full value alignment.

This is something I used to worry about a lot. Now it's something I don't worry about it at all.

I wrote a paper on this, Goal changes in intelligent agents back in 2018 for a small FLI grant, (in perhaps the first round of public funds for AGI x-risk). One of my first posts on LW was The alignment stability problem.

I still think this would be a very challenging problem if we were designing a value-aligned autonomous AGI. Now I don't think we're going to do that. 

I now see goal mis-specification as a very hard problem, and one we don't need to tackle to create autonomous AGI or even superintelligence. Therefore I think we won't.

Instead we'll make the central goal of our first AGIs to follow instructions or to be corrigible (correctable).

It's counterintuitive to think of a highly intelligent and fully autonomous being that wants more than anything to do what a less intelligent human tells them to do. But I think it's completely possible, and a much safer option for our first AGIs.

This is much simpler than trying to instill our values with such accuracy that we'd be happy with the result. Neither showing examples of things we like (as in RL training) nor explicitly stating our values in natural language seems likely to be accurate enough after it's been interpreted by a superintelligent AGI that is likely to see the world at least somewhat differently than we do. That sort of re-interpretation is functionally similar to value drift, although it's separable. Adding the problem of actual value drift on top of the dangers of goal misspecification just makes things worse.

Aligning an AGI to follow instructions isn't trivial either, but it's a lot easier to specify than getting values right and stable. For instance, LLMs already largely "know" what people tend to mean by instructions - and that's before the checking phase of do what I mean and check (DWIMAC). 

Primarily, though, instruction-following has the enormous advantage of allowing for corrigibility - you can tell your AGI to shut down to accept changes, or issue new revised instructions if/when you realize (likely because you asked the AGI) that your instructions would be interpreted differently than you'd like.

If that works and we get superhuman AGI aligned to follow instructions, we'll probably want to use that AGI to help us solve the problem of full value alignment, including solving value drift. We won't want to launch an autonomous AGI that's not corrigible/instruction-following until we're really sure our AGIs have a sure solution. (This is assuming we have those AGIs controlled by humans who are ethical enough to release control of the future into better hands once they're available - a big if).

Comment by Seth Herd on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T01:23:34.684Z · LW · GW

This might work. Let's remember the financial incentives. Exposing a non-aligned CoT to all users is pretty likely to generate lots of articles about how your AI is super creepy, which will create a public perception that your AI in particular is not trustworthy relative to your competition.

I agree that it would be better to expose from an alignment perspective, I'm just noting the incentives on AI companies.

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-19T23:05:39.360Z · LW · GW
  1. We die (don't fuck this step up!:)
    1. Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
  2. We die (don't let your AGI fuck this step up!:)
    1. 22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn't thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
  3. the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.
Comment by Seth Herd on Training AI agents to solve hard problems could lead to Scheming · 2024-11-19T22:54:41.603Z · LW · GW

I agree with all of those points locally.

To the extent people are worried about LLM scaleups taking over, I don't think they should be.

We will get nice instruction-following tool AIs.

But the first thing we'll do with those tool AIs is turn them into agentic AGIs. To accomplish any medium-horizon goals, let alone the long-horizon ones we really want help with, they'll need to do some sort of continuous learning, make plans (including subgoals), and reason in novel sub-domains.

None of those things are particularly hard to add. So we'll add them. (Work is underway on all of those capacities in different LLM agent projects).

Then we have the risks of aligning real AGI.

That's why this post was valuable. It goes into detail on why and how we'll add the capacities that will make LLM agents much more useful but also add the ability and instrumental motivation to do real scheming.

I wrote a similar post to the one you mention, Cruxes of disagreement on alignment difficulty. I think understanding the wildly different positions on AGI x-risk among different experts is critical; we clearly don't have a firm grasp on the issue, and we need it ASAP. The above is my read on why TurnTrout, Pope and co are so optimistic - they're addressing powerful tool AI, and not the question of whether we develop real AGI or how easy that will be to align.

FWIW I do think that can be accomplished (as sketched out in posts linked from my user profile summary), but it's nothing like easy or default alignment, as current systems and their scaleups are.

I'll read and comment on your take on the issue.

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-19T19:06:24.656Z · LW · GW

That would be great. Do reference scalable oversight to show you've done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-19T18:51:11.996Z · LW · GW

This is a great question. I think what we can do is spread good logic about AGI risks. That is tricky. Outside of the LW audience, getting the emotional resonance right is more important than being logically correct. And that's a whole different skill.

My impression is that Yudkowsky has harmed public epistemics in his podcast appearances by saying things forcefully and with rather poor spoken communication skills for novice audiences. Leahy is better but may also be making things worse by occasionally losing his cool and coming off as a bit of an asshole. People then associate the whole idea of AI safety with "these guys who talk down to us and seem mean and angry". Then motivated reasoning kicks in and they're oriented to trying to prove them wrong instead of discover the truth.

That doesn't mean logical arguments don't count with normies; they do. But the logic comes into play lots more when you're not counted as dangerous or an enemy by emotional processing.

So just repeating the basic arguments of "something smarter will treat us like we do animals by default" and "surely we all want the things we love now to survive AGI" while also being studiously nice is my best guess at the right approach.

I struggle to do this myself; it's super frustrating to repeatedly be in conversations where people seem to be obstinately refusing to think about some pretty basic and obvious logic.

Maybe the logic will win out even if we're not able to be nice about it, but I'm quite sure it will win out faster if we can be

Repetition counts. Any worriers with any access to public platforms should probably be speaking publicly about this - as long as they're trying hard to be nice.

Edit: to bring it back to this particular type of scenario: when someone says "let it rip, I don't care if the winners aren't human!" is the most important time to be nice and get curious instead of pointing out how stupid this take is. Just asking questions is going to lead most people to realize that actually they do value human-like consciousness and pleasant experiences, not just progress and competition in a disneyland without children (ref at the end).

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-19T17:58:38.994Z · LW · GW

Ah, yes. That is quite a set of jailbreak techniques. When I say "alignment is solved for LLM agents", I mean something different than what people mean by alignment for LLMs themselves.

I'm using alignment to mean AGI that does what its user wants. You are totally right that there's an edge case and a problem if the principal "user", the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user's instructions/desires. Which is exactly what they'll want.

More in the other comment. I haven't worked this through. Thanks for pointing it out.

This might mean that an org that develops LLM-based AGI systems can't really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they'll put in a bunch of stopgap jailbreak prevention measures and hope they're adequate when they won't be.

I need to think more about this.

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-19T17:26:39.163Z · LW · GW

This isn't something I've thought about adequately.

I think LLM agents will almost universally include a whole different mechanisms that can prevent jailbreaking: Internal independent review in which there are calls to a different model instance to check whether proposed plans and actions are safe (or waste time and money).

Once agents can spend your people's money or damage their reputation, we'll want to have them "think through" the consequences of important plans and actions before they execute. 

As long as you're engineering that and paying the compute costs, you might as well use it to check for harms as well- including checking for jailbreaking. If that check finds evidence of jailbreaking, it can just clear the model context, call for human review from the org, or suspend that account.

I don't know how adequate that will be, but it will help.

This is probably worth thinking more about; Ii've sort of glossed over it while being concerned mostly about misalignment and misuse by fully authorized parties. But jailbreaking and misuse by clients could also be a major danger.

Comment by Seth Herd on What (if anything) made your p(doom) go down in 2024? · 2024-11-19T17:22:08.030Z · LW · GW

TBC, I don't think it will slow progress all that much; there are other routes to improvement. I guess I didn't express the biggest reason this shifts my p(doom) a little: it's a slower takeoff, giving more time for the reality of the situation to sink in before we have takeover capable AGI. I think we'll still hit near-human LLM agents on schedule (1-2 years) by scaffolding next-gen LLMs boosted with o1 style training.

I'm really hoping that the autonomy of these systems will impact people emotionally, creating more and better policy thinking and alignment work on those types of AGIs. I think the rate of approach to AGI is more important than the absolute timelines; we'll see ten times the work on really relevant policy and alignment once we see compelling evidence of the type of AGI that will be transformative and dangerous.

I've heard enough credible-sounding rumors to give > 50% that they're true. This is partly a product of this result fitting my theory of why LLMs work so well. While they are predictors, what they're learning from human text is mostly to copy human intelligence. Moving past that will be slower.

Do you mean we're waiting tiil 2026/27 for results of thee next scaleup? If this round (GPT5, Claude 4, Gemini 2.0) show diminishing returns, wouldn't we expect that the next will too?

Comment by Seth Herd on How to use bright light to improve your life. · 2024-11-19T05:06:10.290Z · LW · GW

Great post, thank you!

SAD: When I did a very brief lit search, the research showed much larger effects of vitamin D supplementation than light exposure therapy. Of course, they weren't using enough dakka on the light, so both should be used. But two of my close friends with severe SAD were dramatically improved when I got them to supplement D regularly. It's handy that you don't need to take it regularly, just in large doses occasionally (probably don't do more than 50k IU at a time for safety). Sorry I didn't keep the references where I can find them!

Again, doing both is probably a good idea, but most people seem to be vit. D deficient, as you'd expect from a light-exposure-synthesized vitamin, with all of this modern unnatural clothes-wearing and indoors-dwelling.

Back to light: as the standard male night owl (particularly on a WFH flexible schedule): Am I understanding you correctly that if I wanted to go to bed earlier (not sure I do but I probably should), I'd wake up earlier and blast my eyeballs with light right away, then avoid bright light 3-4 hours before bed? Anything else?

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-19T04:30:32.912Z · LW · GW

I think you're assuming a sharp line between sincere ethics motivations and self-interest. In my view, that doesn't usually exist. People are prone to believe things that suit their self-interest. That motivated reasoning is the biggest problem with public discourse. People aren't lying, they're just confused. I think Musk definitely and even probably Altman believe they're doing the best thing for humanity - they're just confused and not taking the effort to get un-confused.

I'm really sorry all of that happened to you. Capitalism is a harsh system, and humans are harsh beings when we're competing. And confused beings, even when we're trying not to be harsh. I didn't have time to go through your whole story, but I fully believe you were wronged.

I think most villains are the heroes of their own stories. Some of us are more genuinely altruistic than others - but we're all confused in our own favor to one degree or another.

So reducing confusion while playing to everyone's desire to be a hero is one route to survival.

Comment by Seth Herd on What are Emotions? · 2024-11-19T04:18:11.037Z · LW · GW

I'm so glad you found that response helpful!

I primarily mean reward in the sense of reinforcement - a functional definition from animal psychology and neuroscience: reinforcement is whatever makes the previous behavior more likely in the future.

But I also mean a positive feeling (qualia if you like, although I find that term too contentious to use much). I think we have a positive feeling when we're getting a reward (reinforcement), but I'm not sure that all positive feelings work as enforcement. Maybe.

As to how deep can that recursive learning mechanism go: very deep. When people spend time arguing about logic and abstract values online, they've gone deep. There's no limit- until the world intervenes to tell you your chain of predicted-reward inferences has gone off-track. For instance, if that person has lost their job, and they're cold and hungry, they might track down the (correct) logic that they ascribed too much value to proving people wrong on the internet, and reduce their estimate of its value.

Comment by Seth Herd on Training AI agents to solve hard problems could lead to Scheming · 2024-11-19T03:45:40.057Z · LW · GW

Question 1:

I stay in the loop when my AGI is solving hard problems. Absolutely it will need persistent goals, new reasoning, and continuous learning to make progress. That changing mind opens up The alignment stability problem as you note in your comment on the other thread. My job is making sure it's not going off the rails WRT my intent as it works. 

People will do this by default. Letting it run for any length of time without asking questions about what it's up to would be both very expensive, and beyond the bounds of patience and curiosity for almost any humans. I instructed it to cure cancer, but I'm going to keep asking it how it's planning to do that and what progress its making. My important job is asking it about its alignment continually as it learns and plans. I'm frequently asking if it's had ideas about scheming to get its (subgoal) accomlished (while of course re-iterating the standing instructions to tell me the whole truth relevant to my requests). Its alignmnt is my job, until it's so much smarter than me, and clearly understands my intent, to trust it to keep itself aligned.

Question 2:

Yes, instruction-following should be helpful-only. Giving a bunch of constraints on the instructions it will follow adds risk that it won't obey your instructions to shut down or amend its goals or its understanding of previous instructions. That's the principal advantage of corrigibility. Max Harms details this logic in much more compelling detail.

Yes, this definitely opens up the prospect of misuse, and that is terrifying. But this is not only the safer early route, it's the one AGI project leaders will choose - because they're people who like power.

An org that's created instruction-following AGI would have it follow instructions only from one or a few top "principals". They would instruct it to follow a limited set of instructions from any users they license its instances to. Some of those users would try to jailbreak it to follow dangerous instructions.

And having even a few humans from different groupos (e.g., rival governments) fully in charge of real AGIs would be terrifying. Much more on this in [If we solve alignment, do we die anyway?](https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1).

(Provisional current answer after that discussion: it becomes a tricky political negotiation. Who knows. But maybe. 

I favor this approach far above just giving a list of goals and side constraints and hoping they're not too badly interpreted. The reasoning in your post pretty much describes what I'm worried about - although there are also some less intuitive but logically sound misinterpretations of lots of goals that might come into play as the AGI becomes thoroughly superhuman and does more of its own thinking.

To your final observation:

Yes, some outcome-based RL is probably inevitable. o1 was probably trained that way, and others will follow. Let us hope it is not too much to overwhelm the non-instrumental training and the explicit goal of following instructions. And let us figure out how much is likely to be too much before we learn the hard way.

Comment by Seth Herd on Project Adequate: Seeking Cofounders/Funders · 2024-11-19T03:18:14.207Z · LW · GW

I missed this until I finally got around to responding to your last post, which I'd put on my todo list.

I applaud your initiative and drive! I do think it's a tough pitch to try to leapfrog the fast progressive in deep networks. Nor do I think the alignment picture for those types of systems is nearly as bleak as Yudkowsky & the old school believe. But neither is it likely to be easy enough to leave to chance and those who don't fully grasp the seriousness of the problem. I've written about some of the most common Cruxes of disagreement on alignment difficulty.

So I'd suggest you would have better odds working within the ML framework that is happening with or without your help. I also think that even if you do produce a near-miraculous breakthrough in symbolic GOFAI, Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc.

OTOH, if you have a really insightful approach, and a good reason to think the result would be easier to align than language model agents, maybe pursuing that path makes sense, since no one else is doing exactly that. As I said in my comment on your last request for directions, I think there are higher-expected-value nearly-as-underserved routes to survival; namely, working on alignment for the LLM agents that are our most likely route to first AGIs at this point (focusing on different routes from aligning the base LLMs, which is common but inadequate).

I'm also happy to talk. Your devotion to the project is impressive, and a resource not to be wasted!

Comment by Seth Herd on One person's worth of mental energy for AI doom aversion jobs. What should I do? · 2024-11-19T02:56:55.680Z · LW · GW

Did you make any progress on choosing a course? My brief pitch is this: LLM agents are our most likely route to AGI, and particularly likely in short timelines. Aligning them is not the same as aligning the base LLMs. Yet almost no one is working on bridging that gap.

That's what I'm working on. More can be found in my user profile.

I do think this is high prospective impact. I'm not sure what you mean by low prospective risk. I think the work has good odds of being at least somewhat useful, since it's so neglected and it's pretty commonly agreed that language model agents (or foundation model agents or LLM cognitive architectures) are a pretty likely path to first AGI.

I'm happy to talk more. I meant to respond here sooner.

Comment by Seth Herd on What (if anything) made your p(doom) go down in 2024? · 2024-11-19T02:50:51.357Z · LW · GW

The recent rumors about slowed progress in large training runs have reduced my p(doom). More time to prepare for AGI raises our odds. This probably won't be a large delay. This is combined with the observation that inference-time compute does also scale results, but it probably doesn't scale them that fast - the graph released with o1 preview didn't include units on the cost/compute axis.

More than that, my p(doom) went steadily down as I kept contemplating instruction-following as the central alignment goal. I increasingly think it's the obvious thing to try once you're actually contemplating launching an AGI that could become smarter than you; and it's a huge benefit to any technical alignment scheme, since it offers the advantages of corrigibility, allowing you to correct some alignment errors.

More on that logic in Instruction-following AGI is easier and more likely than value aligned AGI

Comment by Seth Herd on Training AI agents to solve hard problems could lead to Scheming · 2024-11-19T02:16:02.566Z · LW · GW

Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.

This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This approach seems likely to work - but I want to hear more pushback on it before I'd trust it in practice.

I think this is not only an interesting dodge around this class of alignment concerns, but it's the most likely thing to actually be implemented. When someone is actually getting close to launching a system they hope is or will become smarter than they are, they'll think a little harder about making its central goal "solve cancer" or anything else broad and consequentialist. The natural choice is to just extend what LLMs are mostly aligned for now: following instructions, including consequentialist instructions.

This logic is all laid out in more detail in Instruction-following AGI is easier and more likely than value aligned AGI, but I didn't specifically address scheming there.

Comment by Seth Herd on Training AI agents to solve hard problems could lead to Scheming · 2024-11-19T02:07:39.212Z · LW · GW

Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.

I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal.

I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.

Comment by Seth Herd on Training AI agents to solve hard problems could lead to Scheming · 2024-11-19T01:31:14.173Z · LW · GW

To summarize:

  • AGI will have goals, because we want it to accomplish things for us
  • Scheming is often helpful in accomplishing goals
  • AGIs will realize they should scheme, then scheme.

Despite thinking the core logic is almost that simple, I think it's useful to have this set of thinking laid out so carefully and in the detail you give here.

I am also still a bit confused as to why this careful presentation is useful. I find the logic so compelling that needing to be walked carefully through it seems strange to me. And yet there are intelligent and well-informed people who say things like "there's no empirical evidence for scheming in AIs" in all seriousness. So I'd like to understand that perspective better.

While I don't fully understand the perspective that needs to be convinced that scheming is likely, I do have some guesses. I think on the whole it stems from understanding current AI systems well, and reasoning from there. Current systems do not really scheme; they lack the capacity. Those who reason by analogy with humans or with fictional or hypothetical generally superintelligent AI see scheming as extremely likely from a misaligned AGI, because they're assuming it will have all the necessary cognitive capacities.

There are more nuanced views, but I think those are the two starting points that generate this dramatic difference in opinions.

Some more specific common cruxes of disagreement on scheming likelihood:

  • Will AGI have consequentialist goals?
  • Will AGI be able to do novel reasoning?
  • Will AGI be able to do long-term planning?
  • Will AGI be able to learn continuously from experience?

I see the answers to all of these questions as being obviously, inevitably yes by default; all of these are useful, so we will keep building toward AGI with all of these capacities if nothing stops us. Having extremely useful transformative limited AGI (like super-foundation models) would not stop us from building "real AGI" with the above properties.

I've tried to convey why those properties seem so inevitable (and actually rather easy to add from here) in real AGI, Steering subsystems: capabilities, agency, and alignment, and Sapience, understanding, and "AGI", among snippets in other places. I'm afraid none of them is as clear or compelling as I'd like from the perspective of someone who starts reasoning from current AI and asks why or how would we include those dangerous properties in our future AGIs?

That's why I'm glad you guys are taking a crack at it in a more careful and expansive way, and from the perspective of how little we'd need to add to current systems to make them solve important problems, and how that gives rise to scheming. I'll be referencing this post on this point.

Edit note: Most of this was written after an accidental premature submit ctrl-return action.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-18T20:14:21.241Z · LW · GW

What I'm saying is that the people you mention should put a little more time into it. When I've been involved in philosophy discussions with academics, people tend to treat it like a fun game, with the goal being more to sore points and come up with clever new arguments than to converge on the truth.

I think most of the world doesn't take philosophy seriously, and they should.

I think the world thinks "there aren't real answers to philosophical questions, just personal preferences and a confusing mess of opinions". I think that's mostly wrong; LW does tend to cause convergence on a lot of issues for a lot of people. That might be groupthink, but I held almost identical philosophical views before engaging with LW - because I took the questions seriously and was truth-seeking.

I think Musk or Page are fully capable of LW-style philosophy if they put a little time into it - and took it seriously (were truth-seeking).

What would change people's attitudes? Well, I'm hoping that facing serious questions in how we create, use, and treat AI does cause at least some people to take the associated philosophical questions seriously.

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-18T18:42:13.970Z · LW · GW

My pleasure. Evan Hubinger made this point to me when I'd misunderstood his scalable oversight proposal.

Thanks again for engaging with my work!

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-18T18:33:42.696Z · LW · GW

Thanks for reading, and responding! It's very helpful to know where my arguments cease being convincing or understandable.

I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.

Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don't work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we'll probably use in Internal independent review for language model agent alignment. I'm not happy with the clarity of that post, though, so I'm currently working on two followups that might be clearer.

Or perhaps the missing link is going from aligned AI systems to aligned "Real AGI". I do think there's a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned - IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.

So that's how I get to the first aligned AGI at roughly human level or below.

From there it seems easier, although still possible to fail.

If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.

I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there's a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-18T18:17:39.422Z · LW · GW

Very interesting. This does imply that Page was pretty committed to this view.

Note that he doesn't explicitly state that non-sentient machine successors would be fine; he could be assuming that the winning machines would be human-plus in all ways we value.

I think that's a foolish thing to assume and a foolish aspect of the question to overlook. That's why I think more careful philosophy would have helped resolve this disagreement with words instead of a gigantic industrial competition that's now putting as all at risk.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-18T18:14:07.989Z · LW · GW

Maybe Page does believe that. I think it's nearly a self-contradictory position, and that Page is a smart guy, so with more careful thought, this beliefs are likely to converge on the more common view here on LW; replacing humanity might be OK only if our successors are pretty much better at enjoying the world in the same way we do.

I think people who claim to not care whether our successors are conscious are largely confused, which is why doing more philosophy would be really valuable.

Beff Jezos is exactly my model. Digging through his writings, I found him at one point explicitly state that he was referring to machine offspring with some sort of consciousness or enjoyment when he says humanity should be replaced. In other places he's not clear on it. It's bad philosophy, because it's taking a backseat to arguments.

This is why I want to assume that Page would converge to the common belief: so we don't mark people who seem to disagree with us as enemies, and drive them away from doing the careful, collaborative thinking that would get our beliefs to converge.

Addenda on why I think beliefs on this topic converge with additional thought: I don't think there's a universal ethics, but I do think that humans have built-in mechanisms that tend to make us care about other humans. Assuming we'd care about something that acts sort of like a sentient being, but internally just isn't one, is an easy mistake to make without managing to imagine that scenario in adequate detail.

Comment by Seth Herd on What are Emotions? · 2024-11-17T23:35:37.905Z · LW · GW

I think you're primarily addressing reward signals or reinforcement signals. These are, by definition, signals that make behavior preceding them more likely in the future. In the mammalian brain, they define what we pursue.

Other emotions are different; back to them later.

The dopamine system appears to play this role in the mammalian brain. It's somewhat complex, in that new predictions of future rewards seem to be the primary source of reinforcement for humans; for instance, if someone hands me a hundred dollars, I have a new prediction that I'll eat food, get shelter, or do something that in turn predicts reward; so I'll repeat whatever behavior preceded that, and I'll update my predictions for future reward.

For way more than you want to know about how dopamine seems to shape our actions, see my paper Neural mechanisms of human decision-making and the masses of work it references.

Or better yet, read Steve Byrnes' Intro to brain-like-AGI safety sequence, focusing on the steering subsystem. Then look at his Valence sequence for more on how we pass reward predictions among our "thoughts" (representations of concepts). (IMO, his Valence matches exactly what the dopamine system is known to do for short time tasks, and what it probably does in human complex thought).

 

So, when you ask people what their goals are, they're mentioning things that predict reward to them. They're guesses about what would give a lot of reward signals. The correct answer to "'why do you want that" is "because I think I'd find it really rewarding". ("I'd really enjoy it" is close but not quite correct, since there's a difference between wanting and liking in the brain- google that for another headfull).

Now, we can be really wrong about what we'd find rewarding or enjoy. I think we're usually way off. But that is how we pick goals, and what drives our behavior (along with a bunch of other factors that are less determinative, like what we know about and what happens into our attention).

 

Other emotions, like fear, anger, etc. are different. They can be thought of as "tilts"' to our cognitive landscape. Even learning that we're experiencing them is tricky. That's why emotional awareness is a subject to learn about, not just something we're born knowing. We need to learn to "feel the tilt". Elevated heart rate might signal fear, anger, or excitement; noticing it or finding other cues are necessary to understand how we're tilted, and how to correct for it if we want to act rationally. Those sorts of emotions "tilt the landscape" of our cognition by making different thoughts and actions more likely, like thoughts of how someone's actions were unfair or physical attacks when we're angry.

 

See also my post [Human preferences as RL critic values - implications for alignment](https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for). I'm not sure how clear or compelling it is. But I'm pretty sure that predicted reward is pretty synonymous with what we call "values".

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-17T23:02:58.721Z · LW · GW

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

Comment by Seth Herd on If we solve alignment, do we die anyway? · 2024-11-17T23:01:45.950Z · LW · GW

We're mostly in agreement here. If you're willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.

you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.

In my scenario, we've got aligned AGI - or at least AGI aligned to follow instructions. If that didn't work, we're already dead. So the AGI is going to follow its human's orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I'm guessing here at where we might not be thinking of things the same way).

If I thought ordering an AGI to self-improve was suicidal, I'd be relieved.

Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator - and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI - following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn't, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.

Again, I'm guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.

If I've missed the point of differing perspectives, I apologize.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-17T17:41:06.961Z · LW · GW

I agree that there was a lot more to that exchange than that quick summary.

My point was that there wasn't enough or it wasn't careful enough.

Comment by Seth Herd on Making a conservative case for alignment · 2024-11-17T14:13:00.564Z · LW · GW

I agree with everything you've said. The advantages are primarily from not aligning to values but only to following instructions rather than using RL or any other process to infer underlying values. Instruction-following AGI is easier and more likely than value aligned AGI.

I think creating real AGI based on an LLM aligned to be helpful, harmless and honest would probably be the end of us, as carrying the set of value implied by RLHF to their logical conclusions outside of human control would probably be pretty different from our desired values. Instruction-following provides corrigibililty.

Edit: by "'small group" I meant something like five people who are authorized to give insntructions to an AGI.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-17T03:47:47.080Z · LW · GW

That makes sense under certain assumptions - I find them so foreign I wasn't thinking in those terms. I find this move strange if you worry about either alignment or misuse. If you hand AGI to a bunch of people, one of them is prone to either screw up and release a misaligned AGI, or deliberately use their AGI to self-improve and either take over or cause mayhem.

To me these problems both seem highly likely. That's why the move of responding to concern over AGI by making more AGIs makes no sense to me. I think a singleton in responsible hands is our best chance at survival.

If you think alignment is so easy nobody will screw it up, or if you strongly believe that an offense-defense balance will strongly hold so that many good AGIs safely counter a few misaligned/misused ones, then sure. I just don't think either of those are very plausible views once you've thought back and forth through things.

Cruxes of disagreement on alignment difficulty explains why I think anybody who thinks alignment is super easy is overestimating their confidence (as is anyone who's sure it's really really hard) - we just haven't done enough analysis or experimentation yet.

If we solve alignment, do we die anyway? addresses why I think offense-defense balance is almost guaranteed to shift to offense with self-improving AGI, meaning a massively multipolar scenario means we're doomed to misuse.

 

My best guess is that people who think open-sourcing AGI is a good idea either are thinking only of weak "AGI" and not the next step to autonomously self-improving AGI, or they've taken an optimistic  guess at the offense-defense balance with many human-controlled real AGIs. 

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-17T00:34:39.953Z · LW · GW

I totally agree. And I also think that all involved are quite serious when they say they care about the outcomes for all of humanity. So I think in this case history turned on a knife edge; Musk would've at least not done this much harm had he and Page had clearer thinking and clearer communication, possibly just by a little bit.

But I do agree that there's some motivated reasoning happening there, too. In support of your point that Musk might find an excuse to do what he emotionally wanted to anyway (become humanity's savior and perhaps emperor for eternity): Musk did also express concern about DeepMind making Hassabis the effective emperor of humanity, which seems much stranger - Hassabis' values appear to be quite standard humanist ones, so you'd think having him in charge of a project with the clear lead would be a best-case scenario for anything other than being in charge yourself. So yes, I do think Musk, Altman, and people like them also have some powerful emotional drives toward doing grand things themselves.

It's a mix of motivations, noble and selfish, conscious and unconscious. That's true of all of us all the time, but it becomes particularly salient and worth analyzing when the future hangs in the balance.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-16T21:45:02.303Z · LW · GW

I sometimes feel we spend too much time on philosophy and communication in the x-risk community. But thinking through the OpenAI drama suggests that it's crucial.

Now the world is in more and more immediate danger because a couple of smart guys couldn't get their philosophy or their communication right-enough, and didn't spend the time necessary to clarify. Instead Musk followed his combative and entrepreneurial instincts. The result was dramatically heating up the race for AGI, which previously had no real competition to DeepMind.

OpenAI wouldn't have launched without Musk's support, and he gave it because he was afraid of Larry Page being in charge of a successful Google AGI effort.

From Musk's interview with Tucker Carlson (automated transcript, sorry!):

I mean the the reason open AI exists at all is that um Larry Paige and I used to be close friends and I would stay at his house in pal Alto and I would talk to him late into the night about uh AI safety and at least my (01:12) perception was that Larry was not taking uh AI safety seriously enough um and um what did he say about it he really seemed to be um one want want sort of digital super intelligence basically digital God if you will and at one point uh I said well what about you know we're going to make sure humanity is okay here um and and and um uh and then he called me a speciest

Musk was afraid of what Page would do with AGI because Page called Musk a speciesist (specist?) when they were talking about AGI safety. What did Page mean by this? He probably hadn't worked it all the way through.

These guys stopped being friends, and Musk put a bunch of money and effort into developing an org that could rival DeepMind's progress toward AGI.

That org was captured by Altman. But it was always based on a stupid idea: make AGI open source. That's the dumbest thing you could do with something really dangerous - unless you believed that it would otherwise wind up in hands that just don't care about humanity.

That's probably not what Page meant. On consideration, he would probably have clarified that AI that includes what we value about humanity would be a worthy successor. He probably wasn't even clear on his own philosophy at the time.

A little more careful conversation would've prevented this whole thing, and we'd be in a much better strategic position.

 

In my mind this also shows how immensely intelligent people can also do really dumb things outside of their area of intellectual expertise.

Comment by Seth Herd on OpenAI Email Archives (from Musk v. Altman) · 2024-11-16T21:11:29.260Z · LW · GW

Rings true. I'm not sure it pushes me much on the ethics of OpenAI; somebody else had a good idea for a philosophy and a name to push for AI in a certain (maybe dumb) direction; they recognized it as a good idea and appropriated it for their own similar project. Should they have used a more different name? Probably. Should they have used a more different philosophical argument? No. Should they have brought Guy Ravine on board? Probably not; his vision for how the thing would actually go was very different from theirs, and none of his skills were really that relevant. He'd have been in arguments with them from the start.

Is this the right way for industry to work? Nope. But nobody knows how to properly give credit for good but broad ideas.

None of this is to endorse anything or anyone related to OpenAI, just to say it's pretty standard practice.

Comment by Seth Herd on Making a conservative case for alignment · 2024-11-16T20:07:46.934Z · LW · GW

I didn't read this post as proposing an alliance with conservative politicians. The main point seemed to be that engaging with them by finding common ideological ground is just a good way to improve epistemics and spread true knowledge.

The political angle I endorse is that the AGI x-risk community is heavily partisan already, and that's a very dangerous position to take. There are two separable reasons: remaining partisan will prevent us from communicating well with the conservatives soon to assume power (and who may well have power during a critical risk period for alignment); and it will increase polarization on the issue, turning it from a sensible discussion to a political football, just like the climate crisis has become.

Avoiding the mere mention of politics would seem to hurt the the odds that we think clearly enough about the real pragmatic issues arising from the current political situation. They matter, and we mustn't ignore those dynamics, however much we dislike them.

Comment by Seth Herd on Making a conservative case for alignment · 2024-11-16T01:24:01.589Z · LW · GW

This might be the most important alignment idea in a while.

Making an honest argument based on ideological agreements is a solidly good idea.

"Alignment" meaning alignment to one group is not ideal. But I'm afraid it's inevitable. Technical alignment will always be easier with a simpler alignment target. For instance, making an AGI aligned to the good of all humanity is much trickier than aligning it to want to do what one particular human says to do. Taking directions is almost completely a subset of inferring desires, and one person (or a small group) is a subset of all of humanity — and much easier to define.

If that human (or their designated successor(s)) has any compassion and any sense, they'll make their own and their AGIs goal to create fully value-aligned AGI. Instruction following or Intent alignment can be a stepping-stone to value alignment.

It is time to reach across the aisle. The reasons you mention are powerful. Another is to avoid polarization on this issue. Polarization appears to have completely derailed the discussion of climate change, similar to alignment in being new and science-based. Curernt guesses are that the US democratic party would be prone to pick up the AI safety banner — which could polarize alignment. Putting existential risk, at least, on the conservative side might be a better idea for the next four years, and for longer if it reduces polarization by aligning US liberal concerns about harms to individuals (e.g., artists) and bias in AI systems, with conservative concerns about preserving our values and our way of life(e.g., concerns we'll all die or be obsoleted)

Comment by Seth Herd on The Hopium Wars: the AGI Entente Delusion · 2024-11-13T01:38:27.885Z · LW · GW

Well, that's disturbing. I'm curious what you mean by "soon"' for autonomous continuous improvement, and what mechanism you're envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it's fairly slow and seems to have upper bounds.

As for the rate of cyber security and eval improvement, I agree that it's not on track. I wouldn't be surprised if it's not on track, and we'll actually see the escape you're talking about.

My one hope here is that the rate of improvement isn't on rails; it's in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn't assume that just because nobody finds LLMs dangerous, they won't find AGI or even proto-AGI obviously and intuitively dangerous.

Again, I'm not at all sure this happens in time on the current trajectory. But it's possible for the folks at the lab to say "we're going to deploy it internally, but let's improve our evals and our network security first, because this could be the real deal".

It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.

 

The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.

All in all, I expect that if we launch misaligned proto-AGI, we're probably all dead. I agree that people are all too likely to launch it before they're sure if it's aligned or what its capabilities are relative to their security and evals.  So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they're in place before it's even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it's succeeded). Since those are already how LLMs and agents are built, there's little chance the org doesn't at least try them.

That might sound like a forlorn hope; but the target isn't perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don't have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.

It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it's safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.

Comment by Seth Herd on Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI · 2024-11-12T21:49:15.854Z · LW · GW

Those outcomes sound quite plausible.

I'm particularly concerned with polarization. Becoming a political football was the death knell for sensible discussion on climate change, and it could be the same for AGI x-risk. Public belief in climate change actually fell while the evidence mounted. My older post AI scares and changing public beliefs is actually mostly about polarization.

Having the debate become ideologically/politically motivated seems like it wouldn't be good. I'm still really hoping to avoid polarization on AGI x-risk. It does seem like "AI safety", concerns about bias, deepfakes, and harms from interacting with LLMs are already primarily discussed among liberals in the US.

Neither side has started really worrying about job loss, but that would tend to be the liberal side, too, since conservatives are still somewhat more free-market oriented.

While tying concerns about x-risk with calls to slow AI based on mundane harms might seem expedient, I wouldn't take that bargain if it created worse polarization.

I think this is a common attitude among the x-risk worried, especially since it's hard to predict whether a slowdown in the US AGI push would be a net good or bad thing for x-risk.

Comment by Seth Herd on Buck's Shortform · 2024-11-12T04:32:47.789Z · LW · GW

I think this is a valuable distinction.

I note that the solutions you mention for the second, less-addressed class of misuse only prevent people who aren't officially in charge of AGI from misusing it; they don't address government appropriation.

Governments have a monopoly on the use of force, and their self-perceived mandate includes all issues critical to national security. AGI is surely such an issue.

I expect that government will assume control of AGI if they see it coming before it's smart enough to help its creators evade that control. And that would be very difficult in most foreseeable scenarios.

You can hop borders, but you're just moving to another government's jurisdiction.

I don't have any better solutions to government misuse for a self-coup and permanent dictatorship. Any such solutions are probably political, not technical, and I know nothing about politics.

But it seems like we need to get some politically savvy people onboard before we have powerful AI aligned to its creators intent. Technical alignment is only a partial solution.

Comment by Seth Herd on papetoast's Shortforms · 2024-11-11T20:22:46.520Z · LW · GW

There are infinite things to be sad and annoyed by, should you choose to focus on those. :) I'd rather focus on the world as a whole being made better in your examples.

Comment by Seth Herd on Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. · 2024-11-11T19:10:34.315Z · LW · GW

I think that's one issue; LLMs don't get the same types of guidance, etc. that humans get; they get a lot of training and RL feedback, but it's structured very differently.

I think this particular article gets another major factor right, where most analyses overlook it: LLMs by default don't do metacognitive checks on their thinking. This is a huge factor in humans appearing as smart as we do. We make a variety of mistakes in our first guesses (system 1 thinking) that can be found and corrected with sufficient reflection (system 2 thinking). Adding more of this to LLM agents is likely to be a major source of capabilities improvements. The focus on increasing "9s of reliability" is a very CS approach; humans just make tons of mistakes and then catch many of the important ones; LLMs sort of copy their cognition from humans, so they can benefit from the same approach - but they don't do much of it by default. Scripting it in to LLM agents is going to at least help, and it may help a lot.

Comment by Seth Herd on Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al. · 2024-11-11T18:45:23.482Z · LW · GW

I read the whole thing because of its similarity to my proposals about metacognition as an aid to both capabilities and alignment in language model agents.  

In this and my work, metacognition is a way to keep AI from doing the wrong thing (from the AIs perspective). They explicitly do not address the broader alignment problem of AIs wanting the wrong things (from humans' perspective). 

They note that "wiser" humans are more prone to serve the common good, by taking more perspectives into account. They wisely do not propose wisdom as a solution to the problem of defining human values or beneficial action from an AI. Wisdom here is an aid to fulfilling your values, not a definition of those values. Their presentation is a bit muddled on this issue, but I think their final sections on the broader alignment problem make this scoping clear.

My proposal of a metacognitive "internal review" or "System 2 alignment check" shares this weakness. It doesn't address the right thing to point an AGI at; it merely shores up a couple of possible routes to goal mis-specification.

This article explicitly refuses to grapple with this problem:

3.4.1. Rethinking AI alignment 

With respect to the broader goal of AI alignment, we are sympathetic to the goal but question this definition of the problem. Ultimately safe AI may be at least as much about constraining the power of AI systems within human institutions, rather than aligning their goals.

I think limiting the power of AI systems within human institutions is only sensible if you're thinking of tool AI or weak AGI; thinking you'll constrain superhuman AIs seems like obviously a fool's errand. I think this proposal is meant to apply to AI, not ever-improving AGI. Which is fine, if we have a long time between transformative AI and real AGI.

I think it would be wildly foolish to assume we have that gap between important AI and real AGI. A highly competent assistant may soon be your new boss.

I have a different way to duck the problem of specifying complex and possibly fragile human values: make the AGI's central goal to merely follow instructions. Something smarter than you wanting nothing more than to follow your instructions is counterintuitive, but I think it's both consistent, and in-retrospect obvious; I think not only is this alignment target safer, but far more likely for our first AGIs.  People are going to want the first semi-sapient AGIs to follow instructions, just like LLMs do, not make their own judgments about values or ethics. And once we've started down that path, there will be no immediate reason to tackle the full value alignment problem.

 (In the longer term, we'll probably want to use instruction-following as a stepping-stone to full value alignment, since instruction-following superintelligence would eventually fall into the wrong hands and receive some really awful instructions. But surpassing human intelligence and agency doesn't necessitate shooting for full value alignment right away.)

A final note on the authors' attitudes toward alignment: I also read it because I noted Yoshua Bengio and Melanie Mitchell among the authors. It's what I'd expect from Mitchell, who has steadfastly refused to address the alignment problem, in part because she has long timelines, and in part because she believes in a "fallacy of dumb superintelligence" (I point out how she goes wrong in The (partial) fallacy of dumb superintelligence).

I'm disappointed to see Bengio lend his name to this refusal to grapple with the larger alignment problem. I hope this doesn't signal a dedication to this approach. I had hoped for more from him.

Comment by Seth Herd on papetoast's Shortforms · 2024-11-11T14:24:04.627Z · LW · GW

In both cases one particular project was harmed but the sum total of projects was helped.

Comment by Seth Herd on how to truly feel my beliefs? · 2024-11-11T02:57:25.684Z · LW · GW

Think about your beliefs' consequences a lot and in detail. Personalize your imagination of the outcomes, imagining consequences to you and to others as vividly as possible.

I'm pretty sure that's the mechanism (after working on the neuroscience of motivation for a long time), but I'm not sure you should actually do that for existential risk from AGI. I work full-time on AGI risk, but I really enjoy not fully feeling the consequences of my beliefs WRT doom (I give it roughly 50% since so much of the logic is poorly worked through thus far). Would I work a little harder if I was more terrified? Probably for a while, but I might well burn out.

One possible solution is to do the opposite type of motivation: think of the consequences of succeeding in aligning AGI (or avoiding other dangers). Think about them in as much detail and as frequently as you can. The imagination has to be vivid enough to evoke emotions; tying those emotional experiences to the concepts is what makes you feel your beliefs.

Imagining a vast number of humans, transhumans, and sentient AIs enjoying their lives to the fullest and enjoying activities we can barely imagine is a way more fun way to motivate yourself.

Challenge yourself to imagine how much incredible fun people might have if we get aligned superintelligence (I like to assume near-perfect simulations, so that people can have lots of challenges and adventures without getting in each others' way, but there are more mundane ways to have immense fun, too)

I'm not sure if it's as effective, but for your own sake I'd recommend that over imagining how bad failure would be.

Comment by Seth Herd on Viliam's Shortform · 2024-11-11T02:30:23.816Z · LW · GW

If publication and credit standards were changed, we'd see more scientists investigating interesting ideas from both within and outside of academia. The existing structure makes scientists highly conservative in which ideas they test from any source, which is bad when applied to ideas from outside academia - but equally bad when applied to ideas from inside academia.

5% definitely isn't the cutoff for which ideas scientists actually do test empirically.

Throwing away about 90% of your empirical work (total minus real hits and false alarms from your 5%) would be a high price to pay for exploring possibly-true hypotheses. Nobody does that. Labs in cognitive psychology and neuroscience, the fields I'm familiar with, publish at least half of their empirical work (outside of small pilot studies, which are probably a bit lower).

People don't want to waste work so they focus on experiments that are pretty likely to "work" by getting "'significant" results at the p<.05 level. This is because they can rarely publish studies that show a null effect, even if they're strong enough to establish that any effect is probably too small to care about.

So it's really more like a 50% chance base rate. This is heavily biased toward exploitation of existing knowledge rather than exploration toward new knowledge.

And this is why scientists mostly ignore ideas from outside of academia. They are very busy working hard to keep a lab afloat. Testing established and reputable ideas is much better business than finding a really unusual idea and demonstrating that it's right, given how often that effort would be wasted.

The solution is publishing "failed" experiments. It is pretty crazy that people keep wasting time re-establishing which ideas aren't true. Some of those experiments would be of little value, since they really can't say if there's a large effect or not; but that would at least tell others where it's hard to establish the truth. And bigger, better studies finding near-zero effects could offer almost as much information as those finding large and reliable effects. The ones of little value would be published in lesser venues and so be less important on a resume, but they'd still offer value and show that you're doing valuable work.

The continuation of journals as the official gatekeepers of what information you're rewarded for sharing is a huge problem. Even the lower-quality ones are setting a high bar in some senses, by refusing even to print studies with inconclusive results. And the standard is completely arbitary in celebrating large effects while refusing to even publish studies of the same quality that give strong evidence of near-zero effects.

Comment by Seth Herd on LLMs Look Increasingly Like General Reasoners · 2024-11-09T22:41:59.790Z · LW · GW

I like trying to define general reasoning; I also don't have a good definition. I think it's tricky.The ability to do deduction, induction, and abduction.

 

  • The ability to do deduction, induction, and abduction.

I think you've got to define how well it does each of these. As you noted on that very difficult math benchmark comment, saying they can do general reasoning doesn't mean doing it infinitely well.

  • The ability to do those in a careful, step by step way, with almost no errors (other than the errors that are inherent to induction and abduction on limited data).

I don't know about this one. Humans seem to make a very large number of errors, but muddle through by recognizing at above-chance levels when they're more likely to be correct - then building on that occasional success. So I think there are two routes to useful general-purpose reasoning - doing it well, or being able to judge success at above-chance and then remember it for future use one way or another.

  • The ability to do all of that in a domain-independent way.
  • The ability to use all of that to build a self-consistent internal model of the domain under consideration.

Here again, I think we shouldn't overstimate how self-consistent or complete a model humans use when they make progress on difficult problems. It's consistent and complete enough, but probably far from perfect.
 

Comment by Seth Herd on LLMs Look Increasingly Like General Reasoners · 2024-11-09T22:32:14.712Z · LW · GW

I totally agree. Natural language datasets do have the right information embedded in them; it's just obscured by a lot of other stuff. Compute alone might be enough to bring it out.

Part of my original hypothesis was that even a small improvement in the base model might be enough to make scaffolded System 2 type thinking very effective. It's hard to guess when a system could get past the threshold of having more thinking work better, like it does for humans (with diminishing returns). It could come frome a small improvement in the scaffolding, or a small improvement in memory systems, or even from better feedback from outside sources (e.g., using web searches and better distinguishing good from bad information).

All of those factors are critical in human thinking, and our abilities are clearly a nonlinear product of separate cognitive capacities. That's why I expect improvements in any or all of those dimensions to eventually lead to human-plus fluid intelligence. And since efforts are underway on each of those dimensions, I'd guess we see that level sooner than later. Two years is my median guess for human level reasoning on most problems, maybe all. But we might still not have good online learning allowing, for a relevant instance, for the system to be trained on any arbitrary job and to then do it competently. Fortunately I expect it to scale past human level at a relatively slow pace from there, giving us a few more years to get our shit together once we're staring roughly human-equivalent agents in the face and so start to take the potentials seriously.

Comment by Seth Herd on Eli's shortform feed · 2024-11-09T12:24:11.550Z · LW · GW

The market is much more crowded now. A new old okcupid service would be competing against okcupid as well as everything else. And okcupid has a huge advantage in an existing userbase.

And, OKCupid's algorithm still exists, sort of. And you can write as much as you like. What aspect of the old site do you think was critically different?

I just don't think there's barely a cent to be made in launching yet another dating app. So you can't blame people for not doing it.

I think the biggest advantage of old OKC was that more people used it; now people are spread across hinge and bumble as well as Tinder.