Posts

What's a better term now that "AGI" is too vague? 2024-05-28T18:02:31.371Z
Anthropic announces interpretability advances. How much does this advance alignment? 2024-05-21T22:30:52.638Z
Instruction-following AGI is easier and more likely than value aligned AGI 2024-05-15T19:38:03.185Z
Goals selected from learned knowledge: an alternative to RL alignment 2024-01-15T21:52:06.170Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
Corrigibility or DWIM is an attractive primary goal for AGI 2023-11-25T19:37:39.698Z
Sapience, understanding, and "AGI" 2023-11-24T15:13:04.391Z
Altman returns as OpenAI CEO with new board 2023-11-22T16:04:03.123Z
OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns 2023-11-20T14:20:33.539Z
We have promising alignment plans with low taxes 2023-11-10T18:51:38.604Z
Seth Herd's Shortform 2023-11-10T06:52:28.778Z
Shane Legg interview on alignment 2023-10-28T19:28:52.223Z
The (partial) fallacy of dumb superintelligence 2023-10-18T21:25:16.893Z
Steering subsystems: capabilities, agency, and alignment 2023-09-29T13:45:00.739Z
AGI isn't just a technology 2023-09-01T14:35:57.062Z
Internal independent review for language model agent alignment 2023-07-07T06:54:11.552Z
Simpler explanations of AGI risk 2023-05-14T01:29:29.289Z
A simple presentation of AI risk arguments 2023-04-26T02:19:19.164Z
Capabilities and alignment of LLM cognitive architectures 2023-04-18T16:29:29.792Z
Agentized LLMs will change the alignment landscape 2023-04-09T02:29:07.797Z
AI scares and changing public beliefs 2023-04-06T18:51:12.831Z
The alignment stability problem 2023-03-26T02:10:13.044Z
Human preferences as RL critic values - implications for alignment 2023-03-14T22:10:32.823Z
Clippy, the friendly paperclipper 2023-03-02T00:02:55.749Z
Are you stably aligned? 2023-02-24T22:08:23.098Z

Comments

Comment by Seth Herd on The case for stopping AI safety research · 2024-07-26T20:40:05.005Z · LW · GW

I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI "wanted" to.

As for understanding the concept of intent, I agree that "true" intent is very difficult to understand, particularly if it's projected far into the future. That's a huge problem for approaches like CEV. The virtue of the approach I'm suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring "true" intent, the AGI just "wants" to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I'm thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.

I'll read your article.

Comment by Seth Herd on Universal Basic Income and Poverty · 2024-07-26T16:11:44.052Z · LW · GW

Maloch is the name of this force, and rent-seeking is one of its faces.

I think this is basically correct, although as others have noted it doesn't completely counteract progress.

There is a form of rent-seeking from other sources than land ownership, like the rising college tuitions. Arguably, zoning is a separate form of rent-seeking that's not directly based on land ownership but control of government to make ones own life better at the expense of others' opportunities.

Those two are more clearly Maloch. Competition for good degrees and good zoning drives prices as high as people will pay.

Comment by Seth Herd on Universal Basic Income and Poverty · 2024-07-26T16:04:38.606Z · LW · GW

Excellent point. But these changes are much less than the 100x wealth increase, which implies that there is a very strong poverty-inducing force, it's just not completely negating progress.

Comment by Seth Herd on Universal Basic Income and Poverty · 2024-07-26T16:00:46.354Z · LW · GW

This doesn't address how the equilibrium would change if such basic income becomes universal.

Comment by Seth Herd on The case for stopping AI safety research · 2024-07-25T16:29:06.268Z · LW · GW

Thank you!

The link to your paper is broken. I've read the Christiano piece. And some/most of the CEV paper, I think.

Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.

Comment by Seth Herd on How to avoid death by AI. · 2024-07-25T12:45:23.023Z · LW · GW

Asking people to listen to a long presentation is a bigger ask than a concise presentation with more details than the current post. Got anything in between?

Comment by Seth Herd on Is Chinese AGI a valid concern for the USA? · 2024-07-25T12:41:11.631Z · LW · GW

On the political will issue: it seems from my relatively ignorant perspective that China has more political will for large infrastructure projects than the US. This might be related to a more technocratic leadership style, and their somewhat longer time horizon perspective (which in turn might spring from slower turnover of leadership).

That might lead to more government spending and more coherent AI projects. That is why I don't count China out, even for the relatively short term.

Comment by Seth Herd on The case for stopping AI safety research · 2024-07-25T12:05:20.286Z · LW · GW

What's a good overview of those grounded arguments? I looked at your writings and it wasn't clear where to start.

Comment by Seth Herd on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-24T14:58:46.943Z · LW · GW

Not needing corrigibility with an AI that follows instructions is the point. Avoiding manipulation seems as simple as saying "most importantly, don't manipulate me. Please state your understanding of manipulation so we can establish a mutually understood definition..."

I agree that, if it's not adequately following instructions, that won't work. But it seems like it would have to be pretty far off target.

Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.

See my instruction-following AGI post for more of this logic and links to my other work on how we'd do the technical alignment for an LLM-based or RL-based AGI. The instruction-following as alignment goal and the technical approaches seem obvious-in-retrospect and therefore pretty likely to be used in first AGI alignment attempts.

Comment by Seth Herd on Caring about excellence · 2024-07-23T19:41:39.919Z · LW · GW

I strongly disagree, despite the excellence of your presentation :) Arguing against excellence is a bad look, but I feel compelled to voice my take anyway. I strongly suspect that many LW readers are already pursuing excellence beyond what is optimal for their real goals.

It certainly makes sense to pursue excellence in some things. But that usually takes time, which limits the number of other projects you can do. And there's often a large nonlinearity in the likely success vs. time spent on a given project, making pursuit of excellence actively irrational.

In my two decades in academic research, it looked to me like the pursuit of excellence was a very common mistake made by academics. I suspect the same is true of the rationalist community on average. I found I could be quite helpful by examining different scientific projects and pointing out where corners could be cut that would save a lots of time while having only a tiny effect on the project's outcome and likely success. The people doing those projects usually agreed with me; they hadn't thought about those measures because they had been pursuing excellence as a heuristic without any real means-ends analysis.

I do not see a nice dividing line between excellence and perfectionism, but a smooth continuum that requires frequent re-evaluation specific to each project.

"Anything worth doing is worth doing well" is my nomination for most damaging aphorism in history.

The perfect is the enemy of the good.

Or, to put it another way: pursue excellence in your choice of how to spend your time. Sometimes the correct choice will be pursuit of excellence; this is particularly likely when that pursuit builds your skills, or excellence is required for even satisfactory results. By this broader definition, doing something well means doing it efficiently: spending no more time than the time/results tradeoff curve dictates, and bearing in mind the opportunity cost of not spending that time on other projects.

In many situations, excellent choices mean doing a good-enough job, and quickly moving on to accomplish other things.

I'm sorry to say that I find this suggestion to be actively anti-rationalist. There is certainly such a thing as too much analysis, but that's much more likely on small decisions where a quick decision beats a better decision. Deciding how to spend one's time seems to deserve a careful analysis, not heuristic-based decision-making.

I also think that much American culture already stresses excellence over happiness, which is a huge mistake of another sort. Telling American high achievers to pursue excellence seems like the opposite of the advice they most need: satisfice, prioritize, and keep your eye on how your actions pursue your real goals.

The topic is a good deal more complex and detailed than either this piece or my response covered. Like most important questions, it comes down to the specifics of each situation. But I did feel compelled to offer the counterargument, since the essay didn't really address them.

Comment by Seth Herd on Closed Limelike Curves's Shortform · 2024-07-23T19:12:55.556Z · LW · GW

This seems like a much better target for spreading rationalism. The other listed articles all seem quite detailed and far from the central rationalist project. Decision-making seems like a more likely on-ramp.

Comment by Seth Herd on Towards more cooperative AI safety strategies · 2024-07-23T04:10:11.243Z · LW · GW

That is an excellent point. I hate the idea of gathering attention and reputation, but that's probably an big part of having people listen to you when it's important.

Comment by Seth Herd on The $100B plan with "70% risk of killing us all" w Stephen Fry [video] · 2024-07-22T21:55:44.010Z · LW · GW

This is really good, the new best intro to AGI X-risk arguments for the layperson that I know of - and by a long way. It's not only the best audio, but better than any written presentation I've come across - and I have been on the lookout.

It's succinct and accurate. It hits the best arguments in a very clear way, and didn't leave any important ones out that I know of (I and others would argue that there are other important arguments, but many are too complex or controversial to include in an intro).

I didn't look at the video, just listened to the audio, so I have no idea if that enhances or detracts from the audio production.

Comment by Seth Herd on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-22T15:16:24.795Z · LW · GW

I agree that that's possible, but it's not clear to me what direction it would go. If an agent's goal is to do whatever its principal currently wants, it doesn't seem to create a clear incentive to manipulate what that is. Doing whatever the principal last said they want is a subgoql of that, but it's just a subgoal. It would help fulfill that subgoal to make the principal keep wanting that subgoal. But the principal always wants (and should say they want) to not be manipulated. So that's another subgoal, and probably a more important one than any particular instrumental subgoal.

Does that make sense? If you see it differently I'd be interested to understand a different perspective.

In a way this is an extra element to prevent manipulation, but it also falls directly out of the goal-as-a-pointer-to-principal's-goals.

Comment by Seth Herd on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-21T01:28:29.007Z · LW · GW

If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.

This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.

Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace, and hope I conveyed the proper intuition. Or I can design an AGI with the goal of doing what I tell it to do (or maybe "what I want"), then telling it to achieve world peace and giving it the definition. In the second case, its goal is similar but not identical to the first. It's a pointer to that goal instead of a direct definition. Thus, if the goal changes in my head (or my instructions in that variant), the pointer's value changes. In the first direct definition case, it does not.

Goals as pointers is not something any evolved system really has, but it seems logically consistent and to have major advantages for alignment. More on that in my instruction-following AGI... post.

Comment by Seth Herd on How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation") · 2024-07-21T01:05:25.820Z · LW · GW

I agree that eigening isn't the key concept for alignment or other scientific process. Sure you could describe any consensus that way, but they could be either very good or just awful depending on how much valid analysis went into each step of doing that eigening. In a really good situation, progress toward consensus is only superficially describable as eigening. The real progress is happening by careful thinking and communicating. The eigening isn't happening by reputation but by quality of work. In a bad field, eigening is doing most of the work.

Referring to them both as eigening seems to obscure the difference between good and bad science/theory creation.

Comment by Seth Herd on Introduction to Modern Dating: Strategic Dating Advice for beginners · 2024-07-20T23:48:34.972Z · LW · GW

Oh and I should've said- I very much agree that the world needs more common-sense, basic advice because what's basic and common sense isn't totally obvious to everyone if they haven't thought about it quite that way. So I very much support your project, including posting on LessWrong as well as other spots for clever, interested daters.

Comment by Seth Herd on Introduction to Modern Dating: Strategic Dating Advice for beginners · 2024-07-20T22:52:31.449Z · LW · GW

Clever people are different than rationalists. Here we not only fancy ourselves to be clever, but rationalists. Thus, this advice to be rational about dating is less relevant here than to other groups of "clever" people.

I missed "skippable" because I was already impatient and skimming :) I don't wait for an invitation to skip, because time is scarce.

Comment by Seth Herd on Me & My Clone · 2024-07-20T22:49:45.498Z · LW · GW

Step outside of the room. You'll see the world from different locations, symmetry will be broken. IDK why you're interested in a situation that can't happen, perfect sensory symmetry for long periods.

Free will is a contradiction in terms, so I don't think it's worth fussing over exactly that wording. What people ususally mean by free will is that their decisions truly affect the future. If you take "they" to refer to their brain state, this is pretty clearly true on that interpretation. I think other definitions of self are incoherent and that one is pretty straightforwardly what we mean by "me", so I agree that "free will" in its common definition is compatible with determinism.

Comment by Seth Herd on eggsyntax's Shortform · 2024-07-20T22:40:16.974Z · LW · GW

Sure, I'll use this terminology.

There will be some overlap where components are both tools and part of the scaffolding; for instance, a competent language model cognitive architecture would probably have a "tool" of an episodic memory it can write to and read from; and the next LLM call from the scaffolding portion would often be determined by the result of reading from that episodic memory "tool", making it also part of the scaffolding. Similarly with sensory systems and probably many others; how the scaffolding invokes the LLM will depend on the results of calls to tools.

But it's useful to have more distinct terminology even when it's not perfect, so I'll go ahead and use this.

Comment by Seth Herd on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-20T22:33:57.347Z · LW · GW

AND your accurate assessment of the difficulty. The overconfidence displayed in this mini-experiment seems to result in part from people massively misestimating the difficulty of this relatively simple problem. That's why it's so concerning WRT alignment.

Comment by Seth Herd on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-20T22:29:16.048Z · LW · GW

Ah- well said. I understand the rest of your comments better now. And I thoroughly agree, with a caveat about the complexity of the problem and the amount of thought and teamwork applied (e.g., I expect that a large team working for a month in effective collaboration would've solved the problem in this experiment, but alignment is probably much more difficult than that).

Comment by Seth Herd on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-20T22:25:34.839Z · LW · GW

I think the concern is that, if plans need some verification, it may be impossible to align smarter-than-human AGI. In order to verify those plans, we'd have to build one. If the plan doesn't work (isn't verified), that may be the end of us - no retries possible.

There are complex arguments on both sides, so I'm not arguing this is strictly true. I just wanted to clarify that that's the concern and the point of asking people to solve it on the first try. I think ultimately this is partly but not 100% true of ASI alignment, and clarifying exactly how and to what degree we can verify plans empirically is critical to the project. Having a plan verified on weak, non-autonomous/self-aware/agentic systems may or may not generalize to that plan working on smarter systems with those properties. Some of the ways verification will or won't generalize can probably be identified with careful analysis of how such systems will be functionally different.

Comment by Seth Herd on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-20T21:42:24.183Z · LW · GW

Yes to your first point. I think that

abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments

Is a fair characterization of those results. So would be the inverse, "abstract reasoning about unfamiliar domains is hard therefore you should distrust AI success arguments".

I think both are very true, and so we should distrust both. We simply don't know. 

I think the conclusion taken,

planning is hard therefore you should distrust alignment plans

Is also valid and true. 

People just aren't as smart as we'd like to think we are, particularly in reasoning about complex and unfamiliar domains. So both our plans and evaluations of them tend to be more untrustworthy than we'd like to think. Planning and reasoning require way more collective effort than we'd like to imagine. Careful studies of both individual reasoning in lab tasks and historical examples support this conclusion.

One major reason for this miscalibration is the motivated reasoning effect. We tend to believe what feels good (predicts local reward). Overestimating our reasoning abilities is one such belief among vary many examples of motivated reasoning.

Comment by Seth Herd on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-20T21:33:29.760Z · LW · GW

Yes they're correct in assuming success is possible in those situations - but their assumption of the possible routes to success is highly incorrect. People are making a very large error in overestimating how well they understand the situation and failing to think about other possibities. This logical error sounds highly relevant to alignment and AI risk.

Comment by Seth Herd on A simple case for extreme inner misalignment · 2024-07-20T21:24:38.037Z · LW · GW

Making representations simpler even when that makes them worse at their job is not more intelligent. Yes, on the mammalian scale smarter minds compress more in many ways. That doesn't mean yet smarter minds will keep doing it more even when it makes them worse at achieving their goals, and is not necessary since they have adequate storage to keep the less-compressed and therefore more accurate and useful represenations.

This is a noble project, but I'm afraid the premise is simply false.

Comment by Seth Herd on A simple case for extreme inner misalignment · 2024-07-20T21:22:15.633Z · LW · GW

I agree 100%. This post is basically arguing that greater intelligence will get its goals more wrong in future versions. That would be dumber, not smarter.

The post frames the hypothesis as "greater intelligence compresses more" without hugely arguing that's true and inevitable. I think the premise is simply false. Better compression is an element of greater intelligence up to some point (useful abstract representations that aid thinking with limited computational resources), but not further beyond that point with any necessity.

Comment by Seth Herd on A simple case for extreme inner misalignment · 2024-07-20T21:14:33.848Z · LW · GW

The inline reactions are totally different in that they de-anonymize your feedback. It's both rational and emotionally instinctive to not want someone irritated at you specifically because you disagree with them. In comments you can at least try to be extra nice and appreciative when you express disagreement to offset that downside. But that's a lot more work than clicking a button. I think that's sometimes why the anonymous big downvote is used. It takes real thought and discipline to reserve it for fair, rational "the discussion would be much better if you hadn't written this" responses. And I'm not sure how many people even try to do that.

Comment by Seth Herd on A simple case for extreme inner misalignment · 2024-07-20T21:13:41.592Z · LW · GW

It makes me happy to see you say polarization makes you sad. Sorry :) I agree that it's all too easy to lose rational discussion to polarization, and I live in fear of seeing LessWrong and the alignment community become polarized.

Your point about them not fully making sense makes sense. But despite that, I think agree/disagree votes on posts might help prevent that from happening. Receiving a large downvote on a high-effort or dearly-valued post is emotionally painful, and creates a bit of hostility/negative emotion toward the downvoter. Since you don't know who it was, you wind up thinking of it as "the opposition to my position". This hostility tempts you to downvote high effort or dearly valued posts stating similar opposing views. Now they're irritated/hostile back. Polarization results. Even if that polarization is on specific issues rather than lumping many views together, it's almost as bad.

It looks to me like agree/disagree on comments serves remarkably well to allow people to express agreement or disagreement with what they perceive as the main point, while still acknowledging the effort someone put into clarifying and arguing for that point. Many comments also make multiple points, so you're guessing at what each agree vote really means, and that's okay. It still serves the important purpose of keeping upvote/downvate from being used for agree/disagree. I think people are often posts voting post as agree or disagree rather than their contribution to the conversation. If you're quite sure a post is wrong in its main point, you might argue that it's not contributing to our collective knowledge but instead confusing it. And it's attractive to register your opinion of agreeing without writing a comment just to say "yeah". But we don't want to be voting on the truth (in most cases wehre it's not clear cut), but rather discussing it, and awarding congratulations (upvotes) for people that are trying to advance the discussion, even when their logic or their methods happen to be partly mistaken.

Comment by Seth Herd on Introduction to Modern Dating: Strategic Dating Advice for beginners · 2024-07-20T20:44:54.171Z · LW · GW

I'm not sure why this got downvotes, except that its core point seems to be "be rational about dating" and just about every single reader on LessWrong identifies as a rationalist, so already believes this about everything. Thus, they've tried to be rational about dating, and either derived or been exposed to most of these points already. Thus, it seems LessWrong just isn't the ideal outlet for this.

I might also suggest a slightly less wordy and faster-start version of this essay (and most essays). I don't care what your qualifications are much, so "I've been a dating coach" would do, with maybe more qualifications in a footnote.

I've seen a bunch of dating advice, and most of this is common advice. But I found your "to-do before entering a relationship" to have some novel points and framings, so I particularly liked that.

Comment by Seth Herd on Linkpost: Surely you can be serious · 2024-07-20T20:32:40.334Z · LW · GW

It's popular because people think it's a helpful point to make. Nonserious people can and should become serious. And science shouldn't be a LARP.

I was a professional scientist (in cognitive psychology and neuroscience) for two decades and change. I quit largely because the field wasn't serious in this sense. Nobody cared much about applications. They were just enjoying "doing science" (or sometimes not even really enjoying it).

They damned well could and should get off their intellectual asses and become serious. That's what this post advocates, and it's very much a matter of choice and perspective. That's why it's perceived by many, including me, as a valuable point and perspective.

Comment by Seth Herd on Seth Herd's Shortform · 2024-07-18T13:17:24.513Z · LW · GW

A metaphor for the US-China AGI race

It is as though two rivals have discovered that there are genies in the area. Whichever of them finds a genie and learns to use its wishes can defeat their rival, humiliating or killing them if they choose. If they both have genies, it will probably be a standoff that encourages defection; these genies aren't infinitely powerful or wise, so some creative offensive wish will probably bypass any number of defensive wishes. And there are others that may act if they don't.

In this framing, the choice is pretty clear. If it's dangerous to use a genie without taking time to understand and test it, too bad. Total victory or complete loss hang in the balance. If one is already ahead in the search, they'd better speed up and make sure their rival can't follow their tracks to find a genie of their own.

This is roughly the scenario Aschenbrenner presents in Situational Awareness. But this is simplifying, and focusing attention on one part of the scenario, the rivalry and the danger. The full scenario is more complex.[1]

Of particular importance is that these "genies" can serve as well for peace as for war. The can grant wealth beyond imagination, and other things barely yet hoped for. And they will probably take substantial time to come into their full power.

This changes the overwhelming logic of racing. Using a genie to prevent a rival from acquiring one is not guaranteed to work, and it's probably not possible without collateral damage. So trying that "obvious" strategy might result in the rival attacking in fear of or retaliation. Since both rivals are already equipped with dreadful offensive weapons, such a conflict could be catastrophic. This risk applies even if one is willing to assume that controlling the genie (alignment) is a solvable problem.

And we don't know the depth of the rivalry. Might these two be content to both enjoy prosperity and health beyond their previous dreams? Might they set aside their rivalry, or at least make a pledge to not attack each other if they find a genie? Even if it's only enforced by their conscience, such a pledge might hold if suddenly all manner of wonderful things became possible at the same time as a treacherous unilateral victory. Would it at least make sense to discuss this possibility while they both search for a genie? And perhaps they should also discuss how hard it might be to give a wish that doesn't backfire and cause catastrophe.

This metaphor is simplified, but it raises many of the same questions as the real situation we're aware of.

Framed in this way, it seems that Aschenbrenner's call for a race is not the obviously correct or inevitable answer. And the question seems important.

  1. ^

    Other perspectives on Situational Awareness, each roughly agreeing on the situation but with differences that influence the rational and likely outcomes:

    Nearly a book review: Situational Awareness, by Leopold Aschenbrenner.

    Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity

    Response to Aschenbrenner's "Situational Awareness"

    On Dwarksh’s Podcast with Leopold Aschenbrenner

    I have agreements and disagreements with each of these, but those are beyond the scope of this quick take.

Comment by Seth Herd on Towards more cooperative AI safety strategies · 2024-07-17T20:55:37.371Z · LW · GW

Sure. But scaring opponents with inflated arguments and demands by radicals didn't seem to work well for the environmental movement, so the AI safety movement probably shouldn't employ those tactics.

Comment by Seth Herd on Towards more cooperative AI safety strategies · 2024-07-17T15:40:45.826Z · LW · GW

That's a good point. But not all of the imposed costs were strategically wise, so the backlash didn't need to be that large to get the important things done. It could be argued that the most hardline, strident environmentalists might've cost the overall movement immensely by pushing for minor environmental gains that come at large perceived costs.

I think that did happen, and that similarly pushing for AI safety measures should be carefully weighed in cost vs benefit. The opposite argument is that we should just get everyone used to paying costs for ai safety (in terms of limiting ai progress that would not probably be highly dangerous). I think that strategy backfired badly for environmentalism and would backfire for us.

Maybe. Again, I'm not expert in PR and I'd really like to have people who are expert involved in coming up with strategies.

Comment by Seth Herd on Towards more cooperative AI safety strategies · 2024-07-16T21:20:34.751Z · LW · GW

Perhaps the broader point here is that public relations is a complex art, of which we are mostly not even practitioners let alone masters. We should probably learn about it and get better.

I also want to note that there are probably psychological as well as societal defense mechanisms against someone trying to change your worldview. I don't know the name of the phenomena, but this is essentially why counselors/therapists typically avoid giving advice or stating their opinion plainly; the client is prone to rebel against that advice or worldview. I'd suspect this happens because it's terribly dangereous to just let other people tell you how to think; you'll be taken advantage of rather quickly if you do. Obviously there are multiple routes around these defense mechanisms, since people do convince others to change their minds in both subtle and forceful ways. But we should probably learn the theory of how that happens, prior to triggering a bunch of defense mechanisms by going in swinging with amateur enthusiasm (and the unusual perspective of devoted rationalism).

Waiting to speak while polishing our approach seems foolish when time is short. I find very short timelines entirely plausible, but I nonetheless think it would behoove us to collectively gather some clues before arguing loudly in public.

This is in part because I think the last point is very true and very relevant: people who aren't taking AGI risk seriously largely just aren't taking AI seriously. They'll take it more seriously the more it advances, with no convincing needed. So a good bit of the work is being done by progress, so we're not as far behind in getting people to pay attention as it seems. That gives us a bit more time to figure out how to work with that societal attention as it continues to grow.

None of this is arguing for shutting up or not speaking the truth. I'm just suggesting we err on the side of speaking gently, to avoid triggering strong defense mechanisms we don't understand.

Comment by Seth Herd on Towards more cooperative AI safety strategies · 2024-07-16T21:06:18.614Z · LW · GW

Regardless of who is more power-seeking, it would probably be a good idea to look at how being power-seeking has been a disadvantage to other movements. It looks to me like the insistence/power-seeking of the environmental movement may well have been an immense disadvantage; it may have created a backlash that's almost as strong as the entire movement.

Comment by Seth Herd on Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity · 2024-07-16T04:35:29.783Z · LW · GW

Excellent work.

To summarize one central argument in briefest form:

Aschenbrenner's conclusion in Situational Awareness is wrong in overstating the claim.

He claims that treating AGI as a national security issue is the obvious and inevitable conclusion for those that understand the enormous potential of AGI development in the next few years. But Aschenbrenner doesn't adequately consider the possibility of treating AGI primarily as a threat to humanity instead of a threat to the nation or to a political ideal (the free world). If we considered it primarily a threat to humanity, we might be able to cooperate with China and other actors to safeguard humanity.

I think this argument is straightforwardly true. Aschenbrenner does not adequately consider alternative strategies, and thus his claim of the conclusion being the inevitable consensus is false.

But the opposite isn't an inevitable conclusion, either.

I currently think Aschenbrenner is more likely correct about the best course of action. But I am highly uncertain. I have thought hard about this issue for many hours both before and after Aschenbrenner's piece sparked some public discussion. But my analysis, and the public debate thus far, are very far from conclusive on this complex issue.

This question deserves much more thought. It has a strong claim to being the second most pressing issue in the world at this moment, just behind technical AGI alignment.

Comment by Seth Herd on Alignment: "Do what I would have wanted you to do" · 2024-07-13T21:44:54.676Z · LW · GW

I was confused for a moment. You start out by saying there's no alternative to CEV, then end up by saying there's a consensus that CEV isn't a good first alignment target.

Doesn't that mean that whether or how to pursue CEV it's not relevant to whether we live or die? It seems like we should focus on the alignment targets we'll pursue first, and leave CEV and the deeper nature of values and preferences for the Long Reflection - if we can arrange to get one.

I certainly hope you're right that there's a de-facto consensus that CEV/value alignment probably isn't relevant for our first do-or-die shots at alignment. It sure looks that way to me, so I'd like to see more LW brainpower going toward detailed analyses of the alignment schemes on which we're most likely to bet the future.

Comment by Seth Herd on Alignment: "Do what I would have wanted you to do" · 2024-07-13T21:10:48.619Z · LW · GW

I don't think Dearnaley's proposal is detailed enough to establish whether or not it would really in practice have a "basin of attraction". I take it to be roughly the same idea as ambitious value learning and CEV. All of them might be said to have a basin of attraction (and therefore your continuity property) for this reason: if they initially misunderstand what humans want initially (a form of your delta) they should work to understand it better and make sure they understand it, as a byproduct of having their goal be not a certain set of outcomes, but a variable standing for outcomes humans prefer, while the exact value of that variable can remain unknown and refined as one possible sub-goal.

Another related thing that springs to mind: all goals may have your continuity property with a slightly different form of delta. If an AGI has one main goal, and a few other less important goals/values, those might (in some decision-making processes) be eliminated in favor of the more important goal (if continuing to have those minor goals would hurt its ability to achieve the more important goal).

The other important piece to note about the continuity property is that we don't know how large a delta would be ruinous. It's been said that "value is fragile" but the post But exactly how complex and fragile? got almost zero meaningful discussion. Nobody knows until we get around to working that out. It could be that a small delta in some AGI architectures would just result in a world with slightly more things like dance parties and slightly less things like knitting circles, disappointing to knitters but not at all catastrophic. I consider that another important unresolved issue.

Back to your intial point: I agree that other preferences could interact disastrously with the indeterminacy of something like CEV. But it's hard for me to imagine an AGI whose goal is to do what humanity wants but also has a preference for wiping out humanity. But it's not impossible. I guess with the complexity of pseudo-goals in a system like an LLM, it's probably something we should be careful of. 

Comment by Seth Herd on Seth Herd's Shortform · 2024-07-13T20:34:47.173Z · LW · GW

Interesting. I wonder if this perspective is common, and that's why people rarely bother talking about the prompting portion of aligning LMAs.

I don't know how to really weigh which is more important. Of course, even having a model reliably follow prompts is a product of tuning (usually RLHF or RLAIF, but there are also RL-free pre-training techniques that work fairly well to accomplish the same end). So its tendency to follow many types of prompts is part of the underlying "personality".

Whatever their relative strengths, aligning an LMA AGI should employ both tuning and prompting (as well as several other "layers" of alignment techniques), so looking carefully at how these come together within a particular agent architecture would be the game.

Comment by Seth Herd on Alignment: "Do what I would have wanted you to do" · 2024-07-13T20:02:01.647Z · LW · GW

That's quite a collection of relevant work. I'm bookmarking this as the definitive collection on the topic; I haven't seen better and I assume you would've and linked it if it existed.

I think you should just go ahead and make this a post. When you do, we can have a whole discussion in a proper place, because this deserves more discussion.

Prior to you writing that post, here are some thoughts:

I think it's pretty clearly correct that CEV couldn't produce a single best answer, for the reasons you give and cite arguments for. Human values are quite clearly path-dependent. Given different experiences (and choices/stochastic brain activity/complex interactions between initial conditions and experiences), people will wind up valuing fairly different things.

However, this doesn't mean that something like CEV or ambitious value learning couldn't produce a pretty good result. Of all the many worlds that humans as a whole would absolutely love (compared to the nasty, brutish and short lives we now live), you could just pick one at random and I'd call that a dang good outcome.

I think your stronger claim, that the whole idea of values and beliefs is incoherent, should be kept separate. I think values and beliefs are pretty fuzzy and changeable, but real by the important meanings of those words. Whatever its ontological status, I prefer outcomes I prefer to ones I'd hate, and you could call those my values even if it's a very vague and path-dependent collection.

But I don't think that's probably a major component of this argument, so that stronger claim should probably be mostly set aside while considering whether anything like CEV/value learning could work.

Again, I hope you'll make this a post, but I'd be happy to continue the discussion here as well as there.

Comment by Seth Herd on Alignment: "Do what I would have wanted you to do" · 2024-07-13T11:38:27.876Z · LW · GW

It seems like all of the many correct answers to what X would've wanted might not include the AGI killing everyone.

Wrt the continuity property, I think Max Harm's corrigibility proposal has that, without suffering as obviously from the multiple interpretations you mention. Ambitious value learning is intended to as well, but has more of that problem. Roger Dearnaley's alignment as a basin of attraction addresses that stability property more directly. Sorry I don't have links handy.

Comment by Seth Herd on Alignment: "Do what I would have wanted you to do" · 2024-07-13T11:29:20.765Z · LW · GW

I don't think anyone has proposed this. I think the most similar proposal is my instructions-following AGI (particularly since I'm also mostly thinking of just such a text prompt in a language model agent as the implementation).

My proposal with its checking emphasis is aimed more at the range where the AGI is human level and above, where yours seems more aimed at the truly super intelligent range. Mine keeps the human in charge of figuring out what they would've wanted in case the AGI gets that wrong.

Other related work is linked in that post.

The above objections to CEV partly apply to your proposal. There is probably not just one thing X would've wanted with more consideration, since conclusions may depend on circumstances.

I'm not sure that breaks the proposal; it could be that any of the several things X might've wanted would serve adequately.

Comment by Seth Herd on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T21:09:33.413Z · LW · GW

I feel the same way about being unqualified to consider the geopolitical dynamics. But I also agree that the questions of technical alignment and best alignment target are interconnected (e.g., instruction-following as target seems to make technical alignment much easier). Therefore, I think no single human being is qualified to answer the whole question. As such, I think we need collaboration with people with other expertise. Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?

I agree that we have much better footing to address both the technical and alignment target questions now than 10-15 years ago. So I think we need a new concerted effort.

Comment by Seth Herd on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T21:02:10.557Z · LW · GW

I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don't exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.

More specifics on the other thread.

Comment by Seth Herd on When is a mind me? · 2024-07-10T20:54:57.418Z · LW · GW

Okay, so would you say that the you of today goes on to experience the you-of-tomorrow's experiences? I think the relationship is the same to a perfect duplicate. The duplicate is no less you than the you of tomorrow is. They are separate people from their perspective after duplication, but almost-the-same-person to a much greater degree than twins.

You (pre-duplication) will go on to have two separate sets of experiences. Both are you from your current perspective before duplication; you should give them equal consideration in your decisions, since the causal relationship is the same in both ways between you and the duplicate as to your self of tomorrow.

Consider the case where the duplicate is teleported to your location and vice versa during duplication. Then just location swapped while you're asleep. And consider that you wouldn't care a whit if every molecule of your body was Theseus-swapped one by one for identical molecules in identical locations and roles while you slept.

Comment by Seth Herd on When is a mind me? · 2024-07-10T19:18:26.708Z · LW · GW

Yes, it's a duplicate which will also be you from your current perspective. If you duplicated your car tomorrow you'd have two cars; if you duplicate your mind tomorrow you need to plan on there being two yous tomorrow.

Comment by Seth Herd on When is a mind me? · 2024-07-10T18:46:34.688Z · LW · GW

Terminology aside, I think if you examine this carefully it's incoherent.

Do you think a successful upload would say things like "I'm still me!" and think thoughts like "I'm so glad I payed extra to give myself cool virtual environment options"? That seems like an inevitability if the causal patterns of your mind were captured. And it would be tough to disagree with a thing claiming up and down it's you, citing your most personal memories as evidence.

Comment by Seth Herd on When is a mind me? · 2024-07-10T18:41:49.776Z · LW · GW

If you're not arguing against a perfect copy being you, then I don't understand your position, so much of what follows will probably miss the mark. I had written more but have to cut myself off since this discussion is taking time without having much odds of improving anyone's epistemics noticably.

The Olson twins are do not at all have qualitative identity. They have different minds: sets of memories, beliefs, and values. So I just don't know what your position is. You claim that there doesn't need to be an answer; that seems false, as you could have to make decisions informed by your belief. You currently value your future self more than other people, so you act like you believe that's you in a functional sense.

Are you the same person tomorrow? It's not an identical pattern, but a continuation. I'm saying it's pretty-much you because the elements you wouldn't want changed about yourself are there.

If you value your body or your continuity over the continuity of your memories, beliefs, values, and the rest of your mind that's fine, but the vast majority will disagree with you on consideration. Those things are what we mean by "me".

I certainly do believe in the plural I (under the special cirrumstance I discussed); we must be understanding something differently in the torture question. I don't have a preference pre-copy for who gets tortured; both identical future copies are me from my perspective before copying. Maybe you're agreeing with that?

After copying, we're immediately starting to diverge into two variants of me, and future experiences will not be shared between them.

I was addressing a perfect computational copy. An imperfect but good computational copy is higher resolution, not lower, than a biological twin. It is orders of magnitude more similar to the pattern that makes your mind, even though it is less similar to the pattern that makes your body. What is writing your words is your mind, not your body, so when it says "I" it meets the mind.

Noncomputational physicalism sounds like it's just confused. Physics performs computations and can't be separated from doing that.

Dual aspect theory is incoherent because you can't have our physics without doing computation that can create a being that claims and experiences consciousness like we do. Noncomputational physicalism sounds like the same thing.

I concede it's possible that consciousness includes some magic nonphysical component (that's not computation or pattern instantiated by physics as a pure result of how physics works). That could change my answer to when a mind is me. I don't think that's what you're arguing for though.

I've got to park this here to get other things done. I'll read any response but it might be a better use of time to restart the discussion more carefully - if you care.

Comment by Seth Herd on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T05:08:31.315Z · LW · GW

I agree on pretty much every point you've raised. I agree that there's a huge danger in successful DWIMAC or alignment-to-a-person. It could well lead to catastrophic conflict. I think this deserves a lot more analysis, because the creators of AGI will probably going to shoot for that if there's not a much better argument against than we've seen so far.

This was entirely off-topic for this post; I don't know where we got off topic, but it didn't start in my last comment. And as you say, I think the choice of alignment target is almost as important as technical alignment techniques.

On the other hand, if alignment to human values isn't a stable target, we might be better off relying on the good nature of whoever both aligns their AGI to their intent/values, and wins the AGI war. It's easier to indulge ones' good nature when there is nearly zero downside to doing so, because you have incontestable control over the known lightcone. Even if horrible things happened in that war, most humans would prefer a happy, flourishing group of humans to be their friend. Sociopaths are the exception, so this route does not fill me with confidence either.

I think there's more to be worked out here.

Your suggestion that multiple DWIMAC AGIs with different allegiences might establish both the wisdom and a means of cooperating and splitting the rapidly expanding pie. I also place some guarded optimism in that possibility.