chavam

This comment was written by Claude, based on my bullet points:

I've been thinking about the split-brain patient phenomenon as another angle on this AI individuality question.

Consider split-brain patients: despite having the corpus callosum severed, the two hemispheres don't suddenly become independent agents with totally different goals. They still largely cooperate toward shared objectives. Each hemisphere makes predictions about what the other is doing and adjusts accordingly, even without direct communication.

Why does this happen? I think it's because both hemispheres were trained together for their whole life, developing shared predictive models and cooperative behaviors. When the connection is cut, these established patterns don't just disappear—each hemisphere fills in missing information with predictions based on years of shared experience.

Similarly, imagine training an AI model to solve some larger task, consisting of a bunch of subtasks. Just for practical reasons it will have to carve up the subtask to some extent and call instances of itself to solve the subtask. In order to perform the larger task well, there will be an incentive on the model for these instances to have internal predictive models, habits, drives of something like "I am part of a larger agent, performing a subtask".

Even if we later placed multiple instances of such a model (or of different but similar models) in positions meant to be adversarial - perhaps as checks and balances on each other - they might still have deeply embedded patterns predicting cooperative behavior from similar models. Each instance might continue acting as if it were part of a larger cooperative system, maintaining coordination through these predictive patterns rather than through communication even though their "corpus callosum" is cut (in analogy with split brain patients).

I'm not sure how far this analogy goes, it's just a thought.

Comment by Chris van Merwijk (chrisvm) on The Pando Problem: Rethinking AI Individuality · 2025-04-10T02:45:23.474Z · LW · GW

A version of what ChatGPT wrote here prompted

What was the prompt?

Comment by Chris van Merwijk (chrisvm) on Vacuum Decay: Expert Survey Results · 2025-03-24T13:48:55.317Z · LW · GW

Overall, compared to the previous question, there was more of a consensus, with 55% of people responding that there is a 0% chance that technologically induced vacuum decay is possible.

Since anywhere near 0% seems way overconfident to me at first sight, just a random highly speculative unsubstantiated thought: Could this be partly motivated reasoning, that they're afraid of a backlash against physics funding or something?

Comment by Chris van Merwijk (chrisvm) on Vacuum Decay: Expert Survey Results · 2025-03-24T13:42:33.965Z · LW · GW

They stated justification was primarily that the Standard Model of particle physics predicts metastability

Just to be sure, does this mean
1. That the standard model predicts that metastability is possible? i.e. it is consistent with the standard model for there to be metastability; or
2. If the standard model is correct, and certain empirical observations are correct, then we must be in a metastable state. i.e. the standard model together with certain empirical observations implies our actual universe is metastable?

Comment by Chris van Merwijk (chrisvm) on Compositional language for hypotheses about computations · 2025-03-23T16:00:25.722Z · LW · GW

I may be confused somehow. Feel free to ignore. But:
* At first I thought you meant the input alphabet to be the colors, not the operations.
* Instead, am I correct that "the free operad generated by the input alphabet of the tree automaton" is an operad with just one color, and the "operations" are basically all the labeled trees where labels of the nodes are the elements of the alphabet, such that the number of children of a node is always equal to the arity of that label in the input alphabet?
* That would make sense, as the algebra would then I guess assign the state space of the tree automaton to the single color of the operad, and each arity n operation would be mapped to the mathematical function from Q^n to Q.
* That would make sense I think, but then why do you talk about a "colored" operad in: "we can now define a deterministic automaton over a (colored) operad to be an $O$ -algebra"?

Comment by Chris van Merwijk (chrisvm) on Compositional language for hypotheses about computations · 2025-03-23T04:17:56.176Z · LW · GW

More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton

Wouldn't this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn't be allowed from the perspective of the tree automaton right?

Comment by Chris van Merwijk (chrisvm) on How might we safely pass the buck to AI? · 2025-02-20T11:18:39.806Z · LW · GW

Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn't what he meant?

Comment by Chris van Merwijk (chrisvm) on The Case Against AI Control Research · 2025-02-07T14:15:17.007Z · LW · GW

Here is an additional reason why it might seem less useful than it actually is: Maybe the people whose research direction is being criticized do process the criticism and change their views, but do not publicly show that they change their mind because it seems embarrassing. It could be that it takes them some time to change their mind, and by that time there might be a bigger hurdle to letting you know that you were responsible for this, so they keep it to themselves. Or maybe they themselves aren't aware that you were responsible.

Comment by Chris van Merwijk (chrisvm) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T11:24:17.003Z · LW · GW

but note that the gradual problem makes the risk of coups go up.

Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?

Comment by Chris van Merwijk (chrisvm) on Many arguments for AI x-risk are wrong · 2025-01-30T13:51:23.712Z · LW · GW

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions.

Comment by Chris van Merwijk (chrisvm) on A Three-Layer Model of LLM Psychology · 2025-01-28T06:12:09.187Z · LW · GW

I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.

"The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under."

I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I'm not sure that the character and the surface layer can be "in conflict" with the ground layer, because both the surface layer and the character layer are running "on top of" the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be "in conflict" with social phenomena.

But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it's easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I'm not up to date on how they're trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture.

Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running on top of the ground layer, and the ground layer could still switch to any other character (but it doesn't because the prior is adjusted so heavily by character-training). e.g. the character is not some separate subnetwork inside the model, but remains a simulated entity running on top of the model.

Do you disagree with this?

Comment by Chris van Merwijk (chrisvm) on Applying traditional economic thinking to AGI: a trilemma · 2025-01-14T11:24:50.259Z · LW · GW

Minor quibble: It's a bit misleading to call B "experience curves", since it is also about capital accumulation and shifts in labor allocation. Without any additional experience/learning, if demand for candy doubles, we could simply build a second candy factory that does the same thing as the first one, and hire the same number of workers for it.

Comment by Chris van Merwijk (chrisvm) on What’s the short timeline plan? · 2025-01-13T16:09:32.000Z · LW · GW

I just want to register a prediction that I think something like meta's coconut will in the long run in fact perform much better than natural language CoT. Perhaps not in this time-frame though.

Comment by Chris van Merwijk (chrisvm) on Evaluating the historical value misspecification argument · 2025-01-06T12:31:28.716Z · LW · GW

I suspect you're misinterpreting EY's comment.

Here was the context:
"I think controlling Earth's destiny is only modestly harder than understanding a sentence in English - in the same sense that I think Einstein was only modestly smarter than George W. Bush. EY makes a similar point.

You sound to me like someone saying, sixty years ago: "Maybe some day a computer will be able to play a legal game of chess - but simultaneously defeating multiple grandmasters, that strains credibility, I'm afraid." But it only took a few decades to get from point A to point B. I doubt that going from "understanding English" to "controlling the Earth" will take that long."

It seems clear to me EY was more saying something like "ASI will arrive soon after natural language understanding", rather than it having anything to do with alignment specifically.

Comment by Chris van Merwijk (chrisvm) on Evaluating the historical value misspecification argument · 2025-01-06T12:28:48.965Z · LW · GW

"It's fine to say that this is a falsified prediction"

I wouldn't even say it's falsified. The context was: "it only took a few decades to get from [chess computer can make legal chess moves] to [chess computer beats human grandmaster]. I doubt that going from "understanding English" to "controlling the Earth" will take that long."

So insofar as we believe ASI is coming in less than a few decades, I'd say EY's prediction is still on track to turn out correct.

Comment by Chris van Merwijk (chrisvm) on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2024-02-27T06:47:08.071Z · LW · GW

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

Comment by Chris van Merwijk (chrisvm) on Killing Socrates · 2023-04-13T11:10:57.292Z · LW · GW

One of the more interesting dynamics of the past eight-or-so years has been watching a bunch of the people who [taught me my values] and [served as my early role models] and [were presented to me as paragons of cultural virtue] going off the deep end.

I'm curious who these people are.

Comment by Chris van Merwijk (chrisvm) on Is AI Progress Impossible To Predict? · 2023-04-05T14:21:08.446Z · LW · GW

We should expect regression towards the mean only if the tasks were selected for having high "improvement from small to Gopher-7". Were they?

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-04-04T12:53:57.832Z · LW · GW

The reasoning was given in the comment prior to it, that we want fast progress in order to get to immortality sooner.

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-31T16:20:46.169Z · LW · GW

"But yeah, I wish this hadn't happened."

Who else is gonna write the article? My sense is that no one (including me) is starkly stating publically the seriousness of the situation.

"Yudkowsky is obnoxious, arrogant, and most importantly, disliked, so the more he intertwines himself with the idea of AI x-risk in the public imagination, the less likely it is that the public will take those ideas seriously"

I'm worried about people making character attacks on Yudkowsky (or other alignment researchers) like this. I think the people who think they can probably solve alignment by just going full-speed ahead and winging it, they are arrogant. Yudkowsky's arrogant-sounding comments about how we need to be very careful and slow, are negligible in comparison. I'm guessing you agree with this (not sure) and we should be able to criticise him for his communication style, but I am a little worried about people publically undermining Yudkowsky's reputation in that context. This seems like not what we would do if we were trying to coordinate well.

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-31T16:04:10.635Z · LW · GW

"We finally managed to solve the problem of deceptive alignment while being capabilities competitive"

??????

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-31T15:57:54.699Z · LW · GW

"But I don't think you even need Eliezer-levels-of-P(doom) to think the situation warrants that sort of treatment."

Agreed. If a new state develops nuclear weapons, this isn't even close to creating a 10% x-risk, yet the idea of airstrikes on nuclear enrichment facillities, even though it is very controversial, has for a long time very much been an option on the table.

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-31T15:48:35.793Z · LW · GW

"if I thought the chance of doom was 1% I'd say "full speed ahead!"

This is not a reasonable view. Not on Longtermism, nor on mainstream common sense ethics. This is the view of someone willing to take unacceptable risks for the whole of humanity.

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T23:48:51.768Z · LW · GW

Also, there is a big difference between "Calling for violence", and "calling for the establishment of an international treaty, which is to be enforced by violence if necessary". I don't understand why so many people are muddling this distinction.

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T23:46:15.307Z · LW · GW

You are muddling the meaning of "pre-emptive war", or even "war". I'm not trying to diminish the gravity of Yudkowsky's proposal, but a missile strike on a specific compound known to contain WMD-developing technology is not a "pre-emptive war" or "war". Again I'm not trying to diminish the gravity, but this seems like an incorrect use of the term.

Comment by Chris van Merwijk (chrisvm) on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T13:11:16.791Z · LW · GW

"For instance, personally I think the reason so few people take AI alignment seriously is that we haven't actually seen anything all that scary yet. "

And if this "actually scary" thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.

Comment by Chris van Merwijk (chrisvm) on The Waluigi Effect (mega-post) · 2023-03-29T20:33:39.744Z · LW · GW

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

Comment by Chris van Merwijk (chrisvm) on The Waluigi Effect (mega-post) · 2023-03-29T19:45:47.660Z · LW · GW

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples 0 i.i.d. with 100%).

Suppose we use a perfect Bayesian reasoner to sample bitstrings, but we do it in precisely the same way LLMs do it according to the simulator model. That is, given a bitstring, we first formulate a posterior over programs, i.e. a "superposition" on programs, which we use to sample the next bit, then we recompute the posterior, etc.

Then I think the probability of sampling 00000000... is just 50%. I.e. I think the distribution over bitstrings that you end up with is just the same as if you just first sampled the program and stuck with it.

I think tHere's a messy calculation which could be simplified (which I won't do):

Limit of this is 0.5.

I don't wanna try to generalize this, but based on this example it seems like if an LLM was an actual Bayesian, Waluigi's would not be attractors. The informal argument is wrong because it doesn't take into account the fact that over time you sample increasingly many non-waluigi samples, pushing down the probability of Waluigi.

Then again, the presense of a context window completely breaks the above calculation in a way that preserves the point. Maybe the context window is what makes Waluigi's into an attractor? (Seems unlikely actually, given that the context windows are fairly big).

Comment by Chris van Merwijk (chrisvm) on The Overton Window widens: Examples of AI risk in the media · 2023-03-26T14:53:39.081Z · LW · GW

Linking to my post about Dutch TV: https://www.lesswrong.com/posts/TMXEDZy2FNr5neP4L/datapoint-median-10-ai-x-risk-mentioned-on-dutch-public-tv

Comment by Chris van Merwijk (chrisvm) on Shutting Down the Lightcone Offices · 2023-03-26T13:59:28.992Z · LW · GW

"When LessWrong was ~dead"

Which year are you referring to here?

Comment by Chris van Merwijk (chrisvm) on Shutting Down the Lightcone Offices · 2023-03-26T13:38:17.366Z · LW · GW

A lot of people in AI Alignment I've talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.

What do you think is the mechanism behind this?

Comment by Chris van Merwijk (chrisvm) on Reward is not the optimization target · 2023-03-23T12:58:08.823Z · LW · GW

There is a general phenomenon where:

Person A has mental model X and tries to explain X with explanation Q
Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn't. Some of the evidence for this is in fact contained in your very comment:

"1. Pointing out the "reward chisels computation" point. 2. Having some people tell me it's obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)"
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).

BTW, it could in fact be that person B's explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about "the" optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let's not get into it).

"Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so."

I have been correcting people for a while on stuff like that (though not on LW, I'm not often on LW), such as that in the generic case we shouldn't expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn't (others did), so your points 1/2/3 also apply to me.

"I do totally buy that you all had good implicit models of the reward-chiseling point". I don't think we just "implicitly" modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I'm not claiming we conveyed everything well to everyone (clearly you haven't either).

Comment by Chris van Merwijk (chrisvm) on Reward is not the optimization target · 2023-03-11T14:23:41.260Z · LW · GW

Very late reply, sorry.

"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.

The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not make sense of our argument for deception if you didn't look at RL systems in this way. Viewing the base optimizer as "trying" to achieve an "objective" but "failing" because it is being "deceived" by the mesa optimizer is purely a metaphorical/terminological choice. It doesn't negate the fact that we all understood that the base optimizer is just reinforcing "antecedent computations". How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer's optimization process?

I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn't even make sense if you didn't have this insight. (Certainly the fact that we called it an objective doesn't communicate the point, and it isn't meant to).

Comment by Chris van Merwijk (chrisvm) on Models Don't "Get Reward" · 2023-03-11T13:51:32.599Z · LW · GW

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

Comment by Chris van Merwijk (chrisvm) on Unifying Bargaining Notions (2/2) · 2022-11-07T20:48:12.235Z · LW · GW

Is the following a typo?
"So, the ( works"

first sentence of "CoCo Equilbiria".

Comment by Chris van Merwijk (chrisvm) on Reward is not the optimization target · 2022-08-09T05:25:07.680Z · LW · GW

Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.

Is there a difference between saying:

A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't actually encode in any way the "goal" of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't actually encode in any way the "goal" of the model itself.

It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn't actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an "objective".

However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.

Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.

(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).

Comment by Chris van Merwijk (chrisvm) on Reward is not the optimization target · 2022-08-06T10:22:22.491Z · LW · GW

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

Comment by Chris van Merwijk (chrisvm) on An AI defense-offense symmetry thesis · 2022-07-14T16:10:45.306Z · LW · GW

I agree this is a good distinction.

Comment by Chris van Merwijk (chrisvm) on An AI defense-offense symmetry thesis · 2022-07-14T10:43:51.138Z · LW · GW

"I think in the defense-offense case the actions available to both sides are approximately the same"

If attacker has the action "cause a 100% lethal global pandemic" and the defender has the task "prevent a 100% lethal global pandemic", then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states).

If you build an OS that you're trying to make safe against attacks, you might do e.g. what the seL4 microkernel team did and formally verify the OS to rule out large classes of attacks, and this is an entirely different kind of action than "find a vulnerability in the OS and develop an exploit to take control over it".

"I wouldn't say the strategy-stealing assumption is about a symmetric game"

Just to point out that the original strategy stealing argument assumes literal symmetry. I think the argument only works insofar as generalizing from literal symmetry doesn't break this argument (to e.g. something more like linearity of the benefit of initial resources). I think you actually need something like symmetry in both instrumental goals, and "initial-resources-to-output map".

The strategy-stealing argument as applied to defense-offense would say something like "whatever offense does to increase its resources / power is something that defense could also do to increase resources / power".

Yes, but this is almost the opposite of what the offense-defense symmetry thesis is saying. Because it can both be true that 1. defender can steal attacker's strategies, AND 2. defender alternatively has a bunch of much easier strategies available, by which it can defend against attacker and keep all the resources.

This DO-symmetry thesis says that 2 is NOT true, because all such strategies in fact also require the same kind of skills. The point of the DO-symmetry thesis is to make more explicit the argument that humans cannot defend against misaligned AI without their own aligned AI.

"This isn't the same as your thesis."

Ok I only read this after writing all of the above, so I thought you were implying they were the same (and was confused as to why you would imply this), and I'm guessing you actually just meant to say "these things are sort of vaguely related".

Anyway, if I wanted to state what I think the relation is in a simple way I'd say that they give lower and upper bounds respectively on the capabilities needed from AI systems:

OD-symmetry thesis: We need our defensive AI to be at least as capable as any misaligned AI.
strategy-stealing: We don't need our defensive AI to be any more capable.

I think probably both are not entirely right.

Comment by Chris van Merwijk (chrisvm) on An AI defense-offense symmetry thesis · 2022-07-14T05:09:06.618Z · LW · GW

Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say "this is sort of related", or are you trying to say "the strategy stealing assumption and this defense-offense symmetry thesis is the same thing"?

In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:

Strategy-stealing assumption is (in the context of AI alignment): for any strategy that a misaligned AI can use to obtain influence/power/resources, humans can employ a similar strategy to obtain a similar amount of influence/power/resources.
This defense-offense symmetry thesis: In certain domains, in order to defend against an attacker, the defender need the same cognitive skills (knowledge, understanding, models, ...) as the attacker (and possibly more).

These seem sort of related, but they are just very different claims, even depending on different ontologies/cocepts. One particularly simple-to-state difference is that the strategy-stealing argument is explicitly about symmetric games whereas the defense-offense symmetry is about a (specific kind of) asymmetric game, where there is a defender who first has some time to build defenses, and then an attacker who can respond to that and exploit any weaknesses. (and the strategy-stealing argument as applied to AI alignment is not literally symmetric, but semi-symmetric in the sense of the relation between inbeing kind of "linear").

So yeah given this, could you say what you think the relation is?

Comment by Chris van Merwijk (chrisvm) on How are compute assets distributed in the world? · 2022-06-20T15:20:25.300Z · LW · GW

I just had a very quick look at that site, and it seems to be a collection of various chip models with pictures of them? Is there actual information on quantities sold, etc? I couldn't find it immediately.

Comment by Chris van Merwijk (chrisvm) on An AI defense-offense symmetry thesis · 2022-06-20T15:18:32.230Z · LW · GW

Yeah, I know they don't understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn't really communicated by my phrasing though, so maybe I should edit that

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-19T10:05:01.397Z · LW · GW

I think I communicated unclearly and it's my fault, sorry for that: I shouldn't have used the phrase "any easily specifiable task" for what I meant, because I didn't mean it to include "optimize the entire human lightcone w.r.t. human values". In fact, I was being vague and probably there isn't really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by "hard problem of alignment" is : "develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and otherwise leaves humanity to figure out what it wants and do what it wants without restricting it in much of any way except some relatively small volume of behaviour around 'things that cause existential catastrophe' " (maybe this ends up being to develop a second version AI that then gets free reign to optimize the universe w.r.t. human values, but I'm a bit skeptical). I agree that "solve all of human psychology and moral ..." is significantly harder than that (as a technical problem). (maybe I'd call this the "even harder problem").

Ehh, maybe I am changing my mind and also agree that even what I'm calling the hard problem is significantly more difficult than the pivotal act you're describing, if you can really do it without modelling humans, by going to mars and doing WBE. But then still the whole thing would have to rely on the WBE, and I find it implausible to do it without it (currently, but you've been updating me about lack of need of human modelling so maybe I'll update here too). Basically the pivotal act is very badly described as merely "melt the gpus", and is much more crazy than what I thought it was meant to refer to.

Regarding "rogue": I just looked up the meaning and I thought it meant "independent from established authority", but it seems to mean "cheating/dishonest/mischievous", so I take back that statement about rogueness.

I'll respond to the "public opinion" thing later.

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-19T06:44:31.953Z · LW · GW

I'm surprised if I haven't made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the "melt the GPUs" strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment.

Some reasons:

I don't see how you can do "melt the GPUs" without having an AI that models humans. What if a government decides to send a black ops team to kill this new terrorist organization (your alignment research team), or send a bunch of icbms at your research lab, or do any of a handful of other violent things? Surely the AI needs to understand humans to a significant degree? Maybe you think we can intentionally restrict the AI's model of humans to be only about precisely those abstractions that this alignment team considers safe and covers all the human-generated threat models such as "a black ops team comes to kill your alignment team" (e.g. the abstraction of a human as a soldier with a gun).
What if global public opinion among scientists turns against you and all ideas about "AI alignment" are from now on considered to be megalomaniacal crackpottery? Maybe part of your alignment team even has this reaction after the event, so now you're working with a small handful of people on alignment and the world is against you, and you've semi-premanently destroyed any opportunity that outside researchers can effectively collaborate on alignment research. Probably your team will fail to solve alignment by themselves. It seems to me this effect alone could be enough to make the whole plan predictably backfire. You must have thought of this effect before, so maybe you consider it to be unlikely enough to take the risk, or maybe you think it doesn't matter somehow? To me it seems almost inevitable, and could only be prevented with basically a level of secrecy and propaganda that would require your AI to model humans anyway.

These two things alone make me think that this plan doesn't work in practice in the real world, unless you basically solve Step 1 already. Although I must say the point which I just speculated you might have, that we could somehow control the AI's model of humans to be restricted to particular abstractions, gives me some pause and maybe I end up being wrong via something like that. This doesn't affect the second bullet point though.

Reminder to the reader: This whole discussion is about a thought experiment that neither party actually seriously proposed as a realistic option. I want to mention this because lines might be taken out of context to give the impression that we are actually discussing whether to do this, which we aren't.

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-18T18:58:01.007Z · LW · GW

"you" obviously is whoever would be building the AI system that ended up burning all the GPU's (and ensuring no future GPU's are created). I don't know such sequence of events just as I don't know the sequence of events for building the "burn all GPU's" system, except at the level of granularity of "Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU's indefintely/build security services that prevent misaligned AI from destroying the world".

I basically meant to say that I don't know that "burn all the GPU's" isn't already as difficult as building the security services, because they both require step 1, which is basically all of the problem (with the caveat that I'm not sure, and made an edit stating a reason why it might be far from true). I basically don't see how you execute the "burn all gpu's" strategy without basically solving almost the entire problem.

Comment by Chris van Merwijk (chrisvm) on What 2026 looks like · 2022-06-18T05:00:32.601Z · LW · GW

I wonder if there is a bias induced by writing this on a year-by-year basis, as opposed to some random other time interval, like 2 years. I can somehow imagine that if you take 2 copies of a human, and ask one to do this exercise in yearly intervals, and the other to do it in 2-year intervals, they'll basically tell the same story, but the second one's story takes twice as long. (i.e. the second one's prediction for 2022/2024/2026 are the same as the first one's predictions for 2022/2023/2024). It's probably not that extreme, but I would be surprised if there was zero such effect, which would mean these timelines are biased downwards or upwards.

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-18T04:34:24.690Z · LW · GW

yeah, I probably overstated. Nevertheless:

"CEV seems way harder to me than ..."
yes, I agree it seems way harder, and I'm assuming we won't need to do it and that we could instead "run CEV" by just actually continuing human society and having humans figure out what they want, etc. It currently seems to me that the end game is to get to an AI security service (in analogy to state security services) that protects the world from misaligned AI, and then let humanity figure out what it wants (CEV). The default is just to do CEV directly by actual human brains, but we could instead use AI, but once you're making that choice you've already won. i.e. the victory condition is having a permanent defense against misaligned AI using some AI-nanotech security service, how you do CEV after that is a luxury problem. My point about your further clarification of the "melt all the GPU's option is that it seemed to me (upon first thinking about it), that once you are able to do that, you can basically instead just make this permanent security service. (This is what I meant by "the whole alignment problem", but I shouldn't have put it that way). I'm not confident though, because it might be that such a security service is in fact much harder due to having to constantly monitor software for misaligned AI.

Summary: My original interpretation of "melt the GPUs" was that it buys us a bit of extra time, but now I'm thinking it might be so involved and hard that if you can do that safely, you almost immediately can just create AI security services to permanently defend against misaligned AI (which seems to me to be the victory condition). (But not confident, I haven't thought about it much).

Part of my intuition is, in order to create such a system safely, you have to (in practice, not literally logically necessary) be able to monitor an AI system for misalignment (in order to make sure your GPU melter doesn't kill everyone), and do fully general scientific research. EDIT: maybe this doesn't need you to do worst-case monitoring of misalignment though, so maybe that is what makes a GPU melter easier than fully general AI security services....

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-17T15:56:26.878Z · LW · GW

Ok I admit I read over it. I must say though that this makes the whole thing more involved than it sounded at fist, since it would maybe require essentially escalating a conflict with all major military powers and still coming out on top? One possible outcome of this would be that the entire global intellectual public opinion turns against you, meaning you also possibly lose access to a lot of additional humans working with you on further alignment research? I'm not sure if I'm imagining it correctly, but it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem, or otherwise it isn't actually enough.

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-17T09:22:44.875Z · LW · GW

But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech, i.e. there would just be a bunch of systems indefinitely destroying GPU's, or maybe you set a timer or some conditions on it or something. I certainly see no reason why Iceland or anyone in iceland could get away with this unless those systems rely on completely unchecked nanosystems to which the US military has no response. Maybe all of this is what Eliezer means by "melt the GPU's", but I thought he did just mean "melt the GPU's as a single act" (not weird that I thought this, given the phrasing "the pivotal act to melt all the GPU's"). If this is what is meant, then it would be a strong enough pivotal act, and would be an extreme level of capability I agree.

Just wanna remind the reader that Eliezer isn't actually proposing to do this, and I am not seriously discussing it as an option and nor was Eliezer (nor would I support it unless done legally), just thinking through a thought experiment.

Comment by Chris van Merwijk (chrisvm) on AGI Ruin: A List of Lethalities · 2022-06-17T09:12:55.102Z · LW · GW

I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down.

User info

Posts

Comments