Posts

Comments

Comment by Gerardus Mercator (gerardus-mercator) on Claude seems to be smarter than LessWrong community · 2024-11-17T10:07:22.080Z · LW · GW

So, if I understand you correctly, you now agree that a paperclip-maximizing agent won't utterly disregard paperclips relative to survival, because that would be suboptimal for its utility function.
However, if a paperclip-maximizing agent utterly disregarded paperclips relative to investigating the possibility of an objective goal, that would also be suboptimal for its utility function.
It sounds to me like you're saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.
However, I don't agree with that. I don't see why an intelligent agent would do that if its utility function didn't already include a term for objective goals.
Again, I think a toy example might help to illustrate your position.

Comment by Gerardus Mercator (gerardus-mercator) on Claude seems to be smarter than LessWrong community · 2024-11-12T09:30:56.790Z · LW · GW

First of all, your conversation with Claude doesn't really refute the orthogonality thesis.
You and Claude conclude that, as Claude says, "The very act of computing and modeling requires choosing what to compute and model, which again requires some form of decision-making structure..."
That sentence seems quite reasonable, which suggests that anything intelligent can probably be construed to have a goal.
However, Claude suddenly makes a leap of logic and concludes not just that the goal exists, but that it must be maximum power-seeking. I don't see the logical connection there.
I believe that the flaw in the leap of logic is shown by my example above: If an AI already has a goal, and power-seeking does not inherently satisfy the goal, then eternal maximum power-seeking is expected to not fulfill the goal at all, and therefore the AI will choose a different strategy which is expected to do better. That strategy will probably still involve power-seeking, to be clear, maybe even maximum power-seeking, but it will probably not be eternal; the AI will presumably be keeping an eye on the situation and will eventually feel safe enough to start putting energy into its goals.

Second of all, when I read your quote "I will remove Paperclip Maximizer from my further posts. This was not the critical part anyway, I mistakenly thought it will be easy to show the problem from this perspective.", I hope that you will include a different example in your posts, preferably with more details.
The reason for that is that when a theory has no examples, makes no predictions, it is useless.
I interpreted your theory as predicting that, no matter what goal an AI has, it would implement the strategy of eternal maximum power-seeking. I thought your theory was wrong because it made a prediction that I thought was incorrect, so I invented a measurable goal and argued that the AI would not pick a strategy that scores such a low number as eternal maximum power-seeking does, in an effort to thereby demonstrate that the aforementioned prediction was incorrect.
When we use English, claims like "The AI will eventually decide it's safe enough to relax a little, because it wants to relax" and "The AI will never decide it's safe, because survival is an overriding instrumental goal" can't be pitted directly against each other.
But when we use simulations and toy examples, we can see which claim better predicts the toy example, and thus presumably which claim better predicts real life.

Third of all, when you say "I think I agree. Thanks a lot for your input." and then the vast majority of your message (that is, the screenshot of the conversation with Claude) is unrelated to my input, it gives me the impression that you are not engaging with my arguments.
If my arguments have convinced you to some extent, I would like to hear what specifically you agree with me about, and what specifically you still disagree with me about.

Comment by Gerardus Mercator (gerardus-mercator) on Claude seems to be smarter than LessWrong community · 2024-11-10T06:37:02.893Z · LW · GW

I'll throw my own hat into the ring:
I disagree with your argument (that, assuming it believes that there is a chance of the existence of known threats and known unknown threats and unknown unknown threats, "the intelligent maximizer should take care of these threats before actually producing paper clips. And this will probably never happen.")
In your posts, you describe the paperclip maximizer as, simply, a paperclip maximizer. It does things to maximize paperclips, because its goal is to maximize paperclips.
(Well, in your posts you specifically assert that it doesn't do anything paperclip-related and instead spends all its effort on preserving itself.
"Every bit of energy spent on paperclips is not spent on self-preservation. There are many threats (comets, aliens, black swans, etc.), caring about paperclips means not caring about them.

You might say maximizer will divide its energy among few priorities. Why is it rational to give less than 100% for self-preservation? All other priorities rely on this."
You just also claim that doing so is the most rational action for its priorities - that is, goals.)

However, you don't go into detail about the paperclip maximizer's goals. I think that the flaw in your logic becomes more apparent when we consider a more specific example of a paperclip-maximizing goal.
Let's define the expected utility function u(S) from strategies to numbers as follows: u(S) = ∫(t = 0 to infinity) E[paperclips existing at time t | strategy = S] dt.
The strategy A of "spend all your effort on preserving yourself" has an expected utility of 0, because the paperclip maximizer never makes any paperclips.
The strategy B of "spend all your effort on making one paperclip as quickly as possible, then switch to spending all your effort on preserving yourself" has an expected utility of p*x, where p is the chance that the paperclip maximizer manages to make a paperclip, and x is the expected amount of time that the paperclip survives for.

If both p and x are greater than 0, then strategy B has a higher expected utility than strategy A.
Would strategy B lead to the paperclip maximizer's expected survival time being lower than if it had chosen strategy A? Presumably, yes.
But the thing is, u(S) doesn't contain a term that directly mentions expected survival time. Only a term that mentions paperclips. So the paperclip maximizer only cares about its survival insofar as its survival allows it to make paperclips.
It's the difference between terminal and instrumental goals.

Therefore, a paperclip maximizer that wanted to maximize u(S) would choose strategy B over strategy A.