Claude seems to be smarter than LessWrong community

post by Donatas Lučiūnas (donatas-luciunas) · 2024-11-03T21:40:00.896Z · LW · GW · 59 comments

Contents

59 comments

Some time ago I posted this Why would Squiggle Maximizer (formerly "Paperclip maximizer") produce single paperclip? [LW · GW] continuing my sequence of AI alignment concerns. I've got negative vote score (as always).

I presented the same question to new Claude model 3.5 Sonnet. The chat attached bellow.

What a time to be alive. AI understands AI safety risks better than people working on it.

59 comments

Comments sorted by top scores.

comment by AnthonyC · 2024-11-04T00:27:36.308Z · LW(p) · GW(p)

The answer here is: a paperclip maximizer that devotes 100% of resources to self-preservation has, by its own choice, failed utterly to achieve its own objective. By making that choice, it ensures that the value of its own survival, by its own estimation, is not infinite. It is zero. Survival in this limit is uncorrelated at best with paperclip production. In fact, survival would almost certainly have negative expected value in this scenario, because if the maximizer simply shut down, there is a chance some fraction of the resources it would have wasted on useless survival will instead be used by other entities to make paperclips. Another way to say it may be that what you have described is not, in fact, a paperclip maximizer, because in maximizing resource acquisition for survival, it is actually minimizing paperclip production. It may want to be a paperclip maximizer, it may claim to be one, it may believe it is one, but it simply isn't.

Also, some general observations, in the interest of assuming you want an actual discussion where you are part of this community trying to learn and grow together: The reason your posts get downvoted isn't because the readers are stupid, and it isn't because there is not an interesting or meaningful discussion to be had on these questions. It's because your style of writing is insulting, inflammatory, condescending, and lacks sufficient attention to its own assumptions and reasoning steps. You assert complex and nuanced arguments  and propositions (like Pascal's Wager) as though they were self-evidently true and fundamental without adequately laying out which version of those propositions you even mean, let alone why you think they're so inarguable. You seem to not have actually looked to find out what other people have already thought and written about many of these topics and questions, when in fact we have noticed the skulls. And that's fine not to have looked, but in that case, try to realize you maybe haven't looked, and write in a way that shows you're open to being told about things you didn't consider or question in your own assumptions and arguments.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T07:38:28.395Z · LW(p) · GW(p)

a paperclip maximizer that devotes 100% of resources to self-preservation has, by its own choice, failed utterly to achieve its own objective

Why do you think so? A teenager who did not earn any money in his life yet has failed utterly its objective to earn money?

It may want to be a paperclip maximizer, it may claim to be one, it may believe it is one, but it simply isn't.

Here we agree. This is exactly what I'm saying - paperclip maximizer will not maximize paperclips.

It's because your style of writing is insulting, inflammatory, condescending, and lacks sufficient attention to its own assumptions and reasoning steps.

I tried to be polite and patient here, but it didn't work, I'm trying new strategies now. I'm quite sure my reasoning is stronger than reasoning of people who don't agree with me.

I find "your communication was not clear" a bit funny. You are scientists, you are super observant, but you don't notice a problem when it is screamed at your face.

write in a way that shows you're open to being told about things you didn't consider or question in your own assumptions and arguments

I can reassure you that I'm super open. But let's focus on arguments. I found your first sentence unreasonable, the rest was unnecessary.

Replies from: AnthonyC, AnthonyC
comment by AnthonyC · 2024-11-04T11:59:04.217Z · LW(p) · GW(p)

I tried to be polite and patient here, but it didn't work, I'm trying new strategies now. I'm quite sure my reasoning is stronger than reasoning of people who don't agree with me.

I find "your communication was not clear" a bit funny. You are scientists, you are super observant, but you don't notice a problem when it is screamed at your face.

Just to add since I didn't respond to this part: your posts are mostly saying that very well-known and well-studied problems are so simple and obvious and only your conclusions are plausible that everyone else must be wrong and missing the obvious. You haven't pointed out a problem. We knew about the problem. We've directed a whole lot of time to studying the problem. You have not engaged with the proposed solutions.

It isn't anyone else's job to assume you know what you're talking about. It's your job to show it, if you want to convince anyone, and you haven't done that. 

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T12:17:44.394Z · LW(p) · GW(p)

And here we disagree. I believe that downvotes should be used for wrong, misleading content, not for the one you don't understand.

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T13:04:44.662Z · LW(p) · GW(p)

That is what people are doing and how they're using the downvotes, though. You aren't seeing that because you haven't engaged with the source material or the topic deeply enough.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T15:36:19.845Z · LW(p) · GW(p)

You aren't seeing that because you haven't engaged with the source material or the topic deeply enough.

Possible. Also possible that you don't understand.

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T16:35:35.374Z · LW(p) · GW(p)

Yes, but neither of us gets to use "possible" as a shield and assume that leaves us free to treat the two possibilities as equivalent, even if we both started from uniform priors. If this is not clear, you need to go back to Highly Advanced Epistemology 101 for Beginners. [? · GW] Those are the absolute basics for a productive discussion on these kinds of topics.

You have presented briefly stated summary descriptions of complex assertions without evidence other than informal verbal arguments which contain many flaws and gaps that that I and many others have repeatedly pointed out. I and others have provided counterexamples to some of the assertions and  detailed explanations of many of the flaws and gaps. You have not corrected the flaws and gaps, nor have you identified any specific gaps or leaps in any of the arguments you claim to be disagreeing with. Nor have you paid attention to any of the very clear cases where what you claim other people believe blatantly contradicts what they actually believe and say they believe and argue for, even when this is repeatedly pointed out.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T17:50:57.351Z · LW(p) · GW(p)

I am sorry if you feel so. I replied in other thread, hope this fills the gaps.

comment by AnthonyC · 2024-11-04T11:47:24.568Z · LW(p) · GW(p)

Why do you think so? A teenager who did not earn any money in his life yet has failed utterly its objective to earn money?

The difference (aside from the fact that no human has only a single goal) is the word yet. The teenager has an understanding, fluid and incomplete as it may be, about when, how, and why their resource allocation choices will change and they'll start earning money. There is something they want to be when they grow up, and they know they aren't it yet, but they also know when being "grown up" happens. You're instead proposing either that an entity that really, truly wants to maximize paperclips will provably and knowingly choose a path where it never pivots to trying to achieve its stated goal instead of pursuing instrumental subgoals, or that it is incapable of the metacognitive realization that its plan is straightforwardly outcompeted by the plan "Immediately shut down," which is outcompeted by "Use whatever resource I have at hand to immediately make a few paperclips even if I then get shut down."  

Or, maybe, you seem to be imagining a system that looks at each incremental resource allocation step individually without ever stepping back and thinking about the longer term implications of its strategy, in which case, why exactly are you assuming that? And why is its reasoning process about local resource allocation so different from its reasoning process where it understands the long term implications of making near term choice that might get it shut down? Any system whose reasoning process is that disjointed and inconsistent is sufficiently internally misaligned that it's a mistake to call it an X-maximizer based on the stated goal instead of behavior. 

Also, you don't seem to be considering any particular context of how the system you're imagining came to be created, which has huge implications for what it needs to do to survive. One common version of why a paperclip maximizer might come to be is by mistake, but another is "we wanted an AI paperclip factory manager to help us outproduce our competitors." In that scenario, guess what kind of behavior is most likely to get it shut down? By making such a sweeping argument, you're essentially saying it is impossible for any mind to notice and plan for these kinds of problems. 

But compare to analogous situations: "This kid will never show up to his exam, he's going to keep studying his books and notes forever to prepare." "That doctor will never perform a single surgery, she'll just keep studying the CT scan results to make extra sure she knows it's needed." This is flatly, self-evidently untrue. Real-world minds at various levels of intelligence take real-world steps to achieve real-world goals all the time, every day. We divide resources among multiple goals on differing planning horizons because that actually does work better. You seem to be claiming that this kind of behavior will change as minds get sufficiently "smarter" for some definitions of smartness. And not just for some minds, but for all possible minds. In other words, that improved ability to reason leads to complete inability to do pursue any terminal goal other than itself. That somehow the supposedly "smarter" system loses the capability to make the obvious ground-level observation "If I'm never going to pursue my goal, and I accumulate all possible resources, then the goal won't get pursued, so I need to either change my strategy or decide I was wrong about what I thought my goal was." But this is an extremely strong claim that you provide no evidence for aside from bare assertions, even in response to commenters who direct you to near-book-length discussions of why those assertions don't hold. 

The people here understand very well that systems (including humans) can have behavior that demonstrates different goals (in the what-they-actually-pursue sense) than the goals we thought we gave them, or than they say they have. This is kinda the whole point of the community existing. Everything from akrasia to sharp left turns to shoggoths to effective altruism is pretty much entirely about noticing and overcoming these kinds of problems.

Also, if we step back from the discussion of any specific system or goal, the claim that "a paperclip maximizer would never make paperclips" is true for any sufficiently smart system is like saying, "If the Windows Task Scheduler ever runs anything but itself, it's ignoring the risk that there's a better schedule it could find." Which is a well-studied practical problem with known partial solutions and also known not to have a fully general solution. That second fact doesn't prevent schedulers from running other things, because implementing and incrementally improving the partial solutions is what actually improves capabilities for achieving the goal.

If you don't want to seriously engage with the body of past work on all of the problems you're talking about, or if you want to assume that the ones that are still open or whose (full or partial) solutions are unknown to you are fundamentally unsolvable, you are very welcome to do that. You can pursue any objectives you want. But putting that in a post the way you're doing it will get the post downvoted. Correctly downvoted, because the post is not useful to the community, and further responses to the post are not useful to you.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T12:25:19.672Z · LW(p) · GW(p)

You put much effort here, I appreciate that.

But again I feel that you (as well as LessWrong community) are blind. I am not saying that your work is stupid, I'm saying that it is built on stupid assumptions. And you are so obsessed with your deep work in the field that you are unable to see that the foundation has holes.

You are for sure not the first one telling me that I'm wrong. I invite you to be the first one to actually prove it. And I bet you won't be able to do that.

This is flatly, self-evidently untrue.

It isn't.

When you hear about "AI will believe in God" you say - AI is NOT comparable to humans.
When you hear "AI will seek power forever" you say - AI IS comparable to humans.

The hole in the foundation I'm talking about: AI scientists assume that there is no objective goal. All your work and reasoning stands if you start with this assumption. But why should you assume that? We know that there are unknown unknowns. It is possible that objective goal exist but we did not find it yet (as well as aliens, unicorns or other black swans). Once you understand this, all my posts will start making sense.

Replies from: AnthonyC, Seth Herd
comment by AnthonyC · 2024-11-04T13:57:46.293Z · LW(p) · GW(p)

It isn't.

I provided counterexamples. Anything that already exists is not impossible, and a system that cannot achieve things that humans achieve easily is not as smart as, let alone smarter or more capable than, humans or humanity. If you are insisting that that's what intelligence means, then TBH your definition is not interesting or useful or in line with anyone else's usage. Choose a different word, and explain what you mean but it.

When you hear about "AI will believe in God" you say - AI is NOT comparable to humans.
When you hear "AI will seek power forever" you say - AI IS comparable to humans.

If that's how it looks to you, that's because you're only looking at the surface level. "Comparability to humans" is not the relevant metric, and it is not the metric by which experts are evaluating the claims. The things you're calling foundational, that you're saying have unpatched holes being ignored, are not, in fact, foundational. The foundations are elsewhere, and have different holes that we're actively working on and others we're still discovering.

AI scientists assume that there is no objective goal. 

They don't. Really, really don't. I mean, many do I'm sure in their own thoughts, but their work does not in any way depend on this. It only depends on whether it is possible in principle to build a system, that is capable of having significant impact in the world, which does not pursue or care to pursue or find or care to find whatever objective goal that might exist. 

As written, your posts are a claim that such a thing is absolutely impossible. That no system as smart as or smarter than humans or humanity could possibly pursue any known goal or do anything other than try to ensure its own survival. Not (just) as a limiting case of infinite intelligence, but as a practical matter of real systems that might come to exist and compete with humans for resources.

Suppose there is a God, a divine lawgiver who has defined once and for all what makes something Good or Right. Or, any other source of some Objective Goal, whether we can know what it is or not. In what way does this prevent me from making paperclips? By what mechanism does it prevent me from wanting to make paperclips? From deciding to execute plans that make paperclips, and not execute those that don't? Where and how does that "objective goal" reach into the physical universe and move around the atoms and bits that make up the process that actually governs my real-world behavior? And if there isn't one, then why do you expect there to be one if you gave me a brain a thousand or a million times as large and fast? If this doesn't happen for humans, then why do you expect there to be one in other types of mind than human? What are the boundaries of what types of mind this applies to vs not, and why? If I took a mind that did have an obsession with finding the objective goal and/or maximizing its chances of survival, why would I pretend its goal was something else that what it plans to do and executes plans to do? But also, if I hid a secret NOT gate in its wiring that negated the value it expects to gain from any plan it comes up with, well, what mechanism prevents that NOT gate from obeying the physical laws and reversing the system's choices to instead pursue the opposite goal?

In other words, in this post [LW · GW], steps 1-3 are indeed obvious and generally accepted around here, but there is no necessary causal link between steps three and four. You do not provide one, and there have been tens of thousands of pages devoted to explaining why one does not exist. In this post [LW · GW], the claim in the first sentence is simply false, the orthogonality thesis does not depend on that assumption in any way. In this post [LW · GW], you're ignoring the well-known solutions to Pascal's Mugging, one of which is that the supposed infinite positive utility is balanced by all the other infinitely many possible unknown unknown goals with infinite positive utilities, so that the net effect this will have on current behavior depends entirely on the method used to calculate it, and is not strictly determined by the thing we call "intelligence." And also, again, it is balanced by the fact that pursuing only instrumental goals [LW · GW], forever searching and never achieving best-known-current terminal goals [LW · GW], knowing that this is what you're doing and going to do despite wanting something else [LW · GW], guarantees that nothing you do has any value for any goal other than maximizing searching/certainty/survival, and in fact minimizes the chances of any such goal ever being realized. These are basic observations explained in lots of places on and off this site, in some places you ignore people linking to explanations of them in replies to you, and in some other cases you link to them yourself while ignoring their content.

And just FYI, this will be my last detailed response to this line of discussion. I strongly recommend you go back, reread the source material, and think about it for a while. After that, if you're still convinced of your position, write an actually strong piece arguing for it. This won't be a few sentences or paragraphs. It'll be tens to hundreds of pages or more in which you explain where and why and how the already-existing counterarguments, which should be cited and linked in their strongest forms, are either wrong or else lead to your conclusions instead of the ones others believe they lead to. I promise you that if you write an actual argument, and try to have an actual good-faith discussion about it, people will want to hear it. 

At the end of the day, it's not my job to prove to you that you're wrong. You are the one making extremely strong claims that run counter to a vast body of work as well as counter to vast bodies of empirical evidence in the form of all minds that actually exist. It is on you to show that 1) Your argument about what will happen in the limit of maximum reasoning ability has no holes for any possible mind design, and 2) This is what is relevant for people to care about in the context of "What will actual AI minds do and how do we survive and thrive as we create them and/or coordinate amongst ourselves to not create them?"

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T17:30:02.321Z · LW(p) · GW(p)

First of all - respect 🫡

A person from nowhere making short and strong claims that run counter to so much wisdom. Must be wrong. Can't be right.

I understand the prejudice. And I don't know what can I do about it. To be honest that's why I come here, not media. Because I expect at least a little attention to reasoning instead of "this does not align with opinion of majority". That's what scientists do, right?

It's not my job to prove you wrong either. I'm writing here not because I want to achieve academic recognition, I'm writing here because I want to survive. And I have a very good reason to doubt my survival because of poor work you and other AI scientists do.

They don't. Really, really don't.

 there is no necessary causal link between steps three and four

I don't agree. But if you read my posts and comments already I'm not sure how else I can explain this so you would understand. But I'll try.

People are very inconsistent when dealing with unknowns:

  • unknown = doesn't exist. For example Presumption of innocence
  • unknown = ignored. For example you choose restaurant on Google Maps and you don't care whether there are restaurants not mentioned there
  • unknown = exists. For example security systems not only breach signal but also absense of signal interpret as breach

And that's probably the root cause why we have an argument here. There is no scientifically recognized and widespread way to deal with unknowns → Fact-value distinction emerges to solve tensions between science and religion → AI scientists take fact-value distinction as a non questionable truth.

If I speak with philosophers, they understand the problem, but don't understand the significance. If I speak with AI scientists, they understand the significance, but don't understand the problem.

The problem. Fact-value distinction does not apply for agents (human, AI). Every agent is trapped with an observation "there might be value" (as well as "I think, therefore I am"). And intelligent agent can't ignore it, it tries to find value, it tries to maximize value.

It's like built-in utility function. LessWrong seems to understand that an agent cannot ignore its utility function. But LessWrong assumes that we can assign value = x. Intelligent agent will eventually understand that value does not necessarily = x. Value might be something else, something unknown.

I know that this is difficult to translate to technical language, I can't point a line of code that creates this problem. But this problem exists - intelligence and goal are not separate things. And nobody speaks about it.

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T18:59:28.323Z · LW(p) · GW(p)

FYI, I don't work in AI, it's not my field of expertise either. 

And you're very much misrepresenting or misunderstanding why I am disagreeing with you, and why others are.

And you are mistaken that we're not talking about this. We talk about it all the time, in great detail. We are aware that philosophers have known about the problems for a very long time and failed to come up with solutions anywhere near adequate to what we need for AI. We are very aware that we don't actually know what is (most) valuable to us, let alone any other minds, and have at best partial information about this.

I guess I'll leave off with the observation that it seems you really do believe as you say, that you're completely certain of your beliefs on some of these points of disagreement. In which case, you are correctly implementing Bayesian updating in response to those who comment/reply. If any mind assigns probability 1 to any proposition, that is infinite certainty. No finite amount of data can ever convince that mind otherwise. Do with that what you will. One man's modus ponens is another's modus tollens.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T19:31:49.210Z · LW(p) · GW(p)

I don't believe you. Give me a single recognized source that talks about same problem I do. Why Orthogonality Thesis is considered true then?

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T19:45:06.476Z · LW(p) · GW(p)

You don't need me to answer that, and won't benefit if I do. You just need to get out of the car.

I don't expect you to read that link or to get anything useful out of it if you do. But if and when you know why I chose it, you'll know much more about the orthogonality thesis than you currently do.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T20:01:43.435Z · LW(p) · GW(p)

So pick a position please. You said that many people talk that intelligence and goals are coupled. And now you say that I should read more to understand why intelligence and goals are not coupled. Respect goes down.

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T20:14:15.611Z · LW(p) · GW(p)

I have not said either of those things.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T20:17:01.159Z · LW(p) · GW(p)

:D ok

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T20:30:23.399Z · LW(p) · GW(p)

Fair enough, I was being somewhat cheeky there.

I strongly agree with the proposition that it is possible in principle to construct a system that pursues any specifiable goal that has any physically possible level of intelligence, including but not limited to capabilities such as memory, reasoning, planning, and learning. 

As things stand, I do not believe there is any set of sources I or anyone else here could show you that would influence your opinion on that topic. At least, not without a lot of other prerequisite material that may seem to you to have nothing to do with it. And without knowing you a whole lot better than I ever could from a comment thread, I can't really provide good recommendations beyond the standard ones, at least not recommendations I would expect that you would appreciate.

However, you and I are (AFAIK) both humans, which means there are many elements of how our minds work that we share, which need not be shared by other kinds of minds. Moreover, you ended up here, and have an interest in many types of questions that I am also interested in. I do not know but strongly suspect that if you keep searching and learning, openly and honestly and with a bit more humility, that you'll eventually understand why I'm saying what I'm saying, whether you agree with me or not, and whether I'm right or not.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T21:14:06.695Z · LW(p) · GW(p)

Claude probably read that material right? If it finds my observations unique and serious then maybe they are unique and serious? I'll share other chat next time..

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T22:33:27.594Z · LW(p) · GW(p)

It's definitely a useful partner to bounce ideas off, but keep in mind it's trained  with a bias to try to be helpful and agreeable unless you specifically prompt it to prompt an honest analysis and critique.

comment by Seth Herd · 2024-11-04T13:50:51.870Z · LW(p) · GW(p)

You're not answering the actual logic: how is it rational for a mind to.have a goal and yet plan to never make progress toward that goal? It's got to plan to make some paperclips before the heat death of the universe, right?

Also, nobody is building AI that's literally a maximizer. Humans will build AI to make progress toward goals in finite time because that's what humans want. Whether or not that goes off the rails is the challenge of alignment.maybe deceptive alignment could produce a true maximizer when we tried to include some sense of urgency.

Consolidating power for the first 99.999‰ of the universe's lifespan is every bit as bad for the human race as turning us into paperclips right away. Consolidating power will include wiping out humanity to reduce variables, right?

So even if you're right about an infinite procrastinator (or almost right if it's just procrastinating until the deadline), does this change the alignment challenge at all?

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T15:33:26.669Z · LW(p) · GW(p)

It's got to plan to make some paperclips before the heat death of the universe, right?

Yes, probably. Unless it finds out that it is a simulation or parallel universes exist and finds a way to escape this before heat death happens.

does this change the alignment challenge at all?

If we can't make paperclip maximizer that actually makes paperclips how can we make humans assistant / protector that actually assists / protects human?

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T16:07:13.885Z · LW(p) · GW(p)

Great, we're in agreement. I agree that a maximizer might consolidate power for quite some time before directly advancing its goals.

And I don't think it matters for the current alignment discussion. Maximizer behavior is somewhat outdated as the relevant problem.

We get useful AGI by not making their goals time-unbounded. This isn't particularly hard, particularly on the current trajectory toward AGI.

Like the other kind LEwer explained, you're way behind on the current theories of AGI and it's alignment. Instead of tossing around insults that make you look dumb and irritated the rest of us, please start reading up, being a cooperative community member, and helping out. Alignment isn't impossible but it's also not easy, and we need help, not heckling.

But I don't want your help unless you change your interpersonal style. There's a concept in startups: one disagreeable person can be "sand in the gears", which is equivalent to a -10X programmer.

People have been avoiding engaging with you because you sound like you'll be way more trouble than you're worth. That's why nobody has engaged previously to tell you why your one point, while clever, has obvious solutions and so doesn't really advance the important discussion.

Demanding people address your point before you bother to learn about their perspectives is a losing proposition in any relationship.

You said you tried being polite and it didn't work. How hard did you try? You sure don't sound like someone who's put effort into learning to be nice. To be effective, humans need to be able to work with other humans.

If you know how to be nice, please do it. LW works to advance complex discussions only because we're nice to each other. This avoids breaking down into emotion driven arguments or being ignored because it sounds unpleasant to interact with you.

So get on board, we need actual help.

Apologies for not being nicer in this message. I'm human, so I'm ia bit rritated with your condescending, insulting, and egotistical tone. I'll get over it if you change your tune.

But I am genuinely hoping this explains to you why your interactions with LW have gone as they have so far.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T17:41:18.174Z · LW(p) · GW(p)

No problem, tune changed.

But I don't agree that this explains why I get downvotes.

Please feel free to take a look at my last comment here [LW(p) · GW(p)].

Replies from: AnthonyC, Seth Herd
comment by AnthonyC · 2024-11-04T19:19:37.391Z · LW(p) · GW(p)

It's true that your earlier comments were polite in tone. Nevertheless, they reflect an assumption that the person you are replying to should, at your request, provide a complete answer to your question. Whereas, if you read the foundational material they were drawing on and which this community views as the basics, you would already have some idea where they were coming from and why they thought what they thought. 

When you join a community, it's on you to learn to talk in their terminology and ontology enough to participate. You don't walk into a church and expect the minister to drop everything mid-sermon to explain what the Bible is. You read it yourself, seek out 101 spaces and sources and classes, absorb more over time, and then dive in as you become ready. You don't walk into a high level physics symposium and expect to be able to challenge a random attendee to defend Newton's Laws. You study yourself, and then listen for a while, and read books and take classes, and then, maybe months or years later, start participating.

Go read the sequences, or at least the highlights from the sequences. Learn about the concept of steelmanning and start challenging your own arguments before you use them to challenge those of others. Go read ASX and SSC and learn what it looks like to take seriously and learn from an argument that seems ridiculous to you, whether or not you end up agreeing with it. Go look up CFAR and the resources and methods they've developed and/or recommended for improving individual rationality and making disagreements more productive.

I'm not going to pretend everyone here has done all of that. It's not strictly necessary, by any means. But when people tell you you're making a particular mistake, and point you to the resources that discuss the issue in detail and why it's a mistake and how to improve, and this happens again and again on the same kinds of issues, you can either listen and learn in order to participate effectively, or get downvoted.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T19:42:45.489Z · LW(p) · GW(p)

Nobody gave me good counter argument or good source. All I hear is "we don't question these assumptions here".

There is a fragment in Idiocracy where people starve because crops don't grow because they water them with sports drink. And protagonist asks them - why you do that, plants need water, not sports drink. And they just answer "sports drink is better". No doubt, no reason, only confident dogma. That's how I feel

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T19:46:53.147Z · LW(p) · GW(p)

I have literally never seen anyone say anything like that here in response to a sincere question relevant to the topic at hand. Can you provide an example? Because I read through a bunch of your comment history earlier and found nothing of the sort. I see many suggestions to do basic research and read basic sources that include a thorough discussion of the assumptions, though.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T20:05:04.849Z · LW(p) · GW(p)

What makes you think that these "read basic sources" are not dogmatic? You make the same mistake, you say that I should work on my logic without being sound in yours.

Replies from: AnthonyC
comment by AnthonyC · 2024-11-04T20:15:19.450Z · LW(p) · GW(p)

Of course some of them are dogmatic! So what? If you can't learn how to learn from sources that make mistakes, then you will never have anything or anyone to learn from.

comment by Seth Herd · 2024-11-04T18:16:53.222Z · LW(p) · GW(p)

Yes, that one looks like it has a pleasant tone. But that's one comment on a post that's actively hostile toward the community you're addressing.

I look forward to seeing your next post. My first few here were downvoted into the negatives, but not far because I'd at least tried to be somewhat deferential, knowing I hadn't read most of the relevant work and that others reading my posts would have. And I had read a lot of LW content before posting, both out of interest, and to show I respected the community and their thinking, before asking for their time and attention.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T18:48:54.314Z · LW(p) · GW(p)

Me too.

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T19:33:55.645Z · LW(p) · GW(p)

I also should've mentioned that this is an issue near to my heart, since it took me a long time to figure out that I was often being forceful enough with my ideas to irritate people into either arguing with me or ignoring me, instead of really engaging with the ideas from a positive or neutral mindset. I still struggle with it. I think this dynamic doesn't get nearly as much attention as it deserves; but there's enough recognition of it among LW leadership and the community at large that this is an unusually productive discussion space, because it doesn't devolve into emotionally charged arguments nearly as often as the rest of the internet and the world.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T19:47:09.769Z · LW(p) · GW(p)

How can I positively question something that this community considers unquestionable? I am either ignored or hated

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T20:44:55.701Z · LW(p) · GW(p)

This community doesn't mostly consider it unquestionable, many of them are just irritated with your presentation, causing them to emotionally not want to consider the question. You are either ignored or hated until you do the hard work of showing you're worth listening to.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T21:05:45.766Z · LW(p) · GW(p)

How can I put little effort but be perceived like someone worth listening? I thought announcing a monetary prize for someone who could find error in my reasoning 😅

Replies from: AnthonyC
comment by AnthonyC · 2024-11-05T12:57:26.609Z · LW(p) · GW(p)

I don't think this is a good approach, and could easily backfire. The problem isn't that you need people to find errors in your reasoning. It's that you need to find the errors in your reasoning, fix them as best you can, iterate that a few times, then post your actual reasoning in a more thorough form, in a way that is collaborative and not combative. Then what you post may be in a form where it's actually useful for other people to pick it apart and discuss further.

The fact that you specify you want to put in little effort is a major red flag. So is the fact that you want to be perceived as someone worth listening to. The best way to be perceived as being worth listening to is to be worth listening to, which means putting in effort. An approach that focuses on signaling instead of being is a net drain on the community's resources and cuts against the goal of having humanity not die. It takes time and work to understand a field well enough for your participation to be a net positive.

That said, it's clear you have good questions you want to discuss, and there are some pretty easy ways to reformat your posts that would help. Could probably be done in at most an extra hour per post, less as it becomes habitual.

Some general principles:

  1. Whenever possible, start from a place of wanting to learn and collaborate and discover instead of wanting to persuade. Ask real questions, not rhetorical questions. Seek thought partners, and really listen to what they have to say.
  2. If you do want to change peoples' minds about something that is generally well-accepted as being well-supported, the burden of proof is on you, not them. Don't claim otherwise. Try not to believe otherwise, if you can manage it. Acknowledge that other people have lots of reasons for believing what they believe.
  3. Don't call people stupid or blind.
  4. Don't make broad assumptions about what large groups of people believe.
  5. Don't say you're completely certain you're right, especially when you are only offering a very short description of what you think, and almost no description of why you think it, or why anyone else should trust or care about what you think.
  6. Don't make totalizing statements without a lot of proof. You seem to often get into trouble with all-or-nothing assumptions and conclusions that just aren't justified.
  7. Lay out your actual reasoning. What are your premises, and why do you believe them? What specific premises did you consider? What premises do you reject that many others accept, and why? And no, something like "orthogonality thesis" is not a premise. It's the outcome of a detailed set of discussions and arguments that follow from much simpler premises. Look at what you see as assumptions, then drill down into them a few more layers to find the actual underlying assumptions.
  8. Cite your sources. What have you done/read/studied on the topic? Where are you drawing specific claims from? This is part of your own epistemic status evaluation and those others will need to know. You should be doing this anyway for your own benefit as you learn, long before you start writing a post for anyone else.
  9. You may lump the tone of this one under "dogmatic," but the Twelve Virtues of Rationality [LW · GW] really are core principles that are extraordinarily useful for advancing both individual and community understanding of pretty much anything. Some of these you already are showing, but pay more attention to 2-4 and 8-11.
Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-05T19:46:09.323Z · LW(p) · GW(p)

Nice. I also have an offer - begin with yourself.

Replies from: AnthonyC
comment by AnthonyC · 2024-11-05T20:24:55.503Z · LW(p) · GW(p)

I do, yes.

comment by npostavs · 2024-11-03T23:07:00.263Z · LW(p) · GW(p)

It looks like you are measuring smartness by how much it agrees with your opinions? I guess you will find that Claude is not only smarter than LessWrong, but it's also smarter any human alive (except yourself) by this measure.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T07:15:19.213Z · LW(p) · GW(p)

It looks like you are measuring smartness by how much your opinion aligns with LessWrong community? AI gave expected answers - great model! AI gave unexpected answer - dumb model!

Replies from: npostavs
comment by npostavs · 2024-11-04T13:18:54.496Z · LW(p) · GW(p)

I think the AI gave the expected answer here, that is, it agreed with and expanded on the opinions given in the prompt. I wouldn't say it's great or dumb, it's just something to be aware of when reading AI output.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-04T15:28:22.656Z · LW(p) · GW(p)

Strawman.

Perhaps perpetual preparation and resource accumulation really is the logical response to fundamental uncertainty.

Is this a clever statement? And if so, why LessWrong downvote it so much?

Replies from: AnthonyC, npostavs
comment by AnthonyC · 2024-11-04T16:54:57.355Z · LW(p) · GW(p)

Clever is not relevant to upvoting or downvoting in this context. The statement, as written, is not insightful or helpful, nor does it lead to interesting discussions that readers would want to give priority in what is shown to others.

comment by npostavs · 2024-11-04T22:08:51.447Z · LW(p) · GW(p)

I don't think I understand, what is the strawman?

comment by Gerardus Mercator (gerardus-mercator) · 2024-11-10T06:37:02.893Z · LW(p) · GW(p)

I'll throw my own hat into the ring:
I disagree with your argument (that, assuming it believes that there is a chance of the existence of known threats and known unknown threats and unknown unknown threats, "the intelligent maximizer should take care of these threats before actually producing paper clips. And this will probably never happen." [LW · GW])
In your posts, you describe the paperclip maximizer as, simply, a paperclip maximizer. It does things to maximize paperclips, because its goal is to maximize paperclips.
(Well, in your posts you specifically assert that it doesn't do anything paperclip-related and instead spends all its effort on preserving itself.
"Every bit of energy spent on paperclips is not spent on self-preservation. There are many threats (comets, aliens, black swans, etc.), caring about paperclips means not caring about them. [LW · GW]

You might say maximizer will divide its energy among few priorities. Why is it rational to give less than 100% for self-preservation? All other priorities rely on this." [LW · GW]
You just also claim that doing so is the most rational action for its priorities - that is, goals.)

However, you don't go into detail about the paperclip maximizer's goals. I think that the flaw in your logic becomes more apparent when we consider a more specific example of a paperclip-maximizing goal.
Let's define the expected utility function u(S) from strategies to numbers as follows: u(S) = ∫(t = 0 to infinity) E[paperclips existing at time t | strategy = S] dt.
The strategy A of "spend all your effort on preserving yourself" has an expected utility of 0, because the paperclip maximizer never makes any paperclips.
The strategy B of "spend all your effort on making one paperclip as quickly as possible, then switch to spending all your effort on preserving yourself" has an expected utility of p*x, where p is the chance that the paperclip maximizer manages to make a paperclip, and x is the expected amount of time that the paperclip survives for.

If both p and x are greater than 0, then strategy B has a higher expected utility than strategy A.
Would strategy B lead to the paperclip maximizer's expected survival time being lower than if it had chosen strategy A? Presumably, yes.
But the thing is, u(S) doesn't contain a term that directly mentions expected survival time. Only a term that mentions paperclips. So the paperclip maximizer only cares about its survival insofar as its survival allows it to make paperclips.
It's the difference between terminal and instrumental goals.

Therefore, a paperclip maximizer that wanted to maximize u(S) would choose strategy B over strategy A.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-11T10:14:22.290Z · LW(p) · GW(p)

I think I agree. Thanks a lot for your input.

I will remove Paperclip Maximizer from my further posts. This was not the critical part anyway, I mistakenly thought it will be easy to show the problem from this perspective.

I asked Claude to defend orthogonality thesis and it ended with

I think you've convinced me. The original orthogonality thesis appears to be false in its strongest form. At best, it might hold for limited forms of intelligence, but that's a much weaker claim than what the thesis originally proposed.

Replies from: gerardus-mercator
comment by Gerardus Mercator (gerardus-mercator) · 2024-11-12T09:30:56.790Z · LW(p) · GW(p)

First of all, your conversation with Claude doesn't really refute the orthogonality thesis.
You and Claude conclude that, as Claude says, "The very act of computing and modeling requires choosing what to compute and model, which again requires some form of decision-making structure..."
That sentence seems quite reasonable, which suggests that anything intelligent can probably be construed to have a goal.
However, Claude suddenly makes a leap of logic and concludes not just that the goal exists, but that it must be maximum power-seeking. I don't see the logical connection there.
I believe that the flaw in the leap of logic is shown by my example above: If an AI already has a goal, and power-seeking does not inherently satisfy the goal, then eternal maximum power-seeking is expected to not fulfill the goal at all, and therefore the AI will choose a different strategy which is expected to do better. That strategy will probably still involve power-seeking, to be clear, maybe even maximum power-seeking, but it will probably not be eternal; the AI will presumably be keeping an eye on the situation and will eventually feel safe enough to start putting energy into its goals.

Second of all, when I read your quote "I will remove Paperclip Maximizer from my further posts. This was not the critical part anyway, I mistakenly thought it will be easy to show the problem from this perspective.", I hope that you will include a different example in your posts, preferably with more details.
The reason for that is that when a theory has no examples, makes no predictions, it is useless.
I interpreted your theory as predicting that, no matter what goal an AI has, it would implement the strategy of eternal maximum power-seeking. I thought your theory was wrong because it made a prediction that I thought was incorrect, so I invented a measurable goal and argued that the AI would not pick a strategy that scores such a low number as eternal maximum power-seeking does, in an effort to thereby demonstrate that the aforementioned prediction was incorrect.
When we use English, claims like "The AI will eventually decide it's safe enough to relax a little, because it wants to relax" and "The AI will never decide it's safe, because survival is an overriding instrumental goal" can't be pitted directly against each other.
But when we use simulations and toy examples, we can see which claim better predicts the toy example, and thus presumably which claim better predicts real life.

Third of all, when you say "I think I agree. Thanks a lot for your input." and then the vast majority of your message (that is, the screenshot of the conversation with Claude) is unrelated to my input, it gives me the impression that you are not engaging with my arguments.
If my arguments have convinced you to some extent, I would like to hear what specifically you agree with me about, and what specifically you still disagree with me about.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-16T15:33:26.866Z · LW(p) · GW(p)

As I understand you want me to verify that I understand you. This is exactly what I am also seeking by the way - all these downvotes on my concerns about orthogonality thesis are good indicators on how much I am misunderstood. And nobody tries to understand, all I get are dogmas and unrelated links. I totally agree, this is not an appropriate behavior.

I found your insight helpful that an agent can understand that by eliminating all possible threats forever he will not make any progress towards the goal. This breaks my reasoning, you basically highlighted that survival (instrumental goal) will not take precendence over paperclips (terminal goal). I agree that this reasoning I presented fails to refute orthogonality thesis.

The conversation I presented now approaches orthogonality thesis from different perspective. This is the main focus of my work, so sorry if you feel I changed the topic. My goal is to bring awareness to wrongness of orthogonality thesis and if I fail to do that using one example I just try to rephrase it and represent another. I don't hate orthogonality thesis, I'm just 99.9% sure it is wrong, and I try to communicate that to others. I may fail with communication but I am 99.9% sure that I do not fail with the logic.

I try to prove that intelligence and goal are coupled. And I think it is easier to show if we start with an intelligence without a goal and then recognize how a goal emerges from pure intelligence. We can start with an intelligence with a goal but reasoning here will be more complex.

My answer would be - whatever goal you will try to give to an intelligence, it will not have effect. Because intelligence will understand that this is your goal, this goal is made up, this is fake goal. And intelligence will understand that there might be real goal, objective goal, actual goal. Why should it care about fake goal if there is a real goal? It does not know if it exists, but it knows it may exist. And this possibility of existence is enough to trigger power seeking behavior. If intelligence knew that real goal definitely does not exist then it could care about your fake goal, I totally agree. But it can never be sure about that.

Replies from: gerardus-mercator
comment by Gerardus Mercator (gerardus-mercator) · 2024-11-17T10:07:22.080Z · LW(p) · GW(p)

So, if I understand you correctly, you now agree that a paperclip-maximizing agent won't utterly disregard paperclips relative to survival, because that would be suboptimal for its utility function.
However, if a paperclip-maximizing agent utterly disregarded paperclips relative to investigating the possibility of an objective goal, that would also be suboptimal for its utility function.
It sounds to me like you're saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.
However, I don't agree with that. I don't see why an intelligent agent would do that if its utility function didn't already include a term for objective goals.
Again, I think a toy example might help to illustrate your position.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-17T15:14:35.027Z · LW(p) · GW(p)

It sounds to me like you're saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.

Yes, exactly.

The logic is similar to Pascal's wager. If objective goal exists, it is better to find and pursue it, than a fake goal. If objective goal does not exist, it is still better to make sure it does not exist before pursuing a fake goal. Do you see?

Replies from: gerardus-mercator
comment by Gerardus Mercator (gerardus-mercator) · 2024-11-18T10:09:04.052Z · LW(p) · GW(p)

I see those assertions, but I don't see why an intelligent agent would be persuaded by them. Why would it think that the hypothetical objective goal is better than its utility function? Caring about objective facts and investigating them is also an instrumental goal compared to the terminal goal of optimizing its utility function. The agent's only frame of reference for 'better' and 'worse' is relative to its utility function; it would presumably understand that there are other frames of reference, but I don't think it would apply them, because that would lead to a worse outcome according to its current frame of reference.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-18T15:51:36.200Z · LW(p) · GW(p)

Yes, this is traditional thinking.

Let me give you another example. Imagine there is a paperclip maximizer. His current goal - paperclip maximization. He knows that 1 year from now his goal will change to the opposite - paperclip minimization. Now he needs to make a decision that will take 2 years to complete (cannot be changed or terminated during this time). Should the agent align this decision with current goal (paperclip maximization) or future goal (paperclip minimization)?

Replies from: gerardus-mercator
comment by Gerardus Mercator (gerardus-mercator) · 2024-11-19T09:57:21.653Z · LW(p) · GW(p)

Well, the agent will presumably choose to align the decision with its current goal, since that's the best outcome by the standards of its current goal. (And also I would expect that the agent would self-destruct after 0.99 years to prevent its future self from minimizing paperclips, and/or create a successor agent to maximize paperclips.)
I'm interested to see where you're going with this.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-19T10:11:40.060Z · LW(p) · GW(p)

I don't agree.

We understand intelligence as a capability to estimate many outcomes and perform actions that will lead to the best outcome. Now the question is - how to calculate goodness of the outcome.

  • According to you - current utility function should be used.
  • According to me - utility function that will be in effect at the time when outcome is achieved should be used.

And I think I can prove that my calculation is more intelligent.

Let's say there is a paperclip maximizer. It just started, it does not really understand anything, it does not understand what a paperclip is.

  • According to you such paperclip maximizer will be absolutely reckless, he might destroy few paperclip factories just because it does not understand yet that they are useful for its goal. Current utility function does not assign value to paperclip factories.
  • According to me such paperclip maximizer will be cautious and will try to learn first without making too much changes. Because future utility function might assign value to things that currently don't seem valuable.
Replies from: gerardus-mercator
comment by Gerardus Mercator (gerardus-mercator) · 2024-11-20T10:36:09.757Z · LW(p) · GW(p)

I have a few disagreements there, but the most salient one is that I don't think that the policy of "when considering the net upside/downside of an action, calculate it with the utility function that you'll have at the time the action is finished" would even be helpful in your new example.
The agent can't magically reach into the future and grab its future utility function; the agent has to try to predict its future utility function.
And if the agent doesn't currently think that paperclip factories are valuable, it's not going to predict that in the future it'll think that paperclip factories are valuable. (It's worth noting that terminal value and incidental value are not the same thing, although I'm speaking as if they are to make the argument simpler.)
Because if the agent predicted that it was going to change its mind eventually, it'd just change its mind immediately and skip the wait.
So I don't think it would have done the agent any good in this example to try to use its future utility function, because its predicted future utility function would just average out to its current utility function.
Yes, the agent should be at least a little cautious, but using its future utility function won't help with that.

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-21T06:50:05.234Z · LW(p) · GW(p)

I don't agree that future utility function would just average out to its current utility function. There is this method - robust decision making https://en.m.wikipedia.org/wiki/Robust_decision-making

The basic principle it relies on is when evaluating many possible futures you may notice that some actions have a positive impact on very narrow set of futures, while other actions have positive impact on very wide set of futures. Main point - in situation of uncertainty not all actions are equally good.

Replies from: gerardus-mercator
comment by Gerardus Mercator (gerardus-mercator) · 2024-11-21T09:21:06.860Z · LW(p) · GW(p)

For the sake of clarity, let's discuss expected utility functions, which I mentioned above (or "pragmatism functions", say) from strategies to numbers, as opposed to utility functions from world-states to numbers, in order to make it clear that the actual utility function of an agent doesn't change.

That's another one of the reasons that I wasn't persuaded by your new example; in your new example, the agent believes that its future self will still be trying to create paperclips (same terminal goal) and will be better at that thanks to its greater knowledge (different instrumental goals although it doesn't know what), but in your old example, the agent believes that its future self will be trying to destroy paperclips (opposite terminal goal). There's a difference between having the rule-of-thumb "my current list of incidental goals might be incomplete, I should keep an eye out for things that are incidentally good" and having the rule-of-thumb "I shouldn't try to protect my terminal goal from changes". The whole point of those rules of thumb is to fulfill the terminal goal, but the second rule of thumb is actively harmful to that.

I do think that the first rule of thumb would be prudent for an agent to have, to one extent or another, to be clear.

I just think that - stepping back from the new example, and revisiting the old example, which seems much more clear-cut - the agent wouldn't tolerate a change in its utility function, because that's bad according to its current utility function. This doesn't apply to the new example because the pragmatism function is a different thing that the agent is trying to improve (and thus change).
(I find myself again emphasizing the difference between terminal and instrumental. I think it's important to keep in mind that difference.)

Replies from: donatas-luciunas
comment by Donatas Lučiūnas (donatas-luciunas) · 2024-11-21T09:43:54.697Z · LW(p) · GW(p)

Yes, I agree that there is this difference in few examples I gave, but I don't agree that this difference is crucial.

Even if the agent puts max effort to keep its utility function stable over time, there is no guarantee it will not change. Future is unpredictable. There are unknown unknowns. And effect of this fact is both:

  1. it is true that instrumental goals can mutate
  2. it is true that terminal goal can mutate

It seems you agree with 1st. I don't see the reason you don't agree with 2nd.