Seeking feedback on a critique of the paperclip maximizer thought experiment

post by bio neural (bio-neural) · 2024-07-15T18:39:30.545Z · LW · GW · 1 comment

This is a question post.

Contents

  Answers
    6 Dagon
    2 RHollerith
    1 Muyyd
    1 Tapatakt
None
1 comment


Hello LessWrong community,

I'm working on a paper that challenges some aspects of the paperclip maximizer thought experiment and the broader AI doomer narrative. Before submitting a full post, I'd like to gauge interest and get some initial feedback.

My main arguments are:

1. The paperclip maximizer oversimplifies AI motivations and neglects the potential for emergent ethics in advanced AI systems.

2. The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.

3. Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.

4. Technologies like brain-computer interfaces (e.g., the hypothetical Hypercortex "Membrane" BCI) could lead to human-AI symbiosis rather than conflict.

Questions for the community:

1. Have these critiques of the paperclip maximizer been thoroughly discussed here before? If so, could you point me to relevant posts?

2. What are the strongest counterarguments to these points from a LessWrong perspective?

3. Is there interest in a more detailed exploration of these ideas in a full post?

4. What aspects of this topic would be most valuable or interesting for the LessWrong community?

Any feedback or suggestions would be greatly appreciated. I want to ensure that if I do make a full post, it contributes meaningfully to the ongoing discussions here about AI alignment and safety.

Thank you for your time and insights!

Answers

answer by Dagon · 2024-07-15T19:10:27.361Z · LW(p) · GW(p)

I'm not sure I see how any of these are critiques of the specific paperclip-maximizer example of misalignment.  Or really, how they contradict ANY misalignment worries.  

These are ways that alignment COULD happen, not ways that misalignment WON'T happen or paperclip-style misalignment won't have bad impact.  And they're thought experiments in themselves, so there's no actual evidence in either direction about likelihoods.

As arguments about paperclip-maximizer worries, they're equivalent to "maybe that won't occur".

answer by RHollerith · 2024-07-15T19:48:47.221Z · LW(p) · GW(p)

You could do worse than choosing this next excerpt from a 2023 post by Nate Soares [EA · GW] to argue against. Specifically, explain why the evolution (from spaghetti-code to being organized around some goal or another) described in the excerpt is unlikely or (if it is likely) can be interrupted or rendered non-disastrous by our adopting some strategy (that you will describe).

By default, the first minds humanity makes will be a terrible spaghetti-code mess [LW · GW], with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars [LW · GW] of how it reflects and irons out the tensions within itself over time.

Making the AI even have something vaguely nearing a ‘goal slot’ that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).

(But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a “goal slot”, it’s not letting you touch it.)

comment by faul_sname · 2024-07-15T22:57:51.647Z · LW(p) · GW(p)

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).

Do you have a good reference article for why we should expect spaghetti behavior executors to become wrapper minds as they scale up?

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-07-15T23:38:46.429Z · LW(p) · GW(p)

As a spaghetti behavior executor, I'm worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift [LW(p) · GW(p)], especially throughout a much longer life than presently feasible, so I'd like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.

Replies from: faul_sname
comment by faul_sname · 2024-07-16T23:58:43.482Z · LW(p) · GW(p)

As a spaghetti behavior executor, I'm worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift [LW(p) · GW(p)], especially throughout a much longer life than presently feasible

As a fellow spaghetti behavior executor, replacing my entire motivational structure with a static goal slot feels like dying and handing off all of my resources to an entity that I don't have any particular reason to think will act in a way I would approve of in the long term.

Historically, I have found varying things rewarding at various stages of my life, and this has chiseled the paths in my cognition that make me me. I expect that in the future my experiences and decisions and how rewarded / regretful I feel about those decisions will continue to chisel my cognition in a way that changes what I care about, in the way that past-me endorsed current-me's experiences causing me to care about things (e.g. specific partners, offspring) that past-me did not care about.

I would not endorse freezing my values in place to prevent value drift in full generality. At most I endorse setting up contingencies so my values don't end up trapped in some specific places current-me does not endorse (e.g. "heroin addict").

so I'd like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.

So in this ontology, an agent is made up of a queryable world model and a goal slot. Improving the world model allows the agent to better predict the outcomes of its actions, and the goal slot determines which available action the agent would pick given its world model.

I see the case for improving the world model. But once I have that better world model, I don't see why I would additionally want to add an immutable goal slot that overrides my previous motivational structure. My understanding is that adding a privileged immutable goal slot would only change the my behavior in those cases where I would otherwise have decided that achieving the goal that was placed in that slot was not a good idea on balance.

As a note, you could probably say something clever like "the thing you put in the goal slot should just be 'behave in the way you would if you had access to unlimited time to think and the best available world model'", but if we're going there then I contend that the rock I picked up has a goal slot filled with "behave exactly like this particular rock".

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-07-17T02:18:09.377Z · LW(p) · GW(p)

The point is control over this process, ability to make decisions over development of oneself, instead of leaving it largely in the hands of the inscrutable low level computational dynamics of the brain and influence of external data. Digital immortality doesn't guard against this, and in a million subjective years you might just slip away bit by bit for reasons you don't endorse, not having had enough time to decide how to guide this process. But if there is a way to put uncontrollable drift on hold, then it's your own goal slots, you can do with them what you will when you are ready.

answer by Muyyd · 2024-07-18T00:29:24.729Z · LW(p) · GW(p)

1. The paperclip maximizer oversimplifies AI motivations and neglects the potential for emergent ethics in advanced AI systems.

2. The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.

Because it is a simple (entrance-level) example of unintended consequences. There is a post about emergent phenomena [LW · GW] - so ethics will definetly emerge, but problem lies in probability-chances (and not in overlooking the possibility) that it (behavior of AI) will happen to be to our liking. Slim chances of that comes from size of Mind Design Space [? · GW] (this post [LW · GW] have a pic) and from tremendous difference between man-hours of very smart humans invested in increasing capabilities and man-hours of very smart humans invested in alignment (Don't Look Up - The Documentary: The Case For AI As An Existential Threat on Youtube - 5:45 about this difference).

3. Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.

They are not - we are long past simple entry-level examples and AI safety (in practice by Big Players) got worse, even if it is looks more nuanced and careful. Some time ago AI safety meant something like "how to keep AI contained in its air-gapped box during value-extraction process" and now it means something like "is it safe for the internet? And now? And now? And now?". So all differences in practices are overshadowed by complexity of  new task - make your new AI more capable than competing systems and safe enough for the net. AI safety problems got more nuanced [LW · GW] too.

There were posts about Mind Design Space by Quintin Pope [LW · GW]. 

answer by Tapatakt · 2024-07-17T20:10:42.089Z · LW(p) · GW(p)

The paperclip maximizer oversimplifies AI motivations

Being very simple example kinda is the point? 

and neglects the potential for emergent ethics in advanced AI systems.

The emergent ethics doesn't change anything for us if it's not human-aligned ethics.

The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.

This is very vague. What possibilities do you talk about exactly?

Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.

Does it suggest any safety or development practises? Would you like to elaborate?

1 comment

Comments sorted by top scores.

comment by Vladimir_Nesov · 2024-07-15T19:06:37.447Z · LW(p) · GW(p)

Squiggle maximizer (which is tagged for this post) and paperclip maximizer are significantly different points [LW(p) · GW(p)]. Paperclip maximizer (as opposed to squiggle maximizer) is centrally an illustration for the orthogonality thesis (see greaterwrong mirror of arbital if the arbital page doesn't load).

What the orthogonality thesis says and the paperclip maximizer example illustrates is that it's possible in principle to construct arbitrarily effective agents deserving of moniker superintelligence with arbitrarily silly or worthless goals (in human view). This seems clearly true, but valuable to notice to fix intuitions that would claim otherwise. Then there's a "practical version of orthogonality thesis", which shouldn't be called "orthogonality thesis", but often enough gets confused with it. It says that by default goals of AIs that will be constructed in practice will tend towards arbitrary things that humans wouldn't find agreeable, including something silly or simple. This is much less obviously correct, and the squiggle maximizer [? · GW] sketch is closer to arguing for some version of this.