Is AI alignment a purely functional property?

roko

Is AI alignment a purely functional property?

post by Roko · 2024-12-15T21:42:50.674Z · LW · GW · 8 comments

This is a question post.

8 comments

In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is "aligned" is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.

So I'd like to ask the community whether it is currently considered the mainstream take that 'Alignment' is functional (only input/output mapping matters) or whether the internal computation matters (it's not OK to think a naughty thought and then have some subroutine that cancels it, for example).

Answers

8 comments

Comments sorted by top scores.

comment by Tahp · 2024-12-15T23:48:54.325Z · LW(p) · GW(p)

It may be that generating horrible counterfactual lines of thought for the purpose of rejecting them is necessary for getting better outcomes. To the extent that you have a real dichotomy here, I would say that the input/output mapping is the thing that matters. I want all humans to not end up worse off for inventing AI.

That said, humans may end up worse off by our own metrics if we make AI that is itself suffering terribly based off of its internal computation or it is generating ancestor torture simulations or something. Technically that is an alignment issue, although I worry that most humans won't care if the AI is suffering if they don't have to look at it suffer and it generates outputs that humans like aside from that hidden detail.

comment by Signer · 2024-12-15T23:47:36.255Z · LW(p) · GW(p)

There is no such disagreement, you just can't test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.

Replies from: Roko

↑ comment by Roko · 2024-12-18T05:18:06.597Z · LW(p) · GW(p)

There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.

Replies from: Signer

↑ comment by Signer · 2024-12-18T15:06:18.391Z · LW(p) · GW(p)

What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague "LLMs are a lot like human uploads". And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.

I don't want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it's plausibility, as formally as possible.

comment by p4rziv4l · 2024-12-16T00:09:07.724Z · LW(p) · GW(p)

What it says: irrelevant
How it thinks: irrelevant

It has always been about what it can do in the real world.

If it can generate substantial amounts of money and buy server capacity or
hack into computer systems

then we got cyberlife, aka autonomous, rogue, self-sufficient AI, subject to darwinian forces on the internet, leading to more of those qualities, which improve its online fitness, all the way into a full-blown takeover.

Replies from: Roko

↑ comment by Roko · 2024-12-16T01:25:20.463Z · LW(p) · GW(p)

I should have been clear: "doing things" is a form of input/output since the AI must output some tokens or other signals to get anything done

Replies from: p4rziv4l

↑ comment by p4rziv4l · 2025-02-20T08:31:21.238Z · LW(p) · GW(p)

in a world where mechinterp is not 100%, the answer is logically: input/output is what matters.

we won't be able to read the thoughts anyways, so why base our judgment on it?

but see my comment on why survival fitness in cyberspace is the one axis where most of the relevant input/output will be generated.

comment by Trevor Hill-Hand (Jadael) · 2024-12-15T22:41:18.303Z · LW(p) · GW(p)

It seems like if there is any non-determinism at all, there's always going to be an unavoidable potential for naughty thoughts, so whatever you call the "AI" must address them as part of its function anyway- either that or there is a deterministic solution?

Is AI alignment a purely functional property?

Contents

Answers

8 comments