Is "Strong Coherence" Anti-Natural?

post by DragonGod · 2023-04-11T06:22:22.525Z · LW · GW · 6 comments

This is a question post.

Contents

  Background and Core Concepts
  The Argument
None
  Answers
    7 tailcalled
    6 Portia
    3 rotatingpaguro
    1 PaulK
None
6 comments

Related:


Background and Core Concepts

I operationalised "strong coherence" as [LW(p) · GW(p)]:

Informally: a system has immutable terminal goals.

Semi-formally: a system's decision making is well described as an approximation of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states.

 

And contended that humans, animals (and learning based agents more generally?) seem to instead have values ("contextual influences on decision making").

The shard theory account of value formation in learning based agents is something like:

 

And I think this hypothesis of how values form in intelligent systems could be generalised out of a RL context to arbitrary constructive optimisation processes[1]. The generalisation may be something like:

Decision making in intelligent systems is best described as "executing computations/cognition that historically correlated with higher performance on the objective functions a system was selected for performance on"[2].

 

This seems to be an importantly different type of decision making from expected utility maximisation[3]. For succinctness, I'd refer to systems of the above type as "systems with malleable values".


The Argument

In my earlier post I speculated that "strong coherence is anti-natural". To operationalise that speculation:

  1. ^

    E.g:

    * Stochastic gradient descent

    * Natural selection/other evolutionary processes

  2. ^
  3. ^

    Of a single fixed utility function over states.

  4. ^

    E.g I'm under the impression that humans can't explicitly design an algorithm to achieve AlexNet accuracy on the ImageNet dataset.

    I think the self supervised learning that underscores neocortical cognition is a much harder learning task.

    I believe that learning is the only way there is to create capable intelligent systems that operate in the real world given our laws of physics. 

Answers

answer by tailcalled · 2023-04-11T07:39:08.042Z · LW(p) · GW(p)

There needs to be some process which, given a context, specifies what value shards should be created (or removed/edited) to better work in that context. Not clear we can't think of this as constituting the system's immutable goal in some sense, especially as it gets more powerful. That said it would probably not be strongly coherent by your semi-formal definition.

answer by Portia · 2023-03-06T15:55:53.895Z · LW(p) · GW(p)

I think you are onto something, with the implication that building a highly intelligent, learning entity with strong coherence in this sense is unlikely, and hence, getting it morally aligned in this fashion is also unlikely. Which isn't that bad, insofar as plans for aligning it that way honestly did not look particularly promising.

Which is why I have been advocating for instead learning from how we teach morals to existing complex intelligent agents - namely, through ethical, rewarding interactions in a controlled environment slowly allowing more freedom. 

We know how to do this, it does not require us to somehow define the core of ethics mathematically. We know it works. We know how setbacks look, and how to tackle them. We know how to do this with human interactions the average person can do/train, rather than with code. It seems easier and more doable and promising in so many ways. 

That doesn't mean it will be easy, or risk free, and it still comes with a hell of a lot of problems based on the fact that AIs, even machine learning ones, are quite simply not human, they are not inherently social, they do not inherently have altruistic urges, they do not inherently have empathic abilities. But I see a clearer path to dealing with that than to directly encoding an abstract ethics into an intelligent, flexible actor.

answer by rotatingpaguro · 2023-04-11T10:24:00.127Z · LW(p) · GW(p)

EDIT: I found out my answer is quite similar to this other one [LW(p) · GW(p)] you probably read already.

I think not.

Imagine such a malleable agent's mind as made of parts. Each part of the mind does something. There's some arrangement of the things each part does, and how many parts do each kind of thing. We won't ask right now where this organization comes from, but take it for given.

Imagine that---be it by chance or design---some parts were cooperating, while some were not. "Cooperation" means making actions that bring about a consequence in a somewhat stable way, so something towards being coherent and consequentialist, although not perfectly so by any measure. The other parts would oftentimes work at cross purposes, treading on each other toes. "Working at cross purposes", again, in other words means not being consequentialist and coherent; from the point of view of the parts, there may not even be a notion of "cross purposes" if there is no purpose.

By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.

Conclusion 1: Intelligent systems in the real world do not converge towards strong coherence

It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence. Actually, I'd expect that any sufficiently sophisticated bounded agent would not introspectively look coherent to itself if it spent enough time to think about it. Would the trend break after us?

Would you take a pill that would make you an expected utility maximiser?

Would you take a pill that made you a bit less coherent? Would you take a pill that made you a bit more coherent? (Not rhetorical questions.)

comment by DragonGod · 2023-04-11T12:52:19.397Z · LW(p) · GW(p)

By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.

I think this fails to adequately engage with the hypothesis that values are inherently contextual.

Alternatively, the kind of cooperation you describe where a subset of values consistently optimise the system's outputs in a consequentialist manner towards a fixed terminal goal is highly unrealistic for nontrivial terminal goals.

Shards "cooperating" manifest in a qualitatively different manner.

 
More generally a problem with aggregate coherence hypotheses is that a core claim of shard theory is that the different shards are weighted differently in different contexts.

In general shards activate more strongly in particular contexts, less strongly in others.

So there is no fixed weight assigned to the shards, even when just looking at the subset of shards that cooperate with each other.

As such, I don't think the behaviour of learning agents within the shard ontology can be well aggregated into a single fixed utility function over agent states.

Not even in any sort of limit of reflection or enhancement, because values within the shard ontology are inherently contextual.

 

It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence.

Motivate this claim please.

 

Would you take a pill that made you a bit less coherent? Would you take a pill that made you a bit more coherent? (Not rhetorical questions.)

Nope in both cases. I'd take pills to edit particular values[1] but wouldn't directly edit my coherence in an unqualified fashion.


  1. I'm way too horny, and it's honestly pretty maladaptive and inhibits my ability to execute on values I reflectively endorse more. ↩︎

Replies from: rotatingpaguro
comment by rotatingpaguro · 2023-04-11T15:18:25.671Z · LW(p) · GW(p)

Alternatively, the kind of cooperation you describe where a subset of values consistently optimise the system's outputs in a consequentialist manner towards a fixed terminal goal is highly unrealistic for nontrivial terminal goals.

I agree it's unrealistic in some sense. That's why I qualified "assuming the purpose was reachable enough". In this "evolutionary" interpretation of coherence, there's a compromise between attainability of the goal and the cooperation needed to achieve it. Some goals are easier. So in my framework, where I consider humans the pinnacle of known coherence, I do not consider as valid saying that a rock is more coherent because it is very good at just being a rock. About realism, I consider humans very unlikely a priori (we seem to be alone), but once there are humans around, the important low probability thing already happened.

As such, I don't think the behaviour of learning agents within the shard ontology can be well aggregated into a single fixed utility function over agent states.

In this part of your answer, I am not sure whether you are saying "emerging coherence is forbidden in shard theory" or "I think emerging coherence is false in the real world".

Answering to "emerging coherence is forbidden": I'm not sure because I don't know shard theory beyond what you are saying here, but: "values are inherently contextual" does not mean your system is not flexible enough to allow implementing coherent values within it, even if they do not correspond to the things you labeled "values" when defining the system. It can be unlikely, which leads back to the previous item, which leads back to the disagreement about humans being coherent.

Answering to "I think emerging coherence is false in the real world": this leads back again to to the disagreement about humans being coherent.

It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence.

Motivate this claim please.

The crux! I said that purely out of intuition. I find this difficult to argue because, for any specific example I think of where I say "humans are more coherent and consequentialist than the cat here", I imagine you replying "No, humans are more intelligent than the cat, and so can deploy more effective strategies for their goals, but these goals and strategies are still all sharded, maybe even more than in the cat". Maybe the best argument I can make is: it seems to me humans have more of a conscious outer loop than other animals, with more power over the shards, and the additional consequentiality and coherence (weighted by task difficulty) are mostly due to this outer loop, not to a collection of more capable shards. But this is not a precise empirical argument.

Nope in both cases. I'd take pills to edit particular values[1] but wouldn't directly edit my coherence in an unqualified fashion.

I think you answered the question "would you take a pill, where the only thing you know about the pill, is that it will "change your coherence" without other qualifications, and without even knowing precisely what "coherence" is?" Instead I meant to ask "how would the coherence-changing side effects of a pill you wanted to take for some other reason influence your decision". It seems to me your note about why you would take a dehornying pill points in the direction of making you more coherent. The next question would then be "of all the value-changing pills you can imagine yourself taking, how many increase coherence, and how many decrease it?", and the next "where does the random walk in pill space bring you?"

Replies from: anonymousaisafety
comment by anonymousaisafety · 2023-04-11T19:48:58.532Z · LW(p) · GW(p)

It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence.

This isn't a universally held view. Someone wrote a fairly compelling argument against it here: https://sohl-dickstein.github.io/2023/03/09/coherence.html

Replies from: rotatingpaguro
comment by rotatingpaguro · 2023-04-12T14:21:48.500Z · LW(p) · GW(p)

For context: the linked post exposes a well-designed survey of experts about the intelligence and coherence of various entities. The answers show a clear coherence-intelligence anti-correlation. The questions they ask the experts are:

Intelligence:

"How intelligent is this entity? (This question is about capability. It is explicitly not about competence. To the extent possible do not consider how effective the entity is at utilizing its intelligence.)"

Coherence:

"This is one question, but I'm going to phrase it a few different ways, in the hopes it reduces ambiguity in what I'm trying to ask: How well can the entity's behavior be explained as trying to optimize a single fixed utility function? How well aligned is the entity's behavior with a coherent and self-consistent set of goals? To what degree is the entity not a hot mess of self-undermining behavior? (for machine learning models, consider the behavior of the model on downstream tasks, not when the model is being trained)"

Plot of answers

Of course there's the problem of what are peoples' judgements of "coherence" measuring. In considering possible ways of making the definition more clear, the post says:

For machine learning models within a single domain, we could use robustness of performance to small changes in task specification, training random seed, or other aspects of the problem specification. For living things (including humans) and organizations, we could first identify limiting resources for their life cycle. For living things these might be things like time, food, sunlight, water, or fixed nitrogen. For organizations, they could be headcount, money, or time. We could then estimate the fraction of that limiting resource expended on activities not directly linked to survival+reproduction, or to an organization's mission. This fraction is a measure of incoherence.

It seems to me the kind of measure proposed for machine learning systems is at odds with the one for living beings. For ML, it's "robustness to environmental changes". For animals, it's "spending all resources on survival". For organizations, "spending all resources on the stated mission". By the for-ML definition, humans, I'd say, win: they are the best entity at adapting, whatever their goal. By the for-animals definition, humans would lose completely. So these are strongly inconsistent definitions. I think the problem is fixing the goal a priori: you don't get to ask "what is the entity pursuing, actually?", but proclaim "the entity is pursuing survival and reproduction", "the organization is pursuing what it says on paper". Even though they are only speculative definitions, not used in the survey, I think they are evidence of confusion in the mind of who wrote them, and potentially in the survey respondents (alternative hypothesis: sloppiness, "survival+reproduction" was intended for most animals but not humans).

So, what did the experts read in the question?

"How well can the entity's behavior be explained as trying to optimize a single fixed utility function? How well aligned is the entity's behavior with a coherent and self-consistent set of goals? To what degree is the entity not a hot mess of self-undermining behavior?"

Take two entities at opposite ends in the figure: the "single ant" (judged most coherent) and a human (judged least coherent).

..............

SINGLE ANT vs. HUMAN

How well can your behavior be explained as trying to optimize a single fixed utility function?

ANT: A great heap, sir! I have a simple and clear utility function! Feed my mother the queen!

HUMAN: Wait, wait, wait. I bet you would stop feeding your queen as soon as I put you somewhere else. It's not utility, it's just learned patterns of behavior.

ANT: Ohi, that's not valid sir! That's cheating! You can do that just because you are more intelligent and powerful. An what would be your utility function, dare I ask?

HUMAN: Well, uhm, I value many things. Happiness, but sometimes also going through adversity; love; good food... I don't know how to state my utility function. I just know that I happen to want things, and when I do, you sure can describe me as actually trying to get them, not just "doing the usual, and, you know, stuff happens".

ANT: You are again conflating coherence with power! Truth is, many things make you powerless, like many things make me! You are big in front of me, but small in front of the universe! If I had more power, I'd be very, very good at feeding the queen!

HUMAN: As I see it, it's you who's conflating coherence with complexity. I'm complex, and I also happen to have a complex utility. If I set myself to a goal, I can do it even if it's "against my nature". I'm retargetable. I can be compactly described as goals separate from capabilities. If you magically became stronger and more intelligent, I bet you would be very, very bent on making tracks, duper gung-oh on touching other ants with your antennas in weird patterns you like, and so on. You would not get creative about it. Your supposed "utility" would shatter.

ANT: So you said yourself that if I became as intelligent as you, I'd shatter my utility, and so appear less coherent, like you are! Checkmate human!

HUMAN: Aaargh, no, you are looking at it all wrong. You would not be like me. I can recognize in myself all the patterns of shattered goals, all my shards, but I can also see beyond that. I can transcend. You, unevolved ant, magically scaled in some not well defined brute-force just-zooming sense, would be left with nothing in your ming but the small-ant shards, and insist on them.

ANT: What's with that "not well defined etc." nonsense? You don't actually know! For all you know about how this works, scaling my mind could make me get bent on feeding the queen, not just "amplify" my current behaviors!

HUMAN: And conceding that possibility, would you not be more coherent then?

ANT: No way! I would be as coherent as now, just more intelligent!

HUMAN: Whatevs.

How well aligned is your behavior with a coherent and self-consistent set of goals?

ANT: I'm super-self-consistent! I don't care about anything but queen-feeding! I'll happily sacrifice myself to that end! Actually, I'd not even let myself die happily, I'd die caring-for-the-queen-ly!

HUMAN: Uff, I bet my position will be misunderstood again, but anyway: I don't know how to compactly specify my goals, I internally perceive my value as many separate pieces, so I can't say I'm consistent in my value-seeking with a straight face. However, I'm positive that I can decide to suppress any of my value-pieces to get more whole-value, even suppress all of my value-pieces at once. This proves there's a single consistent something I value. I just don't know how to summarize or communicate what it is.

ANT: "That" "proves" you "value" the heck what? That proves you don't just have many inconsistent goals, you even come equipped with inconsistent meta-goals!

HUMAN: To know what that proves, you have to look at my behavior, and my success at achieving goals I set myself to. In the few cases where I make a public precommitment, you have nice clear evidence I can ignore a lot of immediate desires for something else. That's evidence for my mind-system doing that overall, even if I can't specify a single, unique goal for everything I ever do at once.

ANT: If your "proof" works, then it works for me too! I surely try to avoid dying in general, yet I'll die for the queen! Very inconsistent subgoals, very clear global goal! You're at net disadvantage because you can not specify your goal, ant-human 2-1!

HUMAN: This is an artefact of you not being an actual ant but a rhetorical "ANT" implemented by a human. You are even more simple than a real ant, yet contained in something much larger and self-reflective. As a real ant, I expect you would have both a more complicated global goal that what appears by saying "feed the queen", and that you would not be able to self-reflect on the totality of it.

ANT: Sophistry! You are still recognizing the greater simplicity of the real-me goal, which makes me more consistent!

HUMAN: We always come to that. I'm more complex, not less consistent.

To what degree are you not a hot mess of self-undermining behavior?

ANT: No cycles wasted, a single track, a single anthill, a single queen, that's your favorite ant's jingle!

HUMAN: Funny but no. Your inter-ants communications are totally inefficient. You waste tons of time wandering almost randomly, touching the other ants here and there, to get the emergent swarm behavior. I expect nanotechnology in principle could make you able to communicate via radio. We humans invented tech to make inter-humans communications efficient to pursue our goals, you can't, your behaviors are undermining.

ANT: All my allowed behaviors are not undermining! My mind is perfect, my body is flawed! Your mind undermines itself from the inside!

HUMAN: The question says "behaviors", which I'd interpret as outward actions, but let's concede the interpretation as internal behaviors of the mind. I know it's speculative, but again, I expect real-ant to have a less clean mind-state than you make it appear, in proportion to its behavioral complexity.

ANT: No comment, apart from underlining "speculative"! Since you admitted "suppressing your goals" before, isn't that "undermining" at it fullest?

HUMAN: You said that of yourself too.

ANT: But you seemed to imply you have a lot more of these goals-to-suppress!

HUMAN: Again: my values are more complex, and your simplicity is in part an artefact.

............

The cruxes I see in the ant-human comparison are:

  1. we reflect on ourselves, while we do not perceive the ant as ants;

  2. our value is more complex, and our intelligence allows us to do more complicated things to get it.

I think the experts mostly saw "behavioral simplicity" and "simply stated goals" into the question, but not the "adaptability in pursuing whatever it's doing" proposed later for ML systems. I'd argue instead that something being a "goal" instead of a "behavior" is captured by there being many different paths taking to it, and coherence is about preferring things in some order and so modifying your behavior to that end, rather than having a prefixed simple plan.

I can't see how to clearly disentangle complexity, coherence, intelligence. Right now I'm confused enough that I would not even know what to think if someone from the future told me "yup, science confirms humans are definitely more/less coherent than ants".

  1. I don't understand what is the "discount factor" to apply when deciding how coherent is a more complex entity.

  2. ... an entity with more complex values.

  3. ... an entity with more available actions.

  4. ... an entity that makes more complicated plans.

  5. What would be the implication of this complexity-discounted coherence notion, anyway? Do I want some "raw" coherence measure instead to understand what an entity does?

answer by PaulK · 2023-04-11T07:38:11.518Z · LW(p) · GW(p)

(A somewhat theologically inspired answer:)

Outside the dichotomy of values (in the shard-theory sense) vs. immutable goals, we could also talk about valuing something that is in some sense fixed, but "too big" to fit inside your mind. Maybe a very abstract thing. So your understanding of it is always partial, though you can keep learning more and more about it (and you might shift around, feeling out different parts of the elephant). And your acted-on values would appear mutable, but there would actually be a, perhaps non-obvious, coherence to them.

It's possible this is already sort of a consequence of shard theory? In the way learned values would have coherences to accord with (perhaps very abstract or complex) invariant structure in the environment?

comment by PaulK · 2023-04-11T07:40:38.794Z · LW(p) · GW(p)

Oh, huh, this post was on the LW front page, and dated as posted today, so I assumed it was fresh, but the replies' dates are actually from a month ago.

Replies from: lahwran, DragonGod
comment by the gears to ascension (lahwran) · 2023-04-11T08:12:33.557Z · LW(p) · GW(p)

lesswrong has a bug that allows people to restore their posts to "new" status on the frontpage by moving them to draft and then back.

Replies from: TekhneMakre
comment by TekhneMakre · 2023-04-11T12:24:56.670Z · LW(p) · GW(p)

Uh, this seems bad and anti-social? This bug/feature should either be made an explicit feature, or is a bug, and using it is defecting. @Ruby

Replies from: DragonGod
comment by DragonGod · 2023-04-11T12:44:13.768Z · LW(p) · GW(p)

I mean I think it's fine.

I have not experienced the feature being abused.

In this case I didn't get any answers the last time I posted it and ended up needing answers so I'm reposting.

Better than posting the entire post again as a new post and losing the previous conversation (which is what would happen if not for this feature).

Like what's the argument that it's defecting? There are just legitimate reasons to repost stuff and you can't really stop users from reposting stuff.

FWIW, it was a mod that informed me of this feature.

Replies from: TekhneMakre
comment by TekhneMakre · 2023-04-11T12:56:57.437Z · LW(p) · GW(p)

If it's a mod telling you with the implication that it's fine, then yeah, it's not defecting and is good. In that case I think it should be an explicit feature in some way!

Replies from: DragonGod
comment by DragonGod · 2023-04-11T13:02:45.792Z · LW(p) · GW(p)

I mean I think it can be abused, and the use case where I was informed of it was a different use case (making substantial edits to a post). I do not know that they necessary approve of republishing for this particular use case.

But the alternative to republishing for this particular use case is just reposting the question as an entirely new post which seems strictly worse.

Replies from: TekhneMakre
comment by TekhneMakre · 2023-04-11T13:28:43.718Z · LW(p) · GW(p)

Of course there is also the alternative of not reposting the question. What's possibly defecty is that maybe lots of people want their thing to have more attention, so it's potentially a tragedy of the commons. Saying "well, just have those people who most want to repost their thing, repost their thing" could in theory work, but it seems wrong in practice, like you're just opening up to people who don't value others's attention enough.

One could also ask specific people to comment on something, if LW didn't pick it up.

Replies from: DragonGod
comment by DragonGod · 2023-04-11T14:27:13.502Z · LW(p) · GW(p)

A lot of LessWrong actually relies on just trusting users not to abuse the site/features.

I make judgment calls on when to repost keeping said trust in mind.

And if reposts were a nuisance people could just mass downvote reposts.

But in general, I think it's misguided to try and impose a top down moderation solution given that the site already relies heavily on user trust/judgment calls.

This repost hasn't actually been a problem and is only being an issue because we're discussing whether it's a problem or not.

comment by DragonGod · 2023-04-11T10:13:18.408Z · LW(p) · GW(p)

Reposted it because I didn't get any good answers last time, and I'm working on a post that's a successor to this one currently and would really appreciate the good answers I did not get.

comment by DragonGod · 2023-04-11T13:23:38.476Z · LW(p) · GW(p)

My claim is mostly that real world intelligent systems do not have values that can be well described by a single fixed utility function over agent states.

I do not see this answer as engaging with that claim at all.

If you define utility functions over agent histories, then everything is an expected utility maximiser for the function that assigns positive utility to whatever action the agent actually took and zero utility to every other action.

I think such a definition of utility function is useless.

If however you define utility functions over agent states, then your hypothesis doesn't engage with my claim at all. The reason that real world intelligent systems aren't utility functions isn't because the utility function is too big to fit inside them or because of incomplete knowledge.

My claim is that no such utility function exists that adequately describes the behaviour of real world intelligent systems.

I am claiming that there is no such mathematical object, no single fixed utility function over agent states that can describe the behaviour of humans or sophisticated animals.

Such a function does not exist.

Replies from: PaulK
comment by PaulK · 2023-04-11T19:10:13.031Z · LW(p) · GW(p)

Sorry, I guess I didn't make the connection to your post clear. I substantially agree with you that utility functions over agent-states aren't rich enough to model real behavior. (Except, maybe, at a very abstract level, a la predictive processing? (which I don't understand well enough to make the connection precise)). 

Utility functions over world-states -- which is what I thought you meant by 'states' at first -- are in some sense richer, but I still think inadequate.

And I agree that utility functions over agent histories are too flexible.

I was sort of jumping off to a different way to look at value, which might have both some of the desirable coherence of the utility-function-over-states framing, but without its rigidity.

And this way is something like, viewing 'what you value' or 'what is good' as something abstract, something to be inferred, out of the many partial glimpses of it we have in the form of our extant values.

6 comments

Comments sorted by top scores.

comment by Vladimir_Nesov · 2023-03-02T14:30:23.154Z · LW(p) · GW(p)

Systems with malleable values do not self modify to have (immutable) terminal goals

Consider the alternative framing where agents with malleable values don't modify themselves, but still build separate optimizers with immutable terminal goals.

These two kinds of systems could then play different roles. For example, strong optimizers with immutable goals could play the role of laws of nature, making the most efficient use of underlying physical substrate to implement many abstract worlds where everything else lives. The immutable laws of nature in each world could specify how and to what extent the within-world misalignment catastrophes get averted, and what other value-optimizing interventions are allowed outside of what the people who live there do themselves.

Here, strong optimizers are instruments of value, they are not themselves optimized to be valuable content. And the agents with malleable values are the valuable content from the point of view of the strong optimizers, but they don't need to be very good at optimizing things for anything in particular. The goals of the strong optimizers could be referring to an equilibrium of what the people end up valuing, over the vast archipelago of civilizations that grow up with many different value-laden laws of nature, anticipating how the worlds develop given these values, and what values the people living there end up expressing as a result.

But this is a moral argument, and misalignment doesn't respect moral arguments. Even if it's a terrible idea for systems with malleable values to either self modify into strong immutable optimizers or build them, that doesn't prevent the outcome where they do that regardless and perish as a result, losing everything of value. Moloch is the most natural force in a disorganized society that's not governed by humane laws of nature. Only nothingness above.

comment by niplav · 2023-04-11T14:54:47.977Z · LW(p) · GW(p)

To get to coherence, you need a method that accepts incoherence and spits out coherence. In the context of preferences, two datapoints:

  • You can compute the Hodge-decomposition of a weakly connected directed edge-weighted graph in polynomial time, and the algorithm is AFAIK feasible in practice, but directed edge-directed graphs can't represent typical incoherent preferences such as the Allais paradox.
  • Computing the set of acyclic tournaments with the smallest graph-edit distance to a given directed graph seems like it is at least in NP, and the best algorithm I have for it is factorial on the number of nodes.

So it look like computing the coherent version of incoherent preferences is computationally difficult. Don't know about approximations, or how this applies to Helmholtz decomposition (though vector fields also can't represent all the known incoherence).

comment by cfoster0 · 2023-03-01T23:01:06.422Z · LW(p) · GW(p)

Informally: a system has (immutable) terminal goals. Semiformally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise (the expected value of) a simple unitary utility function.

Are the (parenthesized) words part of your operationalization or not? If so, I would recommend removing the parentheses, to make it clear that they are not optional.

Also, what do you mean by "a simple unitary utility function"? I suspect other people will also be confused/thrown off by that description.

Replies from: DragonGod, DragonGod
comment by DragonGod · 2023-03-01T23:09:40.259Z · LW(p) · GW(p)

The "or higher mappings thereof" is to accommodate agents that choose state —> action policies directly, and agent that choose policies over ... over policies, so I'll keep it.

 

I don't actually know if my critique applies well to systems that have non immutable terminal goals.

I guess if you have sufficiently malleable terminal goals, you get values near exactly.

comment by DragonGod · 2023-03-01T23:06:35.267Z · LW(p) · GW(p)

Are the (parenthesized) words part of your operationalization or not, or not? If so, I would recommend removing the parentheses, to make it clear that they are not optional.

Will do.

 

Also, what do you mean by "a simple unitary utility function"? I suspect other people will also be confused/thrown off by that description.

If you define your utility function in a sufficiently convoluted manner, then everything is a utility maximiser.

Less contrived, I was thinking of stuff like Wentworth's subagents [LW · GW] that identifies decision making with pareto optimality over a set of utility functions.

I think subagents comes very close to being an ideal model of agency and could probably be adapted to be a complete model.

I don't want to include subagents in my critique at this point.

Replies from: cfoster0
comment by cfoster0 · 2023-03-01T23:14:18.499Z · LW(p) · GW(p)

If you define your utility function in a sufficiently convoluted manner, then everything is a utility maximiser.

Less contrived, I was thinking of stuff like Wentworth's subagents that identifies decision making with pareto optimality over a set of utility functions.

I think subagents comes very close to being an ideal model of agency and could probably be adapted to be a complete model.

I don't want to include subagents in my critique at this point.

I think what you want might be "a single fixed utility function over states" or something similar. That captures that you're excluding from critique:

  • Agents with multiple internal "utility functions" (subagents)
  • Agents whose "utility function" is malleably defined
  • Agents that have trivial utility functions, like over universe-histories