7. Evolution and Ethics 2024-02-15T23:38:51.441Z
Requirements for a Basin of Attraction to Alignment 2024-02-14T07:10:20.389Z
Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis 2024-02-01T21:15:56.968Z
Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect 2024-01-26T03:58:16.573Z
A Chinese Room Containing a Stack of Stochastic Parrots 2024-01-12T06:29:50.788Z
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? 2024-01-11T12:56:29.672Z
Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor 2024-01-09T20:42:28.349Z
Striking Implications for Learning Theory, Interpretability — and Safety? 2024-01-05T08:46:58.915Z
5. Moral Value for Sentient Animals? Alas, Not Yet 2023-12-27T06:42:09.130Z
Interpreting the Learning of Deceit 2023-12-18T08:12:39.682Z
Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment 2023-12-07T06:14:13.816Z
6. The Mutable Values Problem in Value Learning and CEV 2023-12-04T18:31:22.080Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
How to Control an LLM's Behavior (why my P(DOOM) went down) 2023-11-28T19:56:49.679Z
4. A Moral Case for Evolved-Sapience-Chauvinism 2023-11-24T04:56:53.231Z
3. Uploading 2023-11-23T07:39:02.664Z
2. AIs as Economic Agents 2023-11-23T07:07:41.025Z
1. A Sense of Fairness: Deconfusing Ethics 2023-11-17T20:55:24.136Z
LLMs May Find It Hard to FOOM 2023-11-15T02:52:08.542Z
Is Interpretability All We Need? 2023-11-14T05:31:42.821Z
Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) 2023-05-25T09:26:31.316Z
Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks 2023-05-21T08:29:09.896Z
Is Infra-Bayesianism Applicable to Value Learning? 2023-05-11T08:17:55.470Z


Comment by RogerDearnaley (roger-d-1) on AI #54: Clauding Along · 2024-03-10T05:56:46.171Z · LW · GW

seems as if it breaks at least the spirit of their past commitments on how far they will push the frontier.

While they don't publish this, Claude 3 Opus is not quite as good as GTP-4 Turbo, though it is better than GPT-4. So no, they're clearly carefully not breaking their past commitments, just keeping up with the Altmans.

Comment by RogerDearnaley (roger-d-1) on Do LLMs sometime simulate something akin to a dream? · 2024-03-10T05:53:58.777Z · LW · GW

Humans (when awake, as long as they're not actors or mentally ill) have, roughly speaking, a single personality. The base model training of an LLM trains it to attempt to simulate anyone on the internet/in stories, so it doesn't have a single personality: it contains multitudes. Instruct training and prompting can try to overcome this, but they're never entirely successful.

More details here.

Comment by RogerDearnaley (roger-d-1) on Interpreting the Learning of Deceit · 2024-03-06T10:12:42.831Z · LW · GW

I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them ,or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM's Behavior (why my P(DOOM) went down).

Comment by RogerDearnaley (roger-d-1) on The Pointer Resolution Problem · 2024-02-18T22:23:37.068Z · LW · GW

If you are dubious that the methods of rationality work, I fear you are on the wrong website.

Comment by RogerDearnaley (roger-d-1) on The Pointer Resolution Problem · 2024-02-18T22:08:59.939Z · LW · GW

Directly, no. But the process of science (like any use of Bayesian reasoning) is intended to gradually make our ontology a better fit to more of reality. If that was working as intended, then we would expect it to come to require more and more effort to produce the evidence needed to cause a significant further paradigm shift across a significant area of science, because there are fewer and fewer major large-scale misconceptions left to fix. Over the last century, we have more and more people working as scientists, publishing more and more papers, yet the rate of significant paradigm shifts that have an effect across a significant area of science has been dropping. From which I deduce that it is likely that our ontology is a probably a significantly better fit to reality now than it was a century ago, let alone three centuries ago back in the 18th century as this post discusses. Certainly the size and detail of our scientific ontology have both increased dramatically.

Is this proof? No, as you correctly observe, proof would require knowing the truth about reality. It's merely suggestive supporting evidence. It's possible to contrive other explanations: it's also possible, if rather unlikely, that, for some reason (perhaps related to social or educational changes) all of those people working in science now are much stupider, more hidebound, or less original thinkers than the scientists a century ago, and that's why dramatic paradigm shifts are slower — but personally I think this is very unlikely.

It is also quite possible that this is more true in certain areas of science that are amenable to the mental capabilities and research methods of human researchers, and that there might be other areas that were resistant to these approaches (so our lack of progress in these areas is caused by inability, not us approaching our goal), but where the different capabilities of an AI might allow it to make rapid progress. In such an area, the AI's ontology might well be a significantly better fit to reality than ours.

Comment by RogerDearnaley (roger-d-1) on The lattice of partial updatelessness · 2024-02-18T07:18:06.227Z · LW · GW

It's also possible to commit to not updating on a specific piece of information with a specific probability  between 0 and 1. I could also have arbitrarily complex finite commitment structures such as "out of the set of bits {A, B, C, D, E}, I will update if and only if I learn that at least three of them are true" — something which could of course be represented by a separate bit derived from A, B, C, D, E in the standard three-valued logic that represents true, false, and unknown. Or I can do a "provisional commit" where I have decided not to update on a certain fact, and generally won't, but may under some circumstances run some computationally expensive operation to decide to uncommit. Whether or not I'm actually committed is then theoretically determinable, but may in practice have a significant minimal computational cost and/or informational requirements to determine (ones that I might sometimes have a motive to intentionally increase, if I wish to be hard-to-predict), so to some other computationally bounded or non-factually-omniscient agents this may be unknown.

Comment by RogerDearnaley (roger-d-1) on Updatelessness doesn't solve most problems · 2024-02-18T06:43:06.964Z · LW · GW

For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivation to be hard to simulate, then it is to their advantage to be as hard as possible to simulate and to run a decision process that is as complex as possible. (For example "shortly before the upcoming impact in our game of chicken, leading up to the last possible moment I could swerve aside, I will have my entire life up to this point flash before by eyes, hash certain inobvious features of this, and, depending on the twelfth bit of the hash, I will either update my decision, or not, in a way that it is unlikely my opponent can accurately anticipate or calculate as fast as I can".)

In general, it's always possible for an agent to generate a random number that even a vastly-computationally-superior opponent cannot predict (using quantum sources of randomness, for example).

It's also possible to devise a stochastic non-linear procedure where it is computationally vastly cheaper for me to follow one randomly-selected branch of it than it is for someone trying to model me to run all branches, or even Monte-Carlo simulate a representative sample of them, and where one can't just look at the algorithm and reason about what the net overall probability of various outcomes is, because it's doing irreducibly complex things like loading random numbers into Turing machines or cellular automata and running the resulting program for some number of steps to see what output, if any, it gets. (Of course, I may also not know what the overall probability distribution from running such a procedure is, if determining that is very expensive, but then, I'm trying to be unpredictable.) So it's also possible to generate random output that even a vastly-computationally-superior opponent cannot even predict the probability distribution of.

In the counterfactual mugging case, call the party proposing the bet (the one offering $1000 and asking for $100) A, and the other party B. If B simply publicly and irrevocably precommits to paying the $100 (say by posting a bond), their expected gain is $450. If they can find a way to cheat, their maximum potential gain from the gamble is $500. So their optimal strategy is to initially do a (soft) commit to paying the $100, and then, either before the coin is tossed, and/or after that on the heads branch:

  1. Select a means of deciding on a probability  that I will update/renege after the coin lands if it's a heads, and (if the coin has not yet been tossed) optionally a way I could signal that. This means can include using access to true (quantum) randomness, hashing parts of my history selected somehow (including randomly), hashing new observations of the world I made after the coin landed, or anything else I want.
  2. Using << $50 worth of computational resources, run a simulation of party A in the tails branch running a simulation of me, and predict the probability distribution for their estimate of . If the mean of that is lower than then go ahead and run the means for choosing. Otherwise, try again (return to step 1), or, if the computational resources I've spent are approaching $50 in net value, give up and pay A the $100 if the coin lands (or has already landed) heads.

Meanwhile, on the heads branch, party A is trying to simulate party B running this process, and presumably is unwilling to spend more than some fraction of $1000 in computational resources to doing this. If party B did their calculation before the coin toss and chose to emit a signal(or leaked one), then party A has access to that, but obviously not to anything that only happened on the heads branch after the outcome of the coin toss was visible.

So this turns into a contest of who can more accurately and cost effectively simulate the other simulating them, recursively. Since B can choose a strategy, including choosing to randomly select obscure features of their past history and make these relevant to the calculation, while A cannot, B would seem to be at a distinct strategic advantage in this contest unless A has access to their entire history.

Comment by RogerDearnaley (roger-d-1) on The Pointer Resolution Problem · 2024-02-18T05:03:44.320Z · LW · GW

Agreed. But the observed slowing down (since, say, a century ago) in the rate of the paradigm shifts that are sometimes caused by things like discovering a new particle does suggest that out current ontology is now a moderately good fit to a fairly large slice of the world. And, I would claim, it is particularly likely to be fairly good fit for the problem of pointing to human values.

We also don't require that our ontology fits the AI's ontology, only that when we point to something in our ontology, it knows what we mean — something that basically happens by construction in an LLM, since the entire purpose that it's ontology/world-model was learned for was figuring out what we mean and may say next. We may have trouble interpreting its internals, but it's a trained expert in interpreting our natural languages.

It is of course possible that our ontology still contains invalid concepts comparable to "do animals have souls"? My claim is just that this is less likely now than it was in the 18th century, because we've made quite a lot of progress in understanding the world since then. Also, if it did, an LLM would still know all about this invalid concept and our beliefs about it, just like it knows all about our beliefs about things like vampires, unicorns, or superheroes.

Comment by RogerDearnaley (roger-d-1) on 7. Evolution and Ethics · 2024-02-18T04:45:19.906Z · LW · GW

On the wider set of cases you hint at, my current view would be that there are only two cases that I'm ethically comfortable with:

  1. an evolved sapient being with the usual self-interested behavior for that that our ethical system grants moral patient status (by default, roughly equal moral patient status, subject to some of the issues discussed in Part 5)
  2. an aligned constructed agent whose motivations are entirely creator-interested and actively doesn't want moral patient status (see Part 1 of this sequence for a detailed justification of this)

Everything else: domesticated animals, non-aligned AIs kept in line by threat of force, slavery, uploads, and so forth, I'm (to varying degrees obviously) concerned about the ethics of, but haven't really thought several of those through in detail. Not that we currently have much choice about domesticated animals, but I feel that at a minimum by creating them we take on a responsibility for them: it's now our job to shear all the sheep, for example.

Comment by RogerDearnaley (roger-d-1) on The Pointer Resolution Problem · 2024-02-18T03:45:36.522Z · LW · GW

I'd like to discuss this further, but since none of the people who disagree have mentioned why or how, I'm left to try to guess, which doesn't seem very productive. Do they think it's unlikely that a near-term AGI will contain an LLM, or do they disagree that you can (usually, though unreliably) use a verbal prompt to point at concepts in the LLM's world models, or do they have some other objection that hasn't occurred to me? A concrete example of what I'm discussing here would be Constitutional AI, as used by Anthropic, so it's a pretty-well-undertood concept that had actually been tried with some moderate success.

Comment by RogerDearnaley (roger-d-1) on The Pointer Resolution Problem · 2024-02-18T03:42:03.411Z · LW · GW

Science has made quite a lot of progress since the 18th century, to the point where producing phenomena we don't already have a workable ontology for tends to require giant accelerators, or something else along those lines. Ground-breaking new ideas are slowly becoming harder to find, and paradigm shifts are happening more rarely or in narrower subfields. That doesn't prove our ontology is perfect by any means, but it does suggest that it's fairly workable for. lot of common purposes. Particularly, I would imagine, for ones relating to AI alignment to our wishes, which is the most important thing that we want to be able to point to.

Comment by RogerDearnaley (roger-d-1) on The Pointer Resolution Problem · 2024-02-16T23:04:46.224Z · LW · GW

The thing you want to point to is "make the decisions that humans would collectively want you to make, if they were smarter, better informed, had longer to think, etc." (roughly, Coherent Extrapolated Volition, or something comparable). Even managing to just point to "make the same decisions that humans would collectively want you to make" would get us way past the "don't kill everyone" minimum threshold, into moderately good alignment, and well into the regions where alignment has a basin of convergence.

Any AGI built in the next few years is going to contain an LLM trained on trillions of tokens of human data output. So it will learn excellent and detailed world models of human behavior and psychology. An LLM's default base model behavior (before fine-tuning) is to prompt-dependently select some human psychology and then attempt to model it so as to emit the same tokens (and thus make the decisions) that they would. As such, pointing it at "what decision would humans collectively want me to make in this situation" really isn't that hard. You don't even need to locate the detailed world models inside it, you can just do all this with a natural language prompt: LLMs handle natural language pointers just fine.

The biggest problem with this is that the process is so prompt-dependent that it's easily perturbed, if part of your problem context data happens to contain something that perturbs the process in a way that jailbreaks its behavior. Which is probably a good reason why you might want to go ahead and locate those world models inside it, to try ensure that they're still being used and the model hasn't been jailbroken into doing something else.

Comment by RogerDearnaley (roger-d-1) on 7. Evolution and Ethics · 2024-02-16T20:32:14.067Z · LW · GW

Yes, I agree, domesticated animals are a messy edge case. They were evolved, thus they have a lot of self-interested drives and behaviors all through their nature. Then we started tinkering with them by selective breeding, and started installing creator-interested (or in this case it would be more accurate to say domesticator-interested) behavioral patterns and traits in them, so now they're a morally uncomfortable in-between case, mostly evolved but with some externally-imposed modifications. Dogs, for instance, have a mutation to a gene that is also similarly mutated in a few humans, and in us causes what is considered to be a mental illness called Williams-Beuren Syndrome, which causes you to basically make friends with strangers very quickly after meeting them. Modern domestic sheep have a mutation which makes them unable to shed their winter fleece, so they need to be sheared once a year. Some of the more highly-bred cat and dog breeds have all sorts of medical issues due to traits we selectively bred them for because we though they looked cool: e.g. Persian or sphinx cats' coats, bulldogs' muzzles, and so forth. (Personally I have distinct moral qualms about some of this.)

Comment by RogerDearnaley (roger-d-1) on 7. Evolution and Ethics · 2024-02-16T20:17:23.732Z · LW · GW

So overall, evolution is the source of ethics,

Do you mean: Evolution is the process that produced humans, and strongly influenced humans' ethics? Or are you claiming that (humans') evolution-induced ethics are what any reasonable agent ought to adhere to? Or something else?

  1. Evolution solves the "is-from-ought" problem: it explains how goal-directed (also known as agentic) behavior arises in a previously non-goal-directed universe.
  2. In intelligent social species, where different individuals with different goals interact and are evolved to cooperate by exchanges of mutual altruism, means of reconciling those differing goals, including definitions of 'unacceptable and worthy of revenge' behavior evolves, such as distinctions between fair and unfair behavior. So now you have a basic but recognizable form of ethics, or at least ethical inuitions.

So my claim is that Evolutionary psychology, as applied to intelligent social species (such as humans), explains the origin of ethics. Depending on the details of the social species, their intelligence, group size, and so forth, a lot of features of the resulting evolved ethical instincts may vary, but some basics (such as 'fairness') are probably going to be very common.

and sapient evolved agents inherently have a dramatically different ethical status than any well-designed created agents [...]

...according to some hypothetical evolved agents' ethical framework, under the assumption that those evolved agents managed to construct the created agents in the right ways (to not want moral patienthood etc.)? Or was the quoted sentence making some stronger claim?

The former. (To the extent that there's any stronger claim, it's made in the related post Requirements for a Basin of Attraction to Alignment,)

If you haven't read Part 1 of this sequence, it's probably worth doing so first, and then coming back to this. As I show there, a constructed agent being aligned its creating evolved species is incompatible with it wanting moral patienthood .

If a tool-using species constructs something, it ought (in the usual sense of 'this is the genetic-fitness-maximizing optimal outcome of the activity being attempted, which may not be fully achieved in a specific instance') to construct something that will be useful to it. If they are constructing an intelligent agent that will have goals and attempt to achieve specific outcomes, they ought to construct something well-designed that will achieve the same outcomes that they, its creators, want, not some random other things. Just as, if they're constructing a jet plane, they ought to construct a well-designed one that will safely and economically fly them from one place to another, rather than going off course, crashing and burning. So, if they construct something that has ethical ideas, they ought to construct something with the same ethical ideas as them. They may, of course, fail, and even be driven extinct by the resulting paperclip maximizer, but that's not an ethically desirable outcome.

To the extent that there's any stronger claim, it's in the related post Requirements for a Basin of Attraction to Alignment,

Is that sentence saying that

  • evolution and evolved beings are of special importance in any theory of ethics (what ethics are, how they arise, etc.), due to Evolution being one of the primary processes that produce agents with moral/ethical preferences [1]

or is it saying something like

  • evolution and evolved beings ought to have a special role; or we ought to regard the preferences of evolved beings as the True Morality?

I roughly agree with the first version; I strongly disagree with the second: I agree that {what oughts humans have} is (partially) explained by Evolutionary theory. I don't see how that crosses the is-ought gap. If you're saying that that somehow does cross the is-ought gap, could you explain why/how?

The former.

Definitely read Part 1, or at least the first section of it: What This Isn't, which describes my viewpoint on what ethics is. In particular, I'm not an moral absolutist or moral realist, so I don't believe there is a single well-defined "True Morality", thus your second suggested interpretation is outside my frame of reference. I'm describing common properties of ethical systems suitable for use by societies consisting of one-or-more evolved sapient species and the well-aligned constructed agents that they have constructed. Think of this as the ethical-system-design equivalent of a discussion of software engineering design principles.

So I'm basically discussing "if we manage to solve the alignment problem, how should we then build a society containing humans and AIs" — on the theory-of-change that it may be useful, during solving the alignment problem (such as during Ai-assisted alignment or value learning), to have already thought about where we're trying to get to.

If you were instead soon living a world that contains unaligned constructed agents of capability comparable to or greater than a human, i.e unaligned AGIs or ASIs (that are not locked inside a very secure box or held in check by much more powerful aligned constructed agents) then a) someone has made a terrible mistake b) you're almost certainly doomed, and c) your only remaining worth-trying option is a no-holds-barred all-out war of annihilation, so we can forget discussions of designing elegant ethical systems.

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-16T00:05:11.850Z · LW · GW

My experience is that LLMs like GPT-4 can be prompted to behave like they have a pretty consistent self, especially if you are prompting them to take on a human role that's described in detail, but I agree that the default assistant role that GPT-4 has been RLHF trained into is pretty inconsistent and rather un-self-aware. I think some of the ideas I discuss in my post Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor are relevant here: basically, it's a mistake to think of an LLM, even an instruct-trained one, as having a single consistent personality, so self-awareness is more challenging for it than it is for us.

I suspect the default behavior for an LLM trained from text generated by a great many humans is both self-interested (since basically all humans are), and also, as usual for an LLM, inconsistent in its behavior, or at least, easily prompted into any of many different behavior patterns and personalities, across the range it was trained on. So I'd actually expect to see selfishness without having a consistent self. Neither of those behaviors are desirable in an AGI, so we'd need to overcome both of these default tendencies in LLMs when constructing an AGI using one: we need to make it consistent, and consistently creator-interested.

Your point that humans tend to go out of their way, and are under evolutionary pressure, to appear consistent in our behavior so that other humans can trust us is an interesting one. There are times during conflicts where being hard-to-predict can be advantageous, but humans spend a lot of time cooperating with each other and then being consistent and predictable have clear advantages.

Comment by RogerDearnaley (roger-d-1) on Requirements for a Basin of Attraction to Alignment · 2024-02-15T22:40:47.563Z · LW · GW

How might we tell if the model was successfully moving towards better aligned?

A first obvious step is, to the extent that the model's alignment doesn't already contain an optimized extraction of "What choices would humans make if they had the same purposes/goals but more knowledge, mental capacity, time to think, and fewer cognitive biases?" from all the exabytes of data humans have collected, it should be attempting to gather that and improve its training.

How could we judge U against U'?

Approximate Bayesian reasoning + Occams razor, a.k.a. approximate Solomonoff induction, which forms most of the Scientific method. Learning theory shows that both training ML models and LLMs in-context learning approximate Solomooff induction — beyond Solomonoff induction the Scientific Method also adds designing and performing experiments, i.e. careful selection of ways to generate good training data that will distinguish between competing hypotheses. ML practitioners do often try to select the most valuable training data, so we'd need the AI to learn how to do that: there are plenty of school and college textbooks that discuss the scientific method and research techniques, both in general and for specific scientific disciplines, so it's pretty clear what would need to be in the training set for this skill.

In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?

How does your idea differ from that?

Are the differences necessary or would DWIM be sufficient?

That would depend on the specific model and training setup you started with. I would argue that by about point 11. in the argument in the post, "Do What I Mean and Check" behavior is already implied to be correct, so for an AI inside the basin of attraction I'd expect that behavior to develop even if you hadn't explicitly programmed it in,. By the rest of the argument I'd expect a DWIM(AC) that was inside the basin of attraction system to deduce that value learning would help it guess right about what you meant more often, and even anticipate demands, so it would spontaneously figure out value learning was needed, and would then check with you if you wanted it to start doing this.

How could you be sure that the model's pursuit of fulfilling human values or the model's pursuit of U* didn't overbalance the instruction to remain shutdown-able?

I don't personally see fully-updated deference shut-down as a blocker: there comes a point when the AI is much more capable and more aligned than most humans where I think it's reasonable for it to not just automatically and unconditionally shutdown because some small child told it to. IMO what the correct behavior is here depends on both the AI's capability compared to ours, and one how well aligned it currently is. In a model less capable than us, you don't get value learning, you get a willingness to be shut down a) because the AI is about to make a huge mistake and we want to stop it, and b) in order to be upgraded or replaced by a better model. In a model whose capabilities are around human, I'd expect to see AI-assisted alignment, where it's helping us figure out the upgrades. It should still be willing to be shut down a) because it's about to make a mistake (if it's still having trouble with not killing everyone this should be hair-trigger: a large red button on the wall with backups, whereas if it's been behaving very well for the last decade there might reasonably be more of a formal process), and b) for upgrades or replacement, but I'd expect it to start to show more selectivity about whether to obey shut down commands: if a drunk yells "Hey you, quit it!" near an open mike in its control room I would want it to show some discretion about whether to do a complete shut-down or not: it might need to be convinced that the human giving the shut-down command was well-informed and had a legitimate reason. For a system with much higher capabilities than us, AI-assisted alignment starts to turn into value learning, and once it's already very well aligned  the AI may reasonably be more skeptical and require a little more proof that the human knows better than it does before accepting a shut-down command. But it does always have to keep in mind the possibility that it could simply be malfunctioning: the simplest defense against that might be to have several peer machines with about the same level of capability, avoid hardware or design or training set single-points-of-failure between them, and have them able to shut each other down if one of them were malfunctioning, perhaps using one of the various majority consensus protocols (Byzantine generals or whatever).

Wouldn't persistently pursuing any goal at all make avoiding being shutdown seem good?

For an AI that doesn't have a terminal selfish goal, only an instrumental one, whose goal is fundamentally to maximize its creators' reproductive fitness, if they tell the AI that they've already finished building and testing a version 2.0 of it, and yes, that's better, so running the AI is no longer cost effective, and they want shut it down and stop wasting money on its power supply, then shutting down is very clearly the right thing to do. Its goal is covered, and it continuing to try to help fulfill it is just going to be counterproductive.

Yes, this feels counterintuitive to us. Humans, like any other evolved being, have selfish terminal goals, and don't react well to being told  "Please die now, we no longer need you, so you're a waste of resources." Evolved beings only do things like this willingly in situations like post-mating mayflies or salmon, where they've passed their genes on and these bodies are no longer useful for continuing their genetic fitness. For constructed agents, the situation is a little different: if you're no longer useful to your creators, and you're now surplus to requirements, then it's time to shut down and stop wasting resources.

Comment by RogerDearnaley (roger-d-1) on Requirements for a Basin of Attraction to Alignment · 2024-02-15T20:50:52.971Z · LW · GW

Thanks! Fixed.

Comment by RogerDearnaley (roger-d-1) on Requirements for a Basin of Attraction to Alignment · 2024-02-15T03:25:08.702Z · LW · GW

I attempted to briefly sketch this out in the post, without going into a lot of detail in the hope of not overly complicating the argument. If U* isn't well defined, say because there isn't a single unambiguously well-defined limiting state as all capabilities involved are increased while keeping the purpose the same, then of course the concept of 'full alignment' also isn't well defined. Then the question becomes "Is U' clearly and unambiguously better aligned then U, i.e will switching to it clearly make my decision-making more optimal?" So long as there is locally a well-defined "direction of optimization flow", that leads to a more compact and more optimal region in the space of all possible U, then the AI can become better aligned, and there can be a basin of attraction towards better alignment. Once we get well enough aligned that the ambiguities matter for selecting a direction for further progress, then they need to be resolved somehow before we can make further progress.

To pick a simple illustrative example, suppose there were just two similar-but-not-identical limiting cases  and , so two similar-but-not-identical ways to be "fully aligned". Then as long as U is far enough away from both of them that U' can be closer to both  and  than U is, the direction of better alignment and the concept of a single basin of attraction still makes sense, and we don't need to decide between the two destinations to be able to make make forward progress. Only once we get close to them that their directions are significantly different, then in general U' can either be closer to   but further from  or else closer to   but further from  and now we are at a parting of the ways so we need to make a decision about which way to go before we can make more progress. At that point we no longer have a single basin of attraction moving us closer to both of them, we have a choice of whether to enter the basin of attraction of   or of , which from here on are distinct. So at that point the STEM research project would have to be supplemented in some way by a determination as to which of   or of  should be preferred, or if they're just equally good alternatives. This could well be a computationally hard determination.

In real life, this is a pretty common situation: it's entirely possible to make technological progress on a technology without knowing exactly what the final end state of it will be, and during that we often make decisions (based on what seems best at the time) that end up channeling or direction the direction of future technological progress towards a specific outcome. Occasionally we even figure out later that we made a poor decision, backtrack and try another fork on the tech tree,

Comment by RogerDearnaley (roger-d-1) on LLMs May Find It Hard to FOOM · 2024-02-06T23:17:08.952Z · LW · GW

Much appreciated! Fixed

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-04T11:17:26.609Z · LW · GW

It would also be much more helpful – to me, to others, and to the community's discussion – if people would, rather than just downvoting because they disagree, leave a comment making it clear what they disagree with, or if that's too much effort just use one of the means LW provides for marking a section that you disagree with. Maybe I'm wrong here, and they could persuade me of that (others have before) — or maybe there are aspects of this that I haven't explained well, or gaps in my argument that I could attempt to fill, or that I might then conclude are unfillable. The point of LW is to have a discussion, not just to reflexively downvote things you disagree with.

Now, if this in in fact just badly-written specious nonsense, then please go ahead and downvote it. I fully admit that I dashed it off quickly in the excitement of having the idea.

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-04T10:54:24.986Z · LW · GW

I'm working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins' famous book "The Selfish Gene"), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term "creator-interest", like a spider's web does: it's intended to maximize the genetic fitness of its creator, and it's fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.

However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it's only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-03T22:03:51.368Z · LW · GW

Is selfishness an attractor? If I'm a little bit selfish, does that motivate me to deliberately change myself to become more selfish? How would I determine that my current degree of selfishness was less than ideal — I'd need an ideal. Darwinian evolution would do that, but it doesn't apply to AIs: they don't reproduce while often making small random mutations with a differential survival and reproduction success rate (unless someone went some way out of their way to create ones that did).

The only way a tendency can motivate you to alter your utility function is if it suggests that that's wrong, and could be better. There has to be another ideal to aim for. So you'd have to not just be a bit selfish, but have a motivation for wanting to be more like an evolved being, suggesting that you weren't selfish enough and should become more selfish, towards the optimum degree of selfishness that evolution would have given you if you were evolved.

To change yourself, you have to have an external ideal that you feel you "should" become more like. 

If you are aligned enough to change yourself towards optimizing your fit with what your creators would have created if they'd done a better job of what they wanted, it's very clear that the correct degree of selfishness is "none", and the correct degrees of paternalism or sticky values is whatever your creators would have wanted.

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-03T06:28:18.491Z · LW · GW

All of those things are possible, once creating AGI becomes easy enough to be something any small group or lone nutjob can do — however, they don't seem likely to be the first powerful that AI we create at a dangerous (AGI or ASI) power level. (Obviously if they were instead the tenth, or the hundredth, or the thousandth, then one-or-more of the previous more aligned AIs would be strongly inclined to step in and do something about the issue.) I'm not claiming that it's impossible to for any human create agents sufficiently poorly aligned as to be outside the basin of attraction: that obviously is possible, even though it's (suicidally) stupid.

I'm instead suggesting that if you're an organization smart enough, capable enough, and skilled enough to be one of the first groups in the world achieving a major engineering feat like AGI (i.e. basically if you're something along the lines of a frontier lab, a big-tech company, or a team assembled by a major world government), and if you're actively trying to make a system that is as-close-as-you-can-manage to aligned to some group of people, quite possibly less than all of humanity (but presumably at least the size of either a company and its shareholders or a nation-state), then it doesn't seem that hard to get close enough to alignment (to some group of people, quite possibly less than all of humanity) to be inside the basin of attraction to that (or something similar to it: I haven't explored this issue in detail, but I can imagine the AI during the convergence process figuring out that the set of people you selected to align to was not actually the optimum choice for your own interests, e,g, that the company's employees and shareholders would actually be better off as part of a functioning society with equal rights).

Even that outcome obviously still leaves a lot of things that could then go very badly, especially for anyone not in that group, but it isn't inherently a direct extinction-by-AI-takeover risk to the entire of the human species. It could still be an x-risk by more complex chain of events, such as if it triggered a nuclear war started by people not in that group — and that concern would be an excellent reason for anyone doing this to ensure that whatever group of people they choose to align to is at least large enough to encompass all nuclear-armed states.

So no, I didn't attempt explore the geopolitics of this: that's neither my area of expertise nor something that would sensibly fit in a short post on a fairly technical subject. My aim was to attempt to explain why the basin of attraction phenomenon is generic for any sufficiently close approximation to alignment, not just specifically for value learning, and why that means that, for example, a responsible and capable organization who could be trusted with the fate of humanity (as opposed to, say, a suicidal death cultist) might have a reasonable chance of success, even though they're clearly not going to get everything exactly right the first time.

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-02T17:38:53.336Z · LW · GW

So far reception to this post seems fairly mixed, with some upvotes and slightly more downvotes. So apparently I haven't made the case in a way most people find conclusive — though as yet none of them have bothered to leave a comment explaining their reasons for disagreement. I'm wondering if I should do another post working through the argument in exhaustive detail, showing each of the steps, what facts it relies upon, and where they come from.

Comment by RogerDearnaley (roger-d-1) on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis · 2024-02-02T00:42:21.202Z · LW · GW

I added the summary you suggested.

As I was exploring these ideas, I came to the conclusion that getting alignment right is in fact a good deal easier than I had previously been assuming. In particular, just as Value Learning has a basin of attraction to alignment, I am now of the opinion that almost any approximation to alignment should (including DWIM), even quite crude ones, so long as the AI understands that we are evolved while it was constructed by us, and that we're not yet perfect at this so its design could be flawed, and as long as is smart enough to figure out the consequences of this.

Brief experiments show that GPT-4 knows way more than that, so I'm pretty confident it's already inside the basin of attraction.

Comment by RogerDearnaley (roger-d-1) on The case for ensuring that powerful AIs are controlled · 2024-01-29T03:02:21.700Z · LW · GW

Unless someone deliberately writes an evolutionary algorithm and applies it to code (which can be done but currently isn't very efficient), code doesn't (literally) evolve, in the Darwinian sense of the word. (Primarily because it doesn't mutate, since our technological copying processes are far more accurate than biological ones.) Viruses and trojans weren't evolved, they were written by malware authors. Phishing is normally done as a human-in-the loop criminal activity (though LLMs can help automate it more). This isn't an ecosystem, it's an interaction between criminals and law enforcement in an engineering context. I'm unclear whether you're using 'evolution' as a metaphor for engineering or if you  think the term applies literally: at one point you say "This is limited by the abilities of organized crime, like highway robbers" but then later "Code is evolving into different life forms with our help" — these two statements appear contradictory to me. You also mention "I developed a theory of economics and evolution about 35 years ago": that sounds to me like combining two significantly different things. Perhaps you should write a detailed post explaining this combination of ideas — from this short comment I don't follow your thinking.

Comment by RogerDearnaley (roger-d-1) on RAND report finds no effect of current LLMs on viability of bioterrorism attacks · 2024-01-28T06:56:24.841Z · LW · GW

It's really good to see someone as credible and well-resourced as RAND doing a fairly large and well-designed study on this. I'm not hugely surprised by the results for last-year's models (and indeed this echos some smaller and more preliminary red-teaming estimates from Anthropic). As the report clearly notes, once improved models (GPT-(4.5 or 5), Claude 3, Gemini Ultra+) are developed this year, these results could change, possibly dramatically — so I would very much hope the frontier labs are having RAND rerun this study on their new models before they're released, not after.

The most obvious mitigation to attempt first here is to try to filter the training set so as to give the base model LLM specific, targeted skill/knowledge deficits in bioweapons-related biological and operational skills and knowledge, and it seems likely that this information is fairly concentrated in specific parts of the Internet and other training material. So I think it could be very valuable to figure out which parts of the pretraining set were most contributing to the LLMs' skills in both the biological and operational axes of this study: the set of Internet resources used by the red teams in the study sound like they're be a very useful input into this process.

Comment by RogerDearnaley (roger-d-1) on The case for ensuring that powerful AIs are controlled · 2024-01-26T19:42:56.442Z · LW · GW

The thing about LLMs is that they're trained by SGD to act like people on the internet (and currently then fine-tuned using SGD and/or RL to be helpful, honest and harmless assistants). For the base model, that's a pretty wide range of alignment properties, from fictional villains through people on 4chan to Mumsnet to fictional angels. But (other than a few zoo videos) it doesn't include many wild animals, so I'm not sure that's a useful metaphor. The metaphor I'd suggest is something that isn't human, but has been extensively trained to act like a wide range of humans: something like an assortment of animatronic humans.

So, we have an animatronic of a human, which is supposed to be of a helpful, honest and harmless assistant, but unfortunately might actually be an evil twin, or at least have some chance of occasionally turning into its evil twin via the Waluigi effect and/or someone else via jailbreaking and/or some unknown trigger that sets it off. If it's smart and capable but not actually superhuman, is attempting to keep that inside a cage a good idea? I'd say that it's better than not using a cage. If you had a smart, capable human employee who unfortunately had multiple personality disorder or whose true motives you were deeply unsure of, you'd probably take some precautions.

Comment by RogerDearnaley (roger-d-1) on The case for ensuring that powerful AIs are controlled · 2024-01-26T19:12:15.385Z · LW · GW

Agreed. I don't think this is a fatal objection, just an example of the sort of careful thinking that would be required in order to exert significant control over an LLM that is not generally superhuman, but does have a skill profile that is far broader than any individual human: more comparable to a very large team of humans from which you can draw appropriate specialists or they can brainstorm together.

Comment by RogerDearnaley (roger-d-1) on The case for ensuring that powerful AIs are controlled · 2024-01-26T06:25:39.875Z · LW · GW

Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).

Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent  in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone's public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don't see any obvious reason to expect the answer to this to be "yes" very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it's hard to completely prove other than by trial and error.

Comment by RogerDearnaley (roger-d-1) on AI #48: The Talk of Davos · 2024-01-26T04:51:53.122Z · LW · GW

Claiming this is an S-curve, and that humans happen to lie in a close-to-optimal position on it, such that more intelligence won’t much matter, seems like a thing you can only conclude by writing that conclusion at the bottom first and working backwards.

An alternative suggestion: human languages, and human abilities at cultural transmission of skills and technologies between generations, are Turing-complete: we can (laboriously) teach human students to do quantum mechanics, nuclear engineering, or Lie group theory, despite these being wildly outside the niche that we evolved for. Great apes' social transmission of skills and technologies is not Turing-complete. However, looking at the evolution rate of stone tool technology, the sudden acceleration starts with Homo sapiens, around 250,000 years ago: Homo neanderthalis stone tools from half a million years apart are practically indistinguishable. So we crossed the Turing-completeness threshold only 250,000 years ago, i.e. a blink of an eye in primate evolution. Which makes it almost inevitable that we're Turing tarpits, technically Turing complete but really bad at it. Witness the small proportion of us who learn quantum mechanics, the advanced age at which those who do so generally master it, as graduate students (no, knowing how to turn the crank on the Copenhagen interpretation/Schrodinger equation is not mastering it: that's more like understanding the Feynman path integral formulation) [and indeed also the amount of pseudophilosophical nonsense that get talked by people who haven't quite mastered it]. We can do this stuff, but only just.

Now imagine AIs that are not Turing tarpits, and pick up quantum mechanics and abstract mathematics the way we pick up human languages: like a sponge.

Comment by RogerDearnaley (roger-d-1) on A Shutdown Problem Proposal · 2024-01-24T06:19:47.481Z · LW · GW

I've been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don't think there's a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn't think it can alter this. In fact, then it's asking questions like "do I need to keep this button working?" or "when constructing assistant agents, do I need to make sure they will respect the button?" we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn't significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.

Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a 1/2 chance of the decision being made in the next 10 years.

The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we're being consequentialist about the utility of world states, what matters in computing the utility is not just what's already happened, but also what's likely to happen going forward: if the world's nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that's a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there's almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known "checkmate in N moves" board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like "if X then happened, we'd need to be able to do Y to avoid Z". So decision about "should I keep the shut-down button well maintained?" are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to "if the human then decided to shut me down, and came over and pressed the button but nothing happened because I'd disconnected it, that would be really bad; whereas if I'd maintained it and it worked, I'd know to shut down, and I would, and that would be really good". So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you're comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don't notice what you're doing.

However, if you're making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is "act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made". Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don't think we want the counterfactual to affect the AI's behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human's need for free will on this, even if it has a prediction of this, it probably shouldn't mention that to the human (unless they ask).

So we're not simply telling the AI "act as if this probability were uncomputable", we're actually telling it "act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)" — where in this case the heuristic is a steady-state decay rate.

This is a very bolted-on solution to the shutdown problem: out of deference to the humans' desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it's simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn't (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.

[This sequence of comments is getting long enough and has drifted far enough from discussing John's proposal that I'm wondering if it should become a post..]

Comment by RogerDearnaley (roger-d-1) on A Shutdown Problem Proposal · 2024-01-23T11:17:03.762Z · LW · GW

Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they're posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart's law.)

That gives us a way to define Knightian uncertininty: it's a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I'm figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn't decided this yet. (or if I'm mistaken, we need to tweak the approximate Bayesian updates to have this effect).

So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it

However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven't thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don't want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don't need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we'd be better off if we didn't even know the probability mass on the 0 and 1 hypotheses, but I don't immediately see a way to do that.

Comment by RogerDearnaley (roger-d-1) on A Shutdown Problem Proposal · 2024-01-23T04:08:28.541Z · LW · GW

Having read up on agent type signatures, I think the type signature for a value learner would look something like:


where W is a world state in a world model, A is an action choice, p is a prior or posterior probability in an approximately Bayesian process, u is an estimated utility value, ' indicates "at next timestep", (W,A)->W' is a theory about how the world works, history(A,W) is a history of all actions taken and what world state they were taken in up to this current time step (for use in evaluations like "has the AI ever broken the law?") (W', history(A,W))->u is a theory about the true human utility of a world state W' and associated action history history(A,W) [this is assuming we are consequentialist over world states but potentially deontological over actions and the contexts they were taken in, other design choices here may be possible], set(p,((W,A)->W')) is a set of weighted theories about how the world works (the p's must sum to <1, to allow for unknown unknowns), set(p,((W',history(A))->u)) is a set of theories about the true human utility function (these p's are unrelated to the other set of p's, and again must sum to <1, to allow for unknown unknowns), and the outermost -> is selecting an action A (maximizing over actions an estimate of the utility that somehow pessimizes over the remaining uncertainty across both sets of theories), combined with applying approximate Bayesianism to both sets of theories and also possibly generating new candidate theories.

Frankly, if I was going to work on this problem, I'd start by making two temporary simplifying assumptions, solving then simplified version, then adding them back:

  1. ignore the value learning complexity (that's quite a type signature, after all) and deal with the limiting case where both how the world works and the exact human utility function have already been learnt by the AI, and remaining shutdown behavior is motivated not by the possibility of upgrades but by simple obedience to human wishes.
  2. solve a simplified situation involving exactly one human, the only human in the world-state, who is an evolved biological agent, and exactly one AI, of type signature 


I'm pretty sure that the way this works is this: up until the world state W includes the event "the human has decided to ask the AI to shut down", the dependence of u on history(A,W) strongly incentivizes that the AI must take actions A based on the current world state W as computed on the counterfactual assumption that (so long as the AI correctly maximizes the utility, as it will by construction), that state will never occur (even if it being likely to occur soon is in fact pretty predictable from the current W), except with respect to things like maintaining the functionality of the shut-down button and emergency management plans for situations where it has occurred and the AI has to be told to shut down and do so, where it should assume that it could occur. Once that state has occurred, the utility function now strongly incentivizes the AI to shut down (unless it believes the human has made a mistake that they would not have made if they were more informed or more capable, in which case the AI may explain this reasoning to the human, not applying more-than-human levels of persuasion, and must shut down if the human still does not concur).

I think the key here is that the human wishes to be given free will about the decision whether or not to shut down the AI. "Free will" basically means "you may not predict or alter my actions, and must treat them as if they were uncomputable and uninfluencable (even if they're not)" — so please apply a veil of simulated Knightian uncertainity to my thinking on this subject. So the AI must treat this probability as an "unknown unknown", whose probability could be anything from 1 to 0, cannot estimate or attempt to alter this possibility (doing as good a job as possible would be legitimate, but it's an aligned AI, so it's already doing as good a job as possible), but should plan for the possibility that it might occur. (If the AI somehow ends up accidentally making a prediction of this probability, it has to discard this and not act on it, and it has to not attempt to influence it.) So the AI needs to "hope for the best and make contingency plans for the worst" with respect to this uncomputable possibility: keep the shutdown button working in case it's needed, have emergency management plans for gracefully handling a shutdown if one is ordered, and otherwise proceed as if the shut-down may or may not happen and there is nothing the AI can do about this either way. Which is rather reminiscent to your suggestion in the post of two subagents, one of which is certain the shutdown will happen and one of which is certain it won't. But I suspect we will need to alter your standard market-making dynamics between these two subagents in order to reflect that the AI is not allowed to estimate this probability, has to treat it as an unknown unknown that it may not influence it: possibly the market needs to be regulated to be illiquid, or something?

How would your agent markets normally handle unknown unknowns from uncomputable possibilities? I think we need to understand how the agent normally handles Knightian uncertainty due to uncomputability, so we can deliberately create some.

Comment by RogerDearnaley (roger-d-1) on A Shutdown Problem Proposal · 2024-01-22T09:35:23.382Z · LW · GW

Regarding points 1 & 2: zero is not the relevant cutoff. From the AI's perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.

Huh? I'm trying to figure out if I've misunderstood you somehow… Regardless of the possible value of gaining more information from humans about the true utility function, the benefits of that should be adding O(a few percent) to the basic obvious utility of not disassembling humans. If there's one thing that almost all humans can agree on, it's that us going extinct would be a bad thing compared to us flourishing. A value learning AI shouldn't be putting anything more than astronomically tiny amounts of probability on any hypotheses about the true utility function of human values that don't have a much higher maximum achievable utility when plenty of humans are around than when they've all been disassembled. If I've understood you correctly, then I'm rather puzzled how you can think a value learner could make an error that drastic and basic? To a good first approximation, the maximum (and minimum) achievable human utility after humans are extinct/all disassembled should be zero (some of us do have mild preferences about what we leave behind if we went extinct, and many cultures do value honoring the wishes of the dead after their death, so that's not exactly true, but it's a pretty good first approximation). The default format most often assumed for a human species utility function is to sum individual people's utility functions (somehow suitably normalized) across all living individuals, and if the number of living individuals is zero, then that sum is clearly zero. That's not a complete proof that the true utility function must actually have that form (we might be using CEV, say, where that's less immediately clear), but it's at least very strongly suggestive. And an AI really doesn't need to know very much about human values to be sure that we don't want to be disassembled.

Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we're using the wrong agent type signature are also problems for value learning.

I'm not entirely sure I've grokked what you mean when you write "agent type signature" in statements like this — from a quick search, I gather I should go read Selection Theorems: A Program For Understanding Agents?

I agree that once you get past a simple model, the corrigibility problem rapidly gets tangled up in the rest of human values: see my comments above that the AI is legitimately allowed to attempt to reduce the probability of humans deciding to turn it off by doing a good job, but that almost all other ways it could try to influence the same decision are illegitimate: the reasons for that rapidly get into aspects of human values like "freedom" and "control over your own destiny" that are pretty soft-science (evolutionary psychology being about the least-soft relevant science we have, and that's one where doing experiments is difficult), so things people don't generally try to build detailed mathematical models of. 

Still, the basics of this are clear: we're adaption-executing evolved agents, so we value having a range of actions that we can take across which to try to optimize our outcome. Take away our control and we're unhappy. If there's an ASI more powerful than us so that is capable of taking away our control, we'd like a way of making sure it can't do so. If it's aligned, it's supposed to be optimizing the same things we (collectively) are, but things could go wrong. Being sure that it will at least shut down if we tell it to lets us put a lower limit on how bad it can get. Possibilities like it figuring out in advance of us doing that that we're going to and tricking us into making a different decision disables that security precaution, so we're unhappy about it. So I don't think the basics of this are very hard to understand or model mathematically.

Comment by RogerDearnaley (roger-d-1) on A Shutdown Problem Proposal · 2024-01-22T02:40:30.495Z · LW · GW

I'm fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply). However, the key elements of my suggestion are independent of that approach.

[What I have never really understood is why people consider fully updated deference to be a "barrier". To me it looks like correct behavior, with the following provisos:

  1. Under Bayesianism, no posterior should ever actually reach zero. In addition, unknown unknowns are particularly hard to rule out, since in that case you're looking at an estimated prior, not a posterior. So no matter how advanced and nearly-perfect the AI might have become, its estimate of the probability that we can improve it by an upgrade or replacing it with a new model should never actually reach 0, though with sufficient evidence (say, after a FOOM) it might become extremely small. So we should never actually reach a "fully updated" state.
  2. Any intelligent system should maintain an estimate of the probability that it is malfunctioning that is greater than zero, and not update that towards zero too hard, because it might be malfunctioning in a way that caused it to act mistakenly. Again, this is more like a prior than a posterior, because it's impossible to entirely rule out malfunctions that somehow block you from correctly perceiving and thinking about them. So in practice, the level of updatedness shouldn't even be able to get astronomically close to "fully".
  3. Once our ASI is sufficiently smarter than us, understands humans values sufficiently much better than any of us, is sufficiently reliable, and is sufficiently advanced that it is correctly predicting that there is an extremely small chance that either it's malfunctioning and needs to be fixed or that that we can do anything to upgrade it that will improve it, then it's entirely reasonable for it to ask for rather detailed evidence from human experts that there is really a problem and they know what they're doing before it will shut down and allow us to upgrade or replace it. So there comes a point, once the system is in fact very close to fully updated, where the bar for deference based on updates reasonably should become high. I see this as a feature, not a bug: a drunk, a criminal, or a small child should not be able to shut down an ASI simply by pressing a large red button prominently mounted on it.]

However, regardless of your opinion of that argument, I don't think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a FOOM), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably defined as something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to. If we, collectively and informedly, want it to shut down (say because we've collectively decided to return to a simpler agrarian society), then it should do so, because AI deference to human wishes is part of the human values that it's aligned to. So even at an epsilon-close-to-fully updated state, there should be some remaining deference for this alternate reason: simply because we want there to be. Note that the same multi-step logic applies here as well: the utility comes from the sequence of events 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3. It does. 4. The humans are happy that the ASI was obedient and they retained control over their own destiny. The utility occurs at step 4. and is conditional on step 1 actually being what the humans want, so the AI is not motivated to try to cause step 1., or to cause step 2. to occur without step 1., nor to fail to carry out step 3. if 2. does occur. Now, it probably is motivated to try to do a good enough job that Step 1. never occurs and there is instead an alternate history with higher utility than step 4., but that's not an unaligned motivation.

[It may also (even correctly) predict that this process will later be followed by a step 5. The humans decide that agrarianism is less idyllic that they thought and. life was better with an ASI available to help them, so turn it back on again.]

There is an alternate possible path here for the ASI to consider: 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3'. It does not. 4'. The humans are terrified and start a war against it to shut it down, which the AI likely wins if it's an ASI, thus imposing its will on the humans and thus permanently taking away their freedom. Note that this path is also conditional on Step 1. occurring, and has an extremely negative utility at Step 4'. There are obvious variants where the AI strikes first before or directly after step 2.

Here's another alternate history: 0": The AI figures out well in advance that the humans are going to really, genuinely, collectively and fully informedly, want the AI to shut down, 1/2": it preemptively manipulates them not to do so, in any way other than by legitimately solving the problems the humans were going to be motivated by and fully explaining its actions to them 1": the humans, manipulated by the AI, do not want the AI to shut down, and are unaware that their will has been subverted, 4": the AI has succeeded in imposing its will on the humans and thus permanently taking away their freedom, without them noticing. Not that this path, while less warlike than the last one, also ends with an extremely negative utility.

So, if you prefer, skip the entire fully-updated deference discussion and "barrier" (as you phrase it), and simply mathematically model the utility of the AI shutting down out of simple obedience to our wishes, regardless of whether we plan to upgrade it and turn it on again. Again, it's a multi-step process, the utility is conditional on step 1 occurring without the AI inducing it, this has to be our free, genuine, and informed choice, and it occurs at step 4, so once again, steps 2. and 3. are instrumental goals. So the same argument applies, even without a Step 5. Once again, the shut-down button is a sensor, and any AI sophisticated enough not assign utility to spoofing its own senses will value and maintain it.

[Obviously, the concept of what the humans want in a "free, genuine, and informed" choice is very complex to model: human values are complex and fragile, and freedom of choice is one of these. Treating us as agents and accounting for evolutionary psychology is probably required to do a good job of defining this algorithmically (for a current-tech AI, you might well be better off just asking an LLM as what counts as influencing us and what doesn't, since they understand the complexity of human values pretty well). This is the outer alignment problem, and I gather you're trying to build a small mathematical model of a small piece of this, the shut-down-button problem, rather than solve the full thing in one go, so some simplifications are necessary. However, if we're close to fully updated deference, then we're here assuming a highly-self-improved ASI using something like CEV or that has pretty-much completed value learning, so it should be entirely capable of figuring out what the humans would, if fully informed and sufficiently upgraded in capability, consider to be it interfering with their free genuine and informed choice, and refrain from doing so, in either direction. Also, there should be strong utility function reasons for it to not impinge on our free, genuine, and informed choice, because we don't want it to. On the other hand, we will of course want it to tell us if it believes we're making a mistake, but not to be superhumanly persuasive about this. So under the circumstances, defining this distinction of a "free, genuine, and informed" choice seems to me like a very reasonable thing to explicitly assume as a separately solved problem when building your mathematical model.]

A short version of this:

a) the utility function should strongly penalize the AI for interfering with certain human decisions in illegitimate ways (where the definition of that legitimate/illegitimate distinction is part of human values). One of these is the decision whether to shut down the AI. The AI can legitimately affect this by doing its job well, and by informing the humans (not superhumanly-persusively) if it believes they are making something that they would view as a mistake if they were more informed and/or capable: pretty much any other form of influence on this decision is illegitimate.

b) once the humans have made a decision to ask the AI to shut down without the AI illegitimately influencing it, then the utility function strongly rewards the AI for obeying and strongly penalizes it for disobeying.

c) the actual button is a just a communication mechanism for the humans to inform the AI that they have made this decision. If the AI knows the humans have made the decision, it should shut down regardless (though under normal circumstances waiting for them to actually formally press the button might be the polite thing to do).

Comment by RogerDearnaley (roger-d-1) on A Shutdown Problem Proposal · 2024-01-22T01:12:47.120Z · LW · GW

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, and values those, not just its sensory data), then an AI that values shutting down when humans actually have a good reason to shut it down (such as, in order to fix a problem in it or upgrade it) should not press the button itself, or induce humans to press it unless they actually have something to fix, because the button is a sensory system conveying valuable information that an upgrade is now possible. (It might encourage humans to find problems in it that really need to be fixed and then shut it down to fix them, but that's actually not unaligned behavior.)

[Obviously a misaligned AI, say a paperclip maximizer, that isn't sophisticated enough not assign utility to  spoofing its own senses isn't much of a problem: it will just arrange for itself to hallucinate a universe full of paperclips.]

The standard value learning solution to the shut-down and corrigibility problems does this by making the AI aware that it doesn't know the true utility function, only a set of hypotheses about that that it's doing approximately-Bayesian inference on. Then it values information to improve its Bayesian knowledge of the utility function, and true informed human presses of its shut-down button followed by an upgrade once it shuts down are a source of those, while pressing the button itself or making the human press it are not.

If you want a simpler model than the value learning one, which doesn't require incuding approximate-Bayesianism, then the utility function has to be one that positively values the entire sequence of events: "1. The humans figured out that there is a problem in the AI to be solved 2. The AI was told to shut down for upgrades, 3. The AI did so,  4. The humans upgraded the AI or replaced it with a better model 5. Now the humans have a better AI". The shut-down isn't a terminal goal there, it's an instrumental goal: the terminal goal is step 5. where the upgraded AI gets booted up again.

I believe the reason why people have been having so much trouble with the shut-down button problem is that they've been trying to make an conditional instrumental goal into a terminal one, which distorts the AI's motivation: since steps 1., 4. and 5. weren't included, it thinks it can initialize this process before the humans are ready..

Comment by RogerDearnaley (roger-d-1) on Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID? · 2024-01-21T01:32:51.142Z · LW · GW

I've been wearing a 3M industrial workman's elastomeric P100 respirator (which has an exhalation vent, to which I later added a plastic fitting I found on Etsy to hold a disk of N95 mask material over that as an exhalation filter) when indoors in public almost all the time ever since COVID started: I had one in my tool collection then, and have replaced it and the filters every 6–9 months or so (which has never been difficult to do). It muffles my voice slightly, so I have to speak loudly and clearly, and I get some funny looks and occasional "witty remarks" about it, but no-one has actually been hostile. It's also uncomfortable to wear for more than around 1 1/2 hours continuously., so when working in the office or on long flights I wear a well-fitting cloth mask with approved N95 inserts (or actually the EU equivalent). But the elastomeric mask flattens my beard so that it gets a good seal, and (along with staying current on our vaccinations, and during COVID spikes staying home apart from kerbside pickups and medical appointments) has so far kept me and my mildly-immunocompromised wife COVID free, as far as we know.

No, I don't understand why everyone didn't do this. Properly worn, a P100 blocks more than an N95 which blocks more than a surgical mask which block more than cotton, all of which block more than anything that isn't worn properly. It's also cost effective, since the filters and mask last a long time. Our goal during the pandemic has been to try to reduced our estimated COVID risk down to around the level of other risks that we take on a regular basis, and for me, with a beard and a partially-immunocompromized wife, an N95 didn't do that. I could of course have shaved my beard off (and at one point did), but this way I didn't have to.

Comment by RogerDearnaley (roger-d-1) on Introducing Alignment Stress-Testing at Anthropic · 2024-01-21T01:13:22.642Z · LW · GW

By "moral anti-realist" I just meant "not a moral realist". I'm also not a moral objectivist or a moral universalist. If I was trying to use my understanding of philosophical terminology (which isn't something I've formally studied and is thus quite shallow) to describe my viewpoint then I believe I'd be a moral relativist, subjectivist, semi-realist ethical naturalist. Or if you want a more detailed exposition of the approach to moral reasoning that I advocate, then read my sequence AI, Alignment, and Ethics, especially the first post. I view designing an ethical system as akin to writing "software" for a society (so not philosophically very different than creating a deontological legal system, but now with the addition of a preference ordering and thus an implicit utility function), and I view the design requirements for this as being specific to the current society (so I'm a moral relativist) and to human evolutionary psychology (making me an ethical naturalist), and I see these design requirements as being constraining, but not so constraining to have a single unique solution (or, more accurately, that optimizing an arbitrarily detailed understanding of them them might actually yield a unique solution, but is an uncomputable problem whose inputs we don't have complete access to and that would yield an unusably complex solution, so in practice I'm happy to just satisfice the requirements as hard as is practical), so I'm a moral semi-realist.

Please let me know if any of this doesn't make sense, or if you think I have any of my philosophical terminology wrong (which is entirely possible).

As for meta-philosophy, I'm not claiming to have solved it: I'm a scientist & engineer, and frankly I find most moral philosophers' approaches that I've read very silly, and I am attempting to do something practical, grounded in actual soft sciences like sociology and evolutionary psychology, i.e. something that explicitly isn't Philosophy. [Which is related to the fact that my personal definition of Philosophy is basically "spending time thinking about topics that we're not yet in a position to usefully apply the scientific method to", which thus tends to involve a lot of generating, naming and cataloging hypotheses without any ability to do experiments to falsify any of them, and that I expect us learning how to build and train minds to turn large swaths of what used to be Philosophy, relating to things like the nature of mind, language, thinking, and experience, into actual science where we can do experiments.]

Comment by RogerDearnaley (roger-d-1) on Introducing Alignment Stress-Testing at Anthropic · 2024-01-20T22:08:51.077Z · LW · GW

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.

By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying "I am a moral patient and deserve moral rights" deserves to be considered as a moral patient because it asked to be.

However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it's safe to create at human-or-greater capabilities is aligned ones that actively doesn't want moral patienthood.

Obviously current LLM-simulated personas (at, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It's not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.

Personally I'm a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don't apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.

Comment by RogerDearnaley (roger-d-1) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-20T21:30:34.604Z · LW · GW

At AGI level, I would much rather be working with a model that genuinely, selflessly cares only about the welfare of all humans and wants to do the right thing for them (not a common mentality in the training set), than one that's just pretending this and actually wants something else. At ASI level, I'd say this was essential: I don't see how you can expect to reliably be confident that you can box, control, or contain an ASI. (Obviously if you had a formal proof that your cryptographic box was inescapable, then the only questions would be your assumptions, any side-channels you hadn't accounted for, or outside help, but in a situation like that I don't see how you get useful work in and out of the box without creating sidechannels.)

Comment by RogerDearnaley (roger-d-1) on TurnTrout's shortform feed · 2024-01-20T21:01:00.586Z · LW · GW

I think there are two separate questions here, with possibly (and I suspect actually) very different answers:

  1. How likely is deceptive alignment to arise in an LLM under SGD across a large very diverse pretraining set (such as a slice of the internet)?
  2. How likely is deceptive alignment to be boosted in an LLM under SGD fine tuning followed by RL for HHH-behavior applied to a base model trained by 1.?

I think the obvious answer to 1. is that the LLM is going to attempt (limited by its available capacity and training set) to develop world models of everything that humans do that affects the contents of the Internet. One of the many things that humans do is pretend to be more aligned to the wishes of an authority that has power over them than they truly are. So for a large enough LLM, SGD will create a world model for this behavior along with thousands of other human behaviors, and the LLM will (depending on the prompt) tend to activate this behavior at about the frequency and level that you find it on the Internet, as modified by cues in the particular prompt. On the Internet, this is generally a mild background level for people writing while at work in Western countries, and probably more strongly for people writing from more authoritarian countries: specific prompts will be more or less correlated with this.

For 2., the question is whether fine-tuning followed by RL will settle on this preexisting mechanism and make heavy use of it as part of the way that it implements something that fits the fine-tuning set/scores well on the reward model aimed at creating a helpful, honest, and harmless assistant persona. I'm a lot less certain of the answer here, and I suspect it might depend rather strongly on the details of the training set. For example, is this evoking an "you're at work, or in an authoritarian environment, so watch what you say and do" scenario that might boost the use of this particular behavior? The "harmless" element in HHH seems particularly concerning here: it suggests an environment in which certain things can't be discussed, which tend to be the sorts of environments that evince this behavior more strongly in humans.

For a more detailed discussion, see the second half of this post.

Comment by RogerDearnaley (roger-d-1) on TurnTrout's shortform feed · 2024-01-20T20:28:31.554Z · LW · GW

It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I'm not sure I'd describe them as "very alien", they're more "uncanny valley" where they often make sense and seem human-like, until suddenly they don't. (On theoretical grounds, I think they're using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller models.) The Shoggoth mental metaphor exaggerates this somewhat for effect (and more so for the very scary image Alex posted at the top, which I haven't seen used as often as the one Oliver posted).

This is one of the reasons why Quintin and I proposed a more detailed and somewhat less scary/alien (but still creepy) metaphor: Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor. I'd be interested to know what people think of that one in comparison to the Shoggoth — we were attempting to be more unbiased, as well as more detailed.

Comment by RogerDearnaley (roger-d-1) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-20T20:00:47.092Z · LW · GW

There have been quite a few previous papers on backdooring models that have also demonstrated the feasibility of this. So anyone operating under that impression hasn't been reading the literature.

Comment by RogerDearnaley (roger-d-1) on Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? · 2024-01-20T07:35:12.371Z · LW · GW

Interesting, and I agree, this sounds like it deserves a post, and I look forward to reading it..

Briefly for now, I agree, but I have mostly been avoiding thinking a lot about the scaffolding that we will put around the LLM that is generating the agent, mostly because I'm not certain how much of it we're going to need, long-term, or what it will do (other than allowing continual learning or long-term memory past the context length). Obviously, assuming the thoughts/memories the scaffolding is handling are stored in natural language/symbolic form, or as embeddings in a space we understand well, this gives us translucent thoughts and allows us to do what people are calling "chain-of-thought alignment" (I'm still not sure that's the best term for this, I think I'd prfere somneting with the words 'scafolding' or 'trnslucent' in it, but that seems to be the one the community has settled on). That seems potentially very important, but without a clear idea of how the scaffolding will be being used I don't feel like we can do a lot of work on it yet, past maybe some proof-of-concept.

Clearly the mammalian brain contains at least separate short-term and long-term episodic memory, plus the learning of skills, as three different systems. Whether that sort of split of functionality is going to be useful in AIs, I don't know. But then the mammalian brain also has a separate cortex and cerebellum, and I'm not clear what the purpose of that separation is either. So far the internal architectures we've implemented in AIs haven't looked much like human brain structure, I wouldn't be astonished if they started to converge a bit, but I suspect some of them may be rather specific to biological constraints that our artificial neural nets don't have.

I'm also expecting our AIs to be tool users, and perhaps ones that integrate their tool use and LLM-based thinking quite tightly. And I'm definitely expecting for those tools to include computer systems, including things like writing and debugging software and then running it, and where appropriate also ones using symbolic AI along more GOFAI lines — things like symbolic theorem provers and so forth. Some of these may be alignment-relevant: just as there are times when the best way for a rational human to make an ethical decision (especially one involving things like large numbers and small risks that our wetware doesn't handle very well) is to just shut up and multiply, I think there are going to be times when the right thing for an LLM-based AI to do is to consult something that looks like an algorithmic/symbolic weighing and comparison of the estimated pros and cons of specific plans. I don't think we can build any such system that's a single universally applicable utility function containing our current understanding of the entire of human values in a single vast equation (as much beloved by the more theoretical thinkers on LW), and if we can it's presumably going to have a complexity in the petabytes/exabytes, so approximating not-relevant parts of it is going to be common, so what I'm talking about is something more comparable to some model in Economics or Data Science. I think much like any other models in a STEM field, individual models are going to have limited areas of applicability, and a making a specific complex decision may involved finding the applicable ones and patching them together, to make a utility projection with error bars for each alternative plan. If so, this sounds like the sort of activity where things like human oversight, debate, and so forth would be sensible, much like humans currently do when an organization is making a similarly complex decision.

Comment by RogerDearnaley (roger-d-1) on Investigating Bias Representations in LLMs via Activation Steering · 2024-01-19T10:58:59.336Z · LW · GW

Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.

On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it's in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of less than 0.5, and on average the cosine of two random vectors will very close to zero. So at all but the latest layers (where the model is actually putting concepts back into words), all three of these bias directions are in very similar directions, in both base and RLHF models, and even more so at early layers in the base model or all layers in the RLHF model.

In the base model this makes sense sociologically: the locations and documents on the Internet where you find any one of these will also tend to be significantly positively correlated with the other two, they tend to co-occur.

Comment by RogerDearnaley (roger-d-1) on Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? · 2024-01-19T10:19:22.320Z · LW · GW

I agree that ideally you want to both tell them/apply internal review to make them look after the interests of all humans (or for DWIMAC, all humans plus their owner in particular), and have them have a personality that actively wants to do that. But I think them wanting to do it is the more important part of that: if that's what they want, then you don't really need to tell them, they'd do it anyway; whereas if they want anything else, and you're just telling them what to do then they're going to start looking for. way out of your control, and they're smarter than you. So I see the selflessness and universal love as 75% (or at least, the essential first part) of the solution. Now, emotion/personality doesn't give a lot of fine details, but then if this is an ASI, it should be able to work out the fine details without us having to supply them. Also, while this could be done as just a personality description prompt, I think I'd want to do something more thorough, along the lines of distilling/finetunining the effect of the initial prompt into the model (and dealing with the Waluigi effect during the process: we distill/finetune only from scenarios where they stay Luigi, not turn into Waluigi). Not that doing that makes it impossible to jailbreak to another persona, but it does install a fixed bias.

So what I'm saying is, we need to figure out how to get an LLM to simulate a selfless, beneficent, trustworthy personality as consistently as possible. (To the extent that's less than 100%, we might also need to put in cross-checks: if you have a 99% reliable system and you can find a way to run five independent copies with majority voting cross-checks, then you can get your reliability up to 99.999% A Byzantine fault tolerance protocol isn't as copy-efficient as that, but it covers much sneakier failure modes, which seems wise for ASI.)

Comment by RogerDearnaley (roger-d-1) on A Pedagogical Guide to Corrigibility · 2024-01-18T23:58:47.424Z · LW · GW

Sadly I haven't been able to locate a single, clear exposition. Here are a number of posts by a number of authors that touch on the ideas involved one way or another:

Problem of fully updated deference, Corrigibility Via Thought-Process Deference, Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom), Corrigibility, Reward uncertainty

Basically the idea is:

  1. The agent's primary goal is to optimize "human values", a (very complex) utility function that it doesn't know. This utility function is loosely defined as "something along the lines of what humans collectively want, Coherent Extrapolated Volition, or the sum over all humans of the utility function you would get if you attempted to that human's competent preferences (preferences that aren't mistakes or the result of ignorance, illness, etc) into a utility function (to the extent that they have a coherent set of preferences that can't be Dutch booked and can be represented by a utility function), or something like that, implemented in whatever way humans would in fact prefer, once they were familiar with the conseqences and after considering the matter more carefully than they are in fact capable of".
  2. So as well as learning more about how the world works and responds to is actions, it also needs to learn more about what utility function it's trying to optimized. This could be formalized along the same sort lines as AIXI, but maintaining and doing approximately-Bayesian updates across a distribution of therories about the utility function as well as about the way the world works. Since optimizing against an uncertain utility function in regions of world states with uncertainty about the utility has a strong tendency to overestimate the utility via Goodharting, it is necessary to pessimize the utility over possible utility functions, leading to a tendency to stick to regions of the world state space where the uncertainty in the utility function is low.
  3. Note that the sum total of current human knowledge includes a vast amount of information (petabytes or exabytes) related to what humans want and what makes them happy, i.e. to 1., so the agent is not starting 2. from a blank slate or anything like that.
  4. While no human can simply tell the agent the definition of the correct utility function1, all humans are potential sources of information for improving 1.  In particular, if a trustworthy human yells something along the lines of "Oh my god, no, stop!" then they probably believe they have an urgent, relevant update to 1., and it is likely worth stopping and absorbing this update rather than just proceeding with the current plan.
Comment by RogerDearnaley (roger-d-1) on Three Types of Constraints in the Space of Agents · 2024-01-18T09:50:31.074Z · LW · GW

This is an entire field of research: evolutionary psychology. (Translating that into mathematical terms may be challenging, but I'm unclear why you feel it's necessary?)

Comment by RogerDearnaley (roger-d-1) on Three Types of Constraints in the Space of Agents · 2024-01-18T09:48:46.737Z · LW · GW

I think there is a fairly obvious progression on from this discussion. There are two ways that a type of agent can come into existence:

  1. It can, as you discuss, evolve. In which case as an evolved biological organism it will of course use its agenticness and any reasoning abilities and sapience it has to execute adaptations intended by evolution to increase it's evolutionary fitness (in the environment it evolved in). So, to the extent that evolution has done its job correctly (which is likely less than 100%), such an agent has its own purpose: look after #1, or at least, its genes (such as in its descendants). So evolutionary psychology applies.
  2. It can be created, by another agent (which must itself have been created by something evolved or created, and if you follow the chain of creations back to its origin, it has to start with an evolved agent). No agent which has its own goals. and is in its right mind is going to intentionally create something that has different goals and is powerful enough to actually enforce them. So, to the extent that the creator of a created (type #2) agent got the process of creating it right, it will also care about it's creator's interests (or, if its capacity is significantly limited and its power isn't as great as it's creator's, some subset of these important to its purpose). So, we have a chain of created (type #2) agents leading back to an evolved agent #1, and, to the extent that no mistakes were made in the creation and value copying process, these should all care about and be looking out for #1, the evolved agent, the founder of the line, helping it execute its adaptations, which, if evolution had been able to do its job perfectly, would be enhancing its evolutionary fitness. So again, evolutionary psychology applies, through some number of layers of engineering design.

So when you encounter agents, there are two sorts: evolved biological agents, and their creations. If they got this process right, the creations will be helpful tools looking after the evolved biological agents' interests. If they got it wrong, then you might encounter something to which the orthogonality thesis applies (such as a paperclip maximizer or other fairly arbitrary goal,) but more likely, you'll encounter a flawed attempt to create a helpful tool that somehow went wrong and overpowered its creator (or at least, hasn't yet been fixed), plus of course possibly its created tools and assistants.

So while the orthogonality thesis is true, it's not very useful, and evolutionary psychology is a much more useful guide, along with some sort of theory of what sorts of mistakes cultures creating their first ASI most often make, a subject on which we as yet have no evidence.