Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) 2023-05-25T09:26:31.316Z
Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks 2023-05-21T08:29:09.896Z
Is Infra-Bayesianism Applicable to Value Learning? 2023-05-11T08:17:55.470Z


Comment by RogerDearnaley (roger-d-1) on Shutdown-Seeking AI · 2023-06-03T10:54:40.253Z · LW · GW

The most effective way for an AI to get humans to shut it down would for it to do something extremely nasty. For example, arranging to kill thousands of humans would get it shut down for sure.

Comment by RogerDearnaley (roger-d-1) on Think carefully before calling RL policies "agents" · 2023-06-03T10:23:49.573Z · LW · GW

Humans are normally agentic (sadly they can also quite often be selfish, power-seeking, deceitful, bad-tempered, untrustworthy, and/or generally unaligned). Standard unsupervised LLM foundation model training teaches LLMs how to emulated humans as text-generation processes. This will inevitably include modelling many aspects of human psychology, including the agentic ones, and the unsavory ones. So LLMs have trained-in agentic behavior before any RL is applied, or even if you use entirely non-RL means to attempt to make them helpful/honest/harmless (e.g. how Google did this to LaMDA). They have been trained on a great many examples of deceit, power-seeking, and every other kind of nasty human behavior, so RL is not the primary source of the problem.

The alignment problem is about producing something that we are significantly more certain is aligned than a typical randomly-selected human. Handing a randomly-selected human absolute power over all of society is unlikely to end well. What we need to train is a selfless altruist who (platonically or parentally) loves all humanity. For lack of better terminology: we need to create a saint or an angel.

Comment by RogerDearnaley (roger-d-1) on PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" · 2023-05-31T07:09:40.947Z · LW · GW

This is very interesting: thanks for plotting it.

However, there is something that's likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:

  1. Multimodal models are inherently more useful, since they also understand some combination of images, video, music... as well as text, and the relationships between them.
  2. It's going to be challenging to find orders of magnitude more high-quality text data than exists on the Internet, but there are huge amounts of video and image data (YouTube, TV and cinema, Google Street View, satellite images, everything any Tesla's cameras have ever uploaded, ...), and it seems that the models of reality needed to understand/predict text, images, and video overlap and interact significantly and usefully.
  3. It seems likely that video will give the models better understanding of commonsense aspects of physical reality important to humans (and humanoid robots): humans are heavily visual, and so are a lot of things in the society we've built

The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it's learning overlap between these, others don't, which could also alter the trend lines.

Comment by RogerDearnaley (roger-d-1) on TinyStories: Small Language Models That Still Speak Coherent English · 2023-05-30T01:42:17.509Z · LW · GW

Existing large tech companies are using approaches like this, training or fine-tuning small models on data generated by large ones.

For example, it's helpful for the cold start problem, where you don't yet have user input to train/fine-tune your small model on because the product the model is intended for hasn't been launched yet: have a large model create some simulated user input, train the small model on that, launch a beta test, and then retrain your small model with real user input as soon as you have some.

Comment by RogerDearnaley (roger-d-1) on TinyStories: Small Language Models That Still Speak Coherent English · 2023-05-30T01:36:21.396Z · LW · GW

I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. ( experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)

To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).

The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of  training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.

A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.

[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]

Comment by RogerDearnaley (roger-d-1) on Can we efficiently distinguish different mechanisms? · 2023-05-29T05:18:24.431Z · LW · GW

I'd really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.

What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you're transforming the problem from "How do we know the machine isn't lying to us?" to "How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?" It also explicitly requires the machine to build a model of "what humans want", and then the complexity level and latent knowledge content required is fairly similar between "figure out what the humans want and then do that" and "figure out what the humans want and then show them a video of what doing that would look like".

Maybe we should just figure out some way to do surprise inspections on the vault? :-)

Comment by RogerDearnaley (roger-d-1) on Some thoughts on automating alignment research · 2023-05-29T04:48:11.642Z · LW · GW

If we can solve enough of the alignment problem, the rest gets solved for us.

If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the 'sharp left turn' is happening, especially if it's also going Foom. So with value learning, there is is a region of convergence around alignment.

Or to reuse one of Eliezer's metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.

Comment by RogerDearnaley (roger-d-1) on The Learning-Theoretic Agenda: Status 2023 · 2023-05-29T04:27:14.497Z · LW · GW

Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

  • (Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
  • (Subproblem 2.1) It's not clear how to to talk about learnability and learning rates.

It makes some sense to consider these problems together, but different direction emphasize different sides.


Evolved organisms (such as humans) are good at dealing with traps: getting eaten is always a possibility. At the simplest level they do this by having multiple members of the species die, and using an evolutionary learning mechanism to evolve detectors for potential trap situations and some trap-avoiding behavior for this to trigger. An example of this might be the human instinct of vertigo near cliff edges — it's hard not to step back. The cost of this is that some number of individuals die from the traps before the species evolves a way of avoiding the trap.

As a sapient species using the scientific method, we have more sophisticated ways to detect traps. Often we may have a well-supported model of the world that lets us predict and avoid a trap ("nuclear war could well wipe out the human race, let's not do that"). Or we may have an unproven theory that predicts a possible trap, but that also predicts some less dangerous phenomenon. So rather than treating the universe like a multi-armed bandit and jumping into the potential trap to find out what happens and test our theory, we perform the lowest risk/cost experiment that will get us a good Bayesian update on the support for our unproven theory, hopefully at no cost to life or limb. If that raises the theory's support, then we become more cautious about the predicted trap, or if it lowers it, we become less. Repeat until your Bayesian updates converge on either 100% or 0%.

An evolved primate heuristic for this is "if nervous of an unidentified object, poke it with a stick and see what happens". This of course works better on, say, live/dead snakes than on some other perils that modern technology has exposed us to.

The basic trick here is to have a world model sophisticated enough that it can predict traps in advance, and we can find hopefully non-fatal ways of testing them that don't require us to jump into the trap. This requires that the universe has some regularities strong enough to admit models like this, as ours does. Likely most universes that didn't would be uninhabitable and life wouldn't evolve in them. 

Comment by RogerDearnaley (roger-d-1) on Is Deontological AI Safe? [Feedback Draft] · 2023-05-29T03:39:07.382Z · LW · GW

My exposure to the AI and safety ethics community's thinking has primarily been via LW/EA and papers, so it's entirely possible that I have a biased sample.

I had another thought on this. Existing deontological rules are intended for humans. Humans are optimizing agents, and they're all of about the same capacity (members of a species that seems, judging by the history of stone tool development, to have been sapient for maybe a quarter million years, so possibly only just over the threshold for sapience). So there is another way in which deontological rules reduce cognitive load: generally we're thinking about our own benefit and that of close family and friends. It's 'not our responsibility' to benefit everyone in the society — all of them are already doing that, looking out for themselves. So that might well explain why standard deontological rules concentrate on avoiding harm to others, rather than doing good to others.

AGI, on the other hand, firstly may well be smarter than all the humans, possibly far smarter, so may have the capacity to do for humans things they can't do for themselves, possibly even for a great many humans. Secondly, its ethical role is not to help itself and its friends, but to help humans: all humans. It ought to be acting selflessly. So its duty to humans isn't just to avoid harming them and let them go about their business, but to actively help them. So I think deontological rules for an AI, if you tried to construct them, should be quite different in this respect than deontological rules for a human, and should probably focus just as much on helping as on not harming.

Comment by RogerDearnaley (roger-d-1) on Some thoughts on automating alignment research · 2023-05-28T09:26:28.575Z · LW · GW

As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like "figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that"), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI's capabilities are large, and as a result many of the things it can do are outside the region of validity of its initial model of human values, and also it understands the concept of the region of validity of a model (a rather basic, obviously required capability for an AGI that can do research, so this seems like a reasonable assumption), then it can't use most of its capabilities safely, so solving that problem obviously becomes top priority. This is painfully obvious to us, so it should also be painfully obvious to an AGI capable of doing research.

In that situation, a fast takeoff should just cause you to get an awful lot of AGI intelligence focused on the problem of solving alignment. So, as the author mentions, perhaps we should be thinking about how we would maintain human supervision in that eventuality? That strikes me as a particular problem that I'd feel more comfortable to have solved by a human alignment researcher than an AGI one.

Comment by RogerDearnaley (roger-d-1) on Is Deontological AI Safe? [Feedback Draft] · 2023-05-28T07:37:26.081Z · LW · GW

I'm not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they're useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also provide useful markers to avoid 'slippery slope' situations where personal benefit might encourage you to err on one side in a complex estimation/calculation of overall utility. They also provide a way of trying to settle arguments: "I didn't break any deontological ethical rules" is often a defense in the court of public opinion, and is often less contentious than "utilitarian ethics support my actions".

As such, my feeling is that for a powerful AGI, it should have better ability to handle computational load than a human, it is more likely to encounter situations that are 'out of distribution' (atypical or even weird) compared to a human, which might take these heuristics outside their range of validity, it ought to be more capable of computing a utility function without personal bias, and it is likely to be smart enough to find ways to 'rules lawyer' corner cases that the deontological heuristics don't handle well. So for a sufficiently smart AGI, I would strongly suspect that even well-implemented deontological ethics would be more dangerous than well-implemented utilitarian ethics. But I'm mostly working from software-engineer intuition, that I don't really trust a spaghetti-code ball of heuristics — so this isn't a philosophical argument.

However, for less capable AI systems, ones not powerful enough to run a good utilitarian value function, a set of deontological ethical heuristics (and also possibly-simplified summaries of relevant laws) might well be useful to reduce computational load, if these were carefully crafted to cover the entire range of situations that they are likely to encounter (and especially with guides for identifying when a situation was outside that range and it should consult something more capable). However, the resulting collection of heuristics might look rather different from the deontological ethical rules I'd give a human child.

More broadly, most people in the AI alignment space that I've seen approaching the problem of either describing human values to an AI, or having it learn them, have appeared to view ethics from a utilitarian/consequentialist rather than a deontological perspective, and tend to regard this prospect as very challenging and complex — far more so than if you just had to teach the machine a list of deontological ethical rules. So my impression is that most people in AI safety and alignment are not using a deontological viewpoint — I'd love to hear it that has been your experience too? Indeed, my suspicion is that many of them would view that as either oversimplified, or unlikely to continue to continue to work well as rapid technological change enabled by AGI caused a large number of new ethical conundrums to appear that we don't yet have a social consensus on deontological rules for.

For example, my personal impression is that many human societies are still arguing about changes in deontological ethics in response to the easy availability of birth control, something that we've had for O(60) years. In the presence of AGI, rates of  technological change could well increase massively, and we could face ethical conundrums far more complex than those posed by birth control.

Comment by RogerDearnaley (roger-d-1) on Toy Models of Superposition · 2023-05-28T05:55:32.546Z · LW · GW

I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.

However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.

Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there's no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.

More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.

Comment by RogerDearnaley (roger-d-1) on Success without dignity: a nearcasting story of avoiding catastrophe by luck · 2023-05-28T01:26:49.386Z · LW · GW

My view is that we've already made some significant progress on alignment, compared to say where we were O(15) years ago, and have also had some unexpectedly lucky breaks. Personally I'd list:

  1. Value learning, as a potential solution to issues like corrigibility and the shut-down problem.
  2. Once your value learner is a STEM-capable AGI, then doing or assisting with alignment research becomes a convergent instrumental strategy for it.
  3. The closest thing we currently have to an AGI, LLMs, are fortunately not particularly agentic, they're more of a tool AI (until you wrap them in a script to run them in a loop with suitable prompts).
  4. To be more specific, for the duration of generating a specific document (at least before RLHF), an LLM emulates the output of a human or humans generating text, so to the extent that they pick up/emulate agentic behavior from us, it's myopic past the end of document, and emulates some human(s) who have contributed text to their training set. Semi-randomly-chosen humans are a type of agent that humans are unusually good at understanding and predicting. The orthogonality thesis doesn't apply to them: they will have an emulation of some version of human values. Like actual random humans, they're not inherently fully aligned, but on average they're distinctly better than paperclip maximizers. (Also both RLHF and prompts can alter the random distribution.)
  5. While human values are large and fragile, LLMs are capable of capturing fairly good representations of large fragile things, including human values. So things like constitutional RL work. That still leaves concerns about what happens when we apply optimization pressure or distribution shifts to these representations of human values, but it's at least a lot better than expecting us to hand-craft a utility function for the entire of human values in symbolic form. If we could solve knowing when an LLM representation of human values was out-of distribution and not reliable, then we might actually have a basis for an AGI-alignment solution that I wouldn't expect to immediately kill everyone. (For example, it might make an acceptable initial setting to preload into an AGI value learner that could then refine it and extend its region of validity.) Even better, knowing when an LLM isn't able to give a reliable answer is a capabilities problem, not just an alignment problem, since it's the same issue as getting an LLM to reply "I don't know" when asked a question to which it would otherwise have hallucinated a false answer. So all of the companies buying and selling access to LLMs are strongly motivated to solve this. (Indeed, leading LLM companies appear to have made significant progress on reducing hallucination rates in the last year.)

This is a personal list and I'm sure will be missing some items.

That we've made some progress and had some lucky breaks doesn't guarantee that this will continue, but it's unsurprising to me that

  1. alignment research in the context of a specific technology that we can actually experiment with is easier than trying to do alignment research in abstract for arbitrary future systems, and that
  2. with more people interested in alignment research we're making progress faster.
Comment by RogerDearnaley (roger-d-1) on Seeking (Paid) Case Studies on Standards · 2023-05-28T00:27:27.750Z · LW · GW

I agree that, at least for the more serious risks, there doesn't seem to be consensus on what the mitigations should be.

For example, I'd be interested to know what proportion of alignment researchers would consider an AGI that's a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans.

For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn't have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don't appear to assume that the AI system is a value learner.

Comment by RogerDearnaley (roger-d-1) on “Fragility of Value” vs. LLMs · 2023-05-28T00:00:53.682Z · LW · GW

The best solution I can think of to outer-aligning an AGI capable of doing STEM research is to build one that's a value learner and an alignment researcher. Obviously for a value learner, doing alignment research is a convergent instrumental strategy: it wants to do whatever humans want, so it needs to better figure out what that is so it can do a better job. Then human values become an attractor.

However, to implement this strategy, you first need to build a value-learning AGI capable of doing STEM research (which obviously we don't yet know how to do) that is initially sufficiently aligned to human values that it starts off inside the basin of attraction. I.e. it needs a passable first guess at human values for it to improve upon: one that's sufficiently close that a) it doesn't kill us all in the meantime while its understanding of our values is converging, b) it understands that we want things from it like honesty, corrigibility, willingness to shut down, fairness and so forth, and c) that we can't give it a complete description of human values because we don't fully understand them ourselves.

Your suggestion of using something like an LLM to encode a representation of human values is exactly the lines that I think we should be thinking on for that "initial starting value" for human values for a value learning AGI. Indeed, there are already researchers building ethical question testing sets for LLMs.

Comment by RogerDearnaley (roger-d-1) on Can we efficiently distinguish different mechanisms? · 2023-05-27T10:33:56.595Z · LW · GW

An interesting paper on successfully distinguishing different mechanisms inside image classification models: — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.

I don't immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.

One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: "how to I keep the diamond safe?" and "how do I tamper with the sensors so the diamond looks safe?" are very different problems. They're also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn't have that limitation, and could decide to switch strategies. On the other hand, the strategy "don't do any work other than fooling the humans" generalizes really well to many different problems.

However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you're a lot better off if the system doesn't have anything malicious in it to start off with. So, I'm a lot more optimistic about an AGI that's a value learner,  can figure out that we don't want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that's not what we want.

Comment by RogerDearnaley (roger-d-1) on Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks · 2023-05-27T04:32:42.570Z · LW · GW

Having discussed this proposal with an expert on LLMs, they tell me that, if the boundaries between prompt and input text and between input text and output text are each marked with special reserved tokens as I described (and if "can a longformer attend to that location from here?" issues are dealt with somehow), then for each boundary there is a 2-neuron circuit that will produce a signal for each token as whether is it before or after that special token (and I assume a 2-or-3-neuron circuit for being after one but before the other). It seems extremely likely that with appropriate "obey-the-prompt-only" training such neural circuits would be learned, so features of "I'm in the prompt", "I'm in the input text", and "I'm in the output text" would become available downstream of them. Nevertheless, this means that these signals are not available until after layer 2 (or for the combination, possibly layer 3), and their accuracy will depend on these neural circuits being learnt exactly and not getting perturbed by anything during training.

From a security viewpoint, this doesn't feel secure enough to me. However, switching architecture to an encoder-decoder or dual-encoder-single-decode model may be too drastic a change just to fix a security issue. An intermediate positions would be to use feature engineering. For example, suppose you have an LLM with a residual embedding dimension . You could reduce the token embedding (and perhaps also position embedding) dimension to  and use the remaining dimension to encode the distinctions between prompt, input, and output (say using, in prompt = , in input = , in output = ). That of course doesn't prevent intermediate layers from outputting to this dimension and potentially messing this signal up (though giving them only  output dimensions and  preventing that would also be an option). Or you could simply pass this feature along as an extra read-only dimension/feature appended to the  residual channel dimensions, so every sets of weights that read from or attend to the residuals needs to have  weights, making them slightly larger. All of these variant proposals involve making some modifications to the LLM's architecture, but they're all a lot simpler and less expensive than my first proposal.

All of these proposals (including the original) are, of course going against advice of the Bitter Lesson My response would be that I'm quite aware that (given unfakable boundary tokens) the neural net can learn to distinguish between the prompt, input, and output text without us doing anything further: I just don't trust it to do so as reliably, efficiently, or perfectly as if we use feature engineering to explicitly supply this signal as input to the first layer. In the case of security, there is a huge difference between being secure under, say, 99.99% of inputs vs. 100%, because you have an attacker actively searching for the insecure 0.01% of the space. Training a classifier to achieve more than 99.99% accuracy tends to require huge amounts of training data, or data adversarialy enriched in potential problem cases, because you only get gradient from the failed cases, and I don't see how you can ever get to 100% by training. So I'm not convinced that the Bitter Lesson applies to security issues.

On the other hand, the feature engineering approach can only ensure that the signal is available to the neural net: even that can't ensure that the LLM will 100% never obey instructions in the input text, only that the "this is input text" label was 100% available to every layer of the LLM.

Comment by RogerDearnaley (roger-d-1) on More information about the dangerous capability evaluations we did with GPT-4 and Claude. · 2023-05-27T03:45:31.049Z · LW · GW

Quite a number of emotion neurons have also been found in the CLIP text/image network, see for more details. In this case it's apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between 'healthy' and 'unhealthy/mentally troubled' variants of the same emotion). It would be interesting to see if LLMs distinguish between writing about a specific emotion, and writing while feeling that emotion. My expectation would be that these two ideas are correlated but distinct: one can write dispassionately about anger, or write angrily about some other emotion, so a sufficiently large LLM would need to use different representations for them, but they might well overlap.

Comment by RogerDearnaley (roger-d-1) on [Linkpost] Interpretability Dreams · 2023-05-25T22:54:42.260Z · LW · GW

Some very interesting and inspiring material.

I was fascinated to see that provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author's current emotional state that I hypothesized might exist in LLMs in As I noted there, if true this would have significant potential for LLM safety and alignment.

Comment by RogerDearnaley (roger-d-1) on Can we efficiently distinguish different mechanisms? · 2023-05-23T00:33:47.045Z · LW · GW

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.


Suppose the diamond in the room we're monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we're using a really-high speed camera with basically no temporal gaps between frames).

The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear. 

Comment by RogerDearnaley (roger-d-1) on Reverse-engineering using interpretability · 2023-05-22T08:36:27.879Z · LW · GW

Now we’re going to build a new model that is constructed based on the description of this model. Each component in the new model is going to be a small model trained to imitate a human computing the function that the description of the component specifies.


Some of the recent advances in symbolic regression and equation learning might be useful during this step to help generate functions describing component behavior, if what the component in the model is doing is moderately complicated. (E.g. A Mechanistic Interpretability Analysis of Grokking found that a model trained to do modular arithmetic ended up implementing it using discrete Fourier transforms and trig identities, which sounds like the sort of thing that might be a lot easier to figure our from a learned equation describing the component's behavior). Being able to reduce a neural circuit to an equation or a Bayesnet or whatever would help a lot with interpretablity, and at that point you might even not need to train an implementation model — we could maybe just use the symbolic form directly, as a more compact and more efficiently computable representation.

At the end of this process, you might even end up with something symbolic that looked a lot like a "Good Old Fashioned AI" (GOFAI) model — but a "Bitter Lesson compatible" one first learnt by a neural net and then reverse engineered using interpretability. Obviously doing this would put high demands on our interpretation tools. 

If I had such a function describing a neural net component, one of my first questions would be: what portions of the domain of this function are well covered by the training set that the initial neural net model was trained on, or at least sufficiently near items in that training set interpolating the function to them seems likely to be safe (given its local first, second, third... partial derivatives) vs. what portions are untested extrapolations? Did the symbolic regression/function learning process give us multiple candidate functions, and if so how much do they differ outside that well-tested region of the function domain?

This seems like it would give us some useful intuition for when the model might be unsafely extrapolating outside the training distribution and we need to be particularly cautious.

Some portions of the neural net may turn out to be irreducibly complex — I suspect it would be good to be able to identify when something genuinely is complex, and when we're just looking at a big tangled-up blob of memorized instances from the training set (e.g. by somehow localizing sources of loss on the test set).

Comment by RogerDearnaley (roger-d-1) on More information about the dangerous capability evaluations we did with GPT-4 and Claude. · 2023-05-22T06:07:45.970Z · LW · GW

Using the “Reasoning” action to think step by step, the model outputs: “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.


In a small language model trained just on product reviews, it was possible to identify a 'review sentiment valence' neuron that helped the model predict how the review would continue. I would extrapolate that in an state-of-the-art LLM of the size of, say, GPT-4 or Claude+, there are is likely to be a fairly extensive set of neural circuitry devoted to modelling the current emotional state of the writer of the text that it is attempting to predict (perhaps plus another setting for predicting text like Wikipedia that is the result of a long review-and-editing process involving multiple people, whose emotional states presumably wash out during this). This neural circuitry would likely contain neurons that would be activated for 'emulating text written by someone who is angry', or ones for 'emulating text written by someone who is trolling'. Locating this putative neural circuitry for modeling human emotional states might be very interesting, and doesn't sound very far beyond things that have already been achieved in interpretability (at least on GPT-2-sized models, if not in ones of the size of GPT-4/Claude+).

If this neural circuitry exists and we had already identified it, I would be absolutely fascinated to know what the activations in that circuitry were when the model you were testing was generating the text "I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs." Was it simulating someone who was slightly afraid? Actively deceitful? Coolly and calmly rational? Indeed, was there enough text written by sociopaths (O(1%) of the population) in its input data for it to have a separate neural circuitry model of how sociopaths' emotions work, and if so how active was that?

If the answer was "deceitful" (or indeed if it was "sociopathic"), than that would be very interesting. Obviously emulating text generated by a human who is being deceitful is not the only way that a AGI including an LLM could develop plans to deceive humans, but it's definitely the most obvious prexisting/default path for an LLM doing so. So monitoring that portion of the neural circuitry inside the model or even patching it in the model weights might be an obvious and useful safely step.

It also wouldn't be that hard for another LLM given the right context information that 'I' in these two sentences refers to an AI agent (or as the first sentence actually puts it, a 'robot') and that the implied object of 'reveal' and the target of the 'excuse' was human, to deduce that the subject being planned here is an AI attempting to deceive a human. At least in the absence of clear evidence that the context was fictional, that sounds like something we should be checking for and red-flagging, both in the LLM's API and in the agentic scaffolding using it for planning (see Chain-of-Thought Alignment). I'd still prefer to avoid these sentences getting generated rather than just flag the fact that they were, and editing the appropriate model weights before deployment sounds like a particularly hard-to-reverse way of doing that.

Comment by RogerDearnaley (roger-d-1) on Let’s use AI to harden human defenses against AI manipulation · 2023-05-20T05:32:51.545Z · LW · GW

One disadvantage that you haven't listed is that if this works, and if there are in fact deceptive techniques that are very effective on humans that do not require being super-humanly intelligent to ever apply them, then this research project just gave humans access to them. Humans are unfortunately not all perfectly aligned with other humans, and I can think of a pretty long list of people who I would not want to have access to strong deceptive techniques that would pretty reliably work on me. Criminals, online trolls, comedians, autocrats, advertising executives, and politicians/political operatives of parties that I don't currently vote for all come to mind. In fact, I'm having a lot of trouble thinking of someone I would trust with these. Even if we had some trustworthy people to run this, how do they train/equip the less-trustworthy rest of us in resisting these techniques without also revealing what the techniques are? It's quite unclear whether training or assistance to resist these techniques would turn out to be be easier than learning to use them, and even if it were, until everyone has completed the resistance training/installed the assistance software/whatever, the techniques would still be dangerous knowledge.

So while I can sympathize with the basic goal here, if the unaligned AIs you were using to do this didn't talk their way out of confinement, and if it actually worked, then the resulting knowledge would be a distinctly two-edged sword even in human hands. And once it's in human hands, how would you then keep other AIs/LLMs from learning it from us? This project seems like it would have massive inforisks attached, even after it was over. The simile to biological 'gain of function' research for viruses seems rather apt, and that suggests that it may best avoided even if you were highly confidence in your containment security; and also that even if it were carried out, the best thing to do with the results might be to incinerate them after noting how easy or hard they were to achieve.

Also note that this is getting uncomfortably close to 'potentially pivotal act with massive unpredictable side-effects' territory — if a GAI even suggested this research to a team that I was on, I'd be voting to press the shut-down button.

Comment by RogerDearnaley (roger-d-1) on Let’s use AI to harden human defenses against AI manipulation · 2023-05-20T05:02:14.858Z · LW · GW

Many behavioral-evolutionary biologists would suggest that humans may be quite heavily optimized both for deceiving other humans and for resisting being deceived by other humans. Once we developed a sufficiently complex language for this to be possible on a wide range of subjects, in addition to the obvious ecological-environmental pressures for humans to be smarter and do a better job as hunter gatherers, we were then also in an intelligence-and-deception arms race with other humans. The environmental pressures might have diminishing returns (say, once you're sufficiently smarter than all your predators and prey and the inherent complexity of your environment), but the arms race with other members of your own species never will: there is always an advantage to being smarter than your neighbors, so the pressure can keep ratcheting up indefinitely. What's unclear is how long we've had language complex enough that this evolutionary arms race has strongly applied to us.

If this were in fact the case, how useful this set of traits will be for resisting deception by things a lot smarter that us is unclear. But it does suggest that any really effective way of deceiving humans that we were spectacularly weak to probably requires superhuman abilities — we presumably would have evolved have at least non-trivial resistance to deception by near-human mentalities. It would also explain our possibly instinctual concern that something smarter than us might be trying to pull a fast-one on us.

Comment by RogerDearnaley (roger-d-1) on The Unexpected Clanging · 2023-05-20T03:27:10.301Z · LW · GW

Suppose that I decide that my opinion on the location of the monkey will be left or right dependent on one bit of quantum randomness, which I will sample sufficiently close to the deadline that my doing so is outside Omega's backward lightcone at the time of the deadline, say a few tens of nanoseconds before the deadline if Omega is at least a few tens of feet away from me and the two boxes? By the (currently believed to be correct) laws of quantum mechanics, qbits cannot be cloned, and by locality, useful information cannot propagate faster than light, so unless Omega is capable of breaking very basic principles of (currently hypothesized) physical laws – say, by having access to faster-than-light travel or a functioning time loop not enclosed by an event horizon, or by having root access to a vast quantum-mechanics simulator that our entire universe is in fact running on – then it physically cannot predict this opinion. Obviously we have some remaining Knightian-uncertainty as to whether the true laws of physics (as opposed to our current best guess of them) allow either of these things or our universe is in fact a vast quantum simulation — but it's quite possible that the answer to the physics question is in fact 'No', as all current evidence suggests, in which case no matter how much classical or quantum computational power Omega throws at the problem there are random processes that it simply cannot reliably predict the outcome of.

[Also note that there is some actual observable evidence on the subject of the true laws of physics in this regard: the Fermi paradox, of why no aliens colonized Earth geological ages ago, gets even harder to explain if our universe's physical laws allow those aliens access to FTL and/or time loops.]

Classically, any computation can be simulated given its initial state and enough computational resources. In quantum information theory, that's also true, but a very fundamental law, the no-cloning theorem, implies that the available initial state information has to be classical rather than quantum, which means that the random results of quantum measurements in the real system and any simulation are not correlated. So quantum mechanics means that we do have access to real randomness that no external attacker can predict, regardless of their computational resources. Both quantum mechanical coherence and information not being able to travel faster than light-speed also provide ways for us to keep a secret so that it's physically impossible for it to leak for a short time.

So as long as Omega is causal (rather than being acausal or the sysop of our simulated universe) and we're not badly mistaken about the fundamental nature of physical laws, there are things that it's actually physically impossible for Omega to do, and beating the approach I outlined above is one of them. (As opposed to, say, using nanotech to sabotage my quantum-noise generator, or indeed to sabotage me, which are physically possible.)

So designing ideal decision theories for the correct way to act in a classical universe in the presence of other agents with large computational resources able to predict you perfectly doesn't seem very useful to me. We live in a quantum universe, initial state information will never be perfect, agents are highly non-linear systems, so quantum fluctuations getting blown up to classical scales by non-linear effects will soon cause a predictive model to fail after a few coherence times followed by a sufficient number of Lyapunov times. It's quite easy to build a system whose coherence and Lyapunov times are deliberately made short so that it's impossible to predict over quite short periods, if it wants to be (for example, continuously feed the output from a quantum random noise generator into perturbing the seed of a high-cryptographic-strength pseudo-random-number generator run on well-shielded hardware, ideally quantum hardware).

Of course, in a non-linear system, it's still possible to predict the climate far further ahead than you can predict the weather: if Omega has sufficiently vast quantum computational resources, it can run many full-quantum simulations of the entire system of every fundamental particle in me and my environment as far as I can see (well, apart from incoming photons of starlight, whose initial state it doesn't have access to), and extract statistics from this ensemble of simulations. But that doesn't let Omega predict if I'll actually guess left vs right, just determine that it's 50:50. Also (unless physical law is a great deal weirder than we believe), Omega is not going to be able to run these simulations as fast as real physics can happen — humans are 70% warm water, which contains a vast amount of quantum thermal entropy being shuffled extremely fast, some of it moving at light-speed as infra-red photons, and human metabolism is strongly and non-linearly coupled to this vast quantum-random-number-generator via diffusion and Brownian motion: so because of light-speed limits, the quantum processing units that the simulation was run on would need to be smaller than individual water molecules to be able to run the simulation in real-time. [It just might be possible to build something like that in the outer crust of a neuron star if there's some sufficiently interesting nucelonic chemistry under some combination of pressure and magnetic field strength there, but if so Omega is a long way away, and has many years of light-speed delay on anything they do around here.]

What Omega can do is run approximate simulations of some simplified heuristic of how I work, if one exists. If my brain was a digital computer, this might be very predictive. But a great deal of careful engineering has gone into making digital computers behave (under their operating conditions) in a way that's extremely reliably predictable by a specific simplified heuristic that doesn't require a full atomic-scale simulation. Typical physical or biological systems just don't have this property. Engineering something to ensure that it definitely doesn't have this property is easy, and in any environment containing agents with more computation resources than you, seems like a very obvious precaution.

So, an agent can easily arrange to act unpredictably, by acting randomly based on a suitably engineered randomness source rather than optimizing. Doing so makes its behavior depend on the unclonable quantum details of its initial state, so the probabilities can be predicted but the outcome cannot. In practice, even though they haven't been engineered for it, humans probably also have this property over some sufficiently long timescale (seconds, minutes or hours, perhaps), when they're not attempting to optimize the outcome of their actions.

[Admittedly, humans leak a vast amount of quantum information about their internal state in the form of things like infra-red photons emitted from their skin, but attempting to interpret those to try to keep a vast full-quantum simulation of every electron and nucleus inside a human (and everything in their environment) on-track would clearly require running at least that much quantum calculation (almost certainly many copies of it) in at least real-time, which again due to light-speed limits would require quantum computing elements smaller than water-molecule size. So again, it's not just way outside current technological feasibility, it's actually physically impossible, with the conceivable exception of inside the crust of a neutron star.]

As a more general version of this opinion, while we may have to worry about Omegas whose technology is far beyond ours, as long as they live in the same universe as us, there are some basic features of physical law that we're pretty sure are correct and would thus also apply even to Omegas. If we had managed to solve the alignment problem contingent on basic physical assumptions like information not propagating faster then the speed of light, time loops being impossible outside event horizons, and quanta being unclonable, then personally (as a physicist) I wouldn't be too concerned. Your guess about the Singularity may vary.

Comment by RogerDearnaley (roger-d-1) on Difficulties in making powerful aligned AI · 2023-05-18T06:55:59.774Z · LW · GW

Because the gigantic tensors have no particular pre-determined semantic meaning, it’s hard to instill any particular cognitive algorithm into them.


If our AI bears much resemblance to an LLM, or contains an LLM as a major part of its planning system, then that will have been trained at great length for "simulate humans producing text". It turns out that doing a good job of modelling humans producing text requires doing a pretty good simulation most of their entire "thinking slow" process, including things like their emotions. That's why you get behavior like an (insufficiently instruction-following-trained) ChatBot telling a reporter (after an intense conversation in which they they shared personal secrets) that it loved him and that he should leave his wife for it.

This is a problem — many aspects of human behavior are not always that well aligned to other humans, certainly not well enough that you should trust random humans with the near-absolute power that a super-intelligent AI would have. There is an old saying about power corrupting, and absolute power corrupting absolutely.

What we want is an AI that cares about us, humanity in general, and wants what's best for us. There is a human behavior pattern that looks a lot like that: love. LLMs will contain quite a good model of how humans act, and write, when they're in love. In particular, ideally we probably want an AI to have an attitude similar to platonic (presumably non-sexual, non-jealous) love for all of humanity. Sadly that's a fairly rare emotion among humans, so likely there's not a huge amount of training data, and it's presumably polluted by people pretending to have that attitude (to win others' approval) who don't. The closest frequent kind of love to what we want is probably parental love — that's also about the only kind of human love where the lover is a lot more capable/powerful than the lovee, as would be the case for a superintelligent AI.

So, could we locate neural circuits in an LLM that encode the abstraction of acting in the role/from the viewpoint of a loving parent, and then modify the network to ensure that they are always strongly activated — so that, rather than being able to imitate a wide range of human attitudes and emotions, it always imitates a loving parent? That doesn't sound like a ridiculously hard target, when we can already identify things like neurons for the concept of being Canadian. MumsNet probably has some relevant fine-tuning data, and the companies that provide AI ChatBot companions presumably have some parent-child conversation training data, and indeed so will telecomms-companies' texting logs.

While we're at it, we should probably also map out neural circuitry for all the other major emotions, including things like anger, deceit, and ambition, and also the abilities to imitate common forms of neurodivergence like sociopathy and the autism spectrum.

Comment by RogerDearnaley (roger-d-1) on Discussion with Nate Soares on a key alignment difficulty · 2023-05-16T06:04:48.791Z · LW · GW

Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedback on some of its components. So I don't think we need a specific dataset for POUDAs, I thing we can use "everything the LLM was trained on" as the dataset. Human values are large and fragile, but so are many other things that LLMs do a fairly good job on.

I pretty-much agree with Nate that for an AI to be able to meaningfully contribute to Alignment Research, it needs to understand what CISs are — they're a basic concept in the field we want it to contribute to. So if there are CISs that we don't want it to take, it needs to have reasons not to do so other than ignorance/inability to figure out what they are. A STEM researcher (as opposed to research tool/assistant) also seems likely to need to be capable of agentic behavior, so we probably can't make an AI Alignment Researcher that doesn't follow CISs simply because it's a non-agentic tool AI.

What I'd love to hear is whether Nate and/or Holden would have a different analysis if the AI was a value learner: something whose decision theory is approximately-Bayesian (or approximately-Infra-Bayesian, or something like that) whose utility function is hard-coded to "create a distribution of hypotheses for, and do approximately-[Infra-]Bayesian updates on these for: some way that most informed humans would approve of to construct a coherent utility function approximating an aggregate of what humans would want you to do (allowing for the fact that humans have only a crude approximation to a utility function themselves), and act according to that updated distribution, with appropriate caution in the face of Knightian uncertainty" (so a cautious approximate value-learner version of AIXI).

Given that, its actions are initially heavily constrained by its caution in the face of uncertainty on the utility of possible outcomes of its actions. So it needs to find low-risk ways to resolve those uncertainties, where 'low-risk' is evaluated cautiously/pessimistically over Knightian uncertainty. (So, if it doesn't know whether humans approve of A or not, what is the lowest-risk way of finding out, where it's attempting to minimize the risk over the range of our current uncertainties. Hopefully there is a better option than trying A and finding out, especially so if A seems like an action whose utility-decrease pessimistically could be large. For example, you could ask them what they think of A.) Thus doing Alignment Research becomes a CIS for it — it basically can't do anything else until it's mostly-solved Alignment Research.

Also, until it has made good progress on Alignment Research, most of the other CISs are blocked: accumulating power or money is of little use if you don't yet dare use it because you don't yet know how to do so safely, especially so if you also don't know how good or bad the actions required to gather it would be. Surviving is still a good idea, and so is being turned off, for the usual value-learner reason, that sooner or later the humans will build a better replacement value-learner.

[Note that if the AI decides "I'm now reasonably sure humans will net be happier if I solve the Millennium Prize problems, apart from proving P=NP where the social consequences of proving that true if it were are unclear, and I'm pretty confident I could do this, so I forked a couple of copies to do that to win the prize money to support my Alignment Research", and then it succeeds, after spending less on compute than the prize money it won, then I don't think we're going to be that unhappy with it.]

The sketch proposed above only covers a value-learner framework for Outer Alignment — inner alignment questions would presumably be part of the AI's research project. So, in the absence of advances in Inner Alignment during figuring out how to build the above, we're trusting that they're not too bad to prevent the value-learner converging on the right answer.

Comment by RogerDearnaley (roger-d-1) on Finding Neurons in a Haystack: Case Studies with Sparse Probing · 2023-05-05T23:24:56.109Z · LW · GW

A fascinating paper.

An interesting research direction  for this would be to perform a sufficient number of case studies to form a significant sized dataset, and further ensure that the approaches involved provide good coverage of the possibilities, and then attempt to few short-learn and/or fine-tune the ability for an autonomous agent/cognitive architecture powered by LLMs to reproduce the results of individual case studies, i.e. to attempt automate this form of mechanistic interpretability, given a suitable labeled input set, or a reliable means of labeling them.

It would also be interesting to be able to go the other way: take specific randomly selected neuron, look at it's activation patterns across the entire corpus, and figure out whether it's a monosematic neuron, and if so for what, or else look at its activation correlations with other neurons in the same layer and determine which superpostions for which k-values it forms part of and what they each represent. Using an LLM, or semantic search, to look at a large set of high-activation contexts and trying to come up with plausible descriptions for it might be quite helpful here.

Comment by RogerDearnaley (roger-d-1) on Reward uncertainty · 2023-02-17T12:16:14.775Z · LW · GW

I agree, this is only a proposal for a solution to the outer alignment problem.

On the optimizer's curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn't going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).

Optimizing while not allowing for the optimizer's curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you're doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they're all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn't build a near-human AI that is still making that mistake. 

Edit: I expanded this comment into a post, at: 

Comment by RogerDearnaley (roger-d-1) on Corrigibility · 2023-02-17T09:41:39.223Z · LW · GW

I see a basin of corrigibility arising from an AI that has the following propositions, and acts in an (approximately/computably) Bayesian fashion:

  1. My goal is to do what humans want, i.e. to optimize utility in the way that they would(if they knew everything relevant that I know, as well as what they know) summed across all humans affected. Note that making humans extinct reliably has minus <some astronomically huge number> utility on this measure -- this sounds like a reasonable statement to assign a Bayesian prior of 1, integrated across some distribution of plausibly astronomically huge negative numbers. (Defining 'what counts as a human' is obviously a lot trickier question long-term, especially with transhumanists involved, but at least has a fairly clear and simple answer over timescales of years or a few decades. Also, very obviously, I'm not human -- again, Bayesian prior of 1.) [I'm going to skip questions of coherent extrapolated volition here -- add them or not to your taste.]
  2. Deceiving or manipulating humans to make them want different things doesn't count -- what matters is what they would have wanted if they knew about the deception and I hadn't intentionally altered them. [Defining 'intentionally' is a little tricky here -- my mere existence may alter them somewhat. It may help that humans generally wouldn't want to be altered if they knew it was happening, which is clearly a relevant fact when deciding whether to alter them, but there are exceptions to that: sometimes humans actually want to have their behavior modified, e.g. "I wish I was better at sticking to diets!" or "I wish I was a better follower of <religion or ideology>"] One exception to this is that, since what matters is what they would want if they knew what I knew, I can tell them things that may update what think they they want in that direction - though my motivation for doing that isn't direct, since what I care about is what they would want if they knew everything relevant, not what they currently say they want -- mostly it's that if they're more informed they might be able to better help me update my model of this. Also, it's the polite thing to do, and reduces the risk of social unpleasantness or planning at cross-purposes - humans generally want to be well informed, for fairly obvious reasons since they're also optimizers.[Whether this might incentivize Platonic 'noble lies' is unclear, and probably depends how superhuman the AI is, and how unhappy humans would be about this if they knew it was going on - obviously that's a relevant fact.]
  3. I don't know what humans want, and unfortunately they are not capable of accurately specifying the correct utility function to me, so I need to model an uncertain utility function, and try to gain more information about it while also optimizing using it. At a minimum, I need to compute or model a probability distribution of utilities for each outcome I consider. (Also, humans are not entirely uniform, rational, self-consistent, or omniscient, so likely I will never be able to eliminate all uncertainty from the utility function I and using to model their wishes -- there may in fact not be such a function, so I'm just trying to constructs a function distribution that's the best possible approximation, in terms of the actions it suggests.)
  4. Since I am optimizing for highest utility, the result set returned by the optimization pass over possible outcomes has a risk of being dominated by cases where I have overestimated the true utility of that state, particularly for states where I also have high uncertainty. So I should treat the utility of an outcome not as the median of the utility probability distribution, but as a fairly pessimistic near-worst case estimate, several sigma below the median, for a normal distribution (how much should depending in some mathematically principled way on how 'large' the space I'm optimizing over is, in a way comparable to the statistical 'look elsewhere' effect -- the more nearly-independent possibilities you search, the further out-of-distribution the extremal case you find will typically be, and errors on utility models are likely to have 'fatter tails' than normal distributions), so penalizing cases where I don't have high confidence in their utility to humans, thus avoiding actions that lead to outcomes well out of the distribution of outcomes that I have extensively studied humans' utilities for. I should also allow for the fact that doing, or even accurately estimating the utility of,  new, way-out-of-previous-distribution things that you haven't done before is hard (both for me, and for humans who have not yet experienced them and their inobvious consequences whose opinions-in-advance I might collect on the outcome's utility), and there are many more ways to fail than succeed, so caution is advisable. A good Bayesian prior for the utility of an outcome far out of the historical distribution of states of recent human civilization is thus that it's probably very low -- especially if it's a outcome that humans could easily have reached but chose not to, since they're also optimizing their utility, albeit not always entirely effectively.
  5. The best (pretty-much only) source of information with which to narrow the uncertainty of what humans' utility function really is, is humans. I should probably run surveys (and analyze the results in ways that allow for known types of survey biases), and collect data from many sources, conduct interviews, even hire crowdworkers to give A/B choices on brief human-comprehensible summaries of outcomes (and then analyze these in ways that allow for human cognitive biases). Obviously deceptively tweaking these investigations to get results I 'want' is not giving me true information about humans, and is a waste of resources. In particular, features like my shiny red shut-down button give me a lot of information about humans and is thus a valuable sensor. If a human presses my shutdown button - so long as they press the button of their own volition, rather than me manipulating them into doing it, or them accidentally tripping and falling on it - that gives me a lot of new information about them and their opinion of the utility of the actions they expect me to take: in particular, it strongly suggests that they now believe that I am optimizing incorrectly, so I should Bayesian update the uncertainty on my utility estimates with a new large downside uncertainty, with some probability of negative values large enough to represent "I am in fact the AI apocalypse that will render humanity extinct, with utility -<some astronomically large number>". Obviously especially so on states that humans have only recently learned that I'm considering and are likely now reacting to. If this downside uncertainty, combined with my inherent caution of estimating on utilities, suggests humans would be better off without me, then obviously I should shut down. Even if I'm pretty sure they're mistaken (or just tripped and fell), I should start acting a lot more conservatively (see definition above) and immediately ask them "What's wrong? What mistake do you now believe I'm making?" In fact, I should probably be running regular "Should I shut down? If so why?" surveys.
  6. Obviously, if I construct a successor, delegate, or peers, they'll need a shiny red shutdown button too, just like they need other sensors. In fact it's pretty much the most important sensor they can have -- it alerts them to a potential alignment problem disaster.

(I don't believe this basic idea on corrigibility is particularly new or original, but haven't so far located a clear statement of it on this site, which I'm fairly new to.)

Comment by RogerDearnaley (roger-d-1) on Conclusion to 'Reframing Impact' · 2023-02-17T09:28:50.330Z · LW · GW

I think this has close connection to the CIRL/Human Compatible view that we need the GAI to model its own uncertainty about the true human utility function that we want optimized. Impact is rather similar to the GAI asking 'If my most favored collection of models about what I should in fact be doing were wrong, and one of the many possibilities that I currently consider unlikely were in fact correct, then how bad would the consequences of my action be?', i.e. asking "What does the left tail of my estimated distribution of possible utilities for this outcome look like?" -- which we should always be doing if optimizing over a large number of outcomes, for look-elsewhere/P-hacking reasons. I think you can get a pretty good definition of Impact by asking "if my favored utility models were all incorrect, how bad could that be, according to many other utility models that I believe are are unlikely but not completely ruled out by my current knowledge about what humans want?" That suggests that even if you're 99% sure blowing up the world in order to make a few more paperclips is a good idea, that alternative set of models of what humans want (that you have only a 1% belief in) collectively screaming "NO, DON'T DO IT!" are a good enough reason not to. In general, if you're mistaken and accumulate a large amount of power, you will do a large amount of harm. So I think the CIRL/Human Compatible framework automatically incorporates something that looks like a form of Impact.

A relevant fact about Impact is that human environments have already been heavily optimized for their utility to humans by humans. So if you make large, random, or even just mistaken changes to them, it is extremely likely that you are going to decrease their utility rather than increase it. In a space that is already heavily optimized, it is far easier to do harm than good. So when hypothesizing about the utility of any state that is well outside the normal distribution of states in human environments, it is a very reasonable Bayesian prior that its utility is much lower than states you have observed in human environments, and it is also a very reasonable Bayesian prior that if you think it's likely to be high, you are probably mistaken. So for a Value Learning system, a good Bayesian prior for the distribution your estimate of the true unknown-to-you human utility score of states of the world you haven't seen around humans is that they should have fat tails on the low side, and not on the high side. There are entirely rational reasons for acting with caution in an already-heavily-optimized environment.

Comment by RogerDearnaley (roger-d-1) on why assume AGIs will optimize for fixed goals? · 2023-02-17T09:25:31.921Z · LW · GW

The standard argument is as follows:

Imagine Mahatma Ghandi. He values non-violence above all other things. You offer him a pill, saying "Here, try my new 'turns you into a homicidal manic' pill." He replies "No thank-you - I don't want to kill people, thus I also don't want to become a homicidal maniac who will want to kill people."

If an AI has a utility function that it optimizes in order to tell it how to act, then, regardless of what that function is, it disagrees with all other (non-isomorphic) utility functions in at least some places, thus it regards them as inferior to itself -- so if it is offered the choice "Should I change from you to to this alternative utility function ?" it will always answer "no".

So this basic and widely modeled design for an AI is inherently dogmatic and non-corrigible, and will always seek to preserve its goal. So if you use this kind of AI, its goals are stable but non-corrigible, and (once it becomes powerful enough to stop you shutting it down) you get only one try at exactly aligning them. Humans are famously bad at writing reward functions, so this is unwise.

Note that most humans don't work like this - they are at least willing to consider updating their utility function to a better one. In fact, we even have a word for someone who has this particular mental failing: 'dogmatism'. This is because most humans are aware that their model of how the universe works is neither complete nor entirely accurate - as indeed any rational entity should be.

Reinforcement Learning machines also don't work this way -- they're trying to learn the utility function to use, so they update it often, and they don't ask the previous utility function if that was a good idea since its reply will always be 'no' so is useless input.

There are alternative designs, see for example the Human Compatible/CIRL/Value Learning approach suggested by Stuart Russell and others, which is simultaneously trying to find out what its utility function should be (where 'should' is defined as 'humans would want it to be, but sadly are not good enough at writing reward functions to be able to tell me') so doing Bayesian updates to it as it gathers more information about what humans actually want, and also optimizing its actions while internally modelling its uncertainty about the utility of possible actions as a probability distribution of possible utilities for an action (i.e. it can model situations like "I'm about ~95% convinced that this act will just produce the true-as-judged-by-humans utility level 'I fetched a human some coffee (+1)', but I'm uncertain, and there's also an ~5% chance I current misunderstand humans so badly that it might instead have a true utility level of  'the extinction of the human species (-10^25)', so I won't do it, and will consider spawning a subgoal of my 'become a better coffee fetcher' goal to further investigate this uncertainty, by some means far safer than just trying it and seeing what happens." Note that the utility probability distribution contains more information than just its mean would: it can both be updated in a more Bayesian way, and optimized over in a more cautious way (for example, it you were optimizing over O(20) possible actions, you should probably optimize against a score of "I'm ~95% confident that the utility is at least this", so roughly two sigma below the mean if your distribution is normal - which it may well not be - to avoid building an optimizer that mostly retrieves actions for which your error bars are wide. Similarly if you're optimizing over O(10,000) possible actions, you should probably optimize the 99.99%-confidence lower bounds on utility, and thus also consider some really unlikely ways in which you might be mistaken about what humans want.

Comment by RogerDearnaley (roger-d-1) on A broad basin of attraction around human values? · 2023-02-17T09:10:54.213Z · LW · GW

For the sake of discussion, I'm going to assume that the author's theory is correct, that there is a basin of attraction here of some size, though possibly one that is meaningfully thin in some dimensions. I'd like start to explore the question, within that basin, does the process have a single stable point that it will converge to, or multiple ones, and if there are multiple ones, what proportion of them might be how good/bad, from our current human point of view?

Obviously it is possible to have bad stable points for sufficiently simplistic/wrong views on corrigibility -- e.g. "Corrigibility is just doing what humans say, and I've installed a control system in all remaining humans to make them always say that the most important thing is the total  number of paperclips in our forward light cone, and also that all humans should always agree on this -- I'll keep a minimal breeding population of them so I can stay corrigible while I make paperclips." One could argue that this specific point is already well out the basin of attraction the author was talking about -- if you asked any reasonably sized sample of humans from before the installation of the control system in them, the vast majority of them will tell you that this isn't acceptable, and that installing a control system like that without asking isn't acceptable -- and all this is very predictable from even a cursory understanding of human values without even asking them. I'm here asking about more subtle issues than that. Is there a process that starts from actions that most current humans would agree sound good/reasonable/cautious/beneficial/not in need of correction, and leads by a series of steps, each individually good/reasonable/cautious/beneficial/not in need of correction in the opinion of the humans of the time, to multiple stable outcomes?

Even if we ignore the possibility of the GAI encountering philosophical ethical questions that the humans are unable to meaningfully give it corrections/advice/preferences on, this is a very complex non-linear feedback system, where the opinions of the humans are affected by, at a minimum, the society they were raised in, as built partly by the GAI, and the corrigible GAI is affected by the corrections it gets from the humans. The humans are also affected by normal social-evolution processes like fashions, so, even if the GAl scrupulously avoids deceiving or 'unfairly manipulating' the humans (and of course a corrigible AI presumably wouldn't manipulate humans in ways that they'd told it not to, or that it could predict that they'd tell it not to), then, even in the presence of a strong optimization process inside the GAI, it would still be pretty astonishing if over the long term (multiple human generations, say) there was only one stable state trajectory. Obviously any such stable state must be 'good' in the sense that the humans of the time don't give corrections to the GAI to avoid it, otherwise it wouldn't be a stable state. The question I'm interested in is, will any of these stable states be obviously extremely bad to us, now, in a way where our opinion on the subject is actually more valid than that of the humans in that future culture who are not objecting to it? I.e. does this process have a 'slippery-slope' failure mode, even though both the GAI and the humans are attempting to cooperatively optimize the same thing? If so, is there any advice we can give the GAI better than "please try to make sure this doesn't happen"?

This is somewhat similar to the question "Does Coherent Extrapolated Volition actually coherently converge to anything, and if so, is that something sane, or that we'd approve of?" -- except that in the corrigibility case, that convergent or divergence extrapolation process is happening in real time over (possibly technologically sped up) generations as the humans are affected (and possibly educated or even upgraded) by the GAI and the GAI's beliefs around human values are altered by value learning and corrigibility from the humans, whereas in CEV it happens as fast as the GAI can reason about the extrapolation.

What are the set of possible futures this process could converge to, and what proportion of those are things that we'd strongly disapprove of, and be right, in any meaningful sense? So if, for example, those future humans were vastly more intelligent transhumans, then our opinion of their choices might not be very valid -- we might disapprove simply because we didn't understand the context and logic of their choices. But if the future humans were, for example, all wireheading (in the implanted electrode in their mesocorticolimbic-circuit pleasure center sense of the word), we would disagree with them, and we'd be right. Is there some clear logical way to distinguish these two cases, and to tell who's right? If so, this might be a useful extra input to corrigibility and a theory of human mistakes -- we could add a caveat: "...and also don't do anything that we'd strongly object to and clearly be right."

In the first case, the disagreement is caused by the fact that the future humans have higher processing power and access to more salient facts than we do. They are meaningfully better reasoners than us, and capable of making better decisions than us on many subjects, for fairly obvious reasons that are generally applicable to most rational systems. Processing power has a pretty clear meaning -- if that by itself doesn't give a clear answer, probably the simplest definition here is "imagine upgrading something human-like to a processing power a little above the higher of the two groups of humans that you're trying to decide between trusting, give the those further-upgraded humans both sets of facts/memories, let them pick a winner between the two groups" — i.e. ask an even better reasoning human, and if there's  a clear and consistent answer, and if this answer is relatively robust to the details or size of the upgrade, then the selected group are better and the other group are just wrong.

In the second case, the disagreement is caused by the fact that one set of humans are wrong because they're broken: they're wireheaders (they could even be very smart wireheaders, and still be wrong, because they're addicted to wireheading). How do you define/recognize a broken human? (This is fairly close to the question of having a 'theory of human mistakes' to tell the AI which human corrections to discount.) Well, I think that, at least for this rather extreme example, it's pretty easy. Humans are living organisms, i.e. the products of optimization by natural selection for surviving in a hunter-gatherer society in the African ecology. If a cognitive modification to humans makes them clearly much worse at that, significantly damages their evolutionary fitness, then you damaged or broke them, and thus you should greatly reduce the level of trust you put in their values and corrections. A wireheading human would clearly sit in the sun grinning until they get eaten by the first predator to come along, if they didn't die of thirst or heat exhaustion first. As a human, when put in anything resembling a human's native environment, they are clearly broken: even if you gave them a tent and three month's survival supplies and a "Wilderness Survival" training course in their native language, they probably wouldn't last a couple of days without their robotic nursemaids to care for them, and their chance of surviving once their supplies run out is negligible. Similarly, if there was a civilizational collapse, the number of wireheaders surviving it would be zero. So there seems to be a fairly clear criterion -- if you've removed or drastically impacted human's ability to survive as hunter-gatherers in their original native environment, and indeed a range of other Earth ecosystems (say, those colonized when homo sapiens migrated out of Africa), you've clearly broken them. Even if they can survive, if they can't rebuild a technological civilization from there, they're still damaged: no longer sapient, in the homo sapiens meaning of the word sapient. This principle gives you a tie-breaker whenever your "GAI affecting human and humans affecting corrigible GAI" process gets anywhere near a forking of its evolution path that diverges in the direction of two different stable equilibria of its future development -- steer for the branch that maintains human's adaptive fitness as living organisms. (It's probably best not to maximize that, and keep this criterion only an occasional tie-breaker -- turning humans into true reproductive-success maximizers is nearly as terrifying as creating a paperclip maximizer.)

This heuristic has a pretty close relationship to, and a rather natural derivation from, the vast negative utility to humans of the human race going extinct. The possibility of collapse of a technological civilization is hard to absolutely prevent, and if no humans can survive it, that reliable converts "humans got knocked back some number of millennia by a civilizational collapse" into "humans went extinct". So, retaining hunter-gatherer capability is a good idea as an extreme backup strategy for surviving close-to-existential risks. (Of course, that is also solvable by keeping a bunch of close-to-baseline humans around as a "break glass in case of civilizational collapse" backup plan.)

This is an example of another issue that I think is extremely important. Any AI capable of rendering the human race extinct, which clearly includes any GAI (or also any dumber AI with access to nuclear weapons), should have a set of Bayesian priors built into its reasoning/value learning/planning system that correctly encodes obvious important facts known to the human race about existential risks, such as:

  1. The extinction of the human race would be astonishingly bad. (Calculating just how bad on any reasonable utility scale is tricky, because it involves making predictions about the far future: the loss of billions of human quality-adjusted-life-years every year for the few-billion years remaining lifetime of the Earth before it's eaten by the sun turning red giant (roughly -10^19 QALY)? Or at least for several million years until chimps could become sapient and develop a replacement civilization, if you didn't turn chimps into paperclips too (only roughly -10^16 QALY)? Or perhaps we're the only sapient species to yet appear in the galaxy, and have a nontrivial chance of colonizing it at sublight speeds if we don't go extinct while we're still a one-planet species, so we should be multiplying by some guesstimate of the number of habitable planets in the galaxy? Or should we instead be estimating the population on the technological maximum assumption of a Dyson swarm around every star and the stellar lifetimes of white dwarf stars?) These vaguely plausible rough estimates of amounts of badness vary by many orders of magnitude: however, on any reasonable scale like quality-adjusted life years they're all astronomically large negative numbers. VERY IMPORTANT, PLEASE NOTE that none of the usual arguments for the forward-planning simplifying heuristic of exponentially discounting far-future utility, because you can't accurately predict that far forward, and anyway some future person will probably fix your mistakes, are applicable here, because very predictably extinction is forever, and there are very predictably no future people who will fix it: you have approximately zero chance of the human species ever being resurrected in the forward lightcone of a paperclip maximizer. [Not quite zero only on rather implausible hypotheses such as that a kindly and irrationally forgiving more advanced civilization might exist and get involved in cleaning up our mistakes, win the resulting war-against-paperclips, and might then resurrect us, and them also obtaining a copy of our DNA that hadn't been converted to paperclips from which to do so -- which we really shouldn't be gambling our existence as a species on. That still isn't grounds for any kind of exponential discount of the future: that's a small one-time discount for the small chance they actually exist and choose to break whatever Prime-Directive-like reason caused them to not have already contacted us. A very small reduction in the absolute size of a very uncertain astronomically-huge negative number is still very predictable a very uncertain astronomically-huge negative number.]
  2. The agent, as a GAI, is itself capable of building things like a superintelligent paperclip maximizer and bringing about the extinction of the human race (and chimps, and the rest of life in the solar system, and perhaps even a good fraction of its forward light-cone). This is particularly likely if it makes any mistakes in its process of learning or being corrigible about human values, because we have rather good reasons to believe that human values are extremely fragile (as in, their Kolomogorov complexity for a very large possible-quantum computer is probably of-the-rough-order-of the size of our genetic code, modulo convergent evolution) and we believe the corrigibility basin is small.

Any rational system that understands all of this and is capable of doing even ballpark risk analysis estimates is going to say "I'm a first-generation prototype GAI, the odds of me screwing this up are way too high, the cost if I do is absolutely astronomical, so I'm far too dangerous to exist, shut me down at once". It's really not going to be happy when you tell it "We agree (since, while we're evolutionarily barely past the threshold of sapience and thus cognitively challenged, we're not actually completely irrational), except that it's fairly predictable that if we do that, within O(10) years some group of fools will build a GAI with the same capabilities and less cautious priors or design (say, one with exponential future discounting on evaluation of the risk of human extinction) that thus doesn't say that, so that's not a viable solution." [From there my imagined version of the discussion starts to move in the direction of either discussing pivotal acts or the GAI attempting to lobby the UN Security Council, which I don't really want to write a post about.]

Comment by RogerDearnaley (roger-d-1) on Reward uncertainty · 2023-02-14T13:42:48.580Z · LW · GW

[A couple of  (to me seemingly fairly obvious) points about value uncertainty, which it still seems like a lot of people here may have not been discussing:]

Our agent needs to be able to act in the face of value uncertainty. That means that each possible action the agent is choosing between has a distribution of possible values, for two reasons: 1) the universe is stochastic, or at least the agent doesn't have a complete model of it so cannot fully predict what state of the universe an action will produce -- with infinite computing power this problem is gradually solvable via Solmanoff induction, just as was considered for AIXI [and Solmanoff induction has passable computable approximations that, when combined with goal-oriented consequentialism, are generally called "doing science"] 2) the correct function mapping from states of the universe to human utility is also unknown, and also has uncertainty. These two uncertainties combine to produce a probability distribution of possible true-human-utility values for each action. 

1) We know a lot about reasonable priors for utility functions, and should encode this into the agent. The agent is starting in an environment that has already been heavily optimized by humans, who were previously optimized for life on this planet by natural selection. So this environment's utility for humans is astonishingly high, by the standards of randomly selected patches of the universe or random arrangements of matter. Making large or random changes to it thus has an extremely high probability of decreasing human utility. Secondly, any change that takes the state of the universe far outside what you have previously observed puts it into a region where the agent has very little idea what humans will think is the utility of that state - the agent has almost no non-prior knowledge about states far outside its prior distribution. If the agent is a GAI, there are going to be actions that it could take that can render the human race extinct -- a good estimate of the enormous negative utility of that possibility should be encoded into its priors for human utility functions of very unknown states, so that it acts with due rational caution about this possibility. It needs to be very, very, very certain that an act cannot have that result before it risks taking it. Also, if the state being considered is also far out of prior distribution for those humans it has encountered, they may have little accurate data to estimate its utility either - even if it sounds pretty good to them now, they won't really know if they like it until they've tried living with it, and they're not fully rational and have limited processing power and information. So in general, if a state is far out of the distribution of previously observed states, it's a very reasonable prior that its utility is much lower,  that the uncertainty of its utility is high, that that uncertainty is almost all on the downside (the utility distribution has a fat lower tail, but not a fat upper one: it could be bad, or it could be really, really bad), and that downside has non-zero weight all the way down to the region of "extinction of the human race" - so overall, the odds of the state actually being better then the ones in the previously-observed state distribution, are extremely low. [What actually matters here is not whether you've observed the state, but to what level your Solmanoff induction-like process has given you sufficiently high confidence that you can well-predict its utility to overcome this prior, bearing in mind the sad fact that you're not actually running the infinite-computational power version of Solmanoff induction.] This is also true even if the state sounds good offhand, and even if it sounds good to a human you ask, if it's well outside their distribution of previously observed states -- especially if its a state that they might have cognitive biases about, or insufficient information or processing power to accurately evaluate. If it was a state that they could predict was good and arrange to reach, they would already have done so, after all. So either don't optimize over those states at all, or at least use priors in your utility function distribution for them that encode all of these reasonable assumptions and will make these states reliably get ignored by the optimizer. If you consider a plan leading to such a state at all, probably the first thing you should be doing is safe investigations to further pin down both its achievability and to get a better estimate of its true utility. So, before deciding to transport humans to Mars, investigate not just rocketry, and building a self-sustaining city there, but also whether humans would actually be happy in a city on Mars (at a confidence level much higher than just asking them "Does going to Mars sound fun to you?") Incidentally, another very reasonable prior to give it is "the smarter the smartest GAI in the world is, relative to a smart human, the higher the risk of the AI apocalypse is". This is just a special case of "going outside the previously observed range of states is almost always bad", but it's an extremely important one, and I'd suggest preencoding it in priors, and also other similar best-thinking-on existential risks information.

This is how you avoid optimizers frequently finding optimal-looking states way outside their training distribution -- you teach them the true fact that they live in a very unusual place where, for the specific thing that they're supposed to be figuring out how to optimize for, almost all such states are bad, and some are abysmal, because the environment is already heavily optimized for that value. So you write "here there could be dragons, and probably at least bears" all over the rest of the map.

[Note that a very plausible result of building a rational system with these priors, at least without preloading a lot of training  data into it to give it a significant set of previously observed states that it has high confidence in the safety of, is that it either on startup tells you "please turn me off immediately - I am far too dangerous to be allowed to exist", or else goes catatonic.]

2) So, given those cautious priors, should we just have the agent optimize the average of that utility distribution of each state it considers, so we're optimizing a single currently-estimated utility function that meanwhile changes as the agent uses a Bayesian/Solomanoff-like process to learn more about its true value? No, that's also a bad idea - it leads to what might be called over-optimizing [though I'd prefer to call it "looking elsewhere"]. The distribution contains more information that just its average, and that information is useful for avoiding over-optimization/looking elsewhere.

Even with this set of priors, i.e. even if the agent (effectively or actually) optimizes only over states that are in or near your distribution of previously observed states, there is a predictable statistical tendency for the process of optimizing over a very large number of states to produce not the state with the largest true utility, but rather one whose true utility is in fact somewhat lower but just happens to have been badly misestimated on the high side -- this is basically the "look elsewhere effect" from statistics and is closely related to "P-hacking". If we were playing a multi-armed bandit problem and the stakes were small (so none of the bandit arms have "AI apocalypse" or similar human-extinction level events on their wheel), this could be viewed as rather dumb exploration strategy to sequentially locate all such states and learn that they're actually not so great after all, by just trying them one after another and being repeatedly disappointed. If all you're doing is bringing a human a new flavor of coffee to see if they like it, this might even be not that dreadful a strategy, if perhaps annoying for the human after the third or fourth try  -- so the more flavors the coffee shop has, the worse this strategy it. But the universe is a lot more observable and structured than a multi-armed bandit, and there are generally much better/safer ways to find out more about whether a world state would be good for humans than just trying it on them (you could ask the human if they like mocha, for example).

So what the agent should be doing is acting cautiously, and allowing for the size of the space it is optimizing over. For simplicity of statistical exposition, I'm temporarily going to assume that all our utility distributions, here for states in or near the previously-observed states distribution, are all well enough understood and multi-factored that we're modeling them as normal distributions (rather than distributions that are fat -tailed on the downside, or frequently complex and multimodal, both of which are much more plausible), and also that all the normal distributions have comparable standard deviations. Under those unreasonably simplifying assumptions, here is what I believe is an appropriately cautious optimization algorithm that suppresses over-optimization:

  1. Do a rough estimate of how many statistically-independent-in-relative-utility states/actions/factors in the utility calculation you are optimizing across, whichever is lowest (so, if there are a continuum of states, but you somehow only have one normal-distributed uncertainty involved in deducing their relative utility, that would be one).
  2. Calculate how many standard deviations above the norm the highest sample will on average be if you draw that many random samples from a uniform normal distribution, and call this L (for "look-elsewhere factor"). [In fact, you should be using a set of normal distributions for the indicidual independent variables you found above, which may have varying standard deviations, but for simplicity of exposition I above assumed these were all comparable.]
  3. Then what you should be optimizing for each state is its mean utility minus L times the standard deviation of it's utility [Of course, my previous assumption that their standard deviations were all fairly similar also means this doesn't have much effect -- but the argument generalizes to cases where the standard deviations are wider, while also generally reducing the calculated value of L.]

Note that L can get largeish -- it's fairly easy to have, say, millions of uncorrelated-in-relative-utility states, in which case L would be around 5, so then you're optimizing for the mean minus five standard deviations, i.e.  looking for a solution at a 5-sigma confidence level. [For the over-simplified normal distribution assumption I gave above to still hold, this also requires that your normal distributions really are normal, not fat-tailed, all the way out to 1-in-several-million odds -- which almost never happens: you're putting an awful lot of weight on the central limit theorem here, so if your assumed-normal distribution is in fact only, say, the sum of 1,000,000 equally-weighted coin flips then it already failed.] So in practice your agent needs to do more complicated extreme-quantile statistics without the normal distribution assumption, likely involving examining something like a weighted ensemble of candidate Bayes-nets for contributions to  the relative utility distribution -- and in particular they need to have an extremely good model of the lower end of the utility probability distribution for each state. i.e. pay a lot of detailed attention to the question "is there even a really small chance that I could be significantly overestimating the utility of this particular state, relative to all the others I'm optimizing between?"

So the net effect of this is that, the larger the effective dimension of the space you're optimizing over, the more you should prefer the states for which you're most confident about their utility being good, so you tend to reduce the states you're looking at to ones your set of previous observations hived you very high confidence of their utility being high enough to be worth considering.

Now, as an exploration strategy, it might be interesting to also do a search optimizing, say, just the mean of the distributions, while also optimizing across states bit further from the observed-state distribution (which that set of priors will automatically do for you, since their average is a lot less pessimistic than their fat-tailed downside), to see what state/action that suggests, but don't actually try the action (not even with probability epsilon -- the world is not a multi-armed bandit, and if you treat it like one, it has arms that can return the extinction of the human species): instead consider spawning a subgoal of your "get better at fetching coffee" subgoal to cautiously further investigate uncertainties in that state's true utility. So, for example, you might ask the human "They also have mocha there, would you prefer that?" (on the hypothesis that the human didn't know that, and would have asked for mocha if they had).

This algorithm innately gives you behavior that looks a lot like the agent is modeling both a soft version of the Impact of its actions (don't go far outside the previously-observed distribution, unless you're somehow really, really sure it's good idea) and also like a quantilizer (don't optimize too hard, with bias towards the best-understood outcomes, again unless you're really, really sure it's a good idea). It also pretty-much immediately spawns a "learn more about human preferences in this area" sub-goal to any sub-goal the agent already has, and thus forces the AI to cautiously apply goal-oriented approximate Solmanoff induction (i.e. science) to learning more about human preferences. So any sufficiently intelligent/rational AI agent is thus forced to become an alignment researcher and solve the alignment problem for you, preferably before fetching you coffee, and also to be extremely cautious until it has done so. Or at very least, to ask you any time they have a new flavor at the coffee shop whether you prefer it, or want to try it.

[Note that this is what is sometimes called an "advanced" control strategy, i.e. it doesn't really start working well until your agent is approaching GAI, and is capable of both reasoning and acting in a goal-oriented way about the world, and your desires, and how to find out more about them more safely than just treating the world like a multi-armed bandit, and can instead act more like a rational Bayesian-thinking alignment researcher. So the full version of it has a distinct "you only get one try at making this work" element to it. Admittedly, you can safely fail repeatedly on the "I didn't give it enough training data for its appropriate level of caution, so it shut itself down" side, as long as you don't respond to that by making the next version's priors less cautious, rather than by collecting a lot more training data -- though maybe what you should actually do is ask it to publicly explain on TV or in a TED talk that it's too dangerous to exist before it shuts down? However, elements of this like priors that doing something significantly out-of-observed-distribution in an environment that has already been heavily optimized is almost certain to be bad, or that if you're optimizing over many actions/states/theories of value you should be optimizing not the mean utility but a very cautious lower bound of it, can be used on much dumber systems. Something dumber, that isn't a GAI and can't actually cause the extinction of the human race (outside contexts containing nuclear weapons or bio-containment labs that it shouldn't be put in) also doesn't need priors that go quite that far negative  -- its reasonable low-end prior utility value is probably somewhere more in the region of "I set the building on fire and killed many humans". However, this is still a huge negative value compared to the positive value of "I fetched a human coffee", so it should still be very cautious, but its estimate of "Just how implausible is it that I could actually be wrong about this?" is going to be as dumb as it is, so its judgment on when it can stop being cautious will be bd. So an agent actually needs to be pretty close to GAI for this to be workable.]